Jaguar Integration with SparkR

Once you have R and SparkR packages installed, you can start the SparkR program by executing the following command:



export JAVA_HOME=/usr/lib/java/jdk1.7.0_75

sparkR \
–driver-class-path $JDBCJAR \
–driver-library-path $LDLIBPATH \
–conf spark.executor.extraClassPath=$JDBCJAR \
–conf spark.executor.extraLibraryPath=$LDLIBPATH


Then in the SparkR command line prompt, you can execute the following R commands:



sc <- sparkR.init(master=”spark://mymaster:7077″, appName=”MyTest”)

sqlContext <- sparkRSQL.init(sc )

drv <- JDBC(“”, “/home/exeray/jaguar/lib/jaguar-jdbc-2.0.jar”, “`”)

conn <- dbConnect(drv, “jdbc:jaguar://localhost:8888/test”, “test”, “test” )


df <- dbGetQuery(conn, “select * from int10k where uid > ‘anxnfkjj2329’ limit 5000;”)

head( df )

> cor(df$uid,df$score)
[1] 0.05107418

#build the simple linear regression
> model<-lm(uid~score,data=df)
> model

lm(formula = uid ~ score, data = df)

(Intercept) score
2.115e+07 1.025e-03

#get the names of all of the attributes
> attributes(model)
[1] “coefficients” “residuals” “effects” “rank”
[5] “fitted.values” “assign” “qr” “df.residual”
[9] “xlevels” “call” “terms” “model”

[1] “lm”



Jaguar’s successful integration with Spark and SparkR  allows wide range of data analytics  over the underlying fast Jaguar data engine.



Jaguar Supports R

R is a powerful language and environment for statistical computing and graphics.  Jaguar’s JDBC API can integrate with R for extensive data modelling and analysis.  To use R with Jaguar, the RJDBC library needs to be installed first:



           $ sudo apt-get install r-cran-rjava

$ sudo R

> install.packages(“RJDBC”, dep=true)

> q()


$ unset JAVA_HOME

$ R

> library(RJDBC)

> drv <- JDBC(“”, “/pathtomy/jaguar-jdbc-2.0.jar”, “`”)

> conn <- dbConnect(drv, “jdbc:jaguar://localhost:8888/test”, “test”, “test” )

> dbListTables(conn)

> dbGetQuery(conn, “select count(*) from mytable;”)

> d <- dbReadTable(conn, “mytable;”)

> q()

Jaguar Supports Spark

Since now Jaguar provides JDBC connectivity, developers can use Apache Spark to load data from Jaguar and perform data analytics and machine learning. The advantages provided by Jaguar is that Spark can load data faster, especially for loading data satisfying complex conditions, from Jaguar than from other data sources. The following code is based on two tables that have the following structure:

create table int10k ( key: uid int(16), score float(16.3), value: city char(32) );
create table int10k_2 ( key: uid int(16), score float(16.3), value: city char(32) );

Scala program:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import scala.collection._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.log4j.Logger
import org.apache.log4j.Level

object TestScalaJDBC {
def main(args: Array[String]) {

def sparkfunc()
val sparkConf = new SparkConf().setAppName(“TestScalaJDBC”)
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._


val people =“jdbc”)
Map( “url” -> “jdbc:jaguar://”,
“dbtable” -> “int10k”,
“user” -> “test”,
“password” -> “test”,
“partitionColumn” -> “uid”,
“lowerBound” -> “2”,
“upperBound” -> “2000000”,
“numPartitions” -> “4”,
“driver” -> “”

// work fine

val people2 =“jdbc”)
Map( “url” -> “jdbc:jaguar://”,
“dbtable” -> “int10k_2”,
“user” -> “test”,
“password” -> “test”,
“partitionColumn” -> “uid”,
“lowerBound” -> “2”,
“upperBound” -> “2000000”,
“numPartitions” -> “4”,
“driver” -> “”

// sort by columns

people.sort($”score”.desc, $”uid”.asc).show()
people.orderBy($”score”.desc, $”uid”.asc).show()

// select by expression
people.selectExpr(“score”, “uid” ).show()
people.selectExpr(“score”, “uid as keyone” ).show()
people.selectExpr(“score”, “uid as keyone”, “abs(score)” ).show()

// select a few columns
val uid2 =“uid”, “score”);

// filter rows
val below60 = people.filter(people(“uid”) > 20990397 ).show()

// group by

// groupby and average

“score” -> “avg”,
“uid” -> “max”

// rollup
“uid” -> “avg”,
“score” -> “max”

// cube
“uid” -> “avg”,
“score” -> “max”

// describe statistics
people.describe( “uid”, “score”).show()

// find frequent items
people.stat.freqItems( Seq(“uid”) ).show()

// join two tables
people.join( people2, “uid” ).show()
people.join( people2, “score” ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) ).show()
people.join(people2).where ( people(“city”) === people2(“city”) ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) and people(“city”) === people2(“city”) ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) && people(“city”) === people2(“city”) ).show()
people.join(people2).where ( people(“uid”) === people2(“uid”) && people(“city”) === people2(“city”) ) .limit(3).show()

// union

// intersection

// exception

// Take samples
people.sample( true, 0.1, 100 ).show()

// distinct

// same as distinct

// cache and persist

// SQL dataframe
val df = sqlContext.sql(“SELECT * FROM int10k where uid < 200000000 and city between ‘Alameda’ and ‘Berkeley’ “)

The class generated from the above Scala program can be submitted to Spark as follows:

/bin/spark-submit –class TestScalaJDBC \
–master spark://masterhost:7077 \
–driver-class-path /path/to/your/jaguar-jdbc-2.0.jar \
–driver-library-path $HOME/jaguar/lib \
–conf spark.executor.extraClassPath=/path/to/your/jaguar-jdbc-2.0.jar \
–conf spark.executor.extraLibraryPath=$HOME/jaguar/lib \


A very useful tool in cluster environment

Distributed Shell (dsh) is a very powerful tool for system administrators in a cluster environment. Here are some tips for installing and using it:

sudo apt-get install dsh


sudo yum install dsh

In /etc/dsh/dsh.conf change remoteshell:

remoteshell =ssh

Here is how to make your public key if you do not have one yet (~/.ssh/

$ ssh-keygen -t rsa -P “”
(no passphrase , ~/.ssh/ will be created )
$ ssh-copy-id -i ~/.ssh/   ALL_OTHER_HOSTS

Then in /etc/dsh/machines.list  put all your hosts:


Finally you can issue commands to ALL the hosts in your cluster:

$ dsh –aM –c YOUR_COMMAND

For example:
$ dsh –aM –c uptime



Jaguar Benchmark Against Spark

Jaguar is the distributed version of ArrayDB. Simply installing ArrayDB over any distributed file system such as Gluster and Ceph, ArrayDB performed extremely well.

We have setup a cluster of computer servers running Glusterfs, which is a distributed file system capable of scaling to several petabytes and handling thousands of clients. We then mounted the Gluster volume on the clustered servers.  Spark 1.3.1 is also installed on these same servers to benchmark SQL operations on a data set consisting of two million key-value pairs.

In Spark testing, the procedure to compile and execute Spark Scala program is as follows:

$ vim MyTest.scala

$ sbt package

$ spark-submit –class MyTest –master yarn-client   target/scala-2.11/mytest_2.11-1.0.jar

In Jaguar testing, the data directory in $HOME/arraydb/ is soft-linked to the mounted directory of gluster volume. Then client programs are started on the different clustered hosts.

1. Joining two tables each consisting of 2000000 data items (32 bytes key, 48 bytes value).


import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType,StructField,StringType};

object MyTest
def main(args: Array[String])


val sparkConf = new SparkConf().setAppName(“MyTest”);
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val people1 = sc.textFile(“hdfs://HD3:9000/home/exeray/2M.txt”)

val people2 = sc.textFile(“hdfs://HD3:9000/home/exeray/2M2.txt”)

val schemaString = “uid v1 v2 v3″;
val schema = StructType ( schemaString.split(” “).map( fieldName => StructField( fieldName, StringType, true ) ) )
val rowRDD1 = _.split(“,”)).map( p=>Row( p(0), p(1), p(2), p(3) ) )
val rowRDD2 = _.split(“,”)).map( p=>Row( p(0), p(1), p(2), p(3) ) )
val peopleSchemaRDD1 = sqlContext.applySchema( rowRDD1, schema )
val peopleSchemaRDD2 = sqlContext.applySchema( rowRDD2, schema )
peopleSchemaRDD1.registerTempTable( “people1” );
peopleSchemaRDD2.registerTempTable( “people2” );

val res = sqlContext.sql(“SELECT * FROM people1 join people2 on people1.uid=people2.uid “)




adb> select * join ( 2M, 2M2 );

Result: Spark took 356 seconds, Jaguar took 168 seconds.


2. Joining two tables with condition

The Spark Scala program has added a where clause:

val res = sqlContext.sql(“SELECT * FROM people1 join people2 on people1.uid=people2.uid where people1.uid >= ‘aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa’ and people1.uid <= ‘gggggggggggggggggggggggggggggggg’ “)

So does Jaguar:

adb> select * join ( 2M, 2M2) where 2M.uid >= ‘aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa’ and 2M.uid <= ‘gggggggggggggggggggggggggggggggg’;

Result: Spark took 85 seconds, Jaguar took 13 seconds.


3.  Count items by keys

Spark:  SELECT count(*) FROM people1 where uid >= ‘kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk’ and uid <= ‘mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm’ “)

Jaguar:  select count(*) from 2M2 where uid >= ‘kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk’ and uid <= ‘mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm’ limit 999999999;

Result: Spark took 52 seconds, Jaguar took 0.1 seconds.

4. Point queries


val res1 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘yGW4r5thqpu7Bb4TCmxtdTpxXTxcOjhk’ “)
val res2 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘lZ1wt3llixT0r5jujuwfcKYb0Og2JF05’ “)
val res3 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘YKyoRLuBuYBGTpmQauGgnPZg3FGI3GxZ’ “)
val res4 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘YOzKDhmtCN095BVtyJRESRjhamhbJD1H’ “)
val res5 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘w0zDgzD2BdWE5sgFxgEL6zBjZckY6mnA’ “)
val res6 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘HhB6p8srwRG4PpHCgT1IG1jKJU0PXDJE’ “)
val res7 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘4Fpqf8JLORNavhwnthF7olySkAk0ggOj’ “)
val res8 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘tvApMnTzxc8SCkyRiSnTWtIYUHJQc91E’ “)
val res9 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘1Sg72G7ubanKSiYkzOqaGf9VvQjIVDLV’ “)
val res10 = sqlContext.sql(“SELECT count(*) FROM people1 where uid = ‘omo64Q5VxjzhDs148tNzrW4sGk4ouASS’ “)


select * from 2M where uid=yGW4r5thqpu7Bb4TCmxtdTpxXTxcOjhk;
select * from 2M where uid=lZ1wt3llixT0r5jujuwfcKYb0Og2JF05;
select * from 2M where uid=YKyoRLuBuYBGTpmQauGgnPZg3FGI3GxZ;
select * from 2M where uid=YOzKDhmtCN095BVtyJRESRjhamhbJD1H;
select * from 2M where uid=w0zDgzD2BdWE5sgFxgEL6zBjZckY6mnA;
select * from 2M where uid=HhB6p8srwRG4PpHCgT1IG1jKJU0PXDJE;
select * from 2M where uid=4Fpqf8JLORNavhwnthF7olySkAk0ggOj;
select * from 2M where uid=tvApMnTzxc8SCkyRiSnTWtIYUHJQc91E;
select * from 2M where uid=1Sg72G7ubanKSiYkzOqaGf9VvQjIVDLV;
select * from 2M where uid=omo64Q5VxjzhDs148tNzrW4sGk4ouASS;

Result: Spark took 134 seconds, Jaguar took 0.7 seconds.


Conclusion: For conditional queries, especially when indexes are used, Jaguar performs much faster than Spark.

ArrayDB 1.0 Official Release

After six months of hard work on ArrayDB, we are proud to announce ArrayDB 1.0, the next-generation NewSQL data store that delivers high-performance based on our revolutionary array-indexing technology.

Some key features of ArrayDB are:

  • High Performance: 5,000,000 per minute data ingestion and indexing building at the same time.  High performance of data write allows storing data at high velocity.
  • Fast Join:  joining of multiple tables at the same time, and at high speed because of fast merge join operation of the unique array-indexed tables.
  • Configurable memory usage: easy configuration of memory usage for fast data load and table join in environments where DRAM resources can be leveraged.
  • More client binding: In addition to C binding, Java and JDBC client API are provided. Any Java application can call native ArrayDB Java API or ArrayDB JDBC to query the fast ArrayDB server.
  • Semi-structured data support: Keys in a table have a schema, but the value fields in a table are schema-less. This feature allows flexible storage of non-structured data as well as fast lookup of key data.

We will continue to improve our product and make ArrayDB scalable.  Future work will include integration of our fast indexing engine with big data platforms to offer a spectrum of computing functionality.

ArrayDB Beta Version has been released

Today we proudly announce that beta version of ArrayDB analytical database has been released to the general public. In the past few months, Exeray has developed a new cloud based enterprise database software, ArrayDB, in analytical processing of big data. By leveraging our breakthrough technological invention (ArrayIndexing ™) that can speed up indexing of data by orders of magnitude, we will provide customers an exceptionally fast query engine for customers to gain deep insights into data. Our software will be a valuable resource as both an transaction engine and an analytical engine. Our ArrayDB product is clearly superior to all competitors in analytics of big data. Firs time in history, a database that employs revolutionary array-based indexing technology is developed and released. The ArrayDB product package can be downloaded from github repository: