©2015 Couchbase Inc. 3
Apache Spark… is a fast and general purpose engine for small and large scale data processing …
©2015 Couchbase Inc. 5
Components: Spark SQL
Structured through DataFrames
Distributed querying with SQL
©2015 Couchbase Inc. 9
How does it work?
Source: http://spark.apache.org/docs/latest/cluster-overview.html
©2015 Couchbase Inc. 10
Spark Benefits
Linearly scalable to 1000+ worker nodes Simpler to use than Hadoop MR Only partial recompute on failure
For developers and data scientists– machine learning– R integration
Tight but not mandatory Hadoop integration– Sources, Sinks– Scheduler
©2015 Couchbase Inc. 11
Spark vs Hadoop
Spark is RAM while Hadoop is mainly HDFS (disk) bound
Fully compatible with Hadoop Input/Output
Easier to develop against thanks to functional composition
Hadoop certainly more mature, but Spark ecosystem growing fast
©2015 Couchbase Inc. 12
Couchbase in the Spark Landscape Transparent generation and persistence of
– RDDs– DataFrames– Dstreams
Spark SQL and N1QL are a natural fit Linearly scale your data and application layer Share data between Spark Applications
The perfect storage companion for your spark applications.
Source: http://spark.apache.org/docs/latest/cluster-overview.html
©2015 Couchbase Inc. 13
Cluster Communication
STORAGE
Couchbase Server 1
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 2
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 3
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 4
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 5
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 6
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service
Spark Worker Spark Worker
©2015 Couchbase Inc. 14
Ecosystem Flexibility
RDBMS
StreamsWeb APIs
DCPKVN1QLViews
BatchingData Archive
OLTP Data
©2015 Couchbase Inc. 17
Couchbase Connector Spark Core
– Automatic Cluster and Resource Management– Creating and Persisting RDDs– Java APIs in addition to Scala
Spark SQL– Easy JSON handling and querying– Tight N1QL Integration
Spark Streaming– Persisting DStreams– DCP source (experimental)
©2015 Couchbase Inc. 18
Facts Current Version: 1.0.0-beta
Code: https://github.com/couchbaselabs/couchbase-spark-connector
Docs until GA: http://developer.couchbase.com/documentation/server/4.0/connectors/spark-1.0/spark-intro.html