Download - Jaws - Data Warehouse with Spark SQL by Ema Orhian

Ema Orhian @emaorhian

Jaws - Data Warehouse with Spark SQL

• Big Data analytics / Machine Learning• 4+ years exp with Hadoop ecosystem• 2 years exp with Spark

About me

http://bigdataresearch.io/

• Co-founder of Big Data Research Group • Provides open source solutions around Big Data analytics

http://atigeo.com/

http://bigdataresearch.io/

http://atigeo.com/

Agenda• jaws-spark-sql-rest (Jaws) intro• Main features • Architecture • Scaling• Resource manager• Working with Tachyon• Working with Parquet files• Configure Spark Sql context• Demo

Shared Spark Sql Context

Concurrent queries run

Query history

Page resultsQuery editor

Jaws• Highly scalable and resilient data warehouse explorer

• Restful alternative to Spark SQL JDBC and not only …

• Support for Spark 0.9.1/Shark thru Spark 1.5

• Support for hive/MR

https://github.com/atigeo/jaws-spark-sql-rest

https://github.com/atigeo/jaws-spark-sql-rest

Main features• Submit queries concurrently and asynchronously

• Provides persisted logs, query history, results with paging

• Pluggable persistent layer (Cassandra/HDFS)

• Supports load balancing with query cancelation

• Provides a metadata browser

• In-memory Parquet warehouse with Tachyon

• Configuration file to fine tune Spark context

• Pluggable UI

Jaws architecture

Scaling•Standalone mode

•Mesos

•YARN

Fine grained mode

Coarse grained mode

Canceling a query

Canceling a query

Results persistence• Queries with limited number of results:

‣ Cassandra‣ HDFS

• Queries with unlimited number of results:‣ HDFS‣ Tachyon

Working with Tachyon• Persists unlimited results in Tachyon• Registers tables over Parquet files from Tachyon

Tachyon benefits:★ in memory storage system★ share data between applications at a memory

speed

Working with Parquet files• Register tables on top of parquet files

Parquet★ columnar format★ nested data structures★ supports schema evolution★ efficient compression

• Files stored on HDFS or Tachyon• MetaInfo about table stored in Cassandra (feature before Spark

1.3)

Configuring Jaws

• Cassandra

• HDFS

• Spray

• Application

• Spark

sparkConfiguration {spark-master=“spark://devbox.local:7077”

/ “mesos://devbox.local:5050” / yarn-client

spark-mesos-coarse=false / truespark-cores-max=100spark-executor-instances=10 }

Demo