Ema Orhian @emaorhian
Jaws - Data Warehouse with Spark SQL
• Big Data analytics / Machine Learning• 4+ years exp with Hadoop ecosystem• 2 years exp with Spark
About me
http://bigdataresearch.io/
• Co-founder of Big Data Research Group • Provides open source solutions around Big Data analytics
http://atigeo.com/
Agenda• jaws-spark-sql-rest (Jaws) intro• Main features • Architecture • Scaling• Resource manager• Working with Tachyon• Working with Parquet files• Configure Spark Sql context• Demo
Shared Spark Sql Context
Concurrent queries run
Query history
Page resultsQuery editor
Jaws• Highly scalable and resilient data warehouse explorer
• Restful alternative to Spark SQL JDBC and not only …
• Support for Spark 0.9.1/Shark thru Spark 1.5
• Support for hive/MR
https://github.com/atigeo/jaws-spark-sql-rest
Main features• Submit queries concurrently and asynchronously
• Provides persisted logs, query history, results with paging
• Pluggable persistent layer (Cassandra/HDFS)
• Supports load balancing with query cancelation
• Provides a metadata browser
• In-memory Parquet warehouse with Tachyon
• Configuration file to fine tune Spark context
• Pluggable UI
Jaws architecture
Scaling•Standalone mode
•Mesos
•YARN
Fine grained mode
Coarse grained mode
Canceling a query
Canceling a query
Results persistence• Queries with limited number of results:
‣ Cassandra‣ HDFS
• Queries with unlimited number of results:‣ HDFS‣ Tachyon
Working with Tachyon• Persists unlimited results in Tachyon• Registers tables over Parquet files from Tachyon
Tachyon benefits:★ in memory storage system★ share data between applications at a memory
speed
Working with Parquet files• Register tables on top of parquet files
Parquet★ columnar format★ nested data structures★ supports schema evolution★ efficient compression
• Files stored on HDFS or Tachyon• MetaInfo about table stored in Cassandra (feature before Spark
1.3)
Configuring Jaws
• Cassandra
• HDFS
• Spray
• Application
• Spark
sparkConfiguration {spark-master=“spark://devbox.local:7077”
/ “mesos://devbox.local:5050” / yarn-client
spark-mesos-coarse=false / truespark-cores-max=100spark-executor-instances=10 }
Demo