Spark SQL: Answers to the top 5 most pressing questions
Speaker: Venkat NannapaneniJune 9, 2016
About Me - Venkat Nannapaneni
▪ MetiStream Sr. Big Data Engineer. Consulted at Monsanto, State Farm, and Premiere Health to name a few.
▪ Put Spark into production at 3 customers. Databricks Certified Spark Developer and Sun Certified Java Programmer.
▪ Just moved from Austin, TX. Would love tips on good eats and places to see!
2
Founded in 2014
Real-time Streaming
Partnered with IBM to educate
more than 1 million data
scientists and data engineers
Certified Spark Systems Integrator
and Trainer
HOTTEST STARTUP2016 Nominee
Open Source
SMWOB1900+ members
400+ members
DataStart Awardee
(funded by the National Science
Foundation)
3
What is Spark SQL?
Spark SQL is a Spark module for structured data processing. Spark SQL provides Spark with more information about the structure of both the data and the computation being performed.
Capture truth in data
4
Spark SQL: Answers to the top 5 most pressing questions.
We take the top 5 most hotly debated and discussed Spark SQL questions and walk you through our recommended answers.
How did we come up with our Top 5?
▪ Stackoverflow top voted and viewed questions
▪ Articles & YouTube▪ Personal experience
5
Top 5 most pressing questions
1. How do you update schema?2. How do you define partitioning of a Spark DataFrame?3. Should I use RDD, DataFrames, or Datasets? (choosing the
right API)4. What is the difference between Apache Spark SQLContext
vs. HiveContext?5. Should I use Spark SQL or Impala?
6
1. How do you update schema?1
7
1
8
1
9
1
10
Zeppelin
▪ https://zeppelin.incubator.apache.org/▪ Apache Incubator Project▪ Started by NFLabs▪ Supports Spark▪ Integrated Spark SQL
support & visualizations▪ Markdown▪ BASH Shell▪ Python, R
1
11
Using ZeppelinDefault interpreter is Scala w/SparkContext sc: SparkContext sqlContext: SQLContext (HiveContext)
%sql – Inline SQL queries%sh – Inline Bash%md - Markdown
Execute panel(Shortcut: shift-enter)
Toggle output display
Panel settings
1
12
2. How do you define partitioning of a Spark DataFrame? 2
13
▪ Repartition is an expensive process, but has its benefits.▪ Better to read partitioned data to avoid repartitioning at later stages.
• Hdfs - Dfs.block.size– Cassandra - input.split.size_in_mb– Pushing predicates
▪ Repartition– 1.6+
• repartition(10,df.col("product"))– Spark < 1.6.0
• repartition(numPartitions: Int)• Create DataFrame from a custom pre-partitioned RDD
2
14
3. Should I use RDD, DataFrames, or Datasets?3
15
3. Should I use RDD, DataFrames, or Datasets?3
16
3
Page: 17
Sample Revenue Data
3
18
RDD
3
19
DataFrame
3
20
Dataset
44. What is the difference between Apache Spark SQLContext vs HiveContext?
21
▪ HiveContext is a super set of the SQLContext. ▪ Ability to write queries using the more complete
HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.
▪ Native window functions < spark 1.5
4
Page: 22
Why HiveContext?
▪ Comes with large dependencies▪ Bringing SQLContext up to feature parity with a
HiveContext▪ Spark.sql.dialect - hiveql, sql
Page: 23
4Other Considerations
5. Should I use Spark SQL or Impala?5
24
AtScale Blog - Posted by Trystan Leftwich on Feb 24, 2016
http://blog.atscale.com/how-different-sql-on-hadoop-engines-satisfy-bi-workloads
5
25
http://www.slideshare.net/cloudera/hive-impala-and-spark-oh-my-sqlonhadoop-in-cloudera-55
5
26
What does Cloudera say?
http://www.slideshare.net/cloudera/hive-impala-and-spark-oh-my-sqlonhadoop-in-cloudera-55
http://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/
5
27
5In Summary
It depends…▪ No single SQL-on-Hadoop engine is best for ALL
queries▪ BI and SQL analytics at interactive latencies.▪ Impala scales with concurrency better than Hive
and Spark▪ Broader procedural applications - ETL
28