+ All Categories
Home > Documents > Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL -...

Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL -...

Date post: 15-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
29
Spark SQL: Answers to the top 5 most pressing questions Speaker: Venkat Nannapaneni June 9, 2016
Transcript
Page 1: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

Spark SQL: Answers to the top 5 most pressing questions

Speaker: Venkat NannapaneniJune 9, 2016

Page 2: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

About Me - Venkat Nannapaneni

▪ MetiStream Sr. Big Data Engineer. Consulted at Monsanto, State Farm, and Premiere Health to name a few.

▪ Put Spark into production at 3 customers. Databricks Certified Spark Developer and Sun Certified Java Programmer.

▪ Just moved from Austin, TX. Would love tips on good eats and places to see!

2

Page 3: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

Founded in 2014

Real-time Streaming

Partnered with IBM to educate

more than 1 million data

scientists and data engineers

Certified Spark Systems Integrator

and Trainer

HOTTEST STARTUP2016 Nominee

Open Source

SMWOB1900+ members

400+ members

DataStart Awardee

(funded by the National Science

Foundation)

3

Page 4: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

What is Spark SQL?

Spark SQL is a Spark module for structured data processing. Spark SQL provides Spark with more information about the structure of both the data and the computation being performed.

Capture truth in data

4

Page 5: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

Spark SQL: Answers to the top 5 most pressing questions.

We take the top 5 most hotly debated and discussed Spark SQL questions and walk you through our recommended answers.

How did we come up with our Top 5?

▪ Stackoverflow top voted and viewed questions

▪ Articles & YouTube▪ Personal experience

5

Page 6: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

Top 5 most pressing questions

1. How do you update schema?2. How do you define partitioning of a Spark DataFrame?3. Should I use RDD, DataFrames, or Datasets? (choosing the

right API)4. What is the difference between Apache Spark SQLContext

vs. HiveContext?5. Should I use Spark SQL or Impala?

6

Page 7: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

1. How do you update schema?1

7

Page 8: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

1

8

Page 9: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

1

9

Page 10: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

1

10

Page 11: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

Zeppelin

▪ https://zeppelin.incubator.apache.org/▪ Apache Incubator Project▪ Started by NFLabs▪ Supports Spark▪ Integrated Spark SQL

support & visualizations▪ Markdown▪ BASH Shell▪ Python, R

1

11

Page 12: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

Using ZeppelinDefault interpreter is Scala w/SparkContext sc: SparkContext sqlContext: SQLContext (HiveContext)

%sql – Inline SQL queries%sh – Inline Bash%md - Markdown

Execute panel(Shortcut: shift-enter)

Toggle output display

Panel settings

1

12

Page 13: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

2. How do you define partitioning of a Spark DataFrame? 2

13

Page 14: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

▪ Repartition is an expensive process, but has its benefits.▪ Better to read partitioned data to avoid repartitioning at later stages.

• Hdfs - Dfs.block.size– Cassandra - input.split.size_in_mb– Pushing predicates

▪ Repartition– 1.6+

• repartition(10,df.col("product"))– Spark < 1.6.0

• repartition(numPartitions: Int)• Create DataFrame from a custom pre-partitioned RDD

2

14

Page 15: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

3. Should I use RDD, DataFrames, or Datasets?3

15

Page 16: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

3. Should I use RDD, DataFrames, or Datasets?3

16

Page 17: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

3

Page: 17

Sample Revenue Data

Page 18: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

3

18

RDD

Page 19: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

3

19

DataFrame

Page 20: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

3

20

Dataset

Page 21: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

44. What is the difference between Apache Spark SQLContext vs HiveContext?

21

Page 22: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

▪ HiveContext is a super set of the SQLContext. ▪ Ability to write queries using the more complete

HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.

▪ Native window functions < spark 1.5

4

Page: 22

Why HiveContext?

Page 23: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

▪ Comes with large dependencies▪ Bringing SQLContext up to feature parity with a

HiveContext▪ Spark.sql.dialect - hiveql, sql

Page: 23

4Other Considerations

Page 24: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

5. Should I use Spark SQL or Impala?5

24

Page 25: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

AtScale Blog - Posted by Trystan Leftwich on Feb 24, 2016

http://blog.atscale.com/how-different-sql-on-hadoop-engines-satisfy-bi-workloads

5

25

Page 28: Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL - Answers to the top... · 2016-06-10 · No single SQL-on-Hadoop engine is best

5In Summary

It depends…▪ No single SQL-on-Hadoop engine is best for ALL

queries▪ BI and SQL analytics at interactive latencies.▪ Impala scales with concurrency better than Hive

and Spark▪ Broader procedural applications - ETL

28


Recommended