Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL -...

Spark SQL: Answers to the top 5 most pressing questions

Speaker: Venkat NannapaneniJune 9, 2016

About Me - Venkat Nannapaneni

▪ MetiStream Sr. Big Data Engineer. Consulted at Monsanto, State Farm, and Premiere Health to name a few.

▪ Put Spark into production at 3 customers. Databricks Certified Spark Developer and Sun Certified Java Programmer.

▪ Just moved from Austin, TX. Would love tips on good eats and places to see!

2

Founded in 2014

Real-time Streaming

Partnered with IBM to educate

more than 1 million data

scientists and data engineers

Certified Spark Systems Integrator

and Trainer

HOTTEST STARTUP2016 Nominee

Open Source

SMWOB1900+ members

400+ members

DataStart Awardee

(funded by the National Science

Foundation)

3

What is Spark SQL?

Spark SQL is a Spark module for structured data processing. Spark SQL provides Spark with more information about the structure of both the data and the computation being performed.

Capture truth in data

4

Spark SQL: Answers to the top 5 most pressing questions.

We take the top 5 most hotly debated and discussed Spark SQL questions and walk you through our recommended answers.

How did we come up with our Top 5?

▪ Stackoverflow top voted and viewed questions

▪ Articles & YouTube▪ Personal experience

5

Top 5 most pressing questions

1. How do you update schema?2. How do you define partitioning of a Spark DataFrame?3. Should I use RDD, DataFrames, or Datasets? (choosing the

right API)4. What is the difference between Apache Spark SQLContext

vs. HiveContext?5. Should I use Spark SQL or Impala?

6

1. How do you update schema?1

7

1

8

1

9

1

10

Zeppelin

▪ https://zeppelin.incubator.apache.org/▪ Apache Incubator Project▪ Started by NFLabs▪ Supports Spark▪ Integrated Spark SQL

support & visualizations▪ Markdown▪ BASH Shell▪ Python, R

1

11

https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/

https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/

Using ZeppelinDefault interpreter is Scala w/SparkContext sc: SparkContext sqlContext: SQLContext (HiveContext)

%sql – Inline SQL queries%sh – Inline Bash%md - Markdown

Execute panel(Shortcut: shift-enter)

Toggle output display

Panel settings

1

12

2. How do you define partitioning of a Spark DataFrame? 2

13

▪ Repartition is an expensive process, but has its benefits.▪ Better to read partitioned data to avoid repartitioning at later stages.

• Hdfs - Dfs.block.size– Cassandra - input.split.size_in_mb– Pushing predicates

▪ Repartition– 1.6+

• repartition(10,df.col("product"))– Spark < 1.6.0

• repartition(numPartitions: Int)• Create DataFrame from a custom pre-partitioned RDD

2

14

3. Should I use RDD, DataFrames, or Datasets?3

15

3. Should I use RDD, DataFrames, or Datasets?3

16

3

Page: 17

Sample Revenue Data

3

18

RDD

3

19

DataFrame

3

20

Dataset

44. What is the difference between Apache Spark SQLContext vs HiveContext?

21

▪ HiveContext is a super set of the SQLContext. ▪ Ability to write queries using the more complete

HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.

▪ Native window functions < spark 1.5

4

Page: 22

Why HiveContext?

▪ Comes with large dependencies▪ Bringing SQLContext up to feature parity with a

HiveContext▪ Spark.sql.dialect - hiveql, sql

Page: 23

4Other Considerations

5. Should I use Spark SQL or Impala?5

24

AtScale Blog - Posted by Trystan Leftwich on Feb 24, 2016

http://blog.atscale.com/how-different-sql-on-hadoop-engines-satisfy-bi-workloads

5

25



http://www.slideshare.net/cloudera/hive-impala-and-spark-oh-my-sqlonhadoop-in-cloudera-55

5

26



What does Cloudera say?


http://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/

5

27





5In Summary

It depends…▪ No single SQL-on-Hadoop engine is best for ALL

queries▪ BI and SQL analytics at interactive latencies.▪ Impala scales with concurrency better than Hive

and Spark▪ Broader procedural applications - ETL

28

[email protected]

29

Date post:	15-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Speaker: Venkat Nannapaneni questions Spark SQL: Answers to …files.meetup.com/14077672/Spark SQL -...

Documents