Date post: | 07-Jan-2017 |
Category: |
Software |
Upload: | knoldus-software-llp |
View: | 1,267 times |
Download: | 4 times |
Introduction to Apache Spark 2.0
Himanshu GuptaSr. Software ConsultantKnoldus Software LLP
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
What is Apache Spark ?● A fast and general engine for large-scale
data processing.
● Offers a rich set of API(s) and Libraries
– In Scala, Java, Python and R
● Most active Apache Big Data project.
Img Src: https://www.google.com/
Spark Survey 2015● Reflected answers and opinions
– Of over 1417 respondents from 842 organizations
● Indicated rapid growth of Spark community.
● Displayed positive attitude towards:
– Concise and Unified API for Big Data processing.
● https://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html
Apache Spark 2.0● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph operations.
Apache Spark 2.0● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph operations.
SparkSession
What is SparkSession ?
Img Src: https://www.google.com/
What is SparkSession ?
SparkContext
For Core API
What is SparkSession ?
SparkContext StreamingContext
For Core API For Streaming API
What is SparkSession ?
SparkContext StreamingContext SQLContext
For Core API For Streaming API For SQL API
What is SparkSession ?
SparkContext StreamingContext SQLContext
For Core API For Streaming API For SQL API
SparkSessionUnified API
Benefits of Spark 2.0● Unified DataFrames and Datasets
– DataFrames = Datasets[Row]
● 10X faster than Spark 1.6
– Due to Whole-Stage Code Generation.
● Smarter than Spark Streaming 1.6:
– As streaming is structured too.
Img Src: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
Why Spark 2.0 is Faster ?
Img Src: https://www.google.com/
Why Spark 2.0 is Faster ?
Reason is
“Whole-Stage Code Generation”
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Volcano Model
Whats wrong here ?
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Volcano Model
For AnswerLets compare same code with hand-written code
System Generated Hand Written
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Volcano Model vs Hand-Written Code
Volcano
Hand-Written
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Solution
Img Src: https://www.google.com/
Solution
Of Course
Whole-Stage Code Generation
Provides the performance of hand-written code with the functionality of ageneral purpose engine.
What is Whole-Stage Code Generation ?
● Same as Volcano Model– As it generates code using the same process.
● The only difference is– Earlier Spark applied code generation only to
expression evaluation (i.e., “1 + a”) but now it generates code for the entire query.
Spark 1.x vs Spark 2.0
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Demo 1
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Questions ??
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Streaming Applications
Pros - ● Consistent● In-Order Data● No Shuffling
Cons - ● Non-Scalable● No Fault Tolerance
Pros - ● Scalable● Fault Tolerant
Cons - ● Inconsistent● Out-of-Order Data● Too much Shuffling
Continuous Application
Img Src: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
How to Achieve it ?
Img Src: https://www.google.com/
Solution
Structured Streaming
Structured Streaming guarantees that at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.
How ?
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Conceptually, Structured Streaming treats all the data arriving as an infinite input table.
How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.
Incremental Execution
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Output Modes● Append
– Only the new rows are appended to the result table since the last trigger will be written to the external storage.
● Complete
– The entire updated result table will be written to external storage
● Update
– Only the rows that were updated in the result table since the last trigger will be changed in the external storage.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Other Benefits● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
There are many more...
Requirements
● Input Sources must be replayable– So that recent data can be re-read if the job
crashes.
● Output Source must support transactional updates
– So that the system make a set of records appear atomically.
Comparison with Other Engines
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Demo 2
Agenda
Part 1(SparkSession)
Part 2(Structured Streaming)
Questions ??
References
● https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
● https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
● https://www.youtube.com/watch?v=ZFBgY0PwUeY
● http://spark.apache.org/docs/latest/
Thank You !!!