Date post: | 07-Jan-2017 |
Category: |
Education |
Upload: | knoldus-software-llp |
View: | 4,084 times |
Download: | 1 times |
An Overview of Spark DataFrames with Scala
An Overview of Spark DataFrames with Scala
Himanshu GuptaSr. Software ConsultantKnoldus Software LLP
Himanshu GuptaSr. Software ConsultantKnoldus Software LLP
Who am I ?Who am I ?
Himanshu Gupta (@himanshug735)
Spark Certified Developer
Apache Spark Third-Party Package contributor - spark-streaming-gnip
Sr. Software Consultant at Knoldus Software LLP
Himanshu Gupta (@himanshug735)
Spark Certified Developer
Apache Spark Third-Party Package contributor - spark-streaming-gnip
Sr. Software Consultant at Knoldus Software LLP
Img src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpgImg src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpg
AgendaAgenda
● What is Spark ?
● What is a DataFrame ?
● Why we need DataFrames ?
● A brief example
● Demo
● What is Spark ?
● What is a DataFrame ?
● Why we need DataFrames ?
● A brief example
● Demo
Apache SparkApache Spark
● Distributed compute engine for large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java and R (Spark 1.4)
● Combines SQL, streaming and complex analytics.
● Runs on Hadoop, Mesos, or in the cloud.
● Distributed compute engine for large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java and R (Spark 1.4)
● Combines SQL, streaming and complex analytics.
● Runs on Hadoop, Mesos, or in the cloud.
Img src - http://spark.apache.org/Img src - http://spark.apache.org/
● Distributed collection of data organized into named columns (i.e. SchemaRDD)
● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata
● Available in Python, Scala, Java and R (in Spark 1.4)
● Distributed collection of data organized into named columns (i.e. SchemaRDD)
● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata
● Available in Python, Scala, Java and R (in Spark 1.4)
Spark DataFramesSpark DataFrames
Google Trends for DataFramesGoogle Trends for DataFrames
Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30
Speed of Spark DataFrames!Speed of Spark DataFrames!
Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter
Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter
RDD API vs DataFrames APIRDD API vs DataFrames API
val linesRDD = sparkContext.textFile(“file.txt”)val wordCountRDD = linesRDD.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
val (word, (sum, n)) = wordCountRDD.map { case (word, count) => (word, (count, 1)) } .reduce { case ((word1, (count1, n1)), (word2, (count2, n2))) =>
("", (count1 + count2, n1 + n2)) }
val average = sum.toDouble / n
val linesDF = sparkContext.textFile(“file.txt”).toDF("line")val wordsDF = linesDF.explode("line", "word")((line: String) => line.split(" "))val wordCountDF = wordsDF.groupBy("word").count()
val average = wordCountDF.agg(avg("count"))
RDD APIRDD API
DataFrame APIDataFrame API
Catalyst Optimizer
Optimization & Execution Plan shared by DataFrames and SparkSQL
Img src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlImg src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
AnalysisAnalysis
Begins with a relation to be computed.
Builds an “Unresolved Logical Plan”.
Applies Catalyst rules.
DataFrame
UnresolvedLogical Plan
Catalyst Rules
Logical OptimizationsLogical Optimizations
Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
● Applies standard rule-based optimizations to the logical plan.
● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification
● Applies standard rule-based optimizations to the logical plan.
● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification
object DecimalAggregates extends Rule[LogicalPlan] { /** Maximum number of decimal digits in a Long */ val MAX_LONG_DIGITS = 18 def apply(plan: LogicalPlan): LogicalPlan = { plan transformAllExpressions { case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS => MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale) } }}
Physical PlanningPhysical Planning
● Generates one or more physical plans from logical plan.
● Selects a plan using Cost Model.
● Generates one or more physical plans from logical plan.
● Selects a plan using Cost Model.
Optimized Logical Plan
Physical PlansCost
ModelSelected
Physical Plan
Physical PlanningPhysical Planning
Code GenerationCode Generation
● Generates Java bytecode for speed of execution.
● Uses Scala language, Quasiquotes.
● Quasiquotes allow programmatic construction of ASTs
● Generates Java bytecode for speed of execution.
● Uses Scala language, Quasiquotes.
● Quasiquotes allow programmatic construction of ASTs
def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }
Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
ExampleExample
val tweets = sqlContext.read.json("tweets.json")
tweets .select("tweetId", "username", "timestamp") .filter("timestamp > 0") .explain(extended = true)
== Parsed Logical Plan =='Filter ('timestamp > 0) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]
== Analyzed Logical Plan ==tweetId: bigint, username: string, timestamp: bigintFilter (timestamp#14L > cast(0 as bigint)) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]
== Optimized Logical Plan ==Project [tweetId#15L,username#16,timestamp#14L] Filter (timestamp#14L > 0) Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]
== Physical Plan ==Filter (timestamp#14L > 0) Scan JSONRelation[file:/home/knoldus/data/json/tweets.json][tweetId#15L,username#16,timestamp#14L]
Example (contd.)Example (contd.)
project
filter
Logical PlanOptimized
Logical Plan Physical Plan
tweets
filter
project
tweets
project
filter
tweets
project
filter
tweets
filter
tweets
filter
Scan (tweets)
Download Code
https://github.com/knoldus/spark-dataframes-meetup
References
http://spark.apache.org/
Spark Summit EU 2015
Deep Dive into Spark SQL’s Catalyst Optimizer
Spark SQL: Relational Data Processing in Spark
Spark SQL and DataFrame Programming Guide
Introducing DataFrames in Spark for Large Scale Data Science
Beyond SQL: Speeding up Spark with DataFrames
Presenter:[email protected]
@himanshug735
Presenter:[email protected]
@himanshug735
Organizer:@Knolspeak
http://www.knoldus.comhttp://blog.knoldus.com
Organizer:@Knolspeak
http://www.knoldus.comhttp://blog.knoldus.com
Thanks