Date post: | 13-Apr-2017 |
Category: |
Engineering |
Upload: | alexey-zinoviev |
View: | 250 times |
Download: | 1 times |
Spark 2
Alexey Zinovyev, Java/BigData Trainer in EPAM
About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With EPAM since 2015
3Spark 2 from Zinoviev Alexey
Contacts
E-mail : [email protected]
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
4Spark 2 from Zinoviev Alexey
Sprk Dvlprs! Let’s start!
5Spark 2 from Zinoviev Alexey
< SPARK 2.0
6Spark 2 from Zinoviev Alexey
Modern Java in 2016Big Data in 2014
7Spark 2 from Zinoviev Alexey
Big Data in 2017
8Spark 2 from Zinoviev Alexey
Machine Learning EVERYWHERE
9Spark 2 from Zinoviev Alexey
Machine Learning vs Traditional Programming
10Spark 2 from Zinoviev Alexey
11Spark 2 from Zinoviev Alexey
Something wrong with HADOOP
12Spark 2 from Zinoviev Alexey
Hadoop is not SEXY
13Spark 2 from Zinoviev Alexey
Whaaaat?
14Spark 2 from Zinoviev Alexey
Map Reduce Job Writing
15Spark 2 from Zinoviev Alexey
MR code
16Spark 2 from Zinoviev Alexey
Hadoop Developers Right Now
17Spark 2 from Zinoviev Alexey
Iterative Calculations
10x – 100x
18Spark 2 from Zinoviev Alexey
MapReduce vs Spark
19Spark 2 from Zinoviev Alexey
MapReduce vs Spark
20Spark 2 from Zinoviev Alexey
MapReduce vs Spark
21Spark 2 from Zinoviev Alexey
SPARK INTRO
22Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
23Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
24Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
25Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
26Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
27Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
28Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
29Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
30Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)
val results = hive.hql(“FROM src SELECT key, value”).collect()
31Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)
val results = hive.hql(“FROM src SELECT key, value”).collect()
32Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)
val results = hive.hql(“FROM src SELECT key, value”).collect()
33Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”)
val results = hive.hql(“FROM src SELECT key, value”).collect()
34Spark 2 from Zinoviev Alexey
RDD Factory
35Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
36Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
37Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
38Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
39Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
40Spark 2 from Zinoviev Alexey
It’s very easy to understand Graph algorithms
def pregel[A](initialMsg: A, maxIterations: Int, activeDirection:EdgeDirection)(vprog: (VertexID, VD, A) => VD,sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],mergeMsg: (A, A) => A)
: Graph[VD, ED] = { <Your Pregel Computations> }
41Spark 2 from Zinoviev Alexey
SPARK 2.0 DISCUSSION
42Spark 2 from Zinoviev Alexey
Spark
Family
43Spark 2 from Zinoviev Alexey
Spark
Family
44Spark 2 from Zinoviev Alexey
Case #0 : How to avoid DStreams with RDD-like API?
45Spark 2 from Zinoviev Alexey
Continuous Applications
46Spark 2 from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Extract, transform and load (ETL)
• Creating a real-time version of an existing batch job
• Online machine learning
47Spark 2 from Zinoviev Alexey
Write Batches
48Spark 2 from Zinoviev Alexey
Catch Streaming
49Spark 2 from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation the
same way you would express a batch computation
on static data.
50Spark 2 from Zinoviev Alexey
Batch
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
51Spark 2 from Zinoviev Alexey
Real
Time
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
52Spark 2 from Zinoviev Alexey
Unlimited Table
53Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
54Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
55Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
56Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
Don’t forget
to start
Streaming
57Spark 2 from Zinoviev Alexey
WordCount with Structured Streaming
58Spark 2 from Zinoviev Alexey
Structured Streaming provides …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
59Spark 2 from Zinoviev Alexey
Structured Streaming provides (in dreams) …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
60Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
61Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
org.apache.spark.sql.
AnalysisException:
Union between streaming
and batch
DataFrames/Datasets is not
supported;
62Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
Go to UnsupportedOperationChecker.scala and check your
operation
63Spark 2 from Zinoviev Alexey
Case #1 : We should think about optimization in RDD terms
64Spark 2 from Zinoviev Alexey
Single
Thread
collection
65Spark 2 from Zinoviev Alexey
No perf
issues,
right?
66Spark 2 from Zinoviev Alexey
The main concept
more partitions = more parallelism
67Spark 2 from Zinoviev Alexey
Do it
parallel
68Spark 2 from Zinoviev Alexey
I’d like
NARROW
69Spark 2 from Zinoviev Alexey
Map, filter, filter
70Spark 2 from Zinoviev Alexey
GroupByKey, join
71Spark 2 from Zinoviev Alexey
Case #2 : DataFrames suggest mix SQL and Scala functions
72Spark 2 from Zinoviev Alexey
History of Spark APIs
73Spark 2 from Zinoviev Alexey
RDD
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
74Spark 2 from Zinoviev Alexey
History of Spark APIs
75Spark 2 from Zinoviev Alexey
SQL
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
76Spark 2 from Zinoviev Alexey
Expression
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
77Spark 2 from Zinoviev Alexey
History of Spark APIs
78Spark 2 from Zinoviev Alexey
DataSet
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
79Spark 2 from Zinoviev Alexey
Case #2 : DataFrame is referring to data attributes by name
80Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
81Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
82Spark 2 from Zinoviev Alexey
Structured APIs in SPARK
83Spark 2 from Zinoviev Alexey
Unified API in Spark 2.0
DataFrame = Dataset[Row]
Dataframe is a schemaless (untyped) Dataset now
84Spark 2 from Zinoviev Alexey
Define
case class
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
85Spark 2 from Zinoviev Alexey
Read JSON
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
86Spark 2 from Zinoviev Alexey
Filter by
Field
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
87Spark 2 from Zinoviev Alexey
Case #3 : Spark has many contexts
88Spark 2 from Zinoviev Alexey
Spark Session
• New entry point in spark for creating datasets
• Replaces SQLContext, HiveContext and StreamingContext
• Move from SparkContext to SparkSession signifies move
away from RDD
89Spark 2 from Zinoviev Alexey
Spark
Session
val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()
val df = sparkSession.read
.option("header","true")
.csv("src/main/resources/names.csv")
df.show()
90Spark 2 from Zinoviev Alexey
No, I want to create my lovely RDDs
91Spark 2 from Zinoviev Alexey
Where’s parallelize() method?
92Spark 2 from Zinoviev Alexey
RDD?
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
93Spark 2 from Zinoviev Alexey
Case #4 : Spark uses Java serialization A LOT
94Spark 2 from Zinoviev Alexey
Two choices to distribute data across cluster
• Java serialization
By default with ObjectOutputStream
• Kryo serialization
Should register classes (no support of Serialazible)
95Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
96Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
Don’t forget about GC
97Spark 2 from Zinoviev Alexey
Tungsten Compact Encoding
98Spark 2 from Zinoviev Alexey
Encoder’s concept
Generate bytecode to interact with off-heap
&
Give access to attributes without ser/deser
99Spark 2 from Zinoviev Alexey
Encoders
100Spark 2 from Zinoviev Alexey
No custom encoders
101Spark 2 from Zinoviev Alexey
Case #5 : Not enough storage levels
102Spark 2 from Zinoviev Alexey
Caching in Spark
• Frequently used RDD can be stored in memory
• One method, one short-cut: persist(), cache()
• SparkContext keeps track of cached RDD
• Serialized or deserialized Java objects
103Spark 2 from Zinoviev Alexey
Full list of options
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
104Spark 2 from Zinoviev Alexey
Spark Core Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
105Spark 2 from Zinoviev Alexey
Spark Streaming Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER (default for Spark Streaming)
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
106Spark 2 from Zinoviev Alexey
Developer API to make new Storage Levels
107Spark 2 from Zinoviev Alexey
What’s the most popular file format in BigData?
108Spark 2 from Zinoviev Alexey
Case #6 : External libraries to read CSV
109Spark 2 from Zinoviev Alexey
Easy to
read CSV
data = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/datasets/samples/users.csv")
data.cache()
data.createOrReplaceTempView(“users")
display(data)
110Spark 2 from Zinoviev Alexey
Case #7 : How to measure Spark performance?
111Spark 2 from Zinoviev Alexey
You'd measure performance!
113Spark 2 from Zinoviev Alexey
114Spark 2 from Zinoviev Alexey
How to benchmark Spark
115Spark 2 from Zinoviev Alexey
Special Tool from Databricks
Benchmark Tool for SparkSQL
https://github.com/databricks/spark-sql-perf
116Spark 2 from Zinoviev Alexey
Spark 2 vs Spark 1.6
117Spark 2 from Zinoviev Alexey
Case #8 : What’s faster: SQL or DataSet API?
118Spark 2 from Zinoviev Alexey
Job Stages in old Spark
119Spark 2 from Zinoviev Alexey
Scheduler Optimizations
120Spark 2 from Zinoviev Alexey
Catalyst Optimizer for DataFrames
121Spark 2 from Zinoviev Alexey
Unified Logical Plan
DataFrame = Dataset[Row]
122Spark 2 from Zinoviev Alexey
Bytecode
123Spark 2 from Zinoviev Alexey
DataSet.explain()
== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]
:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as
bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],
functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])
: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0
+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe
+- Scan CsvRelation(----)
124Spark 2 from Zinoviev Alexey
Case #9 : Why does explain() show so many Tungsten things?
125Spark 2 from Zinoviev Alexey
How to be effective with CPU
• Runtime code generation
• Exploiting cache locality
• Off-heap memory management
126Spark 2 from Zinoviev Alexey
Tungsten’s goal
Push performance closer to the limits of modern
hardware
127Spark 2 from Zinoviev Alexey
Maybe something UNSAFE?
128Spark 2 from Zinoviev Alexey
UnsafeRowFormat
• Bit set for tracking null values
• Small values are inlined
• For variable-length values are stored relative offset into the
variablelength data section
• Rows are always 8-byte word aligned
• Equality comparison and hashing can be performed on raw
bytes without requiring additional interpretation
129Spark 2 from Zinoviev Alexey
Case #10 : Can I influence on Memory Management in Spark?
130Spark 2 from Zinoviev Alexey
Case #11 : Should I tune generation’s stuff?
131Spark 2 from Zinoviev Alexey
Cached
Data
132Spark 2 from Zinoviev Alexey
During
operations
133Spark 2 from Zinoviev Alexey
For your
needs
134Spark 2 from Zinoviev Alexey
For Dark
Lord
135Spark 2 from Zinoviev Alexey
IN CONCLUSION
136Spark 2 from Zinoviev Alexey
We have no ability…
• join structured streaming and other sources to handle it
• one unified ML API
• GraphX rethinking and redesign
• Custom encoders
• Datasets everywhere
• integrate with something important
137Spark 2 from Zinoviev Alexey
Roadmap
• Support other data sources (not only S3 + HDFS)
• Transactional updates
• Dataset is one DSL for all operations
• GraphFrames + Structured MLLib
• Tungsten: custom encoders
• The RDD-based API is expected to be removed in Spark
3.0
138Spark 2 from Zinoviev Alexey
And we can DO IT!
Real-Time Data-Marts
Batch Data-Marts
Relations Graph
Ontology Metadata
Search Index
Events & Alarms
Real-time Dashboarding
Events & Alarms
All Raw Data backupis stored here
Real-time DataIngestion
Batch DataIngestion
Real-Time ETL & CEP
Batch ETL & Raw Area
Scheduler
Internal
External
Social
HDFS → CFSas an option
Time-Series Data
Titan & KairosDBstore data in Cassandra
Push Events & Alarms (Email, SNMP etc.)
139Spark 2 from Zinoviev Alexey
First Part
140Spark 2 from Zinoviev Alexey
Second
Part
141Spark 2 from Zinoviev Alexey
Contacts
E-mail : [email protected]
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
142Spark 2 from Zinoviev Alexey
Any questions?