Date post: | 26-Jan-2017 |
Category: | Data & Analytics |
View: | 234 times |
Download: | 3 times |
Spark Streaming & Spark SQL
Yousun Jeong [email protected]
mailto:[email protected]
History - SparkDeveloped in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations Organizations that are looking at big data challenges including collection, ETL, storage, exploration and analytics should consider Spark for its in-memory performance and the breadth of its model. It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.
Gartner, Advanced Analytics and Data Science (2014)
History - Spark
Some key points about Spark: handles batch, interactive, and real-time within a single
framework native integration with Java, Python, Scala programming
at a higher level of abstraction multi-step Directed Acrylic Graphs (DAGs).
many stages compared to just Hadoop Map and Reduce only.
Data Sharing in MR
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Spark
Benchmark Test
databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale-sorting.html
RDDResilient Distributed Datasets (RDD) are the primary abstraction in Spark a fault-tolerant collection of elements that can be operated on in parallel
There are currently two types: parallelized collections take an existing Scala collection
and run functions on it in parallel Hadoop datasets run functions on each record of a file
in Hadoop distributed file system or any other storage system supported by Hadoop
Fault Tolerance An RDD is an immutable, deterministically re-
computable, distributed dataset.
RDD tracks lineage info rebuild lost data
Benefit of SparkSpark help us to have the gains in processing speed and implement various big data applications easily and speedily
Support for Event Stream
Processing Fast Data Queries in Real Time Improved Programmer Productivity Fast Batch Processing of Large Data
Set
Why I use spark
Big Data
Big Data is not just big
The 3V of Big Data
Big Data Processing1. Batch Processing
processing data en masse big & complex higher latencies ex) MR
2. Stream Processing one-at-a-time processing computations are relatively simple and generally independent sub-second latency ex) Storm
3. Micro-Batching small batch size (batch+streaming)
Spark Streaming Integration
Spark Streaming In Actionimport org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ // create a StreamingContext with a SparkConf configuration val ssc = new StreamingContext(sparkConf, Seconds(10)) // create a DStream that will connect to serverIP:serverPort val lines = ssc.socketTextStream(serverIP, serverPort) // split each line into words val words = lines.flatMap(_.split(" ")) // count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // print a few of the counts to the console wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate
Spark UI
Spark SQL
Spark SQL In Action// Data can easily be extracted from existing sources, // such as Apache Hive.
val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId"")
// Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib
val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
Spark SQL In Actionval allCandidates = sql(""" SELECT userId, age, latitude, logitude FROM Users WHERE subscribed = FALSE"")
// Results of ML algorithms can be used as tables // in subsequent SQL statements.
case class Score(userId: Int, score: Double) val scores = allCandidates.map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features))} scores.registerAsTable("Scores")
MR vs RDD - Compute an Average
RDD vs DF - Compute an Average
Using RDDs data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect()
Using DataFrames sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()
Spark 2.0 : Structured Streaming
Structured Streaming
High-level streaming API built on Spark SQL engine
Runs the same queries on DataFrames
Event time, windowing, sessions, sources & sinks
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Change queries at runtime
Build and apply ML models
Spark 2.0 Example: Page View Count
Input: records in Kafka Query: select count(*) group by page, minute(evtime) Trigger:every 5 sec Output mode: update-in-place, into MySQL sink
logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(logs.user_id). agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...")
Spark 2.0 Use Case: Fraud Detection
Spark 2.0 Performance
Q & A
Thank You!
Click here to load reader