Apache Spark RDDs

transcript

Apache Spark RDDsDean Chen eBay Inc.

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Spark• 2010 paper Berkley's AMPLab

• resilient distributed datasets (RDDs)

• Generalized distributed computation engine/platform

• Fault tolerant in memory caching

• Extensible interface for various work loads

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

https://amplab.cs.berkeley.edu/software/

RDDs• Resilient distributed datasets

• "read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost"

• Familiar Scala collections API for distributed data and computation

• Monadic expression of lazy transformations, but not monads

Spark Shell

• Interactive queries and prototyping

• Local, YARN, Mesos

• Static type checking and auto complete

• Lambdas

val titles = sc.textFile("titles.txt")

val countsRdd = titles .flatMap(tokenize) .map(word => (cleanse(word), 1)) .reduceByKey(_ + _)

val counts = countsRdd .filter{case(_, total) => total > 10000} .sortBy{case(_, total) => total} .filter{case(word, _) => word.length >= 5} .collect

Transformations

map filter flatMap sample union intersection

distinct groupByKey reduceByKey sortByKey join cogroup cartesian

Actions

reduce collect count first

take takeSample saveAsTextFile foreach

val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._

case class Count(word: String, total: Int)

val schemaRdd = countsRdd.map(c => Count(c._1, c._2))

val count = schemaRdd .where('word === "scala") .select('total) .collect

schemaRdd.registerTempTable("counts")

sql(" SELECT total FROM counts WHERE word = 'scala' ").collect

schemaRdd .filter(_.word == "scala") .map(_.total) .collect

registerFunction("LEN", (_: String).length) val queryRdd = sql(" SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ") queryRdd .map(c => s"word: ${c(0)} \t| total: ${c(1)}") .collect() .foreach(println)

Spark Streaming

• Realtime computation similar to Storm

• Input distributed to memory for fault tolerance

• Streaming input in to sliding windows of RDDs

• Kafka, Flume, Kinesis, HDFS

TwitterUtils.createStream() .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))

GraphX

• Optimally partitions and indexes vertices and edges represented as RDDs

• APIs to join and traverse graphs

• PageRank, connected components, triangle counting

val graph = Graph(userIdRDD, assocRDD)

val ranks = graph.pageRank(0.0001).vertices

val userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line => val fields = line.split(",") (fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank)}

• Machine learning library similar to Mahout

• Statistics, regression, decision trees, clustering, PCA, gradient descent

• Iterative algorithms much faster due to in memory caching

val data = sc.textFile("data.txt")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint( parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) )}

val model = LinearRegressionWithSGD.train( parsedData, 100)

val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds .map{case(v, p) => math.pow((v - p), 2)}.mean()

RDDs• Resilient distributed datasets

• Familiar Scala collections API

• Distributed data and computation

• Monadic expression of transformations

• But not monads

Pseudo Monad

• Wraps iterator + partitions distribution

• Keeps track of history for fault tolerance

• Lazily evaluated, chaining of expressions

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

RDD Interface

• compute: transformation applied to iterable(s)

• getPartitions: partition data for parallel computation

• getDependencies: lineage of parent RDDs and if shuffle is required

HadoopRDD

• compute: read HDFS block or file split

• getPartitions: HDFS block or file split

• getDependencies: None

MappedRDD

• compute: compute parent and map result

• getPartitions: parent partition

• getDependencies: single dependency on parent

CoGroupedRDD

• compute: compute, shuffle then group parent RDDs

• getPartitions: one per reduce task

• getDependencies: shuffle each parent RDD

Summary

• Simple Unified API through RDDs

• Interactive Analysis

• Hadoop Integration

• Performance

References• http://www.cs.berkeley.edu/~matei/papers/2010/

hotcloud_spark.pdf

• https://www.youtube.com/watch?v=HG2Yd-3r4-M

• https://www.youtube.com/watch?v=e-Ys-2uVxM0

• RDD, MappedRDD, SchemaRDD, RDDFunctions, GraphOps, DStream

deanchen5@gmail.com

@deanchen

Apache Spark RDDs

Data & Analytics