Apache Spark RDDs

Post on 02-Jul-2015

6,889 views 5 download

description

Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala. http://www.meetup.com/Scala-Bay/events/209740892/

transcript

Apache Spark RDDsDean Chen eBay Inc.

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

Spark• 2010 paper Berkley's AMPLab

• resilient distributed datasets (RDDs)

• Generalized distributed computation engine/platform

• Fault tolerant in memory caching

• Extensible interface for various work loads

http://spark-summit.org/wp-content/uploads/2014/07/Sparks-Role-in-the-Big-Data-Ecosystem-Matei-Zaharia1.pdf

https://amplab.cs.berkeley.edu/software/

RDDs• Resilient distributed datasets

• "read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost"

• Familiar Scala collections API for distributed data and computation

• Monadic expression of lazy transformations, but not monads

Spark Shell

• Interactive queries and prototyping

• Local, YARN, Mesos

• Static type checking and auto complete

• Lambdas

val titles = sc.textFile("titles.txt")

val countsRdd = titles .flatMap(tokenize) .map(word => (cleanse(word), 1)) .reduceByKey(_ + _)

val counts = countsRdd .filter{case(_, total) => total > 10000} .sortBy{case(_, total) => total} .filter{case(word, _) => word.length >= 5} .collect

Transformations

map filter flatMap sample union intersection

distinct groupByKey reduceByKey sortByKey join cogroup cartesian

Actions

reduce collect count first

take takeSample saveAsTextFile foreach

val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._

case class Count(word: String, total: Int)

val schemaRdd = countsRdd.map(c => Count(c._1, c._2))

val count = schemaRdd .where('word === "scala") .select('total) .collect

schemaRdd.registerTempTable("counts")

sql(" SELECT total FROM counts WHERE word = 'scala' ").collect

schemaRdd .filter(_.word == "scala") .map(_.total) .collect

registerFunction("LEN", (_: String).length) val queryRdd = sql(" SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ") queryRdd .map(c => s"word: ${c(0)} \t| total: ${c(1)}") .collect() .foreach(println)

Spark Streaming

• Realtime computation similar to Storm

• Input distributed to memory for fault tolerance

• Streaming input in to sliding windows of RDDs

• Kafka, Flume, Kinesis, HDFS

TwitterUtils.createStream() .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))

GraphX

• Optimally partitions and indexes vertices and edges represented as RDDs

• APIs to join and traverse graphs

• PageRank, connected components, triangle counting

val graph = Graph(userIdRDD, assocRDD)

val ranks = graph.pageRank(0.0001).vertices

val userRDD = sc.textFile("graphx/data/users.txt")val users = userRdd.map { line => val fields = line.split(",") (fields(0).toLong, fields(1))}val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank)}

MLib

• Machine learning library similar to Mahout

• Statistics, regression, decision trees, clustering, PCA, gradient descent

• Iterative algorithms much faster due to in memory caching

val data = sc.textFile("data.txt")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint( parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) )}

val model = LinearRegressionWithSGD.train( parsedData, 100)

val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds .map{case(v, p) => math.pow((v - p), 2)}.mean()

RDDs• Resilient distributed datasets

• Familiar Scala collections API

• Distributed data and computation

• Monadic expression of transformations

• But not monads

Pseudo Monad

• Wraps iterator + partitions distribution

• Keeps track of history for fault tolerance

• Lazily evaluated, chaining of expressions

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

RDD Interface

• compute: transformation applied to iterable(s)

• getPartitions: partition data for parallel computation

• getDependencies: lineage of parent RDDs and if shuffle is required

HadoopRDD

• compute: read HDFS block or file split

• getPartitions: HDFS block or file split

• getDependencies: None

MappedRDD

• compute: compute parent and map result

• getPartitions: parent partition

• getDependencies: single dependency on parent

CoGroupedRDD

• compute: compute, shuffle then group parent RDDs

• getPartitions: one per reduce task

• getDependencies: shuffle each parent RDD

Summary

• Simple Unified API through RDDs

• Interactive Analysis

• Hadoop Integration

• Performance

References• http://www.cs.berkeley.edu/~matei/papers/2010/

hotcloud_spark.pdf

• https://www.youtube.com/watch?v=HG2Yd-3r4-M

• https://www.youtube.com/watch?v=e-Ys-2uVxM0

• RDD, MappedRDD, SchemaRDD, RDDFunctions, GraphOps, DStream

deanchen5@gmail.com

@deanchen