Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… ·...

transcript

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for

In-Memory Cluster Computing

Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica; University of California, Berkeley.

Some images and code samples are from paper, presentation for NSDI or Spark Project website ( http://spark-project.org/ ).

MapReduce in Hadoop

Resilient Distributed Datasets (RDD)

● Immutable, partitioned collection of records● Created by deterministic coarse-grained

transformations● Materialized on action● Fault-tolerant through lineage● Controllable persistence and partitioning

Example: Log mining

val file = spark.textFile(“hdfs://…”)

val errors = file.filter(

line => line.contains(“ERROR”)

).cache()

// Count all the errors

errors.count()

// Count errors mentioning MySQL

errors.filter(line => line.contains(“MySQL”)).count()

// Fetch the MySQL errors as an array of strings

errors.filter(line => line.contains(“MySQL”)).collect()

Example: Logistic Regression

val points = spark.textFile(…).map(parsePoint).cache()

var w = Vector.random(D) // current separating plane

for (i <- 1 to ITERATIONS) {

val gradient = points.map(p =>

(1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x

).reduce(_ + _)

w -= gradient

println(“Final separating plane: “ + w)

Example: PageRank

links = // RDD of (url, neighbors) pairs

ranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {

ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>

links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)

Representation

abstract def compute(split: Split): Iterator[T]

abstract val dependencies: List[spark.Dependency[_]]

abstract def splits: Array[Split]

val partitioner: Option[Partitioner]

def preferredLocations(split: Split): Seq[String]

Scheduling

Evaluation: PageRank

Scalability

Fault Recovery (k-means)

Behavior with Insufficient RAM (logistic regression)

User Applications

● Conviva, data mining (40x speedup)● Mobile Millenium, traffic modeling● Twitter, spam classification● ...

Expressing other Models

● MapReduce, DryadLINQ● Pregel graph processing● Iterative MapReduce● SQL

Conclusion

● RDDs are efficient, general and fault-tolerant abstraction for cluster computing

● 20x faster then Hadoop for memory bound applications

● Can be used for interactive data mining

● Available as Open Source at http://spark-project.org

Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… ·...

Documents