Post on 27-Jan-2015
description
transcript
1
Spark: Next Generation Hadoop
Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D,
Innovation Labs, Impetus
Contents
2
Big Data Computations
Hadoop 2.0 (Hadoop YARN)
• BDAS Spark• BDAS Discretized
Streams
Berkeley data
analytics stack
• PMML Primer• Naïve Bayes Primer
PMML Scoring
for Naïve Bayes
3
Big Data ComputationsC
om
puta
tions/
Op
era
tions
Giant 1 (simple stats) is perfect for Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-body), 4 (optimization) Spark from UC Berkeley is
efficient.
Logistic regression, kernel SVMs, conjugate gradient
descent, collaborative filtering, Gibbs sampling, alternating least squares.
Example is social group-first approach for
consumer churn analysis [2]
Interactive/On-the-fly data processing – Storm.
OLAP – data cube operations. Dremel/Drill
Data sets – not embarrassingly parallel?
Deep Learning Artificial Neural Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing – GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240
Hadoop YARN Requirements or 1.0 shortcomings
4
R1: Scalability• single cluster
limitation
R2: Multi-tenancy • Addressed by
Hadoop-on-Demand• Security, Quotas
R3: Locality awareness• Shuffle of records
R4: Shared cluster utilization• Hogging by users• Typed slots
R5: Reliability/Availability• Job Tracker bugs
R6: Iterative Machine Learning
Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing, Oct 2013, ACM Press.
Iterative ML Algorithms What are iterative algorithms?
Those that need communication among the computing entities
Examples – neural networks, PageRank algorithms, network traffic analysis
Conjugate gradient descent
Commonly used to solve systems of linear equations
[CB09] tried implementing CG on dense matrices
DAXPY – Multiplies vector x by constant a and adds y.
DDOT – Dot product of 2 vectors
MatVec – Multiply matrix by vector, produce a vector.
1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices.
Other iterative algorithms – fast fourier transform, block tridiagonal[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific
computing, Technical Report, University of California, Computer Science Department, 2009.
6 Hadoop YARN Architecture
YARN Internals
7
Application Master
• Sends ResourceRequests to the YARN RM
• Captures containers, resources per container, locality preferences.
YARN RM
• Generates tokens and containers
• Global view of cluster – monolithic scheduling.
Node Manager
• Node health monitoring, advertise available resources through heartbeats to RM.
8
Berkeley Big-data Analytics Stack (BDAS)
BDAS: Spark
[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
Transformations/Actions
Description
Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD.
Filter(function f2) Select elements of RDD that return true when passed through f2.flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to
multiple outputs.Union(RDD r1) Returns result of union of the RDD r1 with the self.Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD.groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value.
No. of parallel tasks is given as an argument (default is 8).reduceByKey(function f4, noTasks)
Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument.
Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key.groupWith(RDD r3, noTasks)
Joins RDD r3 with self and groups by key.
sortByKey(flag) Sorts the self RDD in ascending or descending based on flag.Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDDCollect() Return all elements of the RDD as an array.Count() Count no. of elements in RDDtake(n) Get first n elements of RDD.First() Equivalent to take(1)saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given
path.saveAsSequenceFile(path)
Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs that implement Hadoop writable interface or equivalent.
foreach(function f6) Run f6 in parallel on elements of self RDD.
BDAS: Use Cases
10
Ooyala
Uses Cassandra for video data
personalization.
Pre-compute aggregates VS
on-the-fly queries.
Moved to Spark for ML and computing
views.
Moved to Shark for on-the-fly queries – C* OLAP aggregate
queries on Cassandra 130 secs, 60 ms in Spark
Conviva Uses Hive for
repeatedly running ad-hoc
queries on video data.
Optimized ad-hoc queries using Spark
RDDs – found Spark is 30 times faster
than Hive
ML for connection
analysis and video
streaming optimization.
Yahoo
Advertisement targeting: 30K
nodes on Hadoop Yarn
Hadoop – batch processingSpark – iterative processing
Storm – on-the-fly processing
Content recommendatio
n – collaborative
filtering
11
PMML Primer
12
Predictive Model Markup Language
Developed by DMG (Data Mining Group)
XML representation of a model.
PMML offers a standard to define a
model, so that a model generated in
tool-A can be directly used in tool-B.
May contain a myriad of data
transformations (pre- and post-processing)
as well as one or more predictive
models.
Naïve Bayes Primer
13
Normalization Constant
Likelihood Prior
A simple probabilistic
classifier based on Bayes Theorem
Given features X1,X2,…,Xn,
predict a label Y by calculating the probability for all possible Y value
PMML Scoring for Naïve Bayes
14
Wrote a PMML based scoring
engine for Naïve Bayes algorithm.
This can theoretically be
used in any framework for
data processing by invoking the API
Deployed a Naïve Bayes PMML
generated from R into Storm / Spark
and Samza frameworks
Real time predictions with the above APIs
15
Header
• Version and timestamp
• Model development environment information
Data Dictionary
• Variable types, missing valid and invalid values,
Data Munging/Transformati
on• Normalization,
mapping, discretization
Model
• Model specifi attributes
• Mining Schema• Treatment for
missing and outlier values
• Targets• Prior probability
and default • Outputs
• List of computer output fields
• Post-processing• Definition of model
architecture/parameters.
<DataDictionary numberOfFields="4"> <DataField name="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary>
(ctd on the next slide)
PMML Scoring for Naïve Bayes
16
<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs>
(ctd on the next page)
PMML Scoring for Naïve Bayes
17
PMML Scoring for Naïve Bayes
18
<BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> *</BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>
PMML Scoring for Naïve Bayes
19
Definition Of Elements:-
DataDictionary : Definitions for fields as used in mining
models ( Class, V1, V2, V3 )
NaiveBayesModel : Indicates that this is a NaiveBayes PMML
MiningSchema : lists fields as used in that model.
Class is “predicted” field, V1,V2,V3 are “active” predictor fields
Output: Describes a set of result values that can be
returned from a model
PMML Scoring for Naïve Bayes
20
Definition Of Elements (ctd .. ) :-
BayesInputs:For each type of inputs, contains the counts
of outputsBayesOutput:
Contains the counts associated with the values of the target field
21
Sample Input
Eg1 - n y y n y y n n n n n n y y y y
Eg2 - n y n y y y n n n n n y y y n y
• 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField )
• Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput)
PMML Scoring for Naïve Bayes
PMML Scoring for Naïve Bayes
22
• 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors ) Number of records
( in millions ) Time Taken (seconds)
0.1 4
0.4 7
1.0 12
2.0 21
10 129
25 310
PMML Scoring for Naïve Bayes
23
• 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )
Number of records ( in millions )
Time Taken (
0.1 1 min 47 sec
0.2 3 min 35 src
0.4 6 min 40 secs
1.0 35 mins 17 sec
10 More than 3 hrs
24
Domain specific language approach from Stanford.
Forge [AKS13] – a meta DSL for high performance DSLs.
40X faster than Spark!Spark
Explore BLAS libraries for efficiency
Future of Spark
[Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Proceedings of the 12th international conference on Generative programming: concepts & experiences (GPCE '13). ACM, New York, NY, USA, 145-154.
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.
• Real-time computation
• Processing specialized data structures
• PMML scoring
• Spark for batch computations
• Spark streaming and Storm for real-time.
• Allows traditional analytical tools/algorithms to be re-used.
Conclusion
25
Thank You!
Mail • vijay.sa@impetus.co.in
LinkedIn• http://in.linkedin.com/in/
vijaysrinivasagneeswaranBlogs • blogs.impetus.com
Twitter • @a_vijaysrinivas.
Back up slides
27
GraphLab: Ideal Engine for Processing Natural Graphs [YL12]
[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
Goals – targeted at machine learning.• Model graph dependencies, be
asynchronous, iterative, dynamic.
Data associated with edges (weights, for instance) and vertices (user profile data,
current interests etc.).
Update functions – lives on each vertex• Transforms data in scope of vertex.• Can choose to trigger neighbours
(for example only if Rank changes drastically)
• Run asynchronously till convergence – no global barrier.
Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering).• GraphLab – provides varying level of
consistency. Parallelism VS consistency.
Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc.• Co-EM (Expectation Maximization)
algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time.
GraphLab 2: PowerGraph – Modeling Natural Graphs [1]
[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).
GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges.• Most graph parallel
abstractions assume small neighbourhoods – low degree vertices
• But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs.
• Hard to partition power law graphs, high degree vertices limit parallelism.
Powergraph provides new way of partitioning power law graphs• Edges are tied to
machines, vertices (esp. high degree ones) span machines
• Execution split into 3 phases:• Gather, apply and
scatter.
Triangle counting on Twitter graph• Hadoop MR took 423
minutes on 1536 machines
• GraphLab 2 took 1.5 minutes on 1024 cores (64 machines)
30
BDAS: Discretized Streams
pageViews = readStream("http://...", "1s")1_s = pageViews.map(event => (event.url, 1))counts = 1_s.runningReduce((a, b) => a + b)
Treats streams as series of small time interval batch computations
Event based APIs – stream handling
How to make interval granularity very low (milliseconds)?• Built over Spark RDDs – in-memory distributed cache
Fault-tolerance is based on RDD lineage (series of transformations that can be stored and recomputed on failure).• Parallel recovery – re-computations happen in parallel across the
cluster.
31
BDAS: D-Streams Streaming Operators
words = sentences.flatMap(s => s.split(" "))pairs = words.map(w => (w, 1))counts = pairs.reduceByKey((a, b) => a + b)
Windowing• pairs.window("5s").reduceByKey(_+_)
Incremental aggregation• pairs.reduceByWindow("5s", (a, b) => a + b)
Time skewed joins
32
Representation of an RDDInformation HadoopRDD FilteredRDD JoinedRDD
Set of partitions 1 per HDFS block Same as parent 1 per reduce task
Set of dependencies
None 1-to-1 on parent Shuffle on each parent
Function to compute data set based on parents
Read corresponding block
Compute parent and filter it
Read and join shuffled data
Meta-data on location (preferredLocaations)
HDFS block location from namenode
None (parent) None
Meta-data on partitioning (partitioningScheme)
None None HashPartitioner
Some Spark(ling) examplesScala code (serial)
var count = 0
for (i <- 1 to 100000)
{ val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI.
Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
Some Spark(ling) examplesSpark code (parallel)
val spark = new SparkContext(<Mesos master>)
var count = spark.accumulator(0)
for (i <- spark.parallelize(1 to 100000, 12))
{ val x = Math.random * 2 – 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Notable points:
1. Spark context created – talks to Mesos1 master.
2. Count becomes shared variable – accumulator.
3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.
4. Parallelize method invokes foreach method of RDD.
1 Mesos is an Apache incubated clustering system – http://mesosproject.org
Logistic Regression in Spark: Serial Code// Read data file and convert it into Point objects
val lines = scala.io.Source.fromFile("data.txt").getLines()
val points = lines.map(x => parsePoint(x))
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = Vector.zeros(D)
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient
}
println("Result: " + w)
Logistic Regression in Spark// Read data file and transform it into Point objectsval spark = new SparkContext(<Mesos master>)val lines = spark.hdfsTextFile("hdfs://.../data.txt")val points = lines.map(x => parsePoint(x)).cache()
// Run logistic regressionvar w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value}println("Result: " + w)