+ All Categories
Home > Documents > Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache...

Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache...

Date post: 20-May-2020
Category:
Upload: others
View: 17 times
Download: 0 times
Share this document with a friend
119
Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Transcript
Page 1: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Machine Learning with Apache Spark

Mathijs KattenbergJeroen Schot

PTC workshop , 2018-02-13

Page 2: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

About usMathijs Kattenberg

Technical consultant at SURFsara since 2013

● Working with Big Data technologies (Hadoop, Spark, Kafka)

Before:

● Scientific programmer at VU Amsterdam● MSc Artificial Intelligence at VU Amsterdam

Jeroen Schot

Technical consultant at SURFsara since 2012

● Working with Big Data technologies (Hadoop, Spark, Kafka)

Before:

● MSc Physics at Utrecht University

Page 3: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Program for today

09:00 - 09:15 Welcome & introduction09:15 - 10:30 Apache Spark core and structured API’s10:30 - 10:45 Coffee break10:45 - 12:00 Hands-on Jupyter notebooks12:00 - 13:00 Lunch13:00 - 14:30 Apache Spark MLlib14:30 - 14:45 Coffee break14:45 - 16:15 Hands-on Jupyter notebooks16:15 - 16:30 Coffee break16:30 - 17:00 Practical advice, summary

Page 4: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Apache Spark core and structured API’s

● Differences with traditional HPC approaches

● Distributed data processing

● Resilient Distributed Datasets (RDDs)

● DataFrames (DFs)

Page 5: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

“Traditional” (scientific) software applications

Application developed as:

• Stand-alone binary application

• Assumes a specific environment (e.g. Linux OS, CLI)

• Operates on input files and parameters

• Produces output files

• Researcher specifies input files and params via CLI

Page 6: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Scaling “traditional” applications

Now the one running the application needs to:

• Distribute and split data

• Handle faults and errors inherent with scale

• Submit and track applications

Page 7: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

An exampleConsider from a tweet we are interested in finding:

• Names of persons

• Names of organisations

• Locations and placesI will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting!

Page 8: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

• Store tweets on disk

• Small Python program uses NLTK and Stanford NER to tag

• Write output back to disk

A straightforward implementation

Page 9: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

But…

http://bit.ly/1rxKY0n

Page 10: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Scaling Bottlenecks

• Store tweets on disk: it will eventually fill, many readers

• Small Python program: it can do a tweet every few msecs/secs so need to run separate processes

• Write output back to disk: it will eventually fill, many writers

• Run separate processes: they all need input

Page 11: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Scalability: Design• Data is growing faster than computing power and IO

=> distributed computing necessary

• Most standard applications cannot run in a distributed fashion

=> applications need to be designed with scalability from the start

Page 12: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Machine LimitsCurrent system limits:

• ~256 CPU cores

• 2TB of RAM

• ~500 TB disk space

= expensive (cost does not scale linearly)

Scale out instead of scale up!

Page 13: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Parallel programming is hard

Page 14: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Scalability: DesignIdea: take a step back and consider:

• Work without mutable state

• Restrict the programming interface so that more can be done automatically.

Turns out: we can use ideas from functional programming and declarative languages

Page 15: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Scalable programsConsider: declarative vs. imperative

Page 16: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Functional ProgrammingRestrict the programming interface so that the system can do more automatically. Use ideas from functional programming:

“Here is a function, apply it to all of the data”

• I don't care where it runs (the system should handle that)

• Feel free to run it twice on different nodes (no side effects!)

Page 17: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 18: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapI

like

traffic

lights

1

4

7

6

Map takes as input a function, and a list:

map(len,['I','like','traffic','lights'])which in Python returns [1,4,7,6]

Page 19: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Reduce

47

11

42

13

Reduce takes as input a binary function

and a list

A binary function is a function

with two arguments, like add, subtract, multiply,

etc

100

58

113

reduce(add, [47,11,42,13])returns 113

def add(x,y):return x+y

Page 20: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 21: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 22: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapReduce programming model

Input: set of input key/value pairs, Map function and Reduce functionOutput: set of output key/value pairs

Map function is applied to every input pair to produce an intermediate key/value pair

All intermediate pairs are grouped by key

Reduce function is applied to every key and set of values for that key

Page 23: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Hadoop MapReduce

Page 24: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapReduce strengths

MapReduce framework handles a lot of work for its end-user:

● Splitting work in independent tasks● Task scheduling, retrying on failure● Data grouping/shuffling, in-memory/spilling to disk

Page 25: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapReduce limitations

● Very low level:decomposing problems in (multiple) MapReduce jobs is hard

● Batch-oriented:unsuited for interactive use or realtime processing

● Disk sync:performance issues when chaining jobs (iterative algorithms)

Page 26: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Higher Level Frameworks

= SQL on Hadoop

= Pig - dataflow DSL

= Dataflow API in Java

= Graph processing

All translated to MR jobs

Page 27: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Apache Spark: a general framework• Spark can be seen as a successor of Hadoop MapReduce and is a simplified framework for writing large

scale data-intensive applications

• Write programs in terms of distributed datasets and operations on them

• Accessible from multiple programming languages:

Scala

Java

Python

R (only via DataFrames)

Page 28: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Spark components

Page 29: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 30: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Resilient Distributed Dataset (RDD)

• Abstraction for a collection of objects/elements/records

• Spread over many machines

• Built through parallel transformations

• Immutable

Page 31: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Creation of RDD• Transforming an existing RDD• Through SparkContext:

- From internal data structure

- From reading in file (HDFS or otherwise)

text = "This is a sample text."

textRDD = sc.parallelize(text)

lines = sc.textFile('../data/links.tsv')

Page 32: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Operations on RDDsTransformations:

• Create new RDD

• Lazily computed

• Example: ‘map’, ‘filter’

Actions:

• Return some value or side-effect

• Triggers computation

• Example: ‘count’, ‘saveAsTextFile’

Page 33: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Transformations

RDDs can be created from other RDDs using transformations:

• map(f) Apply function f to each element of the RDD

• flatMap(f) Apply function f to each element of the RDD and unpack lists etc.

• filter(pred) Apply predicate pred to each element RDD and return those that pass pred

• distinct() Remove duplicate entries in RDD

Page 34: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Actions

• collect() Returns all elements of the RDD in a list

• count() Returns the number of elements in the RDD

• take(n) Returns the first n elements of the RDD

• reduce(f) Returns the combined result of f on the RDD.

Page 35: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

map vs flatmap

Page 36: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

• Different from Reduce as in MapReduce

• Aggregates all elements to a single value

Example reduce

Function with two arguments

Page 37: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

reduceByKey

x is not the key but the accumulated value!

Page 38: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Pseudo set operations

Page 39: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Pair RDDs

• The elements of a Pair RDD are pairs (k,v)

• k is interpreted as the key, v as the value

• Very much like Hadoop’s MapReduce

• Pair RDD have extra methods

Page 40: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Pair RDD transformations

• groupByKey() Returns a RDD with elements (key, valuelist)

• reduceByKey(f(x,y)) Applies f to all values of each key (similar to Hadoop MapReduce)

• join(RDD) Joins two RDDs on their keys

• mapValues(f) Apply f to the values, not the keys of the RDD

Page 41: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Actions on pair RDDs

Page 42: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Word Count

Input

the cat sat on the matthe aardvark sat on the sofa

aardvark 1cat 1mat 1on 2sat 2sofa 1the 4

Output

Page 43: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

lines = sc.textFile(file)words = lines.flatMap(lambda s: s.split())pairs = words.map(lambda w: (w, 1))counts = pairs.reduceByKey(lambda x, y: x + y)

Page 44: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 45: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 46: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 47: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 48: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Actions vs. transformations● Try to do as much as possible on executors

● Prefer transformations over actions

● Use collect() only on small data sets

Page 49: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

RDD limitations

• Low level: a lot of key-value juggling

• Little room for optimizations by Spark (it cannot assume structure on the data - no schema)

• Good for unstructured data (text), but what if our data has structure (csv, json, table etc.)?

Page 50: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

DataFrames

● A DataFrame is a distributed collection of data organized into named columns. Conceptually equivalent to a table in a relational database or a dataframe in R/Python Pandas.

● DataFrames can be constructed from a wide array of sources, such as structured data files, external databases, or existing RDDs.

Page 51: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

DataFrames• Collection of Row objects with schema

• Like RDDs, DataFrames are immutable

• Also distributed over machines in cluster

• Transformations and actions

• Lazy but schema is checked eagerly

• Spark makes use of schema information for query optimization

Page 52: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 53: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Operations on DataFrames

• Not arbitrary functions but given operations that are understood by Spark and can be optimized

• Like RDDs, transformations and actions

• Transformations like relational operators

• Also an SQL interface

Page 54: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

DataFrame API

Relational operators, for example:

selectwherejoinlimitgroupByorderBy

Page 55: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

SparkSQL

Page 56: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Can you forget about RDDs? In practice RDDs are used quite often together with

DataFrames.

When we need to tweak schemas.

When we need to clean or wrangle data

When we want more control

When we deal with unstructured data

When we want something just a bit different

Page 57: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Hands-on with Jupyter notebooks● https://prace.jove.surfsara.nl● Username/password: see handout

Page 58: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

About the environment● You will be working in a Jupyter notebook environment● The notebooks run on hardware at SURFsara and are accessible via the

browser● Spark is not connected to a cluster, but runs in local mode● Each of you have an environment with:

○ 2 cores○ 6GB memory

● Running multiple notebooks simultaneously you can run in out-of-memory errors, so shutdown the notebook when starting the next

Page 59: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Apache Spark - MLlib and advanced

Page 60: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Spark: What Runs Where?

• At first glance: Spark code and RDD variables look local

• Important to keep track of local variables and references to distributed data (variables of type RDD)

Page 61: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

An Executing Application

Page 62: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

An Executing Application

Page 63: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

An Executing Application

Page 64: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

PySpark & Py4J

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Page 65: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Spark modes● SparkContext: contains information about the cluster and is the linking

pin between your code and the cluster.● Local mode: single machine, using multiple cores.

For testing and training purposes.● Cluster mode:

○ Stand-alone: dedicated Spark cluster○ Hadoop/YARN: cluster per application○ Mesos: cluster per application, coarse- or fine-grained modes○ Kubernetes: experimental

Page 66: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Spark on ‘classical’ HPC clusters (SGE/Slurm/PBS)● Spark was not designed to run on ‘classical’ HPC cluster● Standard recipe:

○ create a multi-node job submission script○ start Spark master on node 0○ start Spark executors on other nodes

● Filesystem access/assumptions:○ Access to shared file system from all executors (bulk R/W)○ Per executors fast local disk for small file I/O

● Hard to get this secure!● Helper scripts:

○ https://github.com/LLNL/magpie○ https://github.com/glennklockwood/myhadoop

Page 67: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

From High Performance Spark, Holden Karau and Rachel Warren

Page 68: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 69: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Machine Learning: MLlibWhy another one?

Page 70: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Spark MLlib

• Scale: many data sets/models become too big for single machine

• Spark is good at training models in a distributed fashion

• Not so good in predicting with very low latency (overhead for startup Spark jobs)

Page 71: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Machine learning1. Data exploration

2. Data preprocessing

3. Model training

4. Model evaluation

5. Model inspection

Page 72: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MLlib: Spark’s Machine Learning library● The Apache Spark core distribution includes a machine learning library since

its inception called ‘MLlib’● MLlib was based on the RDD API● Spark 1.2 introduced a new package called spark.ml● spark.ml is a high-level interface based on DataFrames● Since Spark 2.0 both are called MLlib

○ DataFrames API is the primary API○ RDD API is in maintenance mode○ RDD API expected to be deprecated in 2.3, removed in 3.0

Page 73: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MLlib data types

MLlib (RDD) uses some numerical data types backed by Breeze

● Local vector○ Dense and sparse vectors of doubles

● Labeled point○ Local vector + a label, used by supervised learning algorithms

● Local matrix○ Dense and sparse matrices stored on a single machine

● Distributed matrix○ Row, column indices with double values stored in one or more RDDs

Page 74: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MLlibCommon machine learning algorithms on top of Spark:

• classification: SVM, Naive Bayes, Random Forests

• regression: logistic regression, decision trees, isotonic regression

• clustering: K-means, PIC, LDA

• collaborative filtering: alternating least squares

• dimensionality reduction: SVD, PCA

Page 75: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Pipeline stages

The pipeline concept is the basis for spark.ml and based on the same idea in scikit-learn. There are three main components:

● Transformer: transforms a DataFrame to a new DataFrame● Estimator: needs fitting on data to produce a model (which is a Transformer)● Pipeline: chain of multiple Transformers and Estimators together

Page 76: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Pipeline

tok = Tokenizer().setInputCol(“text”).setOutputCol(“words”)

htf = HashingTF().setInputCol("words") \ .setOutputCol("features") \ .setNumFeatures(200)

lr = LogisticRegression().setMaxIter(10) \ .setRegParam(0.3) \ .setElasticNetParam(0.8) pipeline = Pipeline().setStages([tok, htf, lr])model = pipeline.fit(training_data)

Page 77: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Pipeline

[...]pipeline = Pipeline().setStages([tok, htf, lr)])model = pipeline.fit(training_data)

predictions = model.transform(test_data)

Page 78: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Model selection (hyperparameter tuning)

Model selection can be done using the CrossValidator and TrainValidationSplit tools. They use as input:

● Estimator or Pipeline: the algorithm to optimize● Set of parameter maps: the parameter grid● Evaluator: metric of the performance of a model

CrossValidator and TrainValidationSplit are Estimators themselves!

Page 79: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Model selection

paramGrid = ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).build()

trainValidationSplit = TrainValidationSplit() .setEstimator(pipeline) .setEvaluator(RegressionEvaluator()) .setEstimatorParamMaps(paramGrid) .setTrainRatio(0.8)

model = trainValidationSplit.fit(training_data)

predictions = model.transform(test_data)

Page 80: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Extending MLlib

● You can write your own Estimators and Transformers● They need to implement the pipeline interfaces● They can be used in Pipelines and mixed with existing ones

Page 81: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Model selection as Pipeline

paramGrid = ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).build()

trainValidationSplit = TrainValidationSplit() .setEstimator(pipeline) .setEvaluator(RegressionEvaluator()) .setEstimatorParamMaps(paramGrid) .setTrainRatio(0.8)

model = trainValidationSplit.fit(training_data)

predictions = model.transform(test_data)

Page 82: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Additional I/O

● Reading/writing labeled data in LIBSVM format○ Using MLUtils.loadLibSVMFile() and MLUtils.saveAsLibSVMFile()

● Models can be persisted after a job ○ Using model.save() / model.load() methods○ Internal Spark-only format

● Some models can be exported in PMML format○ KMeansModel, LassoModel, LinearRegressionModel, LogisticRegressionModel,

RidgeRegressionModel, SVMModel, StreamingKMeansModel○ Importing PMML is not supported

https://www.csie.ntu.edu.tw/~cjlin/libsvm/https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

Page 83: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Using MLlib workflow1. Read the official website documentation:

https://spark.apache.org/docs/2.1.1/ml-guide.html2. Read the Python API docs:

https://spark.apache.org/docs/2.1.1/api/python/index.html3. Read the Scala API docs:

https://spark.apache.org/docs/2.1.1/api/scala/index.html4. Read the Scala source code:

https://github.com/apache/spark/tree/master/mllib/src

Optional: consult Google, Stackoverflow, Spark JIRAMake sure you read the documentation of your Spark version!

Page 84: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Alternatives to Spark MLlibThese libraries can use Spark as a backend and have their own API

● Sparkling Water (H2O) - https://www.h2o.ai/sparkling-water/● DL4J - https://deeplearning4j.org/● Apache Mahout - https://mahout.apache.org/

Page 85: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Hands-on with Jupyter notebooks● https://prace.jove.surfsara.nl● Username/password: see handout

Page 86: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Practical advice & summary

Page 87: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Scala (vs Python/Java)To get the most out of Spark you should use Scala (or at least know a little)

● Scala performs better than Python○ dynamic typing, JVM communication

● Scala API is nicer than the Java API○ although this has improved with Java 8

But there are companies running PySpark in production, and with DataFrames the performance gap is smaller than with RDDs

Page 88: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Community packagesspark-packages.org is an index of third-party Spark packages

Examples:

● graphframes: DataFrame-based Graphs● elasticsearch-hadoop: integration with ElasticSearch● thunder: neural data analysis framework● spark-nlp: Natural Language Processing for Spark

Currently ‘only’ 394 packages, quality varies

Page 89: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

SerializationSpark stores intermediate data in memory when needed/possible

There are three options:

● In-memory as deserialized Java objects○ Fast, might be inefficient wrt space

● In-memory as serialized data (using Kryo)○ More CPU-intensive, but memory-efficient○ Not needed/possible for Python

● On-disk○ When it doesn’t fit in memory, write to disk○ Slow, but fault-tolerant

Page 90: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Serialization - manual control● When using an RDD multiple times inside the same job, you want to control

where/how this RDD is persisted● Based on five attributes:

○ useDisk○ useMemory○ useOfHeap○ deserialized○ replication

● Controlled by calling rdd.persist(TYPE)● When memory or disk are full, Spark will use a Least Recently Used (LRU)

policy to delete partitions

Page 91: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Serialization - manual control

● Example: rdd.persist(DISK_ONLY_2)○ useDisk = True

○ useMemory = False

○ useOfHeap = False

○ deserialized = False

○ replication = 2

Page 92: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Serialization - manual control

● Example: rdd.persist(MEMORY_ONLY_SER)○ useDisk = False

○ useMemory = True

○ useOfHeap = False

○ deserialized = True

○ replication = 1

Page 93: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Serialization - when reading/writingReading data can be made more performant by writing it in a good format

● Compression codec that favors (de)compression speed over compression ratio○ Because of this BZip2 is usually a bad choice

● Serialization format that stores the structure of the data

General advice:

● For RDDs, use Hadoop SequenceFile or ORCFile with LZO or Snappy compression● For DataFrames, use the Parquet format

Page 94: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

HDF5 / netCDF● Official HDF5 Spark Connector (Beta) -

https://www.hdfgroup.org/downloads/spark-connector● Loading netCDF / HDF using SciSpark (NASA JPL) -

https://scispark.jpl.nasa.gov/● H5Spark - https://github.com/valiantljk/h5spark

Page 95: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapPartitions● RDD operations seen so far work on

○ single records (map, filter)○ whole RDDs (join, union)

● MapPartitions works on a whole partition● Allows to ‘share state’ between multiple records in the same partition● Use this to share ‘expensive’ operations

○ creating a DB connection, initializing a Tokenizer, …

● Use this to do secondary sorting or custom aggregations (be careful)

Page 96: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapPartitions● Input: iterator over records in a single partition● Output: iterator over transformed records of this partition

● The full partition might not fit in memory, so avoid creating a list/full buffering

def tokenize(iter):

tokenizer = StringTokenizer() # expensive to startfor line in iter:

yield tokenizer.tokenize(line)

tokenized_rdd = rdd.mapPartitions(tokenize)

Page 97: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapPartitions

● mapPartitions has an optional argument preservesPartitioning(False by default)

● Set this to True iff the function works on a PairRDD and doesn’t modify the keys

Page 98: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

MapPartitionsmapPartitions can be used to implement many other transformations such as map, flatMap and filter

def do_map(iter):for i in iter:

yield f(i)

def do_filter(iter):for i in iter:

if p(i):yield i

def do_flatmap(iter): for sub_iter in iter: for i in sub_iter: yield f(i)

Page 99: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

RDD - data sourcesSparkContext methods to read different data formats

● textFile(path)● wholeTextFiles(path)● binaryFiles(path)● binaryRecords(path, recordLength)● newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass)● sequenceFile(path)

Data can be on any Hadoop-supported filesystem (local, HDFS, S3) accessible to all executors

Page 100: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

RDD - InputFormat example WARC file format

● ‘Standard’ file format for web archives○ Used by Internet Archive, Library of Congress, CommonCrawl

● WARC file: concatenation of WARC records (separated by two newlines)● WARC record: header and content block● Header contains information such as type, date, length

How to read this into Spark?

https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtmlhttp://commoncrawl.org/2014/04/navigating-the-warc-file-format/

Page 101: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

WARC file formatWARC/1.0WARC-Type: responseWARC-Date: 2013-12-04T16:47:32Z Content-Length: 73873Content-Type: application/http; msgtype=responseWARC-IP-Address: 23.0.160.82WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFBWARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

HTTP/1.0 200 OKServer: nginxContent-Type: text/html; charset=UTF-8Vary: Accept-EncodingVary: CookieX-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.Content-Encoding: gzipDate: Wed, 04 Dec 2013 16:47:32 GMTContent-Length: 18953Connection: close

...HTML Content...

Page 102: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Reading WARC - attempt 1● Read in whole files with sc.wholeTextfiles or sc.binaryFiles● Use an existing WARC parsing library● Use this library within flatMap or mapPartitions

Page 103: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Reading WARC - attempt 1● Read in whole files with sc.wholeTextfiles or sc.binaryFiles● Use an existing WARC parsing library● Use this library within flatMap or mapPartitions

import warcfrom pyspark import SparkContext

sc = SparkContext()warc_files = sc.binaryFiles(“input/*.warc”)warc_records = warc_files.flatMap(lambda f: [r for r in warc.read(f)])

Page 104: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Reading WARC - attempt 1● Read in whole files with sc.wholeTextfiles or sc.binaryFiles● Use an existing WARC parsing library● Use this library within flatMap or mapPartitions

Concerns:

● What if a single WARC file is 1GB in size? 10GB? 100GB?● (Can my WARC library read from a byte array)

Page 105: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Reading WARC - attempt 2

● Find or write an Hadoop InputFormat/RecordReader (in Java/Scala)○ Using an existing WARC parsing library

● Read the data using sc.newAPIHadoopFile

from pyspark import SparkContext

sc = SparkContext()warc_records = sc.newAPIHadoopFile(“input/*.warc”, “nl.surfsara.warcutils.WarcInputFormat”, “org.apache.hadoop.io.LongWritable”, “org.apache.hadoop.io.Text”)

Page 106: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

public class WarcRecordReader extends RecordReader<LongWritable, WarcRecord> {private DataInputStream in;private long start;private long pos;private long end;private Seekable filePosition;

private CompressionCodecFactory compressionCodecs = null;private CompressionCodec codec;private Decompressor decompressor;

private LongWritable key = null;private WarcRecord value = null;private WarcReader warcReader;

@Overridepublic void initialize(InputSplit inputSplit, TaskAttemptContext context) throws IOException {

FileSplit split = (FileSplit) inputSplit;Configuration conf = context.getConfiguration();final Path file = split.getPath();

start = split.getStart();end = start + split.getLength();compressionCodecs = new CompressionCodecFactory(conf);codec = compressionCodecs.getCodec(file);

FileSystem fs = file.getFileSystem(conf);FSDataInputStream fileIn = fs.open(split.getPath());

if (isCompressedInput()) {in = new DataInputStream(codec.createInputStream(fileIn, decompressor));filePosition = fileIn;

} else {fileIn.seek(start);in = fileIn;filePosition = fileIn;

}

warcReader = WarcReaderFactory.getReaderUncompressed(in);

warcReader.setWarcTargetUriProfile(WarcIOConstants.URIPROFILE);warcReader.setBlockDigestEnabled(WarcIOConstants.BLOCKDIGESTENABLED);warcReader.setPayloadDigestEnabled(WarcIOConstants.PAYLOADDIGESTENABLED);warcReader.setRecordHeaderMaxSize(WarcIOConstants.HEADERMAXSIZE);warcReader.setPayloadHeaderMaxSize(WarcIOConstants.PAYLOADHEADERMAXSIZE);

this.pos = start;}

https://github.com/sara-nl/warcutils

Page 107: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

public boolean nextKeyValue() throws IOException {if (key == null) {

key = new LongWritable();}pos = filePosition.getPos();key.set(pos);

value = warcReader.getNextRecord();if (value == null) {

return false;}return true;

}

@Overridepublic LongWritable getCurrentKey() {

return key;}

@Overridepublic WarcRecord getCurrentValue() {

return value;}

@Overridepublic float getProgress() throws IOException {

if (start == end) {return 0.0f;

} else {return Math.min(1.0f, (getFilePosition() - start) / (float) (end - start));

}}

@Overridepublic synchronized void close() throws IOException {

try {if (in != null) {

in.close();}

} finally {if (decompressor != null) {

CodecPool.returnDecompressor(decompressor);}

}}

[...]

Page 108: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Reading WARC - attempt 2● Find or write an Hadoop InputFormat/RecordReader (in Java/Scala)

○ Using an existing WARC parsing libary

● Read the data using sc.newHadoopFile

Concerns:

● What if a single WARC record is 1GB in size? 10GB? 100GB?○ Not suitable for any form of distributed computing?

● What if I don’t know Java/Scala?

Page 109: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Real example: unique IDs● Problem: Algorithm expects records to have a unique integer ID for some field

but your dataset has a unique string column (email, username, …)● Solution(?): Use the MonotonicallyIncreasingID function to add a new column

to the DataFrame

from pyspark.sql.functions import monotonically_increasing_id

df = spark.read.csv(“input/*”)

df_with_ids = df.withColumn(“new_id”, monotonically_increasing_id())

Page 110: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Real example: unique IDs● Problem: MonotonicallyIncreasingID generates 64-bit numbers, and the

algorithms expects 32-bit numbers…● Solution(?): We have less than 2^32 items, so just cast from Long to Int

from pyspark.sql.functions import monotonically_increasing_id

df = spark.read.csv(“input/*”)

df_with_ids = df.withColumn(“new_id”, monotonically_increasing_id().cast(“int”))

Page 111: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Real example: unique IDs“The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.”

https://spark.apache.org/docs/2.1.1/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id

● Problem: By casting to Int we will have overlapping IDs!● Solution: Use the MLlib StringIndexer instead● Alternative: Convert to RDD and use zipWithIndex

Page 112: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Real example: unique IDs“StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.”

https://spark.apache.org/docs/latest/ml-features.html#stringindexer

from pyspark.ml.feature import StringIndexer

df = spark.read.csv(“input/*”)string_indexer = StringIndexer(inputCol="id", outputCol="new_id", handleInvalid='error')model = string_indexer.fit(df)df_with_ids = model.transform(df)

Page 113: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Real example: unique IDs“The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.”

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

df = spark.read.csv(“input/*”)rdd_with_ids = df.rdd.zipWithIndex().map(...)df_with_ids = rdd_with_ids.toDF(schema)

Page 114: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Real example: ALS & RegressionEvaluator● After building an ALS model we try to use the standard RegressionEvaluator to calculate the RMSE

when predicting the validation set● RegressionEvaluator seems to always return ‘nan’● Searching the Internet reveals:

“When building a Spark ML pipeline containing an ALS estimator, the metrics "rmse", "mse", "r2" and "mae" all return NaN.

The reason is in CrossValidator.scala line 109. The K-folds are randomly generated. For large and sparse datasets, there is a significant probability that at least one user of the validation set is missing in the training set, hence generating a few NaN estimation with transform method and NaN RegressionEvaluator's metrics too.”

https://issues.apache.org/jira/browse/SPARK-14489

● Reported 08-04-2016● Fixed 28-02-2017 in version 2.2.0● But we are running version 2.1.1 :(

Page 115: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Real example: ALS & RegressionEvaluator● In version 2.2.0 ALS has an extra parameter coldStartStrategy which can be

nan (old behaviour) or drop (drop all rows with NaN predictions)● Workaround for version 2.1.1: drop nan rows manually between transform

and evaluate or subclass the model or estimator

predictions = model.transform(validate).dropna()evaluator = RegressionEvaluator()rmse = evaluator.evaluate(predictions)

def MyEvaluator(RegressionEvaluator): def evaluate(df, params=None): df = df.dropna() return super().evaluate(df, params)

predictions = model.transform(validate)evaluator = MyEvaluator()rmse = evaluator.evaluate(predictions)

Page 116: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13
Page 117: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Narrow transformationsEach child partition depends on a known subset of parent partitions

Page 118: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Wide transformations● Wide transformations are the most expensive and should be avoided or

optimized● Wide transformations are caused by operations such as groupByKey,

reduceByKey, sort and join● Examples for optimizing:

○ Filter first○ Use reduceByKey instead of groupByKey + map

● Join is one of the most expensive operations in Spark○ Use distinct to prevent data explosion○ Use cogroup instead of join

Page 119: Machine Learning with Apache Spark - PRACE Agenda Systems ...€¦ · Machine Learning with Apache Spark Mathijs Kattenberg Jeroen Schot PTC workshop , 2018-02-13

Joins with DataFrames

● Less control with DataFrames● Execution plan is determined by Catalyst● Cannot change the Partitioner● The advice on preventing joins with non-unique keys still holds!


Recommended