Spark ML par Xebia (Spark Meetup du 11/06/2015)

Post on 27-Jul-2015

903 views 2 download

Tags:

transcript

SPARK MLA new High-Level API for MLlib

Spark 1.4.0 preview

Matthieu Blanc

InstructorSpark Developper Training

@matthieublanc

MLLIBMakes Machine Learning Easy and Scalable

Selection of Machine Learning Algorithms

Several design flaws :

• machine learning workflows/pipelines

• make MLlib itself a scalable project

• lack of homogeneity

org.apache.spark.ml to the rescue!

Machine Learning

TrainDataset ML Algorithm

ModelTestDataset Predictions

FeatureEngineering

FeatureEngineering

label features

features features prediction

Machine Learning Pipeline

• Simple construction of ML workflow

• Inspect and debug it

• Tune parameters

• Re-run it on new data

Dataframes

org.apache.spark.ml

Key concepts

• DataFrame as ML Datasets

• Abstractions :

• Transformers

• Estimators

• Evaluators

• Parameters API -> CrossValidator

Transformers

DataFrame DataFrame

def transform(dataset: DataFrame): DataFrame

colA colB … colX

colA colB … colX newCol

Transformer Usage

// Add a categoryVec column to the DataFrame // by applying OneHotEncoder transformation to the column categoryval classEncoder = new OneHotEncoder() .setInputCol("catergory") .setOutputCol("catergoryVec") val newDataFrame = classEncoder.transform(dataFrame)

dataFrame newDataFramecolA colB … category: double

colA colB … category: double categoryVec: vector

Transformers Examples

Normalizer

VectorAssembler

PolynomialExpansion

Model

Tokenizer

OneHotEncoder

HashingTF

Binarizer

Estimators

DataFrame

Model

def fit(dataset: DataFrame): Model

label: double features: vector …

extends Transformer

Model is a Transformer

DataFrame DataFrame

def transform(dataset: DataFrame): DataFrame

features: vector …

features: vector prediction: double …

Estimator + Model Usage

// Apply logisticRegression on a training dataset to create a model// used to compute predictions on a test datasetval logisticRegression = new LogisticRegression() .setMaxIter(50) .setRegParam(0.01) // train val lrModel = logisticRegression.fit(trainDF)// predictval newDataFrameWithPredictions = lrModel.transform(testDF)

Estimators Examples

StringIndexer

StandardScaler

CrossValidator

Pipeline

LinearRegression

LogisticRegression

DecisionTreeClassifier

RandomForestClassifier

GBTClassifier

ALS

Evaluators

DataFrame Metric (Double)

area under ROC curvearea under PR curveroot mean square error

def evaluate(dataset: DataFrame): Double

label: Double prediction: Double …

Estimator + Model Usage

// Area under the ROC curve for the validation setval evaluator = new BinaryClassificationEvaluator()println(evaluator.evaluate(dataFrameWithLabelAndPrediction))

Evaluators Examples

RegressionEvaluator

BinaryClassificationEvaluator

Pipeline

TrainDataset ML Algorithm

ModelTestDataset Predictions

FeatureEngineering

FeatureEngineering

Pipeline

Transformer Estimator DataFrame

PipelineModel

Transformer Estimator

DataFrame DataFrame

Pipeline is an Estimator

Pipeline Usage// The stages of our pipelineval classEncoder = new OneHotEncoder() .setInputCol("class") .setOutputCol("classVec") val vectorAssembler = new VectorAssembler() .setInputCols(Array("age", "fare", "classVec")) .setOutputCol("features") val logisticRegression = new LogisticRegression() .setMaxIter(50) .setRegParam(0.01) // the pipelineval pipeline = new Pipeline() .setStages(Array(classEncoder, vectorAssembler, logisticRegression))// train val pipelineModel = pipeline.fit(trainSet)// predict val validationPredictions = pipelineModel.transform(testSet)

CrossValidator

Given• Estimator• Parameter Grid• Evaluator

Find the Model with the best Parameters

CrossValidator is also an Estimator

CrossValidator Usage// We will cross validate our pipelineval crossValidator = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator)// The params we want to testval paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(2, 5, 1000)) .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01)) .addGrid(logisticRegression.maxIter, Array(10, 50, 100)) .build()crossValidator.setEstimatorParamMaps(paramGrid)// We will use a 3-fold cross validationcrossValidator.setNumFolds(3) // train val cvModel = crossValidator.fit(trainSet)// predict with the best modelval testSetWithPrediction = cvModel.transform(testSet)

DEMOhttps://github.com/mblanc/spark-ml

Conclusion

DataFrame

o.a.spark.ml

RDD

o.a.spark.mllib

Today Tomorrow

uses uses

uses

uses

Summary

• Integration with DataFrames

• Familiar API based on scikit-learn

• Simple parameters tuning

• Schema validation

• User-defined Transformers and Estimators

• Composable and DAG Pipelines

1.4.1? 1.5.0?

MERCI