+ All Categories
Home > Data & Analytics > Márton Balassi Streaming ML with Flink-

Márton Balassi Streaming ML with Flink-

Date post: 16-Apr-2017
Category:
Upload: flink-forward
View: 503 times
Download: 2 times
Share this document with a friend
21
1 © Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect @ Cloudera* @MartonBalassi | [email protected] Judit Feher | Data Scientist @ MTA SZTAKI [email protected] *Work carried out while employed by MTA SZTAKI on the Streamline project. Streaming ML with Flink This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.
Transcript
Page 1: Márton Balassi Streaming ML with Flink-

1©Cloudera,Inc.Allrightsreserved.

MartonBalassi|SolutionsArchitect@Cloudera*@MartonBalassi |[email protected]

Judit Feher |DataScientist@[email protected]

*WorkcarriedoutwhileemployedbyMTASZTAKIontheStreamlineproject.

StreamingMLwithFlink

ThisprojecthasreceivedfundingfromtheEuropeanUnion’sHorizon2020researchandinnovationprogramundergrantagreementNo688191.

Page 2: Márton Balassi Streaming ML with Flink-

2©Cloudera,Inc.Allrightsreserved.

Outline

• CurrentFlinkML APIthroughanexample

• Addingstreamingpredictors

• Onlinelearning

• UsecasesintheStreamlineproject

• Summary

Page 3: Márton Balassi Streaming ML with Flink-

3©Cloudera,Inc.Allrightsreserved.

FlinkML exampleusage

val env = ExecutionEnvironment.getExecutionEnvironmentval trainData = env.readCsvFile[(Int,Int,Double)](trainFile)val testData = env.readTextFile(testFile).map(_.toInt)

val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)

model.fit(trainData)

val prediction = model.predict(testData)prediction.print()

Givenahistorical(training)datasetofuserpreferencesletusrecommenddesirableitemsforasetofusers.

Designmotivatedbythesci-kitlearnAPI.Moreathttp://arxiv.org/abs/1309.0238.

Page 4: Márton Balassi Streaming ML with Flink-

4©Cloudera,Inc.Allrightsreserved.

FlinkML exampleusage

val env = ExecutionEnvironment.getExecutionEnvironmentval trainData = env.readCsvFile[(Int,Int,Double)](trainFile)val testData = env.readTextFile(testFile).map(_.toInt)

val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)

model.fit(trainData)

val prediction = model.test(testData)prediction.print()

Givenahistorical(training)datasetofuserpreferencesletusrecommenddesirableitemsforasetofusers.

Thisisabatchinput.Butdoesitneedtobe?

Page 5: Márton Balassi Streaming ML with Flink-

5©Cloudera,Inc.Allrightsreserved.

Alittlerecommendertheory

Itemfactors

User sideinformation User-Item matrixUser factors

Item sideinformation

U

I

PQ

R

• Rispotentiallyhuge,approximateitwithP∗Q• PredictionisTopK(user’srow∗ Q)

Page 6: Márton Balassi Streaming ML with Flink-

6©Cloudera,Inc.Allrightsreserved.

Predictionisanaturalfitforstreaming.

Page 7: Márton Balassi Streaming ML with Flink-

7©Cloudera,Inc.Allrightsreserved.

Acloser (schematic)look at the API

trait PredictDataSetOperation[Self, Testing, Prediction] {def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]

}

trait PredictOperation[Instance, Model, Testing, Prediction] {

def getModel(instance: Instance) : DataSet[Model]

def predict(value: Testing, model: Model) : DataSet[Prediction]}

ADataSet andarecordlevelAPItoimplementthealgorithm(Predictionisalwaysdoneonamodelalreadytrained)

TherecordlevelversionisarguablymoreconvenientItiswrappedintoadefaultdatasetlevelimplementation

Page 8: Márton Balassi Streaming ML with Flink-

8©Cloudera,Inc.Allrightsreserved.

Acloser (schematic)look at the API

trait Estimator {def fit[Training](training: DataSet[Training])(implicit f: FitOperation[Training]) = {

f.fit(training)}

}

trait Transformer extends Estimator {def transform[I,O](input: DataSet[I])(implicit t: TransformDataSetOperation[I,O]) = {

t.transform(input)}

}

trait Predictor extends Estimator {def predict[Testing](testing: DataSet[Testing])(implicit p: PredictDataSetOperation[T]) = {

p.predict(testing)}

}

Threewell-pickedtraitsgoalongway

Page 9: Márton Balassi Streaming ML with Flink-

9©Cloudera,Inc.Allrightsreserved.

Couldwesharethemodelwithastreamingjob?

Page 10: Márton Balassi Streaming ML with Flink-

10©Cloudera,Inc.Allrightsreserved.

Learninbatch,predictinstreaming

val env = ExecutionEnvironment.getExecutionEnvironmentval strEnv = StreamExecutionEnvironment.getExecutionEnvironmentval trainData = env.readCsvFile[(Int,Int,Double)](trainFile)val testData = env.socketTextStream(...).map(_.toInt)

val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)

model.fit(trainData)

val prediction = model.predictStream(testData)prediction.print()

Page 11: Márton Balassi Streaming ML with Flink-

11©Cloudera,Inc.Allrightsreserved.

Acloser (schematic)look at the streaming API

trait PredictDataSetOperation[Self, Testing, Prediction] {def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]

}

trait PredictDataStreamOperation[Self, Testing, Prediction] {def predictDataStream(instance: Self, input: DataStream[Testing]) : DataStream[Prediction]

}

• ImplicitconversionsfromthebatchPredictorstoStreamPredictors• Themodelisstoredthenloadedintoastateful RichMapFunction processing

theinputstream• DefaultwrapperimplementationstosupportboththeDataStreamleveland

therecordlevelimplementations• Addingthestreamingpredictorimplementationforanalgorithmgiventhe

batchoneistrivial

Page 12: Márton Balassi Streaming ML with Flink-

12©Cloudera,Inc.Allrightsreserved.

Recommender systems in batchvs onlinelearning

• “30M”MusiclisteningdatasetcrawledbytheCrowdRecteam

• Implicit,timestampedmusiclisteningdataset• Eachrecordcontains:[timestamp,user,artist,album,track,…]

• Wealwaysrecommendandlearnwhentheuserinteractswithanitematthefirsttime

• ~50,000users,~100,000artists,~500,000tracks

• This happens when we shuffle the time

• Apartially batchonlinesystem

Page 13: Márton Balassi Streaming ML with Flink-

Use cases in the Streamline project

Judit FehérHungarian Academy of Sciences

Page 14: Márton Balassi Streaming ML with Flink-

How iALS works and why is it different from ALS

ALS Problem to solve: 𝑅# = 𝑃&𝑄#– Linear regression

Error function

𝐿 = 𝑅 − 𝑅* +,-.

/+ 𝜆2 𝑃 +,-.

/+ 𝜆3 𝑄 +,-.

/

Implicit error function

𝐿 = 4 𝑤6,# �̂�6,# − 𝑟6,#/

:;,:<

6=>,#=>

+ 𝜆24 𝑃6 /:;

6=>

+ 𝜆34 𝑄# /:<

#=>

• Weighted MSE

• 𝑤6,# = ?𝑤6,# if(𝑢, 𝑖) ∈ 𝑇𝑤I otherwise 𝑤I ≪

𝑤6,#• Typical weights:

𝑤I = 1, 𝑤6,# = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖

• What does it mean?– Create two matrices from the events– (1) Preference matrix

• Binary • 1 represents the presence of an event

– (2) Confidence matrix• Interprets our certainty on the

corresponding values in the first matrix• Negative feedback is much less certain

Page 15: Márton Balassi Streaming ML with Flink-

Machine learning: batch, streaming? Combined?

Streaming recommeder

• Online learning

• Update immediately, e.g. with large learning rate

• Data streaming

• Read training/testing data only once, no chance to store

• Real time / Interactive

+ More timely, adapts fast

- Challenging to implement

Batch recommender

• Repeatedly read all training data multiple times

• Stochastic gradient: use multipletimes in random order

• Elaborate optimization procedures, e.g. SVM

+ More accurate (?)

+ Easy to implement (?)

Page 16: Márton Balassi Streaming ML with Flink-

Contextualized recommendation (NMusic)

Social recommendation Geo recommendation

R.Palovics,A.A.Benczur,L.Kocsis,T.Kiss,E.Frigo. "Exploiting temporalinfluence inonlinerecommendation",ACMRecSys (2014)

Palovics,Szalai,Kocsis,Pap,Frigo,Benczur.„Location-Aware OnlineLearning for Top-kHashtag Recommendation”,LocalRec (2015)

Page 17: Márton Balassi Streaming ML with Flink-

Internet Memory Research use cases

Identify events that influence consumer behavior (product purchases, media consumption)Events influence people

Before a football match, people buy beer, chips, …Specific events influence specific people (requires user profiles)

A football fan does not play Angry Birds during a football match

Annotation by logistic regressionTrain over data in restStreaming predict crawl time

Page 18: Márton Balassi Streaming ML with Flink-

Portugal Telecom use cases

MEO quadruple-playFeatures

InternetTV (IPTV)Mobile phoneLandline phone

Current challenges

Heterogeneous dataHeterogeneous technical solutionsCustomers profilingCross-domain recommendation1TB/day

Page 19: Márton Balassi Streaming ML with Flink-

Rovio use cases

Page 20: Márton Balassi Streaming ML with Flink-

Development at Sztaki

iALS- Flink already has explicit ALS- The implementation of the implicit version is done- Currently testing the algorithm's accuracy

Matrix factorization- Distributed algorithm*- We have a working prototype tested on smaller matrices but it still needs optimization

Logistic regression- Implementation in progress- It is based on stochastic gradient descent, but in Flink there is only a batch version- Currently working on the gradient descent implementation

Metrics- Implementation and testing is finished- We need to create a pull request

*R. Gemulla et al, “Large scale Matrix Factorization with Distributed Stochastic Gradient Descent”, KDD 2011.

Page 21: Márton Balassi Streaming ML with Flink-

21©Cloudera,Inc.Allrightsreserved.

Summary

• Scalaisagreat tool for buildingDSLs

• FlinkML’s APIismotivated by scikit-learn

• StreamingisanaturalfitforMLpredictors

• Onlinelearningcanoutperformbatchincertaincases

• TheStreamline projectbuilds on Flink,aims to contribute back

as much ofthe results as possible


Recommended