+ All Categories
Home > Documents > IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet...

IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet...

Date post: 22-Sep-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
131
IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan
Transcript
Page 1: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Big Data Stream Mining

Tutorial BDA 2017

Albert Bifet (@abifet), Gianmarco De Francisci Morales,

Joao Gama and Wei Fan

Page 2: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Outline• IoT Fundamentals of

Stream Mining

• IoT Setting

• Classification

• Concept Drift

• Regression

• Clustering

• Frequent Itemset Mining

• IoT Distributed

Stream Mining

• Distributed Stream Processing Engines

• Classification

• Regression

• Open Source Tools

• Applications

• Conclusions

2

Page 3: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Fundamentals of

Stream MiningPart I

Page 4: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Setting

4

Page 5: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

INTERNET OF THINGS

IoT: sensors and actuators connected by networks to computing systems.

• Gartner predicts 20.8 billion IoT devices by 2020.• IDC projects 32 billion IoT devices by 2020

Page 6: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Applications IoT Analytics

6

Page 7: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Applications IoT Analytics

7

Page 8: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT versus Big Data

8

Page 9: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Applications For Energy Management

9

Page 10: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Applications For Connected/Smart Home

10

Page 11: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Applications For Smart Cities

11

Page 12: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Applications For Industrial Automation

12

Page 13: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Analytic Standard ApproachFinite training sets

Static models13

Data Set

Model

Classifier Algorithm builds Model

Page 14: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Data Stream ApproachInfinite training sets

Dynamic models14

D

M

Update Model

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

Page 15: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Pain Points

• Need to retrain!

• Things change over time

• How often?

• Data unused until next update!

• Value of data wasted

15

Page 16: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Stream Mining

• Maintain models online

• Incorporate data on the fly

• Unbounded training sets

• Resource efficient

• Detect changes and adapts

• Dynamic models

16

Page 17: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Streaming Algorithms

Example

17

Page 18: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Streaming Algorithms

Example

18

Page 19: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Streaming Algorithms

Examp

19

Page 20: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Streaming Algorithms

Examp

20

Page 21: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

• General idea, good for streaming algorithms

• Small error ε with high probability 1-δ

• True hypothesis H, and learned hypothesis Ĥ

• Pr[ |H - Ĥ| < ε|H| ] > 1-δ

21

Page 22: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

• What is the largest number that we can store in 8 bits?

22

Page 23: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

• What is the largest number that we can store in 8 bits?

23

Page 24: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

• What is the largest number that we can store in 8 bits?

24

Page 25: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

• What is the largest number that we can store in 8 bits?

25

Page 26: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

• What is the largest number that we can store in 8 bits?

26

Page 27: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

• What is the largest number that we can store in 8 bits?

27

Page 28: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approximation Algorithms

28

Page 29: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Predictive Learning

29

• Classification • Regression • Concept Drift

Page 30: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Classification

30

Page 31: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

DefinitionGiven a set of training examples belonging to nC different classes, a classifier algorithm builds a model that predicts for every unlabeled instance x the class C to which it belongs

31

Examples • Email spam filter • Activities of users smartphone

Photo: Stephen Merity http://smerity.com

Page 32: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Process• One example at at time,

used at most once

• Limited memory

• Limited time

• Anytime prediction

32

Page 33: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

• Based on Bayes’ theorem

• Probability of observing feature xi given class C

• Prior class probability P(C)

• Just counting!

Naïve Bayes

33

posterior =likelihood× prior

evidence

P (C|x) =P (x|C)P (C)

P (x)

P (C|x) ∝Y

xi∈x

P (xi|C)P (C)

C = argmaxC

P (C|x)

Page 34: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w (~xi)

w1

w2

w3

w4

w5

Perceptron• Linear classifier

• Data stream: ⟨xi,yi⟩

• ỹi = hw(xi) = σ(wiT xi)

• σ(x) = 1/(1+e-x) σʹ=σ(x)(1-σ(x))

• Minimize MSE J(w)=½∑(yi-ỹi)2

• SGD wi+1 = wi - η∇J xi

• ∇J = -(yi-ỹi)ỹi(1-ỹi)

• wi+1 = wi + η(yi-ỹi)ỹi(1-ỹi)xi

34

Page 35: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Perceptron Learning

35

PERCEPTRON LEARNING(Stream, ⌘)

1 for each class

2 do PERCEPTRON LEARNING(Stream, class, ⌘)

PERCEPTRON LEARNING(Stream, class, ⌘)

1 ⇤ Let w0 and ~w be randomly initialized

2 for each example (~x , y) in Stream

3 do if class = y

4 then δ = (1 − h~w (~x)) · h~w (~x) · (1 − h~w (~x))5 else δ = (0 − h~w (~x)) · h~w (~x) · (1 − h~w (~x))6 ~w = ~w + ⌘ · δ · ~x

PERCEPTRON PREDICTION(~x)

1 return arg maxclass h~wclass(~x)

Page 36: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Decision Tree• Each node tests a features

• Each branch represents a value

• Each leaf assigns a class

• Greedy recursive induction

• Sort all examples through tree

• xi = most discriminative attribute

• New node for xi, new branch for each value, leaf assigns majority class

• Stop if no error | limit on #instances

36

RoadTested?

Mileage?

Age?

NoYes

High

Low

OldRecent

✅ ❌

Car deal?

Page 37: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Very Fast Decision Tree

• AKA, Hoeffding Tree

• A small sample can often be enough to choose a near optimal decision

• Collect sufficient statistics from a small set of examples

• Estimate the merit of each alternative attribute

• Choose the sample size that allows to differentiate between the alternatives

37

Pedro Domingos, Geoff Hulten: “Mining high-speed data streams”. KDD ’00

Page 38: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Leaf Expansion

• When should we expand a leaf?

• Let x1 be the most informative attribute, x2 the second most informative one

• Is x1 a stable option?

• Hoeffding bound

• Split if G(x1) - G(x2) > ε =

r

R2 ln(1/δ)

2n

38

Page 39: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

HT Induction

39

HT(Stream, δ)

1 ⇤ Let HT be a tree with a single leaf(root)

2 ⇤ Init counts nijk at root

3 for each example (x , y) in Stream

4 do HTGROW((x , y),HT , δ)

HTGROW((x , y),HT , δ)

1 ⇤ Sort (x , y) to leaf l using HT

2 ⇤ Update counts nijk at leaf l

3 if examples seen so far at l are not all of the same class

4 then ⇤ Compute G for each attribute

5 if G(Best Attr.)−G(2nd best) >

q

R2 ln 1/δ2n

6 then ⇤ Split leaf on best attribute

7 for each branch

8 do ⇤ Start new leaf and initiliatize counts

Page 40: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Properties• Number of examples to expand node depends only on

Hoeffding bound (ε decreases with √n)

• Low variance model (stable decisions with statistical support)

• Low overfitting (examples processed only once, no need for pruning)

• Theoretical guarantees on error rate with high probability

• Hoeffding algorithms asymptotically close to batch learner.Expected disagreement δ/p (p = probability instance falls into a leaf)

• Ties: broken when ε < τ even if ΔG < ε

40

Page 41: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Regression

41

Page 42: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

DefinitionGiven a set of training examples with a numeric label, a regression algorithm builds a model that predicts for every unlabeled instance x the value with high accuracy

y=ƒ(x)

42

Examples • Stock price • Airplane delay

Photo: Stephen Merity http://smerity.com

Page 43: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w (~xi)

w1

w2

w3

w4

w5

Perceptron

• Linear regressor

• Data stream: ⟨xi,yi⟩

• ỹi = hw(xi) = wT xi

• Minimize MSE J(w)=½∑(yi-ỹi)2

• SGD w' = w - η∇J xi

• ∇J = -(yi-ỹi)

• w' = w + η(yi-ỹi)xi

43

Page 44: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Regression Tree

• Same structure as decision tree

• Predict = average target value or linear model at leaf (vs majority)

• Gain = reduction in standard deviation (vs entropy)

44

σ =q

X

(yi − yi)2/(N − 1)

Page 45: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Rules• Problem: very large decision trees

have context that is complex and hard to understand

• Rules: self-contained, modular, easier to interpret, no need to cover universe

• � keeps sufficient statistics to:

• make predictions

• expand the rule

• detect changes and anomalies

45

Page 46: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

E.g: x = [4,−1, 1, 2]

f (x) =X

Rl∈S(xi )

θl yl ,

Adaptive Model Rules

• Ruleset: ensemble of rules

• Rule prediction: mean, linear model

• Ruleset prediction

• Weighted avg. of predictions of rules covering instance x

• Weights inversely proportional to error

• Default rule covers uncovered instances

46

E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams." ECML-PKDD ‘13

Page 47: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Algorithm 1: Training AMRules

Input: S: Stream of examples

beginR ← {}, D ← 0

foreach (x, y) ∈ S do

foreach Rule r ∈ S(x) do

if ¬IsAnomaly(x, r ) then

if PHTest(errorr , λ) thenRemove the rule from R

elseUpdate sufficient statistics Lr

ExpandRule(r)

if S(x) = ∅ thenUpdate LD

ExpandRule(D)if D expanded then

R ← R∪ D

D ← 0

return (R, LD)

AMRules Induction

• Rule creation: default rule expansion

• Rule expansion: split on attribute maximizing σ reduction

• Hoeffding bound ε

• Expand when σ1st/σ2nd < 1 - ε

• Evict rule when P-H test error large

• Detect and explain local anomalies

47

=

r

R2 ln(1/δ)

2n

Page 48: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Concept Drift

48

Page 49: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

DefinitionGiven an input sequence ⟨x1,x2,…,xt⟩, output at instant

t an alarm signal if there is a distribution change, and a prediction xt+1 minimizing the error |xt+1 − xt+1|

49

Outputs • Alarm indicating change • Estimate of parameter

Photo: http://www.logsearch.io

Page 50: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

-xt

Estimator

- -Alarm

ChangeDetector

-Estimation

Memory-

6

6?

Application• Change detection on

evaluation of model

• Training error should decrease with more examples

• Change in distribution of training error

• Input = stream of real/binary numbers

• Trade-off between detecting true changes and avoiding false alarms

50

Page 51: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Cumulative Sum

• Alarm when mean of input data differs from zero

• Memoryless heuristic (no statistical guarantee)

• Parameters: threshold h, drift speed v

• g0 = 0, gt = max(0, gt-1 + εt - v)

• if gt > h then alarm; gt = 0

51

Page 52: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Page-Hinckley Test

• Similar structure to Cumulative Sum

• g0 = 0, gt = gt-1 + (εt - v)

• Gt = mint(gt)

• if gt - Gt > h then alarm; gt = 0

52

Page 53: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Number of examples processed (time)E

rro

r ra

te

concept drift

pmin

+ smin

Drift level

Warning level

0 50000

0.8

new window

Statistical Process Control

• Monitor error in sliding window

• Null hypothesis: no change between windows

• If error > warning levellearn in parallel new modelon the current window

• if error > drift levelsubstitute new model for old

53

J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with Drift Detection”. SBIA '04

Page 54: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Concept-adapting VFDT• Model consistent with sliding window on stream

• Keep sufficient statistics also at internal nodes

• Recheck periodically if splits pass Hoeffding test

• If test fails, grow alternate subtree and swap-inwhen accuracy of alternate is better

• Processing updates O(1) time, +O(W) memory

• Increase counters for incoming instance, decrease counters for instance going out window

54

G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01

Page 55: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

VFDTc: Adapting to Change

• Monitor error rate

• When drift is detected

• Start learning alternative subtree in parallel

• When accuracy of alternative is better

• Swap subtree

• No need for window of instances

55

J. Gama, R. Fernandes, R. Rocha: “Decision Trees for Mining Data Streams”. IDA (2006)

Page 56: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Hoeffding Adaptive Tree• Replace frequency counters by estimators

• No need for window of instances

• Sufficient statistics kept by estimators separately

• Parameter-free change detector + estimator with theoretical guarantees for subtree swap (ADWIN)

• Keeps sliding window consistent with “no-change hypothesis”

56

A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams” IDA (2009)

A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ‘07

Page 57: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Clustering

57

Page 58: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

DefinitionGiven a set of unlabeled instances, distribute them into homogeneous groups according to some common relations or affinities

58

Examples • Sensor segmentation • Social network communities

Photo: W. Kandinsky - Several Circles (edited)

Page 59: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Approaches

• Distance based (CluStream)

• Density based (DenStream)

• Kernel based, Coreset based, much more…

• Most approaches combine online + offline phase

• Formally: minimize cost function over a partitioning of the data

59

Page 60: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Micro-Clusters• AKA, Cluster Features CF

Statistical summary structure

• Maintained in online phase, input for offline phase

• Data stream ⟨xi⟩, d dimensions

• Cluster feature vector N: number of pointsLSj: sum of values (for dim. j)

SSj: sum of squared values (for dim. j)

• Easy to update, easy to merge

• Constant space irrespective to the number of examples!

60

Tian Zhang, Raghu Ramakrishnan, Miron Livny: “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. SIGMOD ’96

Page 61: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

CluStream• Timestamped data stream ⟨ti, xi⟩, represented in d+1 dimensions

• Seed algorithm with q micro-clusters (k-means on initial data)

• Online phase. For each new point, either:

• Update one micro-cluster (point within maximum boundary)

• Create a new micro-cluster (delete/merge other micro-clusters)

• Offline phase. Determine k macroclusters on demand:

• K-means on micro-clusters (weighted pseudo-points)

• Time-horizon queries via pyramidal snapshot mechanism

61

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03

Page 62: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

DBSCAN• ε-n(p) = set of points at distance ≤ ε

• Core object q = ε-n(q) has weight ≥ μ

• p is directly density-reachable from q

• p ∈ ε-n(q) ∧ q is a core object

• pn is density-reachable from p1

• chain of points p1,…,pn such that pi+1 is directly d-r from pi

• Cluster = set of points that are mutually density-connected

62

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96

Page 63: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

DenStream• Based on DBSCAN

• Core-micro-cluster: CMC(w,c,r) weight w > μ, center c, radius r < ε

• Potential/outlier micro-clusters

• Online: merge point into p (or o)micro-cluster if new radius r'< ε

• Promote outlier to potential if w > βμ

• Else create new o-micro-cluster

• Offline: DBSCAN

63

Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06

Page 64: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Static Evaluation• Internal (validation)

• Sum of squared distance (point to centroid)

• Dunn index (on distance d) D = min(inter-cluster d) / max(intra-cluster d)

• External (ground truth)

• Rand = #agreements / #choices = 2(TP+TN)/(N(N-1))

• Purity = #majority class per cluster / N

64

Page 65: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Streaming Evaluation• Clusters may: appear, fade, move, merge

• Missed points (unassigned)

• Misplaced points (assigned to different cluster)

• Noise

• Cluster Mapping Measure CMM

• External (ground truth)

• Normalized sum of penalties of these errors

65

H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer:“An effective evaluation measure for clustering on evolving data streams”. KDD ’11

Page 66: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Frequent Itemset Mining

66

Page 67: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

DefinitionGiven a collection of sets of items, find all the subsets that occur frequently, i.e., more than a minimum support of times

67

Examples • Market basket mining • Item recommendation

Page 68: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Fundamentals

• Dataset D, set of items t ∈ D,

constant s (minimum support)

• Support(t) = number of sets in D that contain t

• Itemset t is frequent if support(t) ≥ s

• Frequent Itemset problem:

• Given D and s, find all frequent itemsets

68

Page 69: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Example

69

Dataset ExampleDocument Patterns

d1 abce

d2 cde

d3 abce

d4 acde

d5 abcde

d6 bcd

Support Frequent

d1,d2,d3,d4,d5,d6 c

d1,d2,d3,d4,d5 e,ce

d1,d3,d4,d5 a,ac,ae,ace

d1,d3,d5,d6 b,bc

d2,d4,d5,d6 d,cd

d1,d3,d5 ab,abc,abe

be,bce,abce

d2,d4,d5 de,cde

minimal support = 3

Support Frequent

6 c

5 e,ce

4 a,ac,ae,ace

4 b,bc

4 d,cd

3 ab,abc,abe

be,bce,abce

3 de,cde

Page 70: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Variations• A priori property: t ⊆ t' ➝ support(t) ≥ support(t’)

• Closed: none of its supersets has the same support

• Can generate all freq. itemsets and their support

• Maximal: none of its supersets is frequent

• Can generate all freq. itemsets (without support)

• Maximal ⊆ Closed ⊆ Frequent

70

Page 71: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Example

71

Dataset ExampleDocument Patterns

d1 abce

d2 cde

d3 abce

d4 acde

d5 abcde

d6 bcd

Page 72: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Itemset Streams

• Support as fraction of stream length

• Exact vs approximate

• Incremental, sliding window, adaptive

• Frequent, closed, maximal

72

Page 73: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Pattern mining in streamsKeydatastructure:La.ceofpa2erns,withcounts

73

{A},20 {B},18 {C},18 {D},25

{A,B},15 {A,C},12 {B,C},10 {A,D},5 {B,D},12 {C,D},12

{A,B,C},4 {A,B,D},3 {A,C,D},3 {B,C,D},8

{A,B,C,D},2 count≤7

count>7

Page 74: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

The vast majority of stream pattern mining algorithms (implicitly or explicitly) build and update the pattern lattice.

General scheme:

let L be initial, empty lattice;

forever do {

collect a batch of items of size B;

build a summary S of the batch;

merge S into L;

}

74

Pattern mining in streams

Page 75: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Lossy Counting

• Keep data structure D with tuples (x, freq(x), error(x))

• Imagine to divide the stream in buckets of size⎡1/ε⎤

• Foreach itemset x in the stream, Bid = current sequential bucket id starting from 1

• if x ∈ D, freq(x)++

• else D ← D ∪ (x, 1, Bid - 1)

• Prune D at bucket boundaries: evict x if freq(x) + error(x) ≤ Bid

75

G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02

Page 76: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Moment• Keeps track of boundary below frequent itemsets in a window

• Closed Enumeration Tree (CET) (~ prefix tree)

• Infrequent gateway nodes (infrequent)

• Unpromising gateway nodes (frequent non-closed, child non-closed)

• Intermediate nodes (frequent non-closed, child closed)

• Closed nodes (frequent)

• By adding/removing transactions closed/infreq. do not change

76

Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window”. ICDM ‘04

Page 77: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

FP-Stream

• Multiple time granularities

• Based on FP-Growth (depth-first search over itemset lattice)

• Pattern-tree + Tilted-time window

• Time sensitive queries, emphasis on recent history

• High time and memory complexity

77

C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time granularities”. NGDM (2003)

Page 78: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Itemset mining

CLOSTREAM (Yen+ 09) (Sliding window, all closed, exact)

MFI (Li+ 09) (Transaction-sensitive window, frequent closed, exact)

IncMine (Cheng+ 08) (Sliding window, frequent closed, approximate; faster for moderate approximate ratios)

78

Page 79: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Sequence, tree, graph mining

MILE (Chen+ 05), SMDS (Marascu-Masseglia 06), SSBE (Koper-Nguyen 11): Frequent subsequence (aka sequential pattern) mining

Bifet+ 08: Frequent closed unlabeled subtree mining

Bifet+ 11: Frequent closed labeled subtree mining

Bifet+11: Frequent closed subgraph mining

79

Page 80: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IoT Distributed

Stream MiningPart II

Page 81: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Outline• IoT Fundamentals of

Stream Mining

• IoT Setting

• Classification

• Concept Drift

• Regression

• Clustering

• Frequent Itemset Mining

• Concept Evolution

• Limited Labeled Learning

• IoT Distributed

Stream Mining

• Distributed Stream Processing Engines

• Classification

• Regression

• Open Source Tools

• Applications

• Conclusions

81

Page 82: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Distributed Stream Processing Engines

82

Page 83: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

A Tale of two Tribes

83

DBDBDBDBDBDBData

App App App

Faster Larger

Database

M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05

Page 84: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

SPE Evolution

—2003

—2004

—2005

—2006

—2008

—2010

—2011

—2013

Aurora

STREAM

Borealis

SPC

SPADE

Storm

S4

1st generation

2nd generation

3rd generation

Abadi et al., “Aurora: a new model and architecture for

data stream management,” VLDB Journal, 2003

Arasu et al., “STREAM: The Stanford Data Stream

Management System,” Stanford InfoLab, 2004.

Abadi et al., “The Design of the Borealis Stream

Processing Engine,” in CIDR  ’05

Amini et al., “SPC: A Distributed, Scalable Platform

for Data Mining,” in DMSSP  ’06

Gedik et al., “SPADE: The System S Declarative

Stream Processing Engine,” in SIGMOD  ’08

Neumeyer et al., “S4: Distributed Stream Computing

Platform,” in ICDMW  ’10

http://storm.apache.org

Samza http://samza.incubator.apache.org

84

Page 85: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Actors Model

85

Live Streams

Stream 1

Stream 2

Stream 3

PE

PE

PE

PE

PE

External

Persister

Output 1

Output 2

Event

routing

Page 86: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

S4 Example

86

status.text:"Introducing #S4: a distributed #stream processing system"

PE1

PE2 PE3

PE4

RawStatusnulltext="Int..."

EV

KEY

VAL

Topictopic="S4"count=1

EV

KEY

VAL

Topictopic="stream"count=1

EV

KEY

VAL

TopicreportKey="1"topic="S4", count=4

EV

KEY

VAL

TopicExtractorPE (PE1)extracts hashtags from status.text

TopicCountAndReportPE (PE2-3)keeps counts for each topic acrossall tweets. Regularly emits report event if topic count is above a configured threshold.

TopicNTopicPE (PE4)keeps counts for top topics and outputs top-N topics to external persister

Page 87: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

PE PE

PEI

PEI

PEI

PEI

Groupings

• Key Grouping

(hashing)

• Shuffle Grouping

(round-robin)

• All Grouping

(broadcast)

87

Page 88: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

PE PE

PEI

PEI

PEI

PEI

Groupings

• Key Grouping

(hashing)

• Shuffle Grouping

(round-robin)

• All Grouping

(broadcast)

88

Page 89: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

PE PE

PEI

PEI

PEI

PEI

Groupings

• Key Grouping

(hashing)

• Shuffle Grouping

(round-robin)

• All Grouping

(broadcast)

89

Page 90: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

PE PE

PEI

PEI

PEI

PEI

Groupings

• Key Grouping

(hashing)

• Shuffle Grouping

(round-robin)

• All Grouping

(broadcast)

90

Page 91: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Big Data Processing Engines

• Low latency

• High Latency (Not real time)

91

apache storm

Storm characteristics for real-time data processing workloads:

1 Fast2 Scalable3 Fault-tolerant4 Reliable5 Easy to operate

8

apache sa za fro li kedi

Storm and Samza are fairly similar. Both systems provide:

1 a partitioned stream model,2 a distributed execution environment,3 an API for stream processing,4 fault tolerance,5 Kafka integration

10

real ti e co putatio : strea i g co putatio

MapReduce Limitations

ExampleHow compute in real time (latency less than 1 second):

1 predictions2 frequent items as Twitter hashtags3 sentiment analysis

14

apache spark strea i g

Spark Streaming is an extension of Spark that allowsprocessing data stream using micro-batches of data.

11

Page 92: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Kappa Architecture

• Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.

92

Page 93: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Spark

• Spark Streaming is an extension of Spark that allows processing data stream using micro-batches of data.

93

Page 94: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Spark

• Discretized Stream or DStream represents a continuous stream of data

• either the input data stream received from source, or

• the processed data stream generated by transforming the input stream.

• Internally, a DStream is represented by a continuous series of RDDs

94

Page 95: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Spark

• Any operation applied on a DStream translates to operations on the underlying RDDs

95

Page 96: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Flink

• Streaming engine

96

real ti e co putatio : strea i g co putatio

MapReduce Limitations

ExampleHow compute in real time (latency less than 1 second):

1 predictions2 frequent items as Twitter hashtags3 sentiment analysis

14

Page 97: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Flink

• Streaming engine

97

real ti e co putatio : strea i g co putatio

MapReduce Limitations

ExampleHow compute in real time (latency less than 1 second):

1 predictions2 frequent items as Twitter hashtags3 sentiment analysis

14

Page 98: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Beam

98

Page 99: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Beam

• Apache Beam code can run in:

• Apache Flink

• Apache Spark

• Google Cloud Dataflow

• Google Cloud Dataflow replaced MapReduce:

• It is based on FlumeJava and MillWheel, a stream engine as Storm, Samza

• It writes and reads to Google Pub/Sub, a service similar to Kafka

99

Page 100: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Apache Beam

100

Page 101: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Lambda Architecture

101

Page 102: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Kappa Architecture

102

Page 103: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Classification

103

Page 104: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Hadoop AllReduce

• MPI AllReduce on MapReduce

• Parallel SGD + L-BFGS

• Aggregate + Redistribute

• Each node computes partial gradient

• Aggregate (sum) complete gradient

• Each node gets updated model

• Hadoop for data locality (map-only job)

104

A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”. JMLR (2014)

Page 105: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

7 5

1

4

9

3

8

7

13

5 3 4

15

3737 37 37

3737

AllReduceReduction Tree

Upward = Reduce Downward = Broadcast (All)105

Page 106: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Parallel Decision Trees

• Which kind of parallelism?

• Task

• Data

• Horizontal

• Vertical

106

Data

Attributes

Instances

Class

Instance

Attributes

Page 107: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Aggregation to compute splits

Single attribute tracked in

multiple nodes

Stats

Stats

Stats

StreamHistograms

Model

Instances

Model Updates

Horizontal Partitioning

107

Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)

Page 108: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Hoeffding Tree Profiling

108

Other

6 %

Split

24 %

Learn

70 %

Training time for

100 nominal +

100 numeric

attributes

Page 109: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Stats

Stats

Stats

Stream

Model

Attributes

Splits

Vertical Partitioning

109

Single attribute tracked in

single node

N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Murdopo: “VHT: Vertical Hoeffding Tree”, 2016 https://arxiv.org/abs/1607.08325

Page 110: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Control

Split

Result

Source (n) Model (n) Stats (n) Evaluator (1)

InstanceStream

Shuffle GroupingKey GroupingAll Grouping

Vertical Hoeffding Tree

110

Page 111: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Advantages of Vertical Parallelism

• High number of attributes => high level of parallelism(e.g., documents)

• vs. task parallelism

• Parallelism observed immediately

• vs. horizontal parallelism

• Reduced memory usage (no model replication)

• Parallelized split computation

111

Page 112: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Regression

112

Page 113: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

ModelAggregator

Learner1

Learner2

Learnerp

Predictions

Instances

New Rules

RuleUpdates

VAMR• Vertical AMRules

• Model: rule body + head

• Target mean updated continuously with covered instances for predictions

• Default rule (creates new rules)

• Learner: statistics

• Vertical: Learner tracks statistics of independent subset of rules

• One rule tracked by only one Learner

• Model -> Learner: key grouping on rule ID

113

A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14

Page 114: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

HAMR

• VAMR single model is bottleneck

• Hybrid AMRules(Vertical + Horizontal)

• Shuffle among multiple Models for parallelism

• Problem: distributed default rule decreases performance

• Separate dedicate Learner for default rule

114

A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ‘14

Learners

Model Aggregator

1

Model Aggregator

2

Model Aggregator

r

Predictions

Instances

New Rules

RuleUpdates

LearnersLearners

Predictions

Instances

New Rules

RuleUpdates

LearnersLearners

Learners

Model Aggregator

2

Model Aggregator

2

Model Aggregators

Default Rule Learner

New Rules

Page 115: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Open Source Tools

115

Page 116: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

MOA• {M}assive {O}nline {A}nalysis is a framework for online

learning from data streams.

• It is closely related to WEKA

• It includes a collection of offline and online as well as tools for evaluation:

• classification, regression, clustering

• outlier detection, frequent pattern mining

• Easy to extend, design and run experiments

116

http://moa.cms.waikato.ac.nz/

Page 117: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

streamDM C++

117

http://huawei-noah.github.io/streamDM-Cpp/

Page 118: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Streaming

Vision

118

Distributed

IoT Big Data Stream Mining

Page 119: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

SAMOA

119

http://samoa-project.net

Data Mining

Distributed

Batch

Hadoop

Mahout

Stream

Storm, S4, Samza

SAMOA

Non Distributed

Batch

R, WEKA,…

Stream

MOA

G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)

Page 120: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

StreamDM

120

http://huawei-noah/github.io/streamDM

Page 121: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Conclusions

121

Page 122: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IOT AND INDUSTRY 4.0

Interoperability: IoTInformation transparency: virtual copy of the physical worldTechnical assistance: support human decisionsDecentralized decisions: make decisions on their own

Page 123: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

INTERNET OF THINGS

• EMC Digital Universe, 2014

digital universe

Figure 3: EMC Digital Universe, 2014

7

Page 124: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

INTERNET OF THINGS

• EMC Digital Universe, 2014

Page 125: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

IOT (MC KINSEY)

Page 126: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Summary

• IoT Streaming useful for finding approximate solutions with reasonable amount of time & limited resources

• Algorithms for classification, regression, clustering, frequent itemset mining

• Distributed systems for very large streams

126

Page 127: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Open Challenges

• Structured output

• Multi-target learning

• Millions of classes

• Representation learning

• Ease of use

127

Page 128: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

Thanks!

128

Page 129: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

References

129

Page 130: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

• IDC’s Digital Universe Study. EMC (2011)

• P. Domingos, G. Hulten: “Mining high-speed data streams”. KDD ’00

• J Gama, P. Medas, G. Castillo, P. Rodrigues: “Learning with drift detection”. SBIA’04

• G. Hulten, L. Spencer, P. Domingos: “Mining Time-Changing Data Streams”. KDD ‘01

• J. Gama, R. Fernandes, R. Rocha: “Decision trees for mining data streams”. IDA (2006)

• A. Bifet, R. Gavaldà: “Adaptive Parameter-free Learning from Evolving Data Streams”. IDA (2009)

• A. Bifet, R. Gavaldà: “Learning from Time-Changing Data with Adaptive Windowing”. SDM ’07

• E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams”. ECML-PKDD ‘13

• H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer: “An effective evaluation measure for clustering on evolving data streams”. KDD ’11

• T. Zhang, R. Ramakrishnan, M. Livny: “BIRCH: An Efficient Data Clustering Method for Very Large Databases”. SIGMOD ’96

• C. C. Aggarwal, J. Han, J. Wang, P. S. Yu: “A Framework for Clustering Evolving Data Streams”. VLDB ‘03

• M. Ester, H. Kriegel, J. Sander, X. Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96

130

Page 131: IoT Big Data Stream Mining Tutorial · IoT Big Data Stream Mining Tutorial BDA 2017 Albert Bifet (@abifet), Gianmarco De Francisci Morales, Joao Gama and Wei Fan

• F. Cao, M. Ester, W. Qian, A. Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06

• G. S. Manku, R. Motwani: “Approximate frequency counts over data streams”. VLDB '02

• Y. Chi , H. Wang, P. Yu , R. Muntz: “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window”. ICDM ’04

• C. Giannella, J. Han, J. Pei, X. Yan, P. S. Yu: “Mining frequent patterns in data streams at multiple time granularities”. NGDM (2003)

• M. Stonebraker U. Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone”. ICDE ’05

• A. Agarwal, O. Chapelle, M. Dudík, J. Langford: “A Reliable Effective Terascale Linear Learning System”. JMLR (2014)

• Y. Ben-Haim, E. Tom-Tov: “A Streaming Parallel Decision Tree Algorithm”. JMLR (2010)

• A. T. Vu, G. De Francisci Morales, J. Gama, A. Bifet: “Distributed Adaptive Model Rules for Mining Big Data Streams”. BigData ’14

• G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)

• J. Gama: “Knowledge Discovery from Data Streams”. Chapman and Hall (2010)

• J. Gama: “Data Stream Mining: the Bounded Rationality”. Informatica 37(1): 21-25 (2013)

131


Recommended