Efficient Data Stream Classification viaProbabilistic Adaptive Windows
Albert Bifet1, Jesse Read2,Bernhard Pfahringer3, Geoff Holmes3
1Yahoo! Research Barcelona2Universidad Carlos III, Madrid, Spain
3University of Waikato, Hamilton, New Zealand
SAC 2013, 19 March 2013
Data Streams
Big Data & Real Time
Data Streams
Data StreamsI Sequence is potentially infiniteI High amount of data: sublinear spaceI High speed of arrival: sublinear time per exampleI Once an element from a data stream has been processed
it is discarded or archived
Big Data & Real Time
Data Streams
Approximation algorithms
I Small error rate with high probabilityI An algorithm (ε, δ)−approximates F if it outputs F̃ for which
Pr[|F̃ − F | > εF ] < δ.
Big Data & Real Time
Data Stream Sliding Window
Sampling algorithms
I Giving equal weight to old and new examples: RESERVOIR
SAMPLING
I Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time
8 Bits Counter
1 0 1 0 1 0 1 0
What is the largest number we canstore in 8 bits?
8 Bits Counter
What is the largest number we canstore in 8 bits?
8 Bits Counter
0 20 40 60 80 1000
20
40
60
80
100
x
f (x) = log(1 + x)/ log(2)
f (0) = 0, f (1) = 1
8 Bits Counter
0 2 4 6 8 100
2
4
6
8
10
x
f (x) = log(1 + x)/ log(2)
f (0) = 0, f (1) = 1
8 Bits Counter
0 2 4 6 8 100
2
4
6
8
10
x
f (x) = log(1 + x/30)/ log(1 + 1/30)
f (0) = 0, f (1) = 1
8 Bits Counter
0 20 40 60 80 1000
20
40
60
80
100
x
f (x) = log(1 + x/30)/ log(1 + 1/30)
f (0) = 0, f (1) = 1
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1
What is the largest number we canstore in 8 bits?
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1
With p = 1/2 we can store 2× 256with standard deviation σ =
√n/2
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1
With p = 2−c then E [2c] = n + 2 withvariance σ2 = n(n + 1)/2
8 bits Counter
MORRIS APPROXIMATE COUNTING ALGORITHM
1 Init counter c ← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c ← c + 1
If p = b−c then E [bc] = n(b − 1) + b,σ2 = (b − 1)n(n + 1)/2
PROBABILISTIC APPROXIMATE WINDOW
1 Init window w ← ∅2 for every instance i in the stream3 do store the new instance i in window w4 for every instance j in the window5 do rand = random number between 0 and 16 if rand > b−1
7 then remove instance j from window w
PAW maintains a sample of instancesin logarithmic memory, giving greater
weight to newer instances
Experiments: Methods
Abbr. Classifier Parameters
NB Naive BayesHT Hoeffding TreeHTLB Leveraging Bagging with HT n = 10kNN k Nearest Neighbour w = 1000, k = 10kNNW kNN with PAW w = 1000, k = 10kNNWA kNN with PAW+ADWIN w = 1000, k = 10kNNLB
W Leveraging Bagging with kNNW n = 10
The methods we consider. Leveraging Baggingmethods use n models. kNNWA empties its
window (of max w) when drift is detected (usingthe ADWIN drift detector).
Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Accuracy−w 100 −w 500 −w 1000 −w 5000
Real Avg. 77.88 77.78 79.59 78.23Synth. Avg. 57.99 81.93 84.74 86.03Overall Avg. 62.53 80.28 82.59 83.11
Results
Experimental Evaluation
Table : The window size for kNN and corresponding performance.
Time (seconds)−w 100 −w 500 −w 1000 −w 5000
Real Tot. 297 998 1754 7900Synth. Tot. 371 1297 2313 10671Overall Tot. 668 2295 4067 18570
Results
Experimental Evaluation
Table : The window size for kNN and corresponding performance.
RAM Hours−w 100 −w 500 −w 1000 −w 5000
Real Tot. 0.007 0.082 0.269 5.884Synth. Tot. 0.002 0.026 0.088 1.988Overall Tot. 0.009 0.108 0.357 7.872
Results
Experimental Evaluation
Table : Summary of Efficiency: Accuracy and RAM-Hours.
NB HT HTLB kNN kNNW kNNWA kNNLBW
Accuracy 56.19 73.95 83.75 82.59 82.92 83.19 84.67RAM-Hrs 0.02 1.57 300.02 0.36 8.08 8.80 250.98
Results
Conclusions
Sampling algorithms for kNN
I Giving equal weight to old and new examples: RESERVOIR
SAMPLING
I Giving more weight to recent examples: PROBABILISTIC
APPROXIMATE WINDOW
Big Data & Real Time
Thanks!