Harnessing the Strengthsof Anytime Algorithms
for Constant Data Streams
Philipp Kranen, Thomas Seidl
Data Management and Data Exploration GroupRWTH Aachen University, Germany
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Agenda
2
Problem statement Formal model Novel approaches Evaluation Conclusion
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Motivation – data streams in all day life …
3
type 1
type 2
type m
…
arrival interval ta
constant data stream
tf td
time
qualit
y
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Problem statement
4
Budget algorithmsTailored to a specific application - no result in less time - no improvement
Anytime algorithmsNatural choice for varying streams + result after any time + exploit additional time
Constantstreams
Varyingstreams
Data streamsare ubiquitous Network traffic Sensor
measurements Customer data Surveillance data …
Goal: Improve the resulting quality on constant
streams over that of budget algorithms
Basic idea: spend less time on “confident” items
Prerequisite: a confidence measure for the current result
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Model – premise
5
Example: classification of items on a conveyor belt Given
Anytime classification algorithm (e.g. anytime nearest neighbor)
Confidence measure (td – tf) ≥ ta
Time is normalized to [0, 1] t=0 corresponds to the result just after initialization t=1 complete model has been read, no further
improvement possible n improvement steps (e.g. n training set items for nearest
neighbor) Confidence measure ranges from 0 to 1
0 no confidence 1 certain
First: assume linear dependency between confidence and accuracy
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Model – assumptions
6
scattering [ = σ(t) ]
budget confidence [ = μ(t) ]
ĉ
F(ĉ, t)
Individual confidences are scattered around the mean value (budget confidence)
dxxttgtcFc
ˆ
)),(),((),ˆ( time
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Model – expected time to reach confidence ĉ
7
F(ĉ, t) is the probability that the confidence at time t is larger than ĉ
Use F(ĉ, t) as a cumulative distribution function (n steps!)
h(ĉ, tj) is the probability that we first exceeded ĉ from tj-1 to tj
Determine the expected time needed to reach ĉ by
)ˆ)((),ˆ( ctcptcF
n
ntch
ncFct
1
),ˆ(1
)0,ˆ()ˆ(
),ˆ(),ˆ(),ˆ( 1 jjj tcFtcFtch
time
F(0.3, t)1-F(0.3, t)
confidence
trad. budgetbatch approach
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Batch approach
8
To improve the over all quality of the results,we have to process several items in parallel
batch approach
Buffer
batch approach
Buffer
time: t0
time: t0 + 5∙ta
arrival interval ta
type 1type 2
type m
…
tf td
type 1type 2
type m
…
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
FiFo approach
9
Use FiFo queue with capacity of r Initialize and insert newly arriving items Remove eldest item on overflow Improve item s with lowest time weighted
confidence
if confidences are similar, give priority to older items
)/)(( afd tttr
remaining time
weig
ht
remaining time
weig
ht
))(()( stwsconf r
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Evaluation – classifiers and confidence measures
10
Anytime nearest neighbor classifier(ordered w.r.t. leave-one-out performance on training set)
Anytime support vector machine(m times one class versus all)
Anytime Bayesian classification(Hierarchy of mixture densities per class)
k
isinnsd
knn esconf 1),(
)(
))()((211)(shsh
svmjjesconf
)|()|()(21sCPsCPsconf iibt
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Evaluation – batch approach
11
Throughout: 4-fold cross validation, time scaled to [0, 1]
Budget: performance increases with allotted time Batch: accuracy increases with growing window
size (equal time) Largest (relative) improvement for small window
sizes
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Evaluation – batch approach and model
12
Results confirm theoretic model: “linear” dependency
betweenaccuracy and confidence
Expected time t(c) decreases with growing window size
confidence
trad. budgetbatch approach
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Evaluation – FiFo approach
13
FiFo approach also outperforms the respective budget algorithm
Accuracy increases with larger minimal time factor mtf
Confidence alone yields the best distribution of time allowance
remaining time
weig
ht
1
mtf
))(()( stwsconf r
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Evaluation – comparison
14
FiFo approach performs better than the batch approach in comparable settings throughout all experiments
Performance improvement even for small window/queue sizes
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Conclusion
15
Data streams are ubiquitous So far: budget algorithms on constant streams Achievement: quality improvement over budget
algorithmsby harnessing the strengths of anytime algorithms
Two simple yet effective approaches Evaluation using three prominent classifiers and
simple confidence measures Both approaches outperform the respective
budget algorithms Results confirm theoretic model and motivate
further research Anytime algorithms Confidence measures
Philipp Kranen, Thomas Seidl – Harnessing the Strengths of Anytime Algorithms for Constant Data Streams – ECML PKDD ’09
Poster session tonight
16
Discuss about the paper Investigate stream data items …