A Framework for Clustering Evolving Data Streams
Yueshen Xu
Zhejiang Univ
CCNT, Middleware
Middle ware, CCNT, ZJU04/10/23
Stream Processing Event Stream Processing Complex Event Processing
04/10/23 Middleware, CCNT, ZJU
Data Stream Mining
Event Stream Processing
Complex Event Processing
In-memory Computing
Real –time Computing
Big data
Computing Mode
Real Application
SAP…
Taobao
Yahoo:S4BaiduBrown&MIT
We are We are endeavoring!endeavoring!
The paper itself
Published in VLDB 2003 Have been cited 635 times By C.C. Aggarwal, Jiawei Han, Jiayong Wang,
and Philip S.Yu
04/10/23 Middleware, CCNT, ZJU
Watson
Watson UIUCUIUC
THUTHU UIC
UIC
A standard, a bible as well as an obligatory reading
Expert Pundit Expert Pundit
!!
Data Stream & Streaming Data
What is data stream?——Those Data sets behave just like water flow (I think)
An infinite infinite process consisting of data which continuously continuously evolves with time (C.C. Aggarwal, Jiawei Han et al)
The formalized description
is a multi-dimensional record, and is the corresponding time stamp.
04/10/23 Middleware, CCNT, ZJU
),( iii tVX ),,(, 1 diii vvV
iV it
The data model makes a determining influence on the computing model How?
Principles
Be very different from those for static data sets (my own thought) One-pass scan You can have the only one chance to see it No storage for primitive data Infinite, another form of big data No necessity In-memory mining Instantaneous Preference for new coming data User point of view
04/10/23 Middleware, CCNT, ZJU
Approximate results You must change your old ideas about traditional static data sets
Ordered, Countable, Enumerable, Infinite, no-storage
Data ModelData Model
Vital!
The Framework
The methodology The core value of the paper
Micro- and macro-clustering process Necessity and inevitability under this frame
The pyramidal time frame Balancing between the accuracy and storage capability
04/10/23 Middleware, CCNT, ZJU
The principle of approximate approximate resultsresults
Cluster Feature Vector Additivity The micro clusters
Is it sophistic?
Why are they opted for?
Cluster Feature Vector
04/10/23 Middleware, CCNT, ZJU
Definition CFV is defined as a tuple , the sum of the squares of the data values : Sigma & Square , the sum of data values : Sigma , the sum of the squares of the time stamps : Sigma & Square , the sum of the time stamps : Sigma , the number of data items belonging to the cluster
)32( d ),1,2,1,2( nCFCFCFCF ttxx
xCF2
xCF1tCF 2tCF 2
n
Why CFV?
User – oriented Additivity
Not come up by Prof. Han et al
Pyramidal Time Frame(1)
Snapshots are classified into different orders which can vary from 1 to log(T)
Snapshots of the i-th order at time intervals of Only the last snapshots of order i are stored
04/10/23 Middleware, CCNT, ZJU
i1l
An example2,2 l
WorseWorseCase~~Case~~
Pyramidal Time Frame(2)
The difference from his book
Divided by , but not by The number of orders is constant
04/10/23 Middleware, CCNT, ZJU
i 1i Best caseNo redundancy
Why (my own thought)
The newer is left, and the older is abandoned The lower level is not friendly to those old snapshots, but the
higher one does Not only punish , but protect for the older one
Micro-Cluster(1)------Procedure
04/10/23 Middleware, CCNT, ZJU
t
hh’
Micro cluster(CFV)
Snapshots
T
Micro-Cluster(2)------Initialization
What is to be initialized?
Micro-clusters The number of micro-clusters maintained in each snapshot is
constant
Determined by the amount of memory available
Larger than the natural number of clusters, but smaller than the number of data points in the data stream
Each cluster owns an unique id
04/10/23 Middleware, CCNT, ZJU
Supported by the experiment
Reasonable ?
Micro-Cluster(3)------Updating
A new data point is coming, what will be done? Join, Delete & Merge Join : find the nearest one Find the nearest micro-cluster and be involved in its boundary
RMS & Distance
Delete : find the oldest one The average time stamp of the last m data point
Take the time stamp contained in CFV as the approximation
Merge : find the closest two clusters They don’t explain how idlistidlist
04/10/23 Middleware, CCNT, ZJU
Macro-Cluster(1)------Find the approximate time stamp
What’s the analyst behavior?
Find clusters over a past time horizon of hh All about : additivity property
I don’t understand how they cope with the fault tolerance
Only two snapshots are necessary What is to be clustered?
CFV
04/10/23 Middleware, CCNT, ZJU
)()( 'htStS cc
Not user-friendlyNot user-friendly
Macro-Cluster(2)------modified k-means
What has been modified in k-means?
The micro-clusters are treated as pseudo-points The seeds are no longer picked randomly The more points, the more important
Experiments are sufficient
04/10/23 Middleware, CCNT, ZJU
Q&AQ&A
04/10/23 Middleware, CCNT, ZJU
StreamStream
StreamStream
StreaStreamm StreaStrea
mm
StreaStreamm
StreamStream
StreaStreamm
StreamStream