+ All Categories
Home > Documents > Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and...

Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and...

Date post: 03-Jan-2016
Category:
Upload: chad-hutchinson
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
35
Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and Technology Temple University, Philadelphia
Transcript
Page 1: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Memory-Constrained Data Mining

Slobodan Vucetic

Assistant ProfessorDepartment of Computer and Information Sciences

Center for Information Science and TechnologyTemple University, Philadelphia

Page 2: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Scientific Data Mining LabDr. Slobodan Vucetic, Assistant ProfessorCIS Department, IST Center, Temple University, Philadelphia, USA

Need: (see Nature of March 23, 2006) Amount of data in science every year Shift from computers supporting scientists to playing central role in

testing, and even formulation, of scientific hypothesis

Lab Mission: Developing an interface between data analysis and applied sciences Working on collaborative projects at the interface between computer

science and other disciplines (sciences, engineering, business) Training students to become computational research scientists

Research Tasks: Predictive Modeling Pattern Discovery Summarization

Page 3: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Spatial and temporal dependency

High dimensional data

Data collection bias

Data and knowledge fusion from multiple sources

Large-scale data

Missing/noisy/unstable attributes …

Scientific Data Mining Lab:

Research Challenges

Page 4: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Data Mining Resource-Constrained Data Mining (NSF)

Earth Science Applications Estimation of geophysical parameters from satellite data (NSF)

Biomedical Applications Gene expression data analysis (NIH, PA Dept. of Health) Bioinformatics of protein disorder (PA Dept. of Health) Bioinformatics core facility (PA Dept. of Health) Text mining and Information retrieval (NSF)   Spatial modeling of disease and infection spread

Spatial and Temporal Knowledge Discovery Spatial-temporal data reduction (NSF) Analysis of deregulated electricity markets Analysis of highway traffic data

Scientific Data Mining Lab:

Current Projects

Page 5: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Aim: Accurate and efficient estimation of geophysical parameters from MISR and MODIS instruments on Terra satellite and ground based observations (huge data streams)

Df

Cf

Bf

Af

An

Aa

Ba

Ca

Da

70.5º

70.5º

60.0º

60.0º

45.6º

45.6º

26.1º26.1º

0.0º

2800

km

MISR: Multi-angle Imaging Spectro-Radiometer9 view angles at Earth surface4 Spectral bands

400-km swath width

Vucetic, S., Han, B., Mi, W., Li, Z., Obradovic, Z., A Data Mining Approach for the Validation of Aerosol Retrievals, IEEE Geoscience and Remote Sensing Letters, 2008.

Scientific Data Mining Lab:

Multiple-Source Spatial-Temporal Data Analysis

Page 6: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Result: several pricing regimes existed in California marketVucetic, S., Obradovic, Z. and Tomsovic, K. (2001) “Price-Load Relationships in California's

Electricity Market," IEEE Trans. on Power Systems.

0 2000 4000 6000 8000 10000 12000

203040

0 2000 4000 6000 8000 10000 120000

50100150

APR 8, 98 OCT 1, 98 OCT 1, 99APR 1, 99 JULY 1, 98 JAN 1, 98 JULY 1, 98

Price prediction (R2) Regime

Size (hours) Local Global

Price Volatility

1 5707 0.79 0.76 40 2 4630 0.81 0.75 19 3 1425 0.72 -0.49 9 4 1191 0.48 -0.03 56

15 20 25 30 35 400

10

20

30

40

50

Load [GWh]

Pric

e [$

/MW

h] 4 1

2 3

Scientific Data Mining Lab:

Temporal Data Mining

Aim: analyze price vs. load dependences by discovering semi-stationary segments in multivariate time series

Page 7: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

When topic is difficult to express as a query, often No relevant articles are found by keyword search Too many irrelevant articles are returned

Biomedical Example:“Apurinic/apyrimidinic endonuclease”: 638 citations returned by PubMed“Apurinic/apyrimidinic endonuclease disorder”:1 citation (irrelevant) returned

Result: Large lift of relevant retrievals in top 10Han, B., Obradovic, Z., Hu, Z.Z., Wu, C. H. and Vucetic, S. (2006) “Substring Selection for Biomedical Document Classification,” Bioinformatics.

Scientific Data Mining Lab:Text Mining: Re-Ranking of Articles Retrieved by a Search Engine

Page 8: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Scientific Data Mining Lab:Collaborative filtering

Aim: Predict preferences of an active customer given his/her preferences on some items and a database of preferences of other customers

Result: Regression-based collaborative filtering algorithm is superior to the neighbor-based approach. It is two orders of magnitude faster on-line predicting; more accurate; more robust to small number of observed votes.

Vucetic, S., Obradovic, Z., Collaborative Filtering Using a Regression-Based Approach, Knowledge and Information Systems, Vol. 7, No. 1, pp. 1-22, 2005.

Page 9: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Aim: Understanding protein disorder and its functions

Results:• Protein disorders are very

common (contrary to a 20th century belief)

• Fraction of disorder varies a lot by genomes

• Different types of disorder exist in proteins

• Involved with many important functions Vucetic, S., Brown C., Dunker A.K and

Obradovic, Z., Flavors of Protein Disorder, Proteins: Structure, Function and Genetics, Vol. 52, pp. 573-584, 2003.

Kissinger et al, 1995

Scientific Data Mining Lab:Bioinformatics: Protein Disorder Analysis

Page 10: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Scientific Data Mining Lab:Analysis of Highway Traffic Data

Aim: understand traffic patters, predict traffic congestion and delays

In progress…

Page 11: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Scientific Data Mining Lab:Spatio-Temporal Disease Modelling

Aim: predict infection or disease risk, given the information about population movement

5 10 15 20

5

10

15

20

25

30

5 10 15 200

0.01

0.02

0.03

0.04

Figure 1. Illustration of location clusters and the associated risks.

Location Type

Act

ivit

y T

yp

e In

fec

tio

n R

isk

Result: movement information is very useful in prediction of the infection riskVucetic, S,. Sun, H., Aggregation of Location Attributes for Prediction of Infection Risk, Workshop on Spatial Data Mining: Consolidation and Renewed Bearing, SDM, Bethesda, MD, 2006.

Page 12: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Scientific Data Mining Lab:

Resource-Constrained Data Mining

Aim: Efficient knowledge discovery from large data by limited-capacity

computing devicesApproach:

Integration of data mining and data compression

Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has 76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has 88% accuracy

Page 13: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Motivation

Data mining objective: Efficient and accurate algorithms for learning from large data

Performance measures: Accuracy Scaling with data size (# examples, #attributes)

Mainstream data mining: many accurate learning algorithms that scale linearly or even sub-

linearly with data size and dimension, in both runtime and space Caveat:

linear space scaling is often not sufficient it implies an unbounded growth in memory with data size

Challenge: how to learn from large, or practically infinite, data sets/streams

using limited memory resources

Page 14: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Learning Scenario

Examples are observed sequentially in a single pass

Data stream examples independent and identically

distributed (IID)

Could store the data summary in reservoir with fixed

memory

Page 15: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Approaches

Model-Free: Reservoir Approach Maintains a random sample of size R from data stream Add xt with min(1, R/t), remove randomly Caveat: random sampling often not optimal

Data-Free: Online algorithms Updates the model as examples are observed Perceptron: wt+1 = wt + (yt - f(xt))xt , where f(x) = wTx Caveat: sensitive to data ordering

Hybrid: Data + Model Implicitly done with Support Vector Machines (SVMs)

Page 16: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Objective

Develop a memory-constrained SVM algorithm

What is SVM? Popular data mining algorithm for classification The most accurate on many problems Theoretically and practically appealing Computationally expensive

Cubic training time cost O(N3) (e.g. neural nets are O(N)) Quadratic training memory cost O(N2) (e.g. neural nets are O(N)) Linear prediction cost O(N) (e.g. neural nets are O(1))

Page 17: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:SVM Overview

Goal: Use x1 and x2 to predict class

y {-1, 1} Assume linear prediction

function f(x) = w1x1+w2x2+b sign(f(x)) is final prediction

Challenge: What is better, f1(x) or f2(x) What is the best choice for f(x)?

Answer: Best f(x) has the most wiggle

space it has largest margin

f1(x) f2(x)x1

x2

Page 18: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:SVM Overview

Maximizing margin is equivalent to:minimize ||w||2

such that yi f(xi) 1

What if data are noisy?minimize ||w||2 + Cii

such that yi f(xi) 1 - i, i 0

What if problem is nonlinear?X (X)

Page 19: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:SVM Overview

Standard approach convert to dual problemminimize ||w||2 + Cii

such that yi f(xi) 1 - i, i 0

where Qij = yiyj(xi)(xj) = yiyjK(xi, xj) , K is the Kernel function

Gaussian kernel: K(xi,xj) = exp(||xi – xj||2/A)

i are Lagrange multipliers

Optimization becomes the Quadratic Programming Problem (minimizing convex function with linear constraints)

There is the optimal solution in O(N3) time and O(N2) space SVM predictor:

To predict class of example x, we should compare it with all training examples with i > 0

0,

1min :

2ii ij j i i i

Ci j i i

W Q b y

N

iiii bxxKyxf

1

),()(

Page 20: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:SVM Overview

N

iiii bxxKyxf

1

),()(

f(x) = -1 f(x) = +1

i=0

i=C

0<i<CSupport vectors

Reserve vectors

Error vectors

Page 21: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Incremental-Decremental SVM

Standard SVM solution is “batch”, meaning that all training data should be available for learning

Alternative is “online” SVM that can be update when new training data are available

Incremental-Decremental SVM [Cauwenberghs, Poggio, 2000] For each new example, the update takes

O(Ns2) time, Ns – number of support vectors (0<i<C)

O(NsN) memory. Considering Ns = O(N), memory is O(N2) Total cost for online training on N examples is

O(N3) time O(N2) memory The same as for batch mode

Page 22: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Memory-Constrained IDSVM

Idea Modify IDSVM by upper-bounding number of support vectors

How Twin Vector Machine (TVM) Define budget B and a set of pivot vectors q1…qB Quantize each example to its nearest pivot,

Q(x) = {qk, k = arg minj=1:B ||x-qj||} D = {(xi,yi), i = 1…N} Q(D) = {(Q(xi),yi), i = 1…N}

Training SVM on Q(D) is equivalent to SVM on TV,TV = {TVj, j = 1…B} (Twin Vector Set)TVj = {(qj,+1,nj

+}, (qj,-1,nj-)} (Twin Vector)

O(N3) O(B3) (constant) time; O(N2) O(B2) (constant) memory

minimize ||w||2 + Cii

such that yi f(xi) 1 - i, i 0, i =

1…N

minimize ||w||2 + Cj(nj+j

+ + nj-j

-)such that f(qj) 1 - j, -f(qj) 1 - j-,

j+

, j- 0, j = 1…B

Page 23: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Online TVM

Online-TVM Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel

function K, slack parameter C Output: TVM with parameters 1

+,1-,… B

+,B-, and b

1. Initialize TVM = 0, TV = 2. for i = 1 to N3. if Beneficial(xi)4. Update-TV5. Update-TVM

Page 24: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Online TVM

Beneficial1. if size(TV) < B or |f(xi)| m1

2. return 1

3. else

4. return 0 Online-TVM Input: Data stream D = {(xi,yi), i = 1…N},

budget B, kernel function K, slack parameter C Output: TVM with parameters 1

+,1-,…

B+,B

-, and b

1. Initialize TVM = 0, TV = 2. for i = 1 to N3. if Beneficial(xi)4. Update-TV5. Update-TVM

-1

0

+1

-2

-2

buffer

buffer

m1

Page 25: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Online TVM

Update-TV s = size(TV) TVB+1 = {(xi,yi,1), (qi,-yi,0)} if s < B

TVs+1 = TVB+1 elseif maxi=1:B|f(qi)| > m2

k = arg maxi=1:B |f(qi)| TVk = TVB+1

else find best pair TVi, TVj to merge use (**) to calculate qnew TVi = {(qnew,+1, si

+ + sj+), (qnew,-1,si

- + sj-)}

TVj = TVB+1

jjii

jjjiiinew

ssss

qssqssq

)()((**)

-1

0

+1

-2

-2

buffer

buffer

m2

Page 26: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Online TVM

Merging Heuristics: Nearest versus Weighted

Global versus One-Sided

Rejection merging

-1

0

+1

GlobalMerge

OneSideMerge +1

0

-1

Page 27: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:Results

100

-1 0 1

-1.5

-1

-0.5

0

0.5

1

1.5-6

-4

-2

0

2

4

6

400

-1 0 1

-1.5

-1

-0.5

0

0.5

1

1.5 -5

0

5

10000

-1 0 1

-1.5

-1

-0.5

0

0.5

1

1.5 -20

-10

0

10

Budget B = 100

Page 28: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:

Results

102

103

104

0.75

0.8

0.85

0.9

0.95

1Checkerboard (noisy)

Length of data stream (in log scale)

Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling0 2000 4000 6000 8000 10000

0

500

1000

1500

2000

2500Checkerboard (noisy)

Length of data stream

CP

U t

ime

(in s

econ

ds)

TVM

IDSVM

Budget B = 100

Page 29: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:

Results

101

102

103

104

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1Checkerboard (noisy)

Length of data stream (in log scale)

Acc

urac

y

TVM budget 50

TVM budget 100

TVM budget 200

0 2000 4000 6000 8000 100000

20

40

60

80

Length of data stream

CP

U t

ime

(in s

econ

ds)

Checkerboard (noisy)

TVM budget 50TVM budget 100TVM budget 200

Page 30: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:

Results

Budget B = 100

0 5000 100000.7

0.75

0.8

0.85

0.9

0.95

1

Length of data stream

Acc

urac

y

Checkerboard (noisy)

with buffer

without buffer

102

103

104

105

0.74

0.76

0.78

0.8

0.82

0.84Adult

Length of data stream (in log scale)

Acc

urac

y

OneSideMerge

GlobalMerge

Page 31: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:

Results

102

103

104

105

0.74

0.76

0.78

0.8

0.82

0.84

0.86Adult

Length of data stream (in log scale)

Acc

urac

y

TVMIDSVMLIBSVMRandom Sampling

102

103

104

0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92Banana

Length of data stream (in log scale)

Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

0.8

0.85

0.9

0.95

1Checkerboard

Length of data stream (in log scale)

Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

0.76

0.78

0.8

0.82

0.84Gauss

Length of data stream (in log scale)

Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

105

0.9

0.92

0.94

0.96

0.98

1IJCNN

Length of data stream (in log scale)

Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

102

103

104

0.94

0.95

0.96

0.97

0.98

0.99

1Pendigits

Length of data stream (in log scale)

Acc

urac

y

TVM

IDSVMLIBSVM

Random Sampling

Page 32: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:

Results

Page 33: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:

Results

Page 34: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Resource-Constrained Data Mining:

Conclusions

Memory-Constrained SVM is successful Significantly higher accuracy than baseline Close to the optimal approach

Merging heuristics are very important Future work

Further improvements Forgetting Probabilistic merging

Use data compression Non-IID streams

Page 35: Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and.

Thank You!

More information: http://www.ist.temple.edu/~vucetic/

Collaboration/assistantship contact: Slobodan Vucetic CIS Department, IST Center, Temple University [email protected]


Recommended