+ All Categories
Home > Documents > Rare Category Detection

Rare Category Detection

Date post: 31-Dec-2015
Category:
Upload: marsden-salazar
View: 23 times
Download: 3 times
Share this document with a friend
Description:
Rare Category Detection. Jingrui He Machine Learning Department Carnegie Mellon University Joint work with Jaime Carbonell. What’s Rare Category Detection. Start de-novo Very skewed classes Majority classes Minority classes Labeling oracle Goal - PowerPoint PPT Presentation
38
Rare Category Detection Jingrui He Machine Learning Department Carnegie Mellon University Joint work with Jaime Carbonell
Transcript
Page 1: Rare Category Detection

Rare Category Detection

Jingrui HeMachine Learning Department

Carnegie Mellon University

Joint work with Jaime Carbonell

Page 2: Rare Category Detection

11/17/2008 Machine Learning Lunch 2

What’s Rare Category Detection

Start de-novo Very skewed classes

Majority classes Minority classes

Labeling oracle Goal

Discover minority classes with a few label requests

Page 3: Rare Category Detection

11/17/2008 Machine Learning Lunch 3

Comparison with Outlier Detection Rare classes

A group of points Clustered Non-separable from the

majority classes

Outliers A single point Scattered Separable

Page 4: Rare Category Detection

11/17/2008 Machine Learning Lunch 4

Comparison with Active Learning Rare category

detection Initial condition: NO

labeled examples

Goal: discover the minority classes with the least label requests

Active learning

Initial condition: labeled examples from each class

Goal: improve the performance of the current classifier with the least label requests

Page 5: Rare Category Detection

11/17/2008 Machine Learning Lunch 5

ApplicationsNetwork intrusion detection

Astronomy

Fraud detection

Spam image detection

Page 6: Rare Category Detection

11/17/2008 Machine Learning Lunch 6

The Big Picture

UnbalancedUnlabeledData Set

RareCategoryDetection

Learning inUnbalanced

Settings

Classifier

RawData

Spatial

Relational

Temporal

FeatureExtraction

Page 7: Rare Category Detection

11/17/2008 Machine Learning Lunch 7

Outline Problem definition Related work Rare category detection for spatial data

Prior-dependent rare category detection Prior-free rare category detection

Conclusion

Page 8: Rare Category Detection

11/17/2008 Machine Learning Lunch 8

Related Work Pelleg & Moore 2004

Mixture model Different selection criteria

Fine & Mansour 2006 Generic consistency algorithm Upper bounds and lower bounds

Papadimitriou et al 2003 LOCI algorithm for groups of outliers

Separable orNear-separable

-20 -15 -10 -5 0 5 10 15 20

-15

-10

-5

0

5

10

15

-15 -10 -5 0 5 10 15

-15

-10

-5

0

5

10

-15 -10 -5 0 5 10 15

-10

-5

0

5

10

15

Page 9: Rare Category Detection

11/17/2008 Machine Learning Lunch 9

Outline Problem definition Related work Rare category detection for spatial data

Prior-dependent rare category detection Prior-free rare category detection

Conclusion

Page 10: Rare Category Detection

11/17/2008 Machine Learning Lunch 10

Notations Unlabeled examples: , m Classes: m-1 rare classes: One majority class: ,

Goal: find at least ONE example from each rare class by requesting a few labels

1, , nS x x 1, ,iy m

2 , , mp p2 c m

dix

1 cp p

Page 11: Rare Category Detection

11/17/2008 Machine Learning Lunch 11

Assumptions The distribution of the majority class is

sufficiently smooth Examples from the minority classes form

compact clusters in the feature space

-6 -4 -2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

Page 12: Rare Category Detection

11/17/2008 Machine Learning Lunch 12

Overview of the Algorithms Nearest-neighbor-based methods

Methodology: local density differential sampling

Intuition: select examples according to the change in local density

Page 13: Rare Category Detection

11/17/2008 Machine Learning Lunch 13

Two Classes: NNDB1. Calculate class-specific radius r

2. , , ix S ,i iNN x r x x x r ,i in NN x r

3.

,max

j ii i j

x NN x trs n n

4. Query argmaxix S ix s

5. Rare class?x

Increase t by 1

6. Output

No

Yes

x

Page 14: Rare Category Detection

11/17/2008 Machine Learning Lunch 14

NNDB: Calculate Class-Specific Radius

Number of examples from the minority class:

, calculate the distance between and its nearest neighbor

The class-specific radius:

2 2p K np

ix S ixthK

Kir

1minn Ki ir r

Page 15: Rare Category Detection

11/17/2008 Machine Learning Lunch 15

NNDB: Calculate Nearest Neighbors

120 140 160 180 200 220

120

130

140

150

160

170

180

190

200

r

,i iNN x r x x x r

,i in NN x r

120 140 160 180 200 220

120

130

140

150

160

170

180

190

200

Page 16: Rare Category Detection

11/17/2008 Machine Learning Lunch 16

NNDB: Calculate the Scores

120 140 160 180 200 220

120

130

140

150

160

170

180

190

200

tr

,max

j ii i j

x NN x trs n n

Query argmaxix S ix s

Page 17: Rare Category Detection

11/17/2008 Machine Learning Lunch 17

NNDB: Pick the Next Candidate

120 140 160 180 200 220

120

130

140

150

160

170

180

190

200

1t r

Increase t by 1

, 1max

j ii i j

x NN x t rs n n

Query argmaxix S ix s

Page 18: Rare Category Detection

11/17/2008 Machine Learning Lunch 18

Why NNDB Works Theoretically

Theorem 1 [He & Carbonell 2007]: under certain conditions, with high probability, after a few iteration steps, NNDB queries at least one example whose probability of coming from the minority class is at least 1/3

Intuitively The score measures the change in local density

is

120 140 160 180 200 220

120

130

140

150

160

170

180

190

200

Page 19: Rare Category Detection

11/17/2008 Machine Learning Lunch 19

Multiple Classes: ALICE m-1 rare classes: One majority class: ,1 cp p

2 , , mp p2 c m

1. For each rare class c,

2. We have found examples from class c

2 c m

No

Yes

1c c

3. Run NNDB with prior cp

Page 20: Rare Category Detection

11/17/2008 Machine Learning Lunch 20

Why ALICE Works Theoretically

Theorem 2 [He & Carbonell 2008]: under certain conditions, with high probability, in each outer loop of ALICE, after a few iteration steps in NNDB, ALICE queries at least one example whose probability of coming from one minority class is at least 1/3

Page 21: Rare Category Detection

11/17/2008 Machine Learning Lunch 21

Implementation Issues ALICE

Problem: repeatedly sampling from the same rare class

MALICE Solution: relevance feedback

Class-specific radius

Page 22: Rare Category Detection

11/17/2008 Machine Learning Lunch 22

Results on Synthetic Data Sets

-3 -2 -1 0 1 2 3 4

-1

0

1

2

3

4

5

Page 23: Rare Category Detection

11/17/2008 Machine Learning Lunch 23

Summary of Real Data Sets Abalone

4177 examples 7-dimensional features 20 classes Largest class: 16.50% Smallest class: 0.34%

Shuttle 4515 examples 9-dimensional features 7 classes Largest class: 75.53% Smallest class: 0.13%

Page 24: Rare Category Detection

11/17/2008 Machine Learning Lunch 24

Results on Real Data SetsAbalone Shuttle

MALICEMALICE

InterleaveInterleave

Random sampling Random sampling

Page 25: Rare Category Detection

11/17/2008 Machine Learning Lunch 25

Imprecise priors

0 50 100 150 200 2500

5

10

15

20

Number of Selected Examples

Cla

sses

Dis

cove

red

-5%-10%-20%0+5%+10%+20%

Abalone Shuttle

0 20 40 60 80 1001

2

3

4

5

6

7

Number of Selected Examples

Cla

sses

Dis

cove

red

-5%-10%-20%0+5%+10%+20%

Page 26: Rare Category Detection

11/17/2008 Machine Learning Lunch 26

Outline Problem definition Related work Rare category detection for spatial data

Prior-dependent rare category detection Prior-free rare category detection

Conclusion

Page 27: Rare Category Detection

11/17/2008 Machine Learning Lunch 27

Overview of the Algorithm Density-based method

Methodology: specially designed exponential families

Intuition: select examples according to the change in local density

Difference from NNDB (ALICE): NO prior information needed

Page 28: Rare Category Detection

11/17/2008 Machine Learning Lunch 28

Specially Designed ExponentialFamilies [Efron & Tibshirani 1996]

Favorable compromise between parametric and nonparametric density estimation

Estimated density

xtxgxg T100 exp

Carrier density

Normalizing parameter

parameter vector1p

vector of sufficient statistics1p

Page 29: Rare Category Detection

11/17/2008 Machine Learning Lunch 29

SEDER Algorithm Carrier density: kernel density estimator To decouple the estimation of different

parameters Decompose Relax the constraint such that

Tdxxxt221 ,,

d

j

j

1 00

jx

jjjjij

ji

j

jdxx

xx1exp

2exp

2

1 2

102

2

Page 30: Rare Category Detection

11/17/2008 Machine Learning Lunch 30

Parameter Estimation Theorem 3 [To appear]: the maximum likelihood

estimate and of and satisfy the following conditions:

where

dj ,,1

n

kn

i j

ji

jkj

i

n

i

jjij

ji

jkj

in

k

jk

xx

xExx

x1

1 2

2

0

1

2

2

2

0

1

2

2ˆexp

2ˆexp

j1 j

i0j1̂ j

i0̂

jx

jjjjij

ji

j

j

jjji dxx

xxxxE

2

102

222 ˆˆexp

2exp

2

1

Page 31: Rare Category Detection

11/17/2008 Machine Learning Lunch 31

Parameter Estimation cont. Let

:

where ,

212

111

jjj

b

: positive parameterjb

dj ,,1A

ACBBb j

2

4ˆ2

n

kn

i j

ji

jk

n

i

jij

ji

jk

xx

xxx

nA

1

1 2

2

1

2

2

2

2exp

2exp

1

2jB

n

k

jkxn

C1

21

in most cases

1ˆ jb

Page 32: Rare Category Detection

11/17/2008 Machine Learning Lunch 32

Scoring Function The estimated density

Scoring function: norm of the gradient

where

n

i

d

j jj

ji

jj

jjbb

xbx

bnxg

1 1 2

2

2exp

2

11~

d

l ll

n

i

li

llkki

k

b

xbxxDs

1 22

2

1

d

j jj

ji

jj

jjib

xbx

bnxD

1 2

2

2exp

2

11

Page 33: Rare Category Detection

11/17/2008 Machine Learning Lunch 33

Results on Synthetic Data Sets

Page 34: Rare Category Detection

11/17/2008 Machine Learning Lunch 34

Summary of Real Data Sets

Data Set

n d m Largest Class

Smallest Class

Ecoli 336 7 6 42.56% 2.68%

Glass 214 9 6 35.51% 4.21%

Page Blocks 5473 10 5 89.77% 0.51%

Abalone 4177 7 20 16.50% 0.34%

Shuttle 4515 9 7 75.53% 0.13%

Moderately Skewed

Extremely Skewed

Page 35: Rare Category Detection

11/17/2008 Machine Learning Lunch 35

Moderately Skewed Data Sets

Ecoli Glass

MALICE

MALICE

Page 36: Rare Category Detection

11/17/2008 Machine Learning Lunch 36

Extremely Skewed Data Sets

Page Blocks Abalone

Shuttle

MALICE

MALICE

MALICE

Page 37: Rare Category Detection

11/17/2008 Machine Learning Lunch 37

Conclusion Rare category detection

Open challenge Lack of effective methods

Nearest-neighbor-based methods Prior-dependent Local density differential sampling

Density-based method Prior-free Specially designed exponential families

Page 38: Rare Category Detection

Thank You!


Recommended