+ All Categories
Home > Technology > NAACL HLT 2010 d-Confidence

NAACL HLT 2010 d-Confidence

Date post: 26-May-2015
Category:
Upload: nunoescudeiro
View: 129 times
Download: 1 times
Share this document with a friend
Popular Tags:
25
D-Confidence: an active learning strategy which efficiently identifies small classes Learning from Incomplete Specifications Nuno Filipe Escudeiro [email protected] Alípio Mário Jorge [email protected]
Transcript
Page 1: NAACL HLT 2010 d-Confidence

D-Confidence: an active learning strategy which efficiently identifies small classes

Learning from Incomplete Specifications

Nuno Filipe Escudeiro [email protected] Alípio Mário Jorge [email protected]

Page 2: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Outline

1. Motivations

2. D-Confidence

3. Evaluation

4. Conclusions

5. Future Work

Page 3: NAACL HLT 2010 d-Confidence

• Fraud detection

• Medical data, disease detection

• Web page classification

• Mail categorization

• …

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Automatic resource organization•Large corpora•Unlabeled text documents•Labeling is expensive

Need to identify exemplary cases for all labels to learn… fast (with few labels)

NAACL HLT, 6 de Junho de 2010

Page 4: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Collecting and annotating exemplary cases

– Critical

– Costly

Labeling effort related to:

– Number of labels to learn

– Class distribution in the working set

– Sample representativeness

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 5: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Learning settings

– Supervised: high labeling effort

– Unsupervised: low expressiveness

– Semi-supervised: unable to deal with incomplete specifications

– Active learning: criterious selection of cases to label

• Minimize error

• Availability of pre-labeled examples on all classes

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 6: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Active Learning

Accuracy at low cost

from a complete specification

D-Confidence

Accuracy and Representativeness at low cost

from incomplete specification

Active Learning

Accuracy at low cost

from a complete specification

D-Confidence

Accuracy and Representativeness at low cost

from incomplete specification

Page 7: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

D-Confidence

– Active learning strategy selecting queries with:

• Low confidence

– exploitation / accuracy

• High distance to known classes

– exploration / representativeness

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 8: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Intuition

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 9: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Combines low-confidence with high-distance to produce a bias towards cases from unknown classes located in unexplored regions in case space

k

kk xlab,udistmedian

u|cconfmaxudConf

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 10: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Effect on (SVM) confidence

0

0,2

0,4

0,6

0,8

1

-5 -4 -3 -2 -1 0 1 2 3 4 5 6

Signed distance to dividing hyperplane

Co

nfi

den

ce

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 11: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

D-Confidence

– Repository (UCI) datasets

– Text corpora

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 12: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

    Class distributionDataset # 1 2 3 4 5 6 7 8 9 10 11Iris 150 50 50 50                Cleveland 298 161 53 36 35 13            Vowels 330 30 30 30 30 30 30 30 30 30 30 30SatImg 500 125 48 96 46 67 118          Poker 500 270 170 34 12 4 3 3 2 1 1  

Dataset ActiveLearn1st hit

Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 Class 11

irisConf 1 7 3                dConf 1 3 1                

clevelandConf 3 7 8 19 40            dConf 3 15 8 5 8            

vowelsConf 3 10 14 31 12 27 29 15 31 18 24dConf 2 12 19 16 24 26 23 2 26 3 23

satimgConf 12 28 34 23 32 5          dConf 9 1 4 10 3 10          

pokerConf 1 3 20 43 113 112 147 223 279 277  dConf 3 2 5 9 45 97 98 68 100 65  

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 13: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

D-Confidence

– Repository (UCI) datasets

– Text corpora

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 14: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Text corpora

20 Newsgroups• 500 cases, 20 classes

• most frequent class 35

• least frequent class 20

Reuters-21578• 1000 cases, 52 classes

• most frequent class 435

• least frequent class 2

• 42 out of 52 classes with frequency below 10

Page 15: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

ConfidenceFarthestFirstdConfidence

ConfidenceFarthestFirstdConfidence

Page 16: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

– D-Confidence identifies classes faster (lower cost)

– This gain is bigger for minority classes

– D-Confidence performs better in imbalanced data

– Error may increase

• Exploration / exploitation

• Representativeness / accuracy

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 17: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

– Semi-supervised D-Confidence

– Retrieve cases when representativeness assumption fails

– Scalability

Page 18: NAACL HLT 2010 d-Confidence

Thank you!

Nuno Filipe Escudeiro [email protected] Alípio Mário Jorge [email protected]

Page 19: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

D-Confidence

– Simulated datasets

– Repository (UCI) datasets

– Text corpora

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 20: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Levels (refer to training set properties)

Factor 1 (+) 0 (-)

Colinearity colinear centroids non-colinear centroids

Balancing imbalanced class distribution balanced class distribution

Cohesion isomorphic classes polymorphic classes

Overlapping overlapping separable

Response

ErrorGain = gen.error(dConfidence) – gen.error(Confidence)

Simulated datasets

Page 21: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Colinear Imbalanced Isomorphic Overlapping

1 (+) 1 (+) 1 (+) 1 (+)

1 (+) 1 (+) 1 (+) 0 (-)

1 (+) 1 (+) 0 (-) 1 (+)

1 (+) 1 (+) 0 (-) 0 (-)

1 (+) 0 (-) 1 (+) 1 (+)

1 (+) 0 (-) 1 (+) 0 (-)

1 (+) 0 (-) 0 (-) 1 (+)

1 (+) 0 (-) 0 (-) 0 (-)

0 (-) 1 (+) 1 (+) 1 (+)

0 (-) 1 (+) 1 (+) 0 (-)

0 (-) 1 (+) 0 (-) 1 (+)

0 (-) 1 (+) 0 (-) 0 (-)

0 (-) 0 (-) 1 (+) 1 (+)

0 (-) 0 (-) 1 (+) 0 (-)

0 (-) 0 (-) 0 (-) 1 (+)

0 (-) 0 (-) 0 (-) 0 (-)

Page 22: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Colinearity Imbalanced Isomorphic Overlapping

4,241 -3,835 -15,459 1,296

Error

Page 23: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Finding cases from all classes

Page 24: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Meta-LearningColinearity

– correlation coefficient, r, among cluster centroids– colinear when |r| ~ 1

Balancing– variance of nk

– balanced when var(nk) ~ 0

Cohesion– #classes divided by #clusters– cohesive when ~ 1– representativeness fails (or highly overlapping clusters) when > 1

Overlapping– inter-cluster inertia divided by intra-cluster inertia– separable when >> 1

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Page 25: NAACL HLT 2010 d-Confidence

NAACL HLT, 6 de Junho de 2010

Motivations | D-Confidence | Evaluation | Conclusions | Future Work


Recommended