NAACL HLT 2010 d-Confidence

D-Confidence: an active learning strategy which efficiently identifies small classes

Learning from Incomplete Specifications

Nuno Filipe Escudeiro [email protected] Alípio Mário Jorge [email protected]

NAACL HLT, 6 de Junho de 2010

Outline

1. Motivations

2. D-Confidence

3. Evaluation

4. Conclusions

5. Future Work

• Fraud detection

• Medical data, disease detection

• Web page classification

• Mail categorization

• …

Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Automatic resource organization•Large corpora•Unlabeled text documents•Labeling is expensive

Need to identify exemplary cases for all labels to learn… fast (with few labels)



Collecting and annotating exemplary cases

– Critical

– Costly

Labeling effort related to:

– Number of labels to learn

– Class distribution in the working set

– Sample representativeness



Learning settings

– Supervised: high labeling effort

– Unsupervised: low expressiveness

– Semi-supervised: unable to deal with incomplete specifications

– Active learning: criterious selection of cases to label

• Minimize error

• Availability of pre-labeled examples on all classes




Active Learning

Accuracy at low cost

from a complete specification

D-Confidence

Accuracy and Representativeness at low cost

from incomplete specification

Active Learning

Accuracy at low cost

from a complete specification

D-Confidence

Accuracy and Representativeness at low cost

from incomplete specification


D-Confidence

– Active learning strategy selecting queries with:

• Low confidence

– exploitation / accuracy

• High distance to known classes

– exploration / representativeness



Intuition



Combines low-confidence with high-distance to produce a bias towards cases from unknown classes located in unexplored regions in case space

k

kk xlab,udistmedian

u|cconfmaxudConf



Effect on (SVM) confidence

0

0,2

0,4

0,6

0,8

1

-5 -4 -3 -2 -1 0 1 2 3 4 5 6

Signed distance to dividing hyperplane

Co

nfi

den

ce



D-Confidence

– Repository (UCI) datasets

– Text corpora



Class distributionDataset # 1 2 3 4 5 6 7 8 9 10 11Iris 150 50 50 50 Cleveland 298 161 53 36 35 13 Vowels 330 30 30 30 30 30 30 30 30 30 30 30SatImg 500 125 48 96 46 67 118 Poker 500 270 170 34 12 4 3 3 2 1 1

Dataset ActiveLearn1st hit

Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 Class 11

irisConf 1 7 3 dConf 1 3 1

clevelandConf 3 7 8 19 40 dConf 3 15 8 5 8

vowelsConf 3 10 14 31 12 27 29 15 31 18 24dConf 2 12 19 16 24 26 23 2 26 3 23

satimgConf 12 28 34 23 32 5 dConf 9 1 4 10 3 10

pokerConf 1 3 20 43 113 112 147 223 279 277 dConf 3 2 5 9 45 97 98 68 100 65



D-Confidence


– Text corpora




Text corpora

20 Newsgroups• 500 cases, 20 classes

• most frequent class 35

• least frequent class 20

Reuters-21578• 1000 cases, 52 classes

• most frequent class 435

• least frequent class 2

• 42 out of 52 classes with frequency below 10



ConfidenceFarthestFirstdConfidence

ConfidenceFarthestFirstdConfidence


– D-Confidence identifies classes faster (lower cost)

– This gain is bigger for minority classes

– D-Confidence performs better in imbalanced data

– Error may increase

• Exploration / exploitation

• Representativeness / accuracy




– Semi-supervised D-Confidence

– Retrieve cases when representativeness assumption fails

– Scalability

Thank you!

Nuno Filipe Escudeiro [email protected] Alípio Mário Jorge [email protected]


D-Confidence

– Simulated datasets


– Text corpora




Levels (refer to training set properties)

Factor 1 (+) 0 (-)

Colinearity colinear centroids non-colinear centroids

Balancing imbalanced class distribution balanced class distribution

Cohesion isomorphic classes polymorphic classes

Overlapping overlapping separable

Response

ErrorGain = gen.error(dConfidence) – gen.error(Confidence)

Simulated datasets



Colinear Imbalanced Isomorphic Overlapping

1 (+) 1 (+) 1 (+) 1 (+)

1 (+) 1 (+) 1 (+) 0 (-)

1 (+) 1 (+) 0 (-) 1 (+)

1 (+) 1 (+) 0 (-) 0 (-)

1 (+) 0 (-) 1 (+) 1 (+)

1 (+) 0 (-) 1 (+) 0 (-)

1 (+) 0 (-) 0 (-) 1 (+)

1 (+) 0 (-) 0 (-) 0 (-)

0 (-) 1 (+) 1 (+) 1 (+)

0 (-) 1 (+) 1 (+) 0 (-)

0 (-) 1 (+) 0 (-) 1 (+)

0 (-) 1 (+) 0 (-) 0 (-)

0 (-) 0 (-) 1 (+) 1 (+)

0 (-) 0 (-) 1 (+) 0 (-)

0 (-) 0 (-) 0 (-) 1 (+)

0 (-) 0 (-) 0 (-) 0 (-)



Colinearity Imbalanced Isomorphic Overlapping

4,241 -3,835 -15,459 1,296

Error



Finding cases from all classes


Meta-LearningColinearity

– correlation coefficient, r, among cluster centroids– colinear when |r| ~ 1

Balancing– variance of nk

– balanced when var(nk) ~ 0

Cohesion– #classes divided by #clusters– cohesive when ~ 1– representativeness fails (or highly overlapping clusters) when > 1

Overlapping– inter-cluster inertia divided by intra-cluster inertia– separable when >> 1




Date post:	26-May-2015
Category:	Technology
Upload:	nunoescudeiro
View:	129 times
Download:	1 times

NAACL HLT 2010 d-Confidence

Technology