+ All Categories
Home > Documents > Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao...

Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao...

Date post: 01-Jan-2016
Category:
Upload: evelyn-bates
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
18
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong
Transcript
Page 1: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Semi-supervised Learning on Partially Labeled Imbalanced Data

May 16, 2010

Jianjun Xie and Tao Xiong

Page 2: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

What Problem We Are Facing

Six data sets extracted from six different domains Domains were removed in the contest

They are all binary classification problems They are all imbalanced data sets

Percentage of positive labels varies from 7.2% to 25.2% This information was removed in the competition They were significantly different from the development sets

They all have one known label to start with

Page 3: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Datasets SummaryFinal contest datasets

Dataset Domain Feature number

Train number

Positive Label %

A Handwriting Recognition

92 17,535 7.23

B Marketing 250 25,000 9.16

C Chemo-informatics

851 25,720 8.15

D Text Processing

12,000 10,000 25.19

E Embryology 154 32,252 9.03

F Ecology 12 67,628 7.68

Page 4: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Stochastic Semi-supervised Learning

Condition: Label distribution is highly imbalanced, positive labels are rare Known labels are few Unlabeled data are abundant

Approach to A, C, and D: Randomly pick one record from unlabeled data pool as “negative” Use the given positive seed and picked “negative” seed as initial

cluster center for k-means clustering Label the cluster as positive where the positive seed resides Repeat above process n times Take the normalized cluster membership count of each data point

as the first set of prediction score

Our approach when number of labels <200

Page 5: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Stochastic Semi-supervised Learning -- continued

Approach to A, C, and D: When more labels are known after query, use both known labels

and randomly picked “negative” seeds as initial cluster center Label cluster using known positive seeds Discard cluster whose membership is not clear Store the cluster membership of each data points Use normalized positive cluster membership counts as prediction

score

Our approach when number of labels <200

Page 6: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Stochastic Semi-supervised Learning -- continued

Approach to B, E, and F: Randomly pick 20 unlabeled data as “negative” labels for each

known positive label. Build over-fit logistic regression model using above dataset Repeat above random picking and model building process n

times Final score is the average of n models.

Our approach when number of labels <200

Page 7: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Supervised Learning Using Gradient Boosting Decision Tree (TreeNet)

Page 8: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Querying Strategy

One critical part of active learning is the query strategy Popular approaches:

Uncertainty sampling Expected model change Query by committee

What we tried: Uncertainty sampling + density based selective sampling Random sampling (for large label purchase) Certainty sampling (try to get more positive labels)

Page 9: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset A: Handwriting RecognitionGlobal score = 0.623, rank 2nd.

Pie Chart Title

Column Chart Title

Sequence Num. Samples

Num. Queried Samples

AUC Sampling Strategy

1 232 1 0.67 Uncertainty/Selective

2 1959 233 0.82 Uncertainty/Selective

3 4286 2192 0.92 Random

4 11057 6478 0.94 Get All

5 0 17535 0.93

Page 10: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset B: MarketingGlobal score = 0.375, rank 2nd.

Pie Chart Title

Page 11: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset C: Chemo-informaticsGlobal score = 0.334, rank 4th. Passive learning.

Pie Chart Title

Page 12: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset D: Text ProcessingGlobal score = 0.331, rank 18th.

Pie Chart Title

Page 13: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset E: EmbryologyGlobal score = 0.533, rank 3rd.

Pie Chart Title

Column Chart Title

Sequence Num. Samples

Num. Queried Samples

AUC Sampling Strategy

1 2 1 0.75 Certainty

2 3 3 0.66 Uncertainty/Selective

3 3 6 0.67 Uncertainty/Selective

4 32243 9 0.72 Get All

5 0 32252 0.86

Page 14: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset E: Embryology

Performance gets worse with more labels

Newly queried labels did too much correction to the existing model

This phenomenon was common in this contest

Global score = 0.533, rank 3rd.

Pie Chart Title

Page 15: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset F: EcologyGlobal score = 0.77, rank 4th.

Pie Chart Title

Column Chart Title

Sequence Num. Samples

Num. Queried Samples

AUC Sampling Strategy

1 2 1 0.76 Uncertainty/Selective

2 7 3 0.73 Uncertainty/Selective

3 542 10 0.77 Uncertainty/Selective

4 5175 552 0.95 Random

5 61901 5727 0.98 Get all

6 0 67628 0.99

Page 16: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Dataset F: Ecology

Performance gets worse with 2 more labels at beginning

Most of the time, too many small queries do more harm than good to global score

Pie Chart Title

Page 17: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Summary on ResultsOverall rank 3rd.

Pie Chart Title

Dataset Positive label %

AUC ALC Num. Queries

Rank Winner AUC

Winner ALC

A 7.23 0.925 0.623 4 2 0.862 0.629

B 9.16 0.767 0.375 2 2 0.733 0.376

C 8.15 0.814 0.334 1 4 0.799 0.427

D 25.19 0.890 0.331 3 18 0.964 0.745

E 9.03 0.865 0.533 4 3 0.894 0.627

F 7.68 0.988 0.771 5 4 0.999 0.802

Page 18: Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Discussions

How to consistently get better performance with only a few labels across different datasets

How to consistently improve model performance with the increase of labels in a given dataset

Does the log2 scaling give too much weight on first few queries? What about every dataset starts with a little bit more labels?


Recommended