Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | mariko-wilder |
View: | 21 times |
Download: | 0 times |
ReverseTesting: An Efficient Framework to Select Amongst Classifiers under
Sample Selection Bias
Wei Fan IBM T.J.Watson
Ian Davidson
SUNY Albany
Where Sample Selection Bias Comes From?
Universe of Examples:Joint probability
distributionP(x,y) = P(y|x) P(x)
DM models this universe
Sampling process TrainingData
Question:Is the training data a good
sample of the universe?
Algorithm
Model
x
y
Universe of Examples
Two classes:red and green
red: f2>f1green: f2<=f1
Unbiased & Biased Samples
Rather Unbiased Sample:evenly distributed
Biased Sample:less likely to sample pointsclose to decision boundary
Single Decision Tree
Error = 2.9% Error = 7.9%
Trained from Unbiased Sample Trained from Biased Sample
Random Decision Tree
Error = 3.1% Error = 4.1%
Trained from Unbiased Sample Trained from Biased Sample
What can we observe? Sample Selection Bias does affect
modeling. Some techniques are more sensitive to bias
than others. Models’ accuracy do get affected.
One important question: How to choose amongst the best classification
algorithm, given potentially biased dataset?
Ubiquitous Problem Fundamental assumption: training data is an
unbiased sample from the universe of examples. Catalogue:
Purchase history is normally only based on each merchant’s own data
However, may not be representative of a population that may potentially purchase from the merchant..
Drug Testing: Fraud Detection: Other examples (see Zadrozny’04 and Smith and
Elkan’04)
Effect of Bias on Model Construction
Inductive model: P(y|x,M): non-trivial dependency on the constructed model M. Recall that P(y|x) is the true conditional probability
“independent” from any modeling techniques. In general, P(y|x,M) != P(y|x).
If the model M is the “correct model”, sample selection bias doesn’t affect learning. (Fan,Davidson,Zadrozny, and Yu’05)
Otherwise, it does. Key Issues:
for real-world problems, we normally do not know the relationship between P(y|x,M) and P(y|x).
No exact idea about where the bias comes from.
Re-Capping Our focus How to choose amongst the best
classification algorithm, given potentially biased dataset? No information on the exactly how the
data is biased No information on if the learners are
affected by the bias. No information on true model, P(y|x)
Failure of Traditional Methods Given sample section bias, cross-
validation based methods are a bad indicator of which methods are the most accurate.
Results come next.
ReverseTesting Basic idea: how to use testing
data’s feature vector x’s to help ordering different models even when their true labels y are not known.
Basic Procedure
Train
A
B
MA
MB
Test
A
B
MAA
MAB
MBA
MBB
Train
Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB
DA
DB
Labeledtest data
Rule If “A’s labeled test data” can construct
“more accurate models” for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate. If MAA > MAB and MBA > MBB then choose A
Similarly, If MAA < MAB and MBA < MBB then choose B
Otherwise, undecided.
Heuristics of ReverseTesting Assume that:
A is more accurate than B Use both A and B labeled data to
train two models. Using A’s data is likely to train a
more accurate model than B’s data.
Result Summary
05
101520253035404550
10-fold leave 1 out ReverseTesting
Total
Wrong
Why CV won’t work?
Sparse Region
CV under-estimate in sparse regions
1. Examples in sparse regions are under represented in CV’s averaged results.• Comparing those examples near the decision boundary• A model performs badly in these under sample regions are not accurately estimated in cross-validation.
2. CV could also create “biased folds” in these “sparse” regions.• Their estimate on biased region itself could also be unreliable.
3. No information on how a model behaves on “feature vectors” not represented in the training data.
Decision Boundary of one fold in 10-fold CV
1-fold Full Training Data
Desiderata in ReverseTesting
Not reduce the size of “sparse regions” as 10-fold CV does
Not use “training model” or something close to training model.
Utilize “feature vectors” not present in the training dataset.
C45 Decision Boundary
Training Data
C45 labeled data
RDT Data
C45 labeled data
RDT labeled data
C45 can never learn
such a model from training
data
RDT Decision Boundary
C45 labeled data RDT labeled data
Model Comparison “Feature vectors in testing data” change
the “decision boundary. The model constructed by algorithm A from
A’s own labeled data != original “training model”.
A’s “inductive bias” is represented in B’s space.
“Use the changed boundary to include more emphasis on these sparse regions for both A and B re-trained on the two labeled test datasets.
Summary Sample Selection bias is a ubiquitous
problem for DM and ML in practice. For most applications and modeling,
techniques, sample selection bias does affect accuracy.
Given sample selection bias, CV based method is bad at estimating order.
ReverseTesting can do a much better job. Future work:
not only orders but also estimates accuracy.