Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 243 times |
Download: | 10 times |
Sample Selection Bias – Covariate Shift: Problems, Solutions, and Applications
Wei Fan, IBM T.J.Watson Research
Masashi Sugiyama, Tokyo Institute of Technology
Updated PPT is available:
http//www.weifan.info/tutorial.htm
Overview of Sample Selection Bias Problem
A Toy Example
Two classes:red and green
red: f2>f1green: f2<=f1
Unbiased and Biased Samples
Not so-biased sampling Biased sampling
Effect on Learning
Unbiased 97.1% Biased 92.1% Unbiased 96.9% Biased 95.9%Unbiased 96.405% Biased 92.7%• Some techniques are more sensitive to bias than others.
• One important question:– How to reduce the effect of sample selection
bias?
Ubiquitous
• Loan Approval• Drug screening• Weather forecasting• Ad Campaign• Fraud Detection• User Profiling• Biomedical Informatics• Intrusion Detection• Insurance • etc
1. Normally, banks only have data of their own customers2. “Late payment, default” models are computed using their own data3. New customers may not completely follow the same distribution.
Face Recognition• Sample selection bias:
– Training samples are taken inside research lab, where there are a few women.
– Test samples: in real-world, men-women ratio is almost 50-50.
The Yale Face Database B
Brain-Computer Interface (BCI)
• Control computers by EEG signals:– Input: EEG signals– Output: Left or Right
Figure provided by Fraunhofer FIRST, Berlin, Germany
Training• Imagine left/right-hand movement
following the letter on the screen
Movie provided by Fraunhofer FIRST, Berlin, Germany
Testing: Playing Games• “Brain-Pong”
Movie provided by Fraunhofer FIRST, Berlin, Germany
Non-Stationarity in EEG Features
Bandpower differences betweentraining and test phases
• Different mental conditions (attention, sleepiness etc.) between training and test phases may change the EEG signals.
Features extracted from brain activityduring training and test phases
Figures provided by Fraunhofer FIRST, Berlin, Germany
Robot Controlby Reinforcement Learning
• Let the robot learn how to autonomously move without explicit supervision.
Khepera Robot
Rewards
• Give robot rewards:– Go forward: Positive reward– Hit wall: Negative reward
• Goal: Learn the control policy that maximizes future rewards
Robot moves autonomously
= goes forward without hitting wall
Example• After learning:
Policy Iteration and Covariate Shift
• Policy iteration:
• Updating the policy correspond to changing the input distributions!
Evaluatecontrol policy
Improvecontrol policy
Different Types of Sample Selection Bias
Bias as Distribution
• Think of “sampling an example (x,y) into the training data” as an event denoted by random variable s – s=1: example (x,y) is sampled into the training data– s=0: example (x,y) is not sampled.
• Think of bias as a conditional probability of “s=1” dependent on x and y
• P(s=1|x,y) : the probability for (x,y) to be sampled into the training data, conditional on the example’s feature vector x and class label y.
Categorization(Zadrozy’04, Fan et al’05, Fan and Davidson’07)
– No Sample Selection Bias• P(s=1|x,y) = P(s=1)
– Feature Bias/Covariate Shift• P(s=1|x,y) = P(s=1|x)
– Class Bias• P(s=1|x,y) = P(s=1|y)
– Complete Bias• No more reduction
• Alternatively, consider D of the size can be sampled “exhaustively” from the universe of examples.
Bias for a Training Set
• How P(s=1|x,y) is computed
• Practically, for a given training set D– P(s=1|x,y) = 1: if (x,y) is sampled into D– P(s=1|x,y) = 0: otherwise
Realistic Datasets are biased?
• Most datasets are biased.
• Unlikely to sample each and every feature vector.
• For most problems, it is at least feature bias.– P(s=1|x,y) = P(s=1|x)
Effect on Learning
• Learning algorithms estimate the “true conditional probability” – True probability P(y|x), such as P(fraud|x)?– Estimated probabilty P(y|x,M): M is the model
built.
• Conditional probability in the biased data.– P(y|x,s=1)
• Key Issue:– P(y|x,s=1) = P(y|x) ?
Bias Resolutions
Heckman’s Two-Step Approach
• Estimate one’s donation amount if one does donate.• Accurate estimate cannot be obtained by a regression using only
data from donors.• First Step: Probit model to estimate probability to donate:
• Second Step: regression model to estimate donation:
• Expected error
• Gaussian assumption
Covariate Shift or Feature Bias• However, no chance for generalization
if training and test samples have nothing in common.
• Covariate shift: – Input distribution changes
– Functional relation remains unchanged
Example of Covariate Shift(Weak) extrapolation:
Predict output values outside training region
Training samples
Test samples
Covariate Shift Adaptation
Training samples
Test samples
• To illustrate the effect of covariate shift, let’s focus on linear extrapolation
True function
Learned function
Generalization Error= Bias + Variance
: expectation over noise
Model Specification• Model is said to be correctly specified if
• In practice, our model may not be correct.
• Therefore, we need a theory for misspecified models!
Ordinary Least-Squares (OLS)
• If model is correct:– OLS minimizes bias
asymptotically
• If model is misspecified:– OLS does not minimize
bias even asymptotically.
We want to reduce bias!
Law of Large Numbers• Sample average converges to the
population mean:
• We want to estimate the expectation over test input points only using training input points .
Key Trick:Importance-Weighted Average
• Importance: Ratio of test and training input densities
• Importance-weighted average:
(cf. importance sampling)
Importance-Weighted LS
• Even for misspedified models, IWLS minimizes bias asymptotically.
• We need to estimate importance in practice.
:Assumed strictly positive
(Shimodaira, JSPI2000)
Use of Unlabeled Samples: Importance Estimation
• Assumption: We have training inputs and test inputs .
• Naïve approach: Estimate and separately, and take the ratio of the density estimates
• This does not work well since density estimation is hard in high dimensions.
Vapnik’s Principle
• Directly estimating the ratio is easier than estimating the densities!
When solving a problem,more difficult problems shouldn’t be solved.
Knowing densities Knowing ratio
(e.g., support vector machines)
Modeling Importance Function
• Use a linear importance model:
• Test density is approximated by
• Idea: Learn so that well approximates .
Kullback-Leibler Divergence
•
(constant)
(relevant)
Learning Importance Function
• Thus
• Since is density,
(objective function)
(constraint)
KLIEP (Kullback-LeiblerImportance Estimation
Procedure)
• Convexity: unique global solution is available
• Sparse solution: prediction is fast!
(Sugiyama et al., NIPS2007)
Examples
Experiments: Setup
• Input distributions: standard Gaussian with– Training: mean (0,0,…,0)– Test: mean (1,0,…,0)
• Kernel density estimation (KDE):– Separately estimate training and test input
densities.– Gaussian kernel width is chosen by likelihood
cross-validation.
• KLIEP– Gaussian kernel width is chosen by likelihood
cross-validation
• KDE: Error increases as dim grows
• KLIEP: Error remains small for large dim
Experimental Results
KDE
KLIEP
Nor
mal
ized
MS
E
dim
Ensemble Methods (Fan and Davidson’07)
Posterior weighting
Class Probability
Integration Over Model Space
Averaging of estimated class probabilities weighted by posterior
Removes model uncertainty by averaging
How to Use Them
• Estimate “joint probability” P(x,y) instead of just conditional probability, i.e.,– P(x,y) = P(y|x)P(x) – Makes no difference use 1 model, but
Multiple models
Examples of How This Works
• P1(+|x) = 0.8 and P2(+|x) = 0.4
• P1(-|x) = 0.2 and P2(-|x) = 0.6
• model averaging, – P(+|x) = (0.8 + 0.4) / 2 = 0.6– P(-|x) = (0.2 + 0.6)/2 = 0.4– Prediction will be –
• But if there are two P(x) models, with probability 0.05 and 0.4
• Then– P(+,x) = 0.05 * 0.8 + 0.4 * 0.4 = 0.2– P(-,x) = 0.05 * 0.2 + 0.4 * 0.6 = 0.25
• Recall with model averaging: – P(+|x) = 0.6 and P(-|x)=0.4– Prediction is +
• But, now the prediction will be – instead of +• Key Idea:
– Unlabeled examples can be used as “weights” to re-weight the models.
Structure Discovery (Ren et al’08)
Original Dataset
Structural Discovery
Structural Re-balancingCorrected Dataset
Active Learning• Quality of learned functions depends on
training input location .
• Goal: optimize training input location
Good input location Poor input location
TargetLearned
Challenges
• Generalization error is unknown and needs to be estimated.
• In experiment design, we do not have training output values yet.
• Thus we cannot use, e.g., cross-validation which requires .
• Only training input positions can be used in generalization error estimation!
Agnostic Setup
• The model is not correct in practice.
• Then OLS is not consistent.
• Standard “experiment design” method does not work!
(Fedorov 1972; Cohn et al., JAIR1996)
Bias Reduction byImportance-Weighted LS (IWLS)
• The use of IWLS mitigates the problem of in consistency under agnostic setup.
• Importance is known in active learning setup since is designed by us!
Importance
(Wiens JSPI2001; Kanamori & Shimodaira JSPI2003; Sugiyama JMLR2006)
Model Selection and Testing
Model Selection• Choice of models is crucial:
• We want to determine the model so that generalization error is minimized:
Polynomial of order 1 Polynomial of order 2 Polynomial of order 3
Generalization Error Estimation
• Generalization error is not accessible since the target function is unknown.
• Instead, we use a generalization error estimate.
Model complexity Model complexity
Cross-Validation• Divide training samples into groups.
• Train a learning machine with groups.
• Validate the trained machine using the rest.
• Repeat this for all combinations and output the mean validation error.
• CV is almost unbiased without covariate shift.
• But, it is heavily biased under covariate shift!
Group 1 Group 2 Group kGroup k-1…
Training Validation
Importance-Weighted CV (IWCV)
• When testing the classifier in CV process, we also importance-weight the test error.
• IWCV gives almost unbiased estimates of generalization error even under covariate shift
Set 1 Set 2 Set kSet k-1…
Training Testing
(Zadrozny ICML2004; Sugiyama et al., JMLR2007)
Example of IWCV
• IWCV gives better estimates of generalization error.
• Model selection by IWCV outperforms CV!
Reserve Testing (Fan and Davidson’06)
Train
A
B
MA
MB
Test
A
B
MAA
MAB
MBA
MBB
Train
Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB
DA
DB
Labeledtest data
Rule• If “A’s labeled test data” can construct “more
accurate models” for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate.– If MAA > MAB and MBA > MBB then choose A
• Similarly, – If MAA < MAB and MBA < MBB then choose B
• Otherwise, undecided.
Why CV won’t work?
Sparse Region
Examples
Ozone Day Prediction (Zhang et al’06)
– Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)
1. Rather skewed and relatively sparse distribution
– 2500+ examples over 7 years (1998-2004)
– 72 continuous features with missing values
– Large instance space• If binary and uncorrelated, 272 is an
astronomical number
– 2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively
Challenges as a Data Mining Problem
3. A large number of irrelevant features– Only about 10 out of 72 features verified to be
relevant,– No information on the relevancy of the other 62
features– For stochastic problem, given irrelevant
features Xir , where X=(Xr, Xir), P(Y|X) = P(Y|Xr) only if the data is exhaustive.
– May introduce overfitting problem, and change the probability distribution represented in the data.
• P(Y = “ozone day”| Xr, Xir) 1 • P(Y = “normal day”|Xr, Xir) 0
4. “Feature sample selection bias”.
– Given 7 years of data and 72 continuous features, hard to find many days in the training data that is very similar to a day in the future
– Given these, 2 closely-related challenges1. How to train an accurate model2. How to effectively use a model to predict
the future with a different and yet unknown distribution
Training Distribution
Testing Distribution
12
3
12
3
+ +
+
+
+
+
- -
Reliable probability estimation under irrelevant
features– Recall that due to irrelevant
features:• P(Y = “ozone day”| Xr, Xir) 1 • P(Y = “normal day”|Xr, Xir) 0
– Construct multiple models – Average their predictions
• P(“ozone”|xr): true probability• P(“ozone”|Xr, Xir, θ): estimated probability by
model θ• MSEsinglemodel:
– Difference between “true” and “estimated”.• MSEAverage
– Difference between “true” and “average of many models”
• Formally show that MSEAverage ≤ MSESingleModel
• Prediction with feature sample selection bias
TrainingS
et Algorithm
…..
Estimated probability
values1 fold
Estimated probability
values10 fold
10CV
10CV
Estimated probability
values2 fold
Decision threshold
VE
VE
“Probability-TrueLabel”
file
Concatenate
Concate
nate
P(y=“ozoneday”|x,θ) Lable
7/1/98 0.1316 Normal
7/2/98 0.6245 Ozone
7/3/98 0.5944 Ozone
………
PrecRecplot
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
MaMb
– A CV based procedure for decision threshold selection
Training Distribution
Testing Distribution
12
3
12
3
+ +
+
+
+
+
- -
P(y=“ozoneday”|x,θ) Lable
7/1/98 0.1316 Normal
7/3/98 0.5944 Ozone
7/2/98 0.6245 Ozone
………
Addressing Data Mining Challenges
• Prediction with feature sample selection bias– Future prediction based on decision
threshold selectedWhole TrainingSet
θ
Classification on
future days
if P(Y = “ozonedays”|X,θ ) ≥ VE
Predict “ozonedays”
Results
KDD/Netflix CUP’07 Task1 (Liu and Kou,07)
Task 1Task 1: Who rated what in 2006
Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006Result: They are the close runner-up, No 3 out of 39 teams
Challenges:•Huge amount of data how to sample the data so that any learning algorithms can be applied is critical•Complex affecting factors: decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users
17K
mo
vie
s
Training Data Task 2
Task 1 Movie Arrival
1998 Time 2005 2006
User Arrival
4 5 ?
3
2
?
QualifierDataset
3M
NO Useror MovieArrival
NETFLIX data generation process
Task 1: Effective Sampling Strategies
• Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set– The probability of picking a movie was proportional to the number
of ratings that movie received; the same strategy for usersMovies
Users
HistorySamples……
Movie5 .0011 ……
Movie3 .001……
Movie4 .0007
……
User7 .0007 ……
User6 .00012……
User8 .00003……
……
Movie5 User 7 ……
Movie3 User 7……
Movie4 .User 8
….1488844,3,2005-09-06822109,5,2005-05-13885013,4,2005-10-1930878,4,2005-12-26823519,3,2004-05-03
…
• Learning Algorithm:– Single classifiers: logistic regression, Ridge regression, decision tree, support vector
machines– Naïve Ensemble: combining sub-classifiers built on different types of features with pre-
set weights– Ensemble classifiers: combining sub-classifiers with weights learned from the
development set
Brain-Computer Interface (BCI)
• Control computers by brain signals:– Input: EEG signals– Output: Left or Right
BCI Results
• When KL is large, covariate shift adaptation tends to improve accuracy.
• When KL is small, no difference.
Subject TrialNo
adaptationWith
adaptationKL
1
1 9.3 % 10.0 % 0.76
2 8.8 % 8.8 % 1.11
3 4.3 % 4.3 % 0.69
2
1 40.0 % 40.0 % 0.97
2 39.3 % 38.7 % 1.05
3 25.5 % 25.5 % 0.43
3
1 36.9 % 34.4 % 2.63
2 21.3 % 19.3 % 2.88
3 22.5 % 17.5 % 1.25
4
1 21.3 % 21.3 % 9.23
2 2.4 % 2.4 % 5.58
3 6.4 % 6.4 % 1.83
51 21.3 % 21.3 % 0.79
2 15.3 % 14.0 % 2.01
KL divergence from trainingto test input distributions
Robot Control byReinforcement Learning
• Swing-up inverted pendulum:– Swing-up the pole by
controlling the car.– Reward:
ResultsCovariate shift adaptation
Existing method (b)
Existing method (a)
Demo: Proposed Method
Wafer Alignment inSemiconductor Exposure
Apparatus• Recent silicon wafers have layer structure.
• Circuit patterns are exposed multiple times.
• Exact alignment of wafers is very important.
Markers on Wafer• Wafer alignment process:
– Measure marker location printed on wafers.– Shift and rotate the wafer to minimize the gap.
• For speeding up, reducing the number of markers to measure is very important.
Active learning problem!
Non-linear Alignment Model
• When gap is only shift and rotation, linear model is exact:
• However, non-linear factors exist, e.g.,– Warp– Biased characteristic of measurement apparatus– Different temperature conditions
• Exactly modeling non-linear factors is very difficult in practice!
Agnostic setup!
Experimental Results
• IWLS-based active learning works very well!
20 markers (out of 38) are chosen by experiment design methods. Gaps of all markers are predicted. Repeated for 220 different wafers. Mean (standard deviation) of the gap prediction error Red: Significantly better by 5% Wilcoxon test Blue: Worse than the baseline passive method
IWLS-based OLS-based“Outer”
heuristicPassive
2.27(1.08) 2.37(1.15) 2.36(1.15) 2.32(1.11)
(Sugiyama & Nakajima ECML-PKDD2008)
Mean squared error of wafer position estimation
Conclusions
Book on Dataset Shift
• Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, Cambridge, 2008.