Cross Validation & Ensembling
Shan-Hung [email protected]
Department of Computer Science,National Tsing Hua University, Taiwan
Machine Learning
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 2 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 3 / 34
Cross ValidationSo far, we use the hold out method for:
Hyperparameter tuning: validation setPerformance reporting: testing set
What if we get an “unfortunate” split?
K-fold cross validation:1 Split the data set X evenly into K subsets X(i) (called folds)2 For i = 1, · · · ,K, train f�N
(i) using all data but the i-th fold (X\X(i))3 Report the cross-validation error C
CV
by averaging all testing errorsC[f�N
(i) ]’s on X(i)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34
Cross ValidationSo far, we use the hold out method for:
Hyperparameter tuning: validation setPerformance reporting: testing set
What if we get an “unfortunate” split?K-fold cross validation:
1 Split the data set X evenly into K subsets X(i) (called folds)2 For i = 1, · · · ,K, train f�N
(i) using all data but the i-th fold (X\X(i))3 Report the cross-validation error C
CV
by averaging all testing errorsC[f�N
(i) ]’s on X(i)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34
Nested Cross Validation
Cross validation (CV) can be applied to both hyperparameter tuningand performance reporting
E.g, 5⇥2 nested CV
1 Inner (2 folds): selecthyperparameters giving lowestC
CV
Can be wrapped by gridsearch
2 Train final model using both
training and validation sets withthe selected hyperparameters
3 Outer (5 folds): report C
CV
astest error
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Nested Cross Validation
Cross validation (CV) can be applied to both hyperparameter tuningand performance reporting
E.g, 5⇥2 nested CV
1 Inner (2 folds): selecthyperparameters giving lowestC
CV
Can be wrapped by gridsearch
2 Train final model using both
training and validation sets withthe selected hyperparameters
3 Outer (5 folds): report C
CV
astest error
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Nested Cross Validation
Cross validation (CV) can be applied to both hyperparameter tuningand performance reporting
E.g, 5⇥2 nested CV
1 Inner (2 folds): selecthyperparameters giving lowestC
CV
Can be wrapped by gridsearch
2 Train final model using both
training and validation sets withthe selected hyperparameters
3 Outer (5 folds): report C
CV
astest error
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Nested Cross Validation
Cross validation (CV) can be applied to both hyperparameter tuningand performance reporting
E.g, 5⇥2 nested CV
1 Inner (2 folds): selecthyperparameters giving lowestC
CV
Can be wrapped by gridsearch
2 Train final model using both
training and validation sets withthe selected hyperparameters
3 Outer (5 folds): report C
CV
astest error
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 6 / 34
How Many Folds K? I
The cross-validation error CCV is an average of C[f�N
(i) ]’sRegard each C[f�N
(i) ] as an estimator of the expected generalizationerror EX(C[f
N
])
CCV is an estimator too, and we have
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
How Many Folds K? I
The cross-validation error CCV is an average of C[f�N
(i) ]’s
Regard each C[f�N
(i) ] as an estimator of the expected generalizationerror EX(C[f
N
])
CCV is an estimator too, and we have
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
How Many Folds K? I
The cross-validation error CCV is an average of C[f�N
(i) ]’sRegard each C[f�N
(i) ] as an estimator of the expected generalizationerror EX(C[f
N
])
CCV is an estimator too, and we have
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
How Many Folds K? I
The cross-validation error CCV is an average of C[f�N
(i) ]’sRegard each C[f�N
(i) ] as an estimator of the expected generalizationerror EX(C[f
N
])
CCV is an estimator too, and we have
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
Point Estimation Revisited: Mean Square Error
Let ˆqn
be an estimator of quantity q related to random variable x
mapped from n i.i.d samples of x
Mean square error of ˆqn
:
MSE( ˆqn
) = EX⇥( ˆq
n
�q)2
⇤
Can be decomposed into the bias and variance:
EX⇥( ˆq
n
�q)2
⇤= E
⇥( ˆq
n
�E[ ˆqn
]+E[ ˆqn
]�q)2
⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2 +(E[ ˆqn
]�q)2 +2( ˆqn
�E[ ˆqn
])(E[ ˆqn
]�q)⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+E
⇥(E[ ˆq
n
]�q)2
⇤+2E
�ˆqn
�E[ ˆqn
]�(E[ ˆq
n
]�q)= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+�E[ ˆq
n
]�q�
2
+2 ·0 · (E[ ˆqn
]�q)= VarX( ˆq
n
)+bias( ˆqn
)2
MSE of an unbiased estimator is its variance
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error
Let ˆqn
be an estimator of quantity q related to random variable x
mapped from n i.i.d samples of x
Mean square error of ˆqn
:
MSE( ˆqn
) = EX⇥( ˆq
n
�q)2
⇤
Can be decomposed into the bias and variance:
EX⇥( ˆq
n
�q)2
⇤= E
⇥( ˆq
n
�E[ ˆqn
]+E[ ˆqn
]�q)2
⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2 +(E[ ˆqn
]�q)2 +2( ˆqn
�E[ ˆqn
])(E[ ˆqn
]�q)⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+E
⇥(E[ ˆq
n
]�q)2
⇤+2E
�ˆqn
�E[ ˆqn
]�(E[ ˆq
n
]�q)= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+�E[ ˆq
n
]�q�
2
+2 ·0 · (E[ ˆqn
]�q)= VarX( ˆq
n
)+bias( ˆqn
)2
MSE of an unbiased estimator is its variance
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error
Let ˆqn
be an estimator of quantity q related to random variable x
mapped from n i.i.d samples of x
Mean square error of ˆqn
:
MSE( ˆqn
) = EX⇥( ˆq
n
�q)2
⇤
Can be decomposed into the bias and variance:
EX⇥( ˆq
n
�q)2
⇤= E
⇥( ˆq
n
�E[ ˆqn
]+E[ ˆqn
]�q)2
⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2 +(E[ ˆqn
]�q)2 +2( ˆqn
�E[ ˆqn
])(E[ ˆqn
]�q)⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+E
⇥(E[ ˆq
n
]�q)2
⇤+2E
�ˆqn
�E[ ˆqn
]�(E[ ˆq
n
]�q)= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+�E[ ˆq
n
]�q�
2
+2 ·0 · (E[ ˆqn
]�q)= VarX( ˆq
n
)+bias( ˆqn
)2
MSE of an unbiased estimator is its variance
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error
Let ˆqn
be an estimator of quantity q related to random variable x
mapped from n i.i.d samples of x
Mean square error of ˆqn
:
MSE( ˆqn
) = EX⇥( ˆq
n
�q)2
⇤
Can be decomposed into the bias and variance:
EX⇥( ˆq
n
�q)2
⇤= E
⇥( ˆq
n
�E[ ˆqn
]+E[ ˆqn
]�q)2
⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2 +(E[ ˆqn
]�q)2 +2( ˆqn
�E[ ˆqn
])(E[ ˆqn
]�q)⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+E
⇥(E[ ˆq
n
]�q)2
⇤+2E
�ˆqn
�E[ ˆqn
]�(E[ ˆq
n
]�q)
= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+�E[ ˆq
n
]�q�
2
+2 ·0 · (E[ ˆqn
]�q)= VarX( ˆq
n
)+bias( ˆqn
)2
MSE of an unbiased estimator is its variance
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error
Let ˆqn
be an estimator of quantity q related to random variable x
mapped from n i.i.d samples of x
Mean square error of ˆqn
:
MSE( ˆqn
) = EX⇥( ˆq
n
�q)2
⇤
Can be decomposed into the bias and variance:
EX⇥( ˆq
n
�q)2
⇤= E
⇥( ˆq
n
�E[ ˆqn
]+E[ ˆqn
]�q)2
⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2 +(E[ ˆqn
]�q)2 +2( ˆqn
�E[ ˆqn
])(E[ ˆqn
]�q)⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+E
⇥(E[ ˆq
n
]�q)2
⇤+2E
�ˆqn
�E[ ˆqn
]�(E[ ˆq
n
]�q)= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+�E[ ˆq
n
]�q�
2
+2 ·0 · (E[ ˆqn
]�q)
= VarX( ˆqn
)+bias( ˆqn
)2
MSE of an unbiased estimator is its variance
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error
Let ˆqn
be an estimator of quantity q related to random variable x
mapped from n i.i.d samples of x
Mean square error of ˆqn
:
MSE( ˆqn
) = EX⇥( ˆq
n
�q)2
⇤
Can be decomposed into the bias and variance:
EX⇥( ˆq
n
�q)2
⇤= E
⇥( ˆq
n
�E[ ˆqn
]+E[ ˆqn
]�q)2
⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2 +(E[ ˆqn
]�q)2 +2( ˆqn
�E[ ˆqn
])(E[ ˆqn
]�q)⇤
= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+E
⇥(E[ ˆq
n
]�q)2
⇤+2E
�ˆqn
�E[ ˆqn
]�(E[ ˆq
n
]�q)= E
⇥( ˆq
n
�E[ ˆqn
])2
⇤+�E[ ˆq
n
]�q�
2
+2 ·0 · (E[ ˆqn
]�q)= VarX( ˆq
n
)+bias( ˆqn
)2
MSE of an unbiased estimator is its variance
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Example: 5-Fold vs. 10-Fold CV
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Consider polynomial regression whereP(y|x) = sin(x)+ e,e ⇠ N (0,s2)Let C[·] be the MSE of predictions (made by a function) to true labelsEX(C[f
N
]): read linebias(CCV): gaps between the red and other solid lines (EX[CCV])VarX (CCV): shaded areas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Consider polynomial regression whereP(y|x) = sin(x)+ e,e ⇠ N (0,s2)
Let C[·] be the MSE of predictions (made by a function) to true labelsEX(C[f
N
]): read linebias(CCV): gaps between the red and other solid lines (EX[CCV])VarX (CCV): shaded areas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Consider polynomial regression whereP(y|x) = sin(x)+ e,e ⇠ N (0,s2)Let C[·] be the MSE of predictions (made by a function) to true labels
EX(C[fN
]): read linebias(CCV): gaps between the red and other solid lines (EX[CCV])VarX (CCV): shaded areas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Consider polynomial regression whereP(y|x) = sin(x)+ e,e ⇠ N (0,s2)Let C[·] be the MSE of predictions (made by a function) to true labelsEX(C[f
N
]): read line
bias(CCV): gaps between the red and other solid lines (EX[CCV])VarX (CCV): shaded areas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Consider polynomial regression whereP(y|x) = sin(x)+ e,e ⇠ N (0,s2)Let C[·] be the MSE of predictions (made by a function) to true labelsEX(C[f
N
]): read linebias(CCV): gaps between the red and other solid lines (EX[CCV])
VarX (CCV): shaded areas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV
MSE(CCV) = EX[(CCV�EX(C[fN
]))2] = VarX(CCV)+bias(CCV)2
Consider polynomial regression whereP(y|x) = sin(x)+ e,e ⇠ N (0,s2)Let C[·] be the MSE of predictions (made by a function) to true labelsEX(C[f
N
]): read linebias(CCV): gaps between the red and other solid lines (EX[CCV])VarX (CCV): shaded areas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])
= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])
= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
Decomposing Bias and Variance
CCV is an estimator of the expected generalization error EX(C[fN
]):
MSE(CCV) = VarX(CCV)+bias(CCV)2, where
bias(CCV) = EX (CCV)�EX(C[fN
]) = E
�Â
i
1
K
C[f�N
(i) ]��E(C[f
N
])= 1
K
Âi
E
�C[f�N
(i) ]��E(C[f
N
])= E
�C[f�N
(s) ]��E(C[f
N
]),8s
= bias
�C[f�N
(s) ]�,8s
VarX (CCV) = Var
�Â
i
1
K
C[f�N
(i) ]�= 1
K
2
Var
�Â
i
C[f�N
(i) ]�
= 1
K
2
�Â
i
Var
�C[f�N
(i) ]�+2Â
i,j,j>i
CovX�C[f�N
(i) ],C[f�N
(j) ]��
= 1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34
How Many Folds K? II
MSE(CCV) = VarX(CCV)+bias(CCV)2, wherebias(CCV) = bias
�C[f�N
(s) ]�,8s
Var(CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
We can reduce bias(CCV) and Var(CCV) by learning theory
Choosing the right model complexity avoiding both underfitting andoverfittingCollecting more training examples (N)
Furthermore, we can reduce Var(CCV) by making f�N
(i) and f�N
(j)
uncorrelated
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34
How Many Folds K? II
MSE(CCV) = VarX(CCV)+bias(CCV)2, wherebias(CCV) = bias
�C[f�N
(s) ]�,8s
Var(CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
We can reduce bias(CCV) and Var(CCV) by learning theory
Choosing the right model complexity avoiding both underfitting andoverfittingCollecting more training examples (N)
Furthermore, we can reduce Var(CCV) by making f�N
(i) and f�N
(j)
uncorrelated
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34
How Many Folds K? III
bias(CCV) = bias
�C[f�N
(s) ]�,8s
VarX (CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
With a large K, the CCV tends to have:
Low bias
�C[f�N
(s) ]�
and Var
�C[f�N
(s) ]�, as f�N
(s) is trained on moreexamplesHigh Cov
�C[f�N
(i) ],C[f�N
(j) ]�, as training sets X\X(i) and X\X(j) are
more similar thus C[f�N
(i) ] and C[f�N
(j) ] are more positively correlated
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34
How Many Folds K? III
bias(CCV) = bias
�C[f�N
(s) ]�,8s
VarX (CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
With a large K, the CCV tends to have:Low bias
�C[f�N
(s) ]�
and Var
�C[f�N
(s) ]�, as f�N
(s) is trained on moreexamples
High Cov
�C[f�N
(i) ],C[f�N
(j) ]�, as training sets X\X(i) and X\X(j) are
more similar thus C[f�N
(i) ] and C[f�N
(j) ] are more positively correlated
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34
How Many Folds K? III
bias(CCV) = bias
�C[f�N
(s) ]�,8s
VarX (CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
With a large K, the CCV tends to have:Low bias
�C[f�N
(s) ]�
and Var
�C[f�N
(s) ]�, as f�N
(s) is trained on moreexamplesHigh Cov
�C[f�N
(i) ],C[f�N
(j) ]�, as training sets X\X(i) and X\X(j) are
more similar thus C[f�N
(i) ] and C[f�N
(j) ] are more positively correlated
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34
How Many Folds K? IV
bias(CCV) = bias
�C[f�N
(s) ]�,8s
VarX (CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Conversely, with a small K, the cross-validation error tends to have ahigh bias
�C[f�N
(s) ]�
and Var
�C[f�N
(s) ]�
but low Cov
�C[f�N
(i) ],C[f�N
(j) ]�
In practice, we usually set K = 5 or 10
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34
How Many Folds K? IV
bias(CCV) = bias
�C[f�N
(s) ]�,8s
VarX (CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
Conversely, with a small K, the cross-validation error tends to have ahigh bias
�C[f�N
(s) ]�
and Var
�C[f�N
(s) ]�
but low Cov
�C[f�N
(i) ],C[f�N
(j) ]�
In practice, we usually set K = 5 or 10
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34
Leave-One-Out CV
bias(CCV) = bias
�C[f�N
(s) ]�,8s
VarX (CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
For very small dataset:MSE(CCV) is dominated by bias
�C[f�N
(s) ]�
and Var
�C[f�N
(s) ]�
Not Cov
�C[f�N
(i) ],C[f�N
(j) ]�
We can choose K = N, which we call the leave-one-out CV
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34
Leave-One-Out CV
bias(CCV) = bias
�C[f�N
(s) ]�,8s
VarX (CCV) =1
K
Var
�C[f�N
(s) ]�+ 2
K
2
Âi,j,j>i
Cov
�C[f�N
(i) ],C[f�N
(j) ]�,8s
For very small dataset:MSE(CCV) is dominated by bias
�C[f�N
(s) ]�
and Var
�C[f�N
(s) ]�
Not Cov
�C[f�N
(i) ],C[f�N
(j) ]�
We can choose K = N, which we call the leave-one-out CV
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 15 / 34
Ensemble Methods
No free lunch theorem: there is no single ML algorithm that alwaysoutperforms the others in all domains/tasks
Can we combine multiple base-learners to improveApplicability across different domains, and/orGeneralization performance in a specific task?
These are the goals of ensemble learning
How to “combine” multiple base-learners?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
Ensemble Methods
No free lunch theorem: there is no single ML algorithm that alwaysoutperforms the others in all domains/tasksCan we combine multiple base-learners to improve
Applicability across different domains, and/orGeneralization performance in a specific task?
These are the goals of ensemble learning
How to “combine” multiple base-learners?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
Ensemble Methods
No free lunch theorem: there is no single ML algorithm that alwaysoutperforms the others in all domains/tasksCan we combine multiple base-learners to improve
Applicability across different domains, and/orGeneralization performance in a specific task?
These are the goals of ensemble learning
How to “combine” multiple base-learners?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
Ensemble Methods
No free lunch theorem: there is no single ML algorithm that alwaysoutperforms the others in all domains/tasksCan we combine multiple base-learners to improve
Applicability across different domains, and/orGeneralization performance in a specific task?
These are the goals of ensemble learning
How to “combine” multiple base-learners?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 17 / 34
VotingVoting: linear combining the predictions of base-learners for each x:
y
k
= Âj
w
j
y
(j)k
where w
j
� 0,Âj
w
j
= 1.
If all learners are given equal weight w
j
= 1/L, we have the plurality
vote (multi-class version of majority vote)
Voting Rule Formular
Sum y
k
= 1
L
ÂL
j=1
y
(j)k
Weighted sum y
k
= Âj
w
j
y
(j)k
,wj
� 0,Âj
w
j
= 1
Median y
k
= medianj
y
(j)k
Minimum y
k
= min
j
y
(j)k
Maximum y
k
= max
j
y
(j)k
Product y
k
= ’j
y
(j)k
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 18 / 34
Why Voting Works? I
Assume that each y
(j) has the expected value EX�y
(j) |x�
and varianceVarX
�y
(j) |x�
When w
j
= 1/L, we have:
EX (y |x) = E
Âj
1
L
y
(j) |x!
=1
L
Âj
E
⇣y
(j) |x⌘= E
⇣y
(j) |x⌘
VarX (y |x) = Var
Âj
1
L
y
(j) |x!
=1
L
2
Var
Âj
y
(j) |x!
=1
L
Var
⇣y
(j) |x⌘+
2
L
2
Âi,j,i<j
Cov
⇣y
(i), y(j) |x⌘
The expected value doesn’t change, so the bias doesn’t change
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34
Why Voting Works? I
Assume that each y
(j) has the expected value EX�y
(j) |x�
and varianceVarX
�y
(j) |x�
When w
j
= 1/L, we have:
EX (y |x) = E
Âj
1
L
y
(j) |x!
=1
L
Âj
E
⇣y
(j) |x⌘= E
⇣y
(j) |x⌘
VarX (y |x) = Var
Âj
1
L
y
(j) |x!
=1
L
2
Var
Âj
y
(j) |x!
=1
L
Var
⇣y
(j) |x⌘+
2
L
2
Âi,j,i<j
Cov
⇣y
(i), y(j) |x⌘
The expected value doesn’t change, so the bias doesn’t change
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34
Why Voting Works? I
Assume that each y
(j) has the expected value EX�y
(j) |x�
and varianceVarX
�y
(j) |x�
When w
j
= 1/L, we have:
EX (y |x) = E
Âj
1
L
y
(j) |x!
=1
L
Âj
E
⇣y
(j) |x⌘= E
⇣y
(j) |x⌘
VarX (y |x) = Var
Âj
1
L
y
(j) |x!
=1
L
2
Var
Âj
y
(j) |x!
=1
L
Var
⇣y
(j) |x⌘+
2
L
2
Âi,j,i<j
Cov
⇣y
(i), y(j) |x⌘
The expected value doesn’t change, so the bias doesn’t change
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34
Why Voting Works? II
VarX (y |x) = 1
L
Var
⇣y
(j) |x⌘+
2
L
2
Âi,j,i<j
Cov
⇣y
(i), y(j) |x⌘
If y
(i) and y
(j) are uncorrelated, the variance can be reducedUnfortunately, y
(j)’s may not be i.i.d. in practiceIf voters are positively correlated, variance increases
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34
Why Voting Works? II
VarX (y |x) = 1
L
Var
⇣y
(j) |x⌘+
2
L
2
Âi,j,i<j
Cov
⇣y
(i), y(j) |x⌘
If y
(i) and y
(j) are uncorrelated, the variance can be reduced
Unfortunately, y
(j)’s may not be i.i.d. in practiceIf voters are positively correlated, variance increases
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34
Why Voting Works? II
VarX (y |x) = 1
L
Var
⇣y
(j) |x⌘+
2
L
2
Âi,j,i<j
Cov
⇣y
(i), y(j) |x⌘
If y
(i) and y
(j) are uncorrelated, the variance can be reducedUnfortunately, y
(j)’s may not be i.i.d. in practiceIf voters are positively correlated, variance increases
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 21 / 34
Bagging
Bagging (short for bootstrap aggregating) is a voting method, butbase-learners are made different deliberatelyHow?
Why not train them using slightly different training sets?
1 Generate L slightly different samples from a given sample is done bybootstrap: given X of size N, we draw N points randomly from Xwith replacement to get X(j)
It is possible that some instances are drawn more than once and someare not at all
2 Train a base-learner for each X(j)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
Bagging
Bagging (short for bootstrap aggregating) is a voting method, butbase-learners are made different deliberatelyHow? Why not train them using slightly different training sets?
1 Generate L slightly different samples from a given sample is done bybootstrap: given X of size N, we draw N points randomly from Xwith replacement to get X(j)
It is possible that some instances are drawn more than once and someare not at all
2 Train a base-learner for each X(j)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
Bagging
Bagging (short for bootstrap aggregating) is a voting method, butbase-learners are made different deliberatelyHow? Why not train them using slightly different training sets?
1 Generate L slightly different samples from a given sample is done bybootstrap: given X of size N, we draw N points randomly from Xwith replacement to get X(j)
It is possible that some instances are drawn more than once and someare not at all
2 Train a base-learner for each X(j)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
Bagging
Bagging (short for bootstrap aggregating) is a voting method, butbase-learners are made different deliberatelyHow? Why not train them using slightly different training sets?
1 Generate L slightly different samples from a given sample is done bybootstrap: given X of size N, we draw N points randomly from Xwith replacement to get X(j)
It is possible that some instances are drawn more than once and someare not at all
2 Train a base-learner for each X(j)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 23 / 34
Boosting
In bagging, generating “uncorrelated” base-learners is left to chanceand unstability of the learning method
In boosting, we generate complementary base-learnersHow? Why not train the next learner on the mistakes of the previouslearnersFor simplicity, let’s consider the binary classification here:d
(j)(x) 2 {1,�1}The original boosting algorithm combines three weak learners togenerate a strong learner
A week learner has error probability less than 1/2 (better than randomguessing)A strong learner has arbitrarily small error probability
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
Boosting
In bagging, generating “uncorrelated” base-learners is left to chanceand unstability of the learning methodIn boosting, we generate complementary base-learnersHow?
Why not train the next learner on the mistakes of the previouslearnersFor simplicity, let’s consider the binary classification here:d
(j)(x) 2 {1,�1}The original boosting algorithm combines three weak learners togenerate a strong learner
A week learner has error probability less than 1/2 (better than randomguessing)A strong learner has arbitrarily small error probability
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
Boosting
In bagging, generating “uncorrelated” base-learners is left to chanceand unstability of the learning methodIn boosting, we generate complementary base-learnersHow? Why not train the next learner on the mistakes of the previouslearners
For simplicity, let’s consider the binary classification here:d
(j)(x) 2 {1,�1}The original boosting algorithm combines three weak learners togenerate a strong learner
A week learner has error probability less than 1/2 (better than randomguessing)A strong learner has arbitrarily small error probability
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
Boosting
In bagging, generating “uncorrelated” base-learners is left to chanceand unstability of the learning methodIn boosting, we generate complementary base-learnersHow? Why not train the next learner on the mistakes of the previouslearnersFor simplicity, let’s consider the binary classification here:d
(j)(x) 2 {1,�1}The original boosting algorithm combines three weak learners togenerate a strong learner
A week learner has error probability less than 1/2 (better than randomguessing)A strong learner has arbitrarily small error probability
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34
Original Boosting Algorithm
Training
1 Given a large training set, randomly divide it into three
2 Use X(1) to train the first learner d
(1) and feed X(2) to d
(1)
3 Use all points misclassified by d
(1) and X(2) to train d
(2). Then feedX(3) to d
(1) and d
(2)
4 Use the points on which d
(1) and d
(2) disagree to train d
(3)
Testing
1 Feed a point it to d
(1) and d
(2) first. If their outputs agree, use themas the final prediction
2 Otherwise the output of d
(3) is taken
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
Original Boosting Algorithm
Training
1 Given a large training set, randomly divide it into three2 Use X(1) to train the first learner d
(1) and feed X(2) to d
(1)
3 Use all points misclassified by d
(1) and X(2) to train d
(2). Then feedX(3) to d
(1) and d
(2)
4 Use the points on which d
(1) and d
(2) disagree to train d
(3)
Testing
1 Feed a point it to d
(1) and d
(2) first. If their outputs agree, use themas the final prediction
2 Otherwise the output of d
(3) is taken
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
Original Boosting Algorithm
Training
1 Given a large training set, randomly divide it into three2 Use X(1) to train the first learner d
(1) and feed X(2) to d
(1)
3 Use all points misclassified by d
(1) and X(2) to train d
(2). Then feedX(3) to d
(1) and d
(2)
4 Use the points on which d
(1) and d
(2) disagree to train d
(3)
Testing
1 Feed a point it to d
(1) and d
(2) first. If their outputs agree, use themas the final prediction
2 Otherwise the output of d
(3) is taken
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
Original Boosting Algorithm
Training
1 Given a large training set, randomly divide it into three2 Use X(1) to train the first learner d
(1) and feed X(2) to d
(1)
3 Use all points misclassified by d
(1) and X(2) to train d
(2). Then feedX(3) to d
(1) and d
(2)
4 Use the points on which d
(1) and d
(2) disagree to train d
(3)
Testing
1 Feed a point it to d
(1) and d
(2) first. If their outputs agree, use themas the final prediction
2 Otherwise the output of d
(3) is taken
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
Original Boosting Algorithm
Training
1 Given a large training set, randomly divide it into three2 Use X(1) to train the first learner d
(1) and feed X(2) to d
(1)
3 Use all points misclassified by d
(1) and X(2) to train d
(2). Then feedX(3) to d
(1) and d
(2)
4 Use the points on which d
(1) and d
(2) disagree to train d
(3)
Testing
1 Feed a point it to d
(1) and d
(2) first. If their outputs agree, use themas the final prediction
2 Otherwise the output of d
(3) is taken
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
Original Boosting Algorithm
Training
1 Given a large training set, randomly divide it into three2 Use X(1) to train the first learner d
(1) and feed X(2) to d
(1)
3 Use all points misclassified by d
(1) and X(2) to train d
(2). Then feedX(3) to d
(1) and d
(2)
4 Use the points on which d
(1) and d
(2) disagree to train d
(3)
Testing
1 Feed a point it to d
(1) and d
(2) first. If their outputs agree, use themas the final prediction
2 Otherwise the output of d
(3) is taken
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34
Example
Assuming X(1), X(2), and X(3) are the same:
Disadvantage: requires a large training set to afford the three-way split
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 26 / 34
AdaBoost
AdaBoost: uses the same training set over and over againHow to make some points “larger?”
Modify the probabilities of drawing the instances as a function of errorNotation:Pr
(i,j): probability that an example (x(i),y(i)) is drawn to train the jthbase-learner d
(j)
e(j) = Âi
Pr
(i,j)1(y(i) 6= d
(j)(x(i))): error rate of d
(j) on its training set
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34
AdaBoost
AdaBoost: uses the same training set over and over againHow to make some points “larger?”Modify the probabilities of drawing the instances as a function of error
Notation:Pr
(i,j): probability that an example (x(i),y(i)) is drawn to train the jthbase-learner d
(j)
e(j) = Âi
Pr
(i,j)1(y(i) 6= d
(j)(x(i))): error rate of d
(j) on its training set
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34
AdaBoost
AdaBoost: uses the same training set over and over againHow to make some points “larger?”Modify the probabilities of drawing the instances as a function of errorNotation:Pr
(i,j): probability that an example (x(i),y(i)) is drawn to train the jthbase-learner d
(j)
e(j) = Âi
Pr
(i,j)1(y(i) 6= d
(j)(x(i))): error rate of d
(j) on its training set
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34
Algorithm
Training
1 Initialize Pr
(i,1) = 1
N
for all i
2 Start from j = 1:1 Randomly draw N examples from X with probabilities Pr
(i,j) and usethem to train d
(j)
2 Stop adding new base-learners if e(j) � 1
2
3 Define aj
= 1
2
log
⇣1�e(j)
e(j)
⌘> 0 and set
Pr
(i,j+1) = Pr
(i,j) ·exp(�aj
y
(i)d
(j)(x(i))) for all i
4 Normalize Pr
(i,j+1), 8i, by multiplying⇣
Âi
Pr
(i,j+1)⌘�1
Testing
1 Given x, calculate y
(j) for all j
2 Make final prediction y by voting: y = Âj
aj
d
(j)(x)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
Algorithm
Training
1 Initialize Pr
(i,1) = 1
N
for all i
2 Start from j = 1:1 Randomly draw N examples from X with probabilities Pr
(i,j) and usethem to train d
(j)
2 Stop adding new base-learners if e(j) � 1
2
3 Define aj
= 1
2
log
⇣1�e(j)
e(j)
⌘> 0 and set
Pr
(i,j+1) = Pr
(i,j) ·exp(�aj
y
(i)d
(j)(x(i))) for all i
4 Normalize Pr
(i,j+1), 8i, by multiplying⇣
Âi
Pr
(i,j+1)⌘�1
Testing
1 Given x, calculate y
(j) for all j
2 Make final prediction y by voting: y = Âj
aj
d
(j)(x)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
Algorithm
Training
1 Initialize Pr
(i,1) = 1
N
for all i
2 Start from j = 1:1 Randomly draw N examples from X with probabilities Pr
(i,j) and usethem to train d
(j)
2 Stop adding new base-learners if e(j) � 1
2
3 Define aj
= 1
2
log
⇣1�e(j)
e(j)
⌘> 0 and set
Pr
(i,j+1) = Pr
(i,j) ·exp(�aj
y
(i)d
(j)(x(i))) for all i
4 Normalize Pr
(i,j+1), 8i, by multiplying⇣
Âi
Pr
(i,j+1)⌘�1
Testing
1 Given x, calculate y
(j) for all j
2 Make final prediction y by voting: y = Âj
aj
d
(j)(x)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
Algorithm
Training
1 Initialize Pr
(i,1) = 1
N
for all i
2 Start from j = 1:1 Randomly draw N examples from X with probabilities Pr
(i,j) and usethem to train d
(j)
2 Stop adding new base-learners if e(j) � 1
2
3 Define aj
= 1
2
log
⇣1�e(j)
e(j)
⌘> 0 and set
Pr
(i,j+1) = Pr
(i,j) ·exp(�aj
y
(i)d
(j)(x(i))) for all i
4 Normalize Pr
(i,j+1), 8i, by multiplying⇣
Âi
Pr
(i,j+1)⌘�1
Testing
1 Given x, calculate y
(j) for all j
2 Make final prediction y by voting: y = Âj
aj
d
(j)(x)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
Algorithm
Training
1 Initialize Pr
(i,1) = 1
N
for all i
2 Start from j = 1:1 Randomly draw N examples from X with probabilities Pr
(i,j) and usethem to train d
(j)
2 Stop adding new base-learners if e(j) � 1
2
3 Define aj
= 1
2
log
⇣1�e(j)
e(j)
⌘> 0 and set
Pr
(i,j+1) = Pr
(i,j) ·exp(�aj
y
(i)d
(j)(x(i))) for all i
4 Normalize Pr
(i,j+1), 8i, by multiplying⇣
Âi
Pr
(i,j+1)⌘�1
Testing
1 Given x, calculate y
(j) for all j
2 Make final prediction y by voting: y = Âj
aj
d
(j)(x)
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34
Example
d
(j+1) complements d
(j) and d
(j�1) by focusing on predictions theydisagree
Voting weights (aj
= 1
2
log
⇣1�e(j)
e(j)
⌘) in predictions are proportional to
the base-learner’s accuracy
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34
Example
d
(j+1) complements d
(j) and d
(j�1) by focusing on predictions theydisagreeVoting weights (a
j
= 1
2
log
⇣1�e(j)
e(j)
⌘) in predictions are proportional to
the base-learner’s accuracyShan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34
Outline
1Cross Validation
How Many Folds?
2Ensemble Methods
VotingBaggingBoostingWhy AdaBoost Works?
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 30 / 34
Why AdaBoost WorksWhy AdaBoost improves performance?
By increasing model complexity? Not exactlyEmpirical study: AdaBoost reduces overfitting as L grows, even whenthere is no training error
AdaBoost increases margin [1, 2]
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
Why AdaBoost WorksWhy AdaBoost improves performance?By increasing model complexity?
Not exactlyEmpirical study: AdaBoost reduces overfitting as L grows, even whenthere is no training error
AdaBoost increases margin [1, 2]
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
Why AdaBoost WorksWhy AdaBoost improves performance?By increasing model complexity? Not exactly
Empirical study: AdaBoost reduces overfitting as L grows, even whenthere is no training error
AdaBoost increases margin [1, 2]
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
Why AdaBoost WorksWhy AdaBoost improves performance?By increasing model complexity? Not exactly
Empirical study: AdaBoost reduces overfitting as L grows, even whenthere is no training error
AdaBoost increases margin [1, 2]
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34
Margin as Confidence of Predictions
Recall in SVC, a larger marginimproves generalizability
Due to higher confidence
predictions over trainingexamplesWe can define the margin for AdaBoost similarlyIn binary classification, define margin of a prediction of an example(x(i),y(i)) 2 X as:
margin(x(i),y(i)) = y
(i)f (x(i)) = Â
j:y
(i)=d
(j)(x(i))
aj
� Âj:y
(i) 6=d
(j)(x(i))
aj
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34
Margin as Confidence of Predictions
Recall in SVC, a larger marginimproves generalizabilityDue to higher confidence
predictions over trainingexamples
We can define the margin for AdaBoost similarlyIn binary classification, define margin of a prediction of an example(x(i),y(i)) 2 X as:
margin(x(i),y(i)) = y
(i)f (x(i)) = Â
j:y
(i)=d
(j)(x(i))
aj
� Âj:y
(i) 6=d
(j)(x(i))
aj
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34
Margin as Confidence of Predictions
Recall in SVC, a larger marginimproves generalizabilityDue to higher confidence
predictions over trainingexamplesWe can define the margin for AdaBoost similarlyIn binary classification, define margin of a prediction of an example(x(i),y(i)) 2 X as:
margin(x(i),y(i)) = y
(i)f (x(i)) = Â
j:y
(i)=d
(j)(x(i))
aj
� Âj:y
(i) 6=d
(j)(x(i))
aj
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34
Margin DistributionMargin distribution over q :
PrX(y(i)
f (x(i)) q)⇡ |(x(i),y(i)) : y
(i)f (x(i)) q |
|X|
A complementary learner:Clarifies low confidence areasIncreases margin of points in theseareas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34
Margin DistributionMargin distribution over q :
PrX(y(i)
f (x(i)) q)⇡ |(x(i),y(i)) : y
(i)f (x(i)) q |
|X|
A complementary learner:Clarifies low confidence areasIncreases margin of points in theseareas
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34
Reference I
[1] Yoav Freund, Robert Schapire, and N Abe.A short introduction to boosting.Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612,1999.
[2] Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou, and JufuFeng.On the margin explanation of boosting algorithms.In COLT, pages 479–490. Citeseer, 2008.
Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 34 / 34