SOA Predictive Analytics SymposiumSeptember 25, 2020
Ensemble Learning
Sometimes one model can’t do it all….
2
Ensemble Learning
Machine Learning algorithms are getting better all the time…
But each model is constrained to learn only some part of the structure of your data (i.e. linear structure, nested dependency inherent in trees
Combining the strengths of different models can produce superior predictions
3
What is an Ensemble?
• Combine predictions from multiple models generated from training data
• May be of the same class (i.e. trees) or different
• Empirical studies show predictions from combinations of models often perform better. “Wisdom of crowds”
4
One Model Is Rarely Uniformly Superior
5
6
Traditional approach is to hone a model using data, i.e. adding additional predictors to a regression model
Complexity can be controlled by regularization, train/test
Model Ensembles approach differently – instead of a single “master” model we combine multiple weaker models
Hope that each model accurately captures some aspect of data structure
Model Supervision vs Model Ensembles
7
• Statistical. Training data is small relative to size of space we want to search, averaging reduces risk of choosing the wrong classifier
• Computational. Individual models get stuck in local optima, averaging can get to better overall prediction
• Representational. Individual learners may not span large enough space to find real solution, combining can enable search in expanded space
Why does ensemble learning work?
How to Build an Ensemble?Ensemble of classifiers should be accurate (low bias) and diverse (so average has lower variance)
• “accurate” means better than random guessing – not a high bar• “diverse” can mean predictions are uncorrelated, or that the class
of possible models spans the space well
Ensemble Method:
Step 1: Develop population of base learners (usually “weak”) Trees are often usedStep 2: Combine them to form a prediction
8
Ensemble Learning: Sequential vs Parallel
Sequential ensemble: base learners are generated sequentially (e.g. AdaBoost)• The basic motivation of sequential methods is to exploit the
dependence between the base learners.• Overall performance may be improved by weighing previously
mislabeled examples with higher weight.
Parallel ensemble: base learners are generated simultaneously without knowledge of each other (e.g. Random Forest).• The basic motivation of parallel methods is to exploit
independence between the base learners since then the error can be reduced by averaging.
9
Bias Variance Tradeoff
MSE, is expected squared error of model
• Overfitting = High Variance• Underfitting = High Bias
10
Bias Variance Tradeoffi.e. Decision Tree, two few nodes then piecewise constant approx. too crude, “bias”
11
If we grow the tree too large, we could continue until each node has only one observation, “variance” – future data sets very unlikely to have same exact characteristics.
Ensembles for classification vs regression
Once we have an ensemble of models, what do we do with it?
Classification: Majority voteRegression: Averaging of predictions
Intelligent selection of predictors better than averaging all predictors that we can cook up• Based on accuracy and diversity• Remove weakest learners and average the rest• Not uniformly accurate: models will represent different aspects of
target function better
12
Ensemble Learning – BaggingBootstrap Aggregation (Bagging) Sample uniformly and
with replacement from training set
Size of each sample can be small or large
Build predictor on each sample and average results
13
Works well for unstable learning algorithms, i.e. decision tree, neural networks
Ensemble Learning – Random Forests
14
Ensemble Learning – Extremely Randomized Trees
Randomness goes one step further: the splitting thresholds are randomized.
Instead of looking for the most discriminative threshold, draw at random for each candidate feature. Best of these is splitting rule.
Reduces of the variance of the model, at the expense of a slightly greater increase in bias.
15
Ensemble Learning – Boosting• Family of learning algorithms to convert weak learners into strong ones• Data points initially equally weighted• Apply weak learners (i.e. shallow decision trees) and calculate error• Increase weight of misclassified, decrease weight of correctly classified• Iterate and finally predict using weighted average of most recently
constructed N models, weighted by overall accuracy
16
By Sirakorn - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=85888769
17
• A collection of networks with the same configuration and different initial random weights is trained on the same dataset
• Each model is then used to make a prediction and the actual prediction is calculated as the average of the predictions.
• Number of models in the ensemble is often kept small both because of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members.
• Ensembles may be as small as three, five, or 10 trained models.
Ensemble of Neural Networks –“Committee of Networks”
Bayesian Ensemble Learning
Bayesian Adaptive Regression Trees (BART)
BART uses a prior and likelihood to construct a sequence of trees using a Markov Monte Carlo Method
Define prior distributions for tree growth, distribution in the terminal node, and residual error
4 choices at each step: grow, prune, swap terminal nodes, switch parent/child node
Empirical studies show high predictive accuracy
18
Injecting randomness to improve estimates and avoid “local minima” pitfall Initial weights for neural network or boosting can be set
randomly…different classifiers can be produced with different seeds
Decision tree selects splitting node among random sample of variables or cut points
Apply random weights to training set
Random selection of train/test sets
Randomly chosen subset of predictors in regression model
19
Evaluation of Ensemble Model
• Harder to inspect these models and check for reasonability• Risk of overfitting the training data (so many parameters!)• Train/Test/Validation samples and /or cross validation is more important
20
Summary Statistics for Ensemble Models
• Importance scores: which variables are most relevant, and relative influence or contribution of each variable
• Shapley as game theory application to identify contribution to prediction at member level
• Interaction statistic which variables interact with which others, strength and degree of those interactions
• Partial Dependence plots nature of dependence of the response on influential inputs, i.e. response increases monotonically with a predictor
21
Variable Importance In a regression model, coefficient magnitude relative
to its standard error can indicate variable importance
Average improvement in purity or reduction in error from a decision tree split (i.e Gini or Entropy)
Randomly permute values of feature and see by how much estimation error of model increases Don’t want to delete the feature since then would need to re-
construct the model
Metrics can be valuable on training or test data
22
Shapley Statistic From Game Theory, it estimates the reward earned by
players in a cooperative game (meaning, proportional to their contribution)
Reward can be construed as reduction in mean squared error relative to naïve model
Calculates marginal contribution of each predictor to all coalitions excluding that predictor, relative to total number of predictors
23
Variable Interaction Variable importance calculations will be misleading if there
are significant interactions Partial Dependence plots marginalize impact by looking at
change in prediction holding all other variables constant at average
24
Friedman’s H-Statistic identifies whether higher-order interactions exist (do features j and k interact in any way?)
Actuarial Applications
• Pricing Estimation – Ensemble learning can be used to both create new estimates and indicate optimal way to combine
• Actuaries are already familiar with a simple ensemble of experience and manual
• From too few estimators to too many
• Partial and Semi-Partial correlation
25