+ All Categories
Home > Documents > Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola...

Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola...

Date post: 20-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
55
Introduction to Machine Learning applied to genomic selection O. González-Recio 1 Dpto Mejora Genética Animal, INIA, Madrid; O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 1 / 51
Transcript
Page 1: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Introduction to Machine Learning applied to genomicselection

O. González-Recio

1Dpto Mejora Genética Animal, INIA, Madrid;

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 1 / 51

Page 2: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 2 / 51

Page 3: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 3 / 51

Page 4: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

MACHINE LEARNING

What is “Learning”?Making useful changes in our minds. -Marvin Minsky-

Denotes changes in the system that enable the system to make the sametask more effectively the next time. -Herbert Simon-

Machine LearningMultidisciplinary field. Bio-informatics, statistics, genomics, datamining, astronomy, www, ...

Avoids rigid parametric models that may be far away from ourobservations.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 4 / 51

Page 5: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

MACHINE LEARNING

What is “Learning”?Making useful changes in our minds. -Marvin Minsky-

Denotes changes in the system that enable the system to make the sametask more effectively the next time. -Herbert Simon-

Machine LearningMultidisciplinary field. Bio-informatics, statistics, genomics, datamining, astronomy, www, ...

Avoids rigid parametric models that may be far away from ourobservations.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 4 / 51

Page 6: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

MACHINE LEARNING

Machine Learning in genomic selectionMassive amount of information.

Need to extract knowledge from large, noisy, redundant, missing andfuzzy data.

ML is able to extract hidden relationships that exist in these hugevolumes of data and do not follow a particular parametric design.

Supervised Learning: we have a target output (phenotypes).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 5 / 51

Page 7: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

MACHINE LEARNING

Massive Genomic InformationWhat does information consume in an information-rich world? it consumesthe attention of its recipients. Hence a wealth of information creates a povertyof attention and a need to allocate that attention efficiently among theoverabundance of information sources that might consume it.

-Herbert Simon; Nobel price in Economics-

OverviewDevelop algorithms to extract knowledge from some set of data in aneffective and efficient fashion, to predict yet to be observed datafollowing certain rules.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 6 / 51

Page 8: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

MACHINE LEARNING

Massive Genomic InformationWhat does information consume in an information-rich world? it consumesthe attention of its recipients. Hence a wealth of information creates a povertyof attention and a need to allocate that attention efficiently among theoverabundance of information sources that might consume it.

-Herbert Simon; Nobel price in Economics-

OverviewDevelop algorithms to extract knowledge from some set of data in aneffective and efficient fashion, to predict yet to be observed datafollowing certain rules.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 6 / 51

Page 9: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

INTRO

What is “Learning”?Given: a colection of examples (data) E (phenotypes and covariates)

Produce: an equation or description (T) that covers all or most examples,and predicts (P) the value, class or category of a yet-to-be observedexample.

The algorithm ’learns’ relationships and associations between alreadyobserved examples to predict phenotypes when their covariates are observed.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 7 / 51

Page 10: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

MOTIVATION

Definitiona computer program is said to learn from experience E with respect to someclass of task T and performance measure P, if its performance at tasks in T, asmeasured by P, improves with experience E.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 8 / 51

Page 11: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

INTRO

Machine Learning is a piece in the process to adquire new knowledge.Workflow in Data Mining tasks

From Inza et al. (2010)

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 9 / 51

Page 12: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Concepts Concepts

OUTLINE OF THE COURSE

In this courseBasic concepts in Machine Learning

Design of a learning system.

Regularization and bias-variance trade off.

Ensemble methods:

BoostingRandom Forest

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 10 / 51

Page 13: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Learning System Design Description

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 11 / 51

Page 14: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Learning System Design Description

Why is it important?

Vital to implement an effective learning.

What should be consideredWonder what do we want to answer.

What scenario is expected.

Design the learning and validation sets in consequence.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 12 / 51

Page 15: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Learning System Design Description

Learning system in genomic selection

Genome-wide association studiesGoal: Find genetic variants associated to a given trait.

What is the phenotype distribution in our population.

Prediction of genetic merit in future generations is less important.

Diseases: Case-control, case-case-control designs.

Genomic selectionGoal: Predict genomic merit of individuals w/o phenotype.

We expect DNA recombinations in subsequent generations.

Re-phenotyping every x generations.

Overlapped or discrete generations.

Select training and testing sets according to the characteristics of ourpopulation.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 13 / 51

Page 16: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Learning System Design Types of designs

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 14 / 51

Page 17: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Learning System Design Types of designs

Learning designSame learning and validation set

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 15 / 51

Page 18: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Learning System Design Types of designs

Learning designk-fold cross validaion

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 16 / 51

Page 19: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Learning System Design Types of designs

Learning designTraining and testing sets

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 17 / 51

Page 20: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Ensemble methods

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 18 / 51

Page 21: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Ensemble methods

Introduction

Wide variate of competing methodsBayes alphabet, Bayesian LASSO, Ridge regression, Logistic regression,Neural networks, ...

The comparative accuracy depends strongly on the trait, problemaddressed or genetic architecture.

A priori we don’t know what method is better for a new problem.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 19 / 51

Page 22: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Ensemble methods

Introduction

EnsemblesEnsembles are combination of different methods (usually simplemodels).

They have very good predictive ability because use complementary andadditivity of models performances.

Ensembles have better predictive ability than methods separately.

They have known statistics properties (no “black boxes”).

“In a multitud of counselors there is saftey”

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 20 / 51

Page 23: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Ensemble methods

Introduction

Ensemblesy = c0 + c1f1(y,X)+ c2f2(y,X)+ ...+ cMfM(y,X)+ e

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 21 / 51

Page 24: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Ensemble methods

Building Ensembles: Two steps

1. Developing a population ofvaried models

Also called base learners.

May be “weak” models:slightly better than randomguess.

Same/different method.

Features Subset Selection(FSS).

May capturenon-linearities andinteractions.

Partition of the input space.

2. Combining them to form acomposite predictor

Voting.

Estimated weight.

Averaging.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 22 / 51

Page 25: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Ensemble methods

Building Ensembles: Two steps

1. Developing a population ofvaried models

Also called base learners.

May be “weak” models:slightly better than randomguess.

Same/different method.

Features Subset Selection(FSS).

May capturenon-linearities andinteractions.

Partition of the input space.

2. Combining them to form acomposite predictor

Voting.

Estimated weight.

Averaging.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 22 / 51

Page 26: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Ensemble methods

Examples

Most common ensemblesModel averaging (e.g. Bayesian model averaging).

Bagging.

Boosting.

Random Forest.

Can be worseMost ensembling use variations of one kind of modeling examples, butcomplex and heterogeneus ensembling may be imagined.

Boosting and Random ForestHigh dimensional heuristic search algorithms to detect signal covariates.

Do not model any particular gene action or genetic architecture.

Do not provide a simple estimate of effect size.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 23 / 51

Page 27: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Bagging

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 24 / 51

Page 28: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Bagging

Bagging

Bootstrap aggregating

bootstrap data and average resultsy = 1

M ∑Mm=1 fm(Ψm), with Ψm being a bootstrapped sample of the N

records of (y,X).fm(•) is the model of choice applied to the bootstrapped data.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 25 / 51

Page 29: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Bagging

Bagging

Bootstrap aggregating

e∼ N(0,σ2e ) i.i.d.

Averaging residuals ei = 1M ∑

Mm=1(yi− yim), we expect that e approximatte

to zero by a factor of M.Unfortunately, e are not independent during the process and a limit isusualy reached.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 26 / 51

Page 30: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Boosting

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 27 / 51

Page 31: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Boosting

Boosting

PropertiesBased on AdaBoost (Freund and Schapire, 1996).

May be applied to both continuous and categorical traits.

Bühlmann and Yu (2003) proposed a version for high dimensionalproblems.

Covariate selectionSmall step gradient descent

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 28 / 51

Page 32: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Boosting

Boosting

In genomic selectionApply base learners on the residuals of the previous one.

Implement feature selection at each step.

Apply a small weight on each learner and train a new learner onresiduals.

It does not require heritance model specification (additivity, epistasis,dominance, . . . ).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 29 / 51

Page 33: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Random Forest

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 30 / 51

Page 34: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Random Forest

Random Forest

PropertiesBased on classification and regression trees (CART).

Analyze discrete or continuous traits.

Implements feature selection.

Exploits randomization.

Massively non-parametric.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 31 / 51

Page 35: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Random Forest

Random Forest

Advantages in genomic selectionIt does not require heritance model specification (additivity, epistasis,dominance, . . . ).

It is able to capture complex interactions in the data.

Implements bagging (Breiman, 1996).

Reduce error prediction by a factor of the number of trees.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 32 / 51

Page 36: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Examples

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 33 / 51

Page 37: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Examples

ExamplesL2-Boosting algorithm applied to high-dimensional problems in genomic selection (GeneticsResearch, 2010)

Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa

Prediction accuracy for productive lifetime in a testing set in dairy cattle(3304 training/1398 testing; 32,611 SNPs)

Method Pearson correlation MSE biasBoosting_OLS 0.65 1.08 0.08

Bayes A 0.63 2.81 1.26Bayesian LASSO 0.66 1.10 0.10

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 34 / 51

Page 38: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Examples

ExamplesL2-Boosting algorithm applied to high-dimensional problems in genomic selection (GeneticsResearch, 2010)

Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa

Prediction accuracy for progeny average feed conversion rate in a testing setin broilers (333 training/61 testing; 3481 SNPs)

Pearson correlation MSE biasBoosting_NPR 0.37 0.006 -0.018Boosting OLS 0.33 0.006 -0.011

Bayes A 0.27 0.007 -0.016Bayesian LASSO 0.26 0.007 -0.010

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 35 / 51

Page 39: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Examples

ExamplesAnalysis of discrete traits in a genomic selection context using Bayesian regressions andMachine Learning (reviewing)

Gonzalez-Recio O. and S. Forni

Prediction accuracy (cor(y, y)) for Scrotal Hernia incidence from three linesof PIC

TBA BTL RanFor L2B LhBLine A (923 purebred) 0.13 0.22 0.26 0.17 0.09Line B (919 purebred) 0.34 0.32 0.38 0.12 0.32Line C (700 crossbred) 0.24 0.15 0.23 0.24 0.15

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 36 / 51

Page 40: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Examples

ExamplesAnalysis of discrete traits in a genomic selection context using Bayesian regressions andMachine Learning (reviewing)

Gonzalez-Recio O. and S. Forni

Area under the ROC curve for Scrotal Hernia incidence from three lines ofPIC

Method TBA BTL RanFor L2B LhBLine A (923 purebred) 0.64 0.65 0.67 0.55 0.60Line B (919 purebred) 0.70 0.69 0.73 0.60 0.72Line C (700 crossbred) 0.62 0.62 0.67 0.67 0.66

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 37 / 51

Page 41: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Ensemble methods Examples

ExamplesAnalysis of discrete traits in a genomic selection context using Bayesian regressions andMachine Learning (reviewing)

Prediction accuracy for Scrotal Hernia incidence from a nucleus line of PIC

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 38 / 51

Page 42: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Bias-Variance trade off

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 39 / 51

Page 43: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Bias-Variance trade off

Background

RegularizationAnalysis of high throughput genotyping data: large p, small n problem.

Models without regularization or feature subset selection (FSS) are proneto overfitting and decrease predictive ability.

Including all covariates increases the complexity of the model.

Follow Occam’s Razor: “entities must not be multiplied beyondnecessity” or “ When accuracy of two hypothesis is similar, prefer thesimpler one”.

Generalization is hurt by complexity.

All new assumptions introduce possibilities for error, then, keep itsimple.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 40 / 51

Page 44: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Bias-Variance trade off

Model complexity

Bias-variance trade offLow complexity: high bias, low variance.

Large complexity: low bias, high variance.

Optimum intermedium

0.0 0.5 1.0 1.5 2.0 2.5 3.0

510

15bias-variance trade off

Model complexity

VarianceBias^2MSE

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 41 / 51

Page 45: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Bias-Variance trade off

Model complexity

Bias-variance trade offLow complexity: high bias, low variance.

Large complexity: low bias, high variance.

Optimum intermedium

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 42 / 51

Page 46: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Bias-Variance trade off

“Regularization” in shrinkage models

Penalization term or prior assumptions

Ridge Regression: penalize ∑ps=1 β 2

s .

Bayes B (C, D,...): set snp variance/coefficient to zero with probabilityπ , and remaining snp variances are assumed inverted chi-squared priordistribution.

Bayes A: assume a inverted chi-squared prior distribution for SNPvariance.

LASSO: penalize λ ∑ps=1 |βs|.

Bayesian LASSO: double exponential prior distribution (controlled byλ ) on SNP coefficients.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 43 / 51

Page 47: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Model complexity in ensembles

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 44 / 51

Page 48: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Model complexity in ensembles

Complexity of ensembles

Use simple models.

Use many models.

Interpretation of many models, even simple model, may be much harderthan with a single model.

Ensembles are competitive in accuracy though at a probable loss ofinterpretability.

Too complex ensembles may lead to overfitting.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 45 / 51

Page 49: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Model complexity in ensembles

Complexity of ensembles

Are ensembles truly complex?They appear so, but do they act so?

Controling complexity in ensembles is not as simple as merely countcoefficients or assume prior distrbutions.

Many ensembles do not show overfitting (Bagging, Random Forest).

Control the complexity of the ensembles using cross-validation (Thereexist more complicated ways).

Tune the number of ensembles constructed.Use more or less complex “base learners”.

In general, ensembles are rather robust to overfitting.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 46 / 51

Page 50: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Model complexity in ensembles

Complexity of ensembles

Mean Squared Error in the training set (2 different base learners).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 47 / 51

Page 51: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Model complexity in ensembles

Complexity of ensembles

Mean Squared Error in the testing set (2 different base learners).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 48 / 51

Page 52: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Regularization Model complexity in ensembles

Complexity of ensembles

Mean Squared Error in the testing set (2 different base learners).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 48 / 51

Page 53: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Remarks

Remarks

Machine LearningNew data/concepts are frequently generated in molecularbiology/genomic, and ML can efficiently adapt to this fast evolvingnature.

ML is able to deal with missing and noisy data from many scenarios.

ML is able to deal with huge volumes of data generated by novelhigh-throughput devices, extracting hidden relationships not noticeableto experts.

ML can adjust its internal structure to the data producing accurateestimates.

ML uses algorithms that learn from the data (combinations of artificialinteligence and statistics).

Need a careful data preprocessing and design of the learning system.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 49 / 51

Page 54: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Remarks

Remarks

EnsemblesEnsembles are combination of several base learners, improving accuracysubstantially.

Ensembles may seem complex, but they do not act so.

Perform extremely well in a variety of possible complex domains.

Have desirable statistical properties.

Scale well computationally.

We will learn how to implement ensembles in a genomic selectioncontext.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 50 / 51

Page 55: Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola 2010/ML.pdf · Introduction to Machine Learning applied to genomic selection O.

Remarks

To take home

Inherent complexity of genetic/biologic systems have unknownproperties/rules that may not be parametrized.

Learn from experiences, interpret from knowledge.

If worried for shrinkage, use boosting.

If believe in state of nature yet, use Random Forest.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 51 / 51


Recommended