Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola...

Post on 20-Jun-2020

6 views 0 download

transcript

Introduction to Machine Learning applied to genomicselection

O. González-Recio

1Dpto Mejora Genética Animal, INIA, Madrid;

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 1 / 51

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 2 / 51

Concepts Concepts

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 3 / 51

Concepts Concepts

MACHINE LEARNING

What is “Learning”?Making useful changes in our minds. -Marvin Minsky-

Denotes changes in the system that enable the system to make the sametask more effectively the next time. -Herbert Simon-

Machine LearningMultidisciplinary field. Bio-informatics, statistics, genomics, datamining, astronomy, www, ...

Avoids rigid parametric models that may be far away from ourobservations.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 4 / 51

Concepts Concepts

MACHINE LEARNING

What is “Learning”?Making useful changes in our minds. -Marvin Minsky-

Denotes changes in the system that enable the system to make the sametask more effectively the next time. -Herbert Simon-

Machine LearningMultidisciplinary field. Bio-informatics, statistics, genomics, datamining, astronomy, www, ...

Avoids rigid parametric models that may be far away from ourobservations.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 4 / 51

Concepts Concepts

MACHINE LEARNING

Machine Learning in genomic selectionMassive amount of information.

Need to extract knowledge from large, noisy, redundant, missing andfuzzy data.

ML is able to extract hidden relationships that exist in these hugevolumes of data and do not follow a particular parametric design.

Supervised Learning: we have a target output (phenotypes).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 5 / 51

Concepts Concepts

MACHINE LEARNING

Massive Genomic InformationWhat does information consume in an information-rich world? it consumesthe attention of its recipients. Hence a wealth of information creates a povertyof attention and a need to allocate that attention efficiently among theoverabundance of information sources that might consume it.

-Herbert Simon; Nobel price in Economics-

OverviewDevelop algorithms to extract knowledge from some set of data in aneffective and efficient fashion, to predict yet to be observed datafollowing certain rules.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 6 / 51

Concepts Concepts

MACHINE LEARNING

Massive Genomic InformationWhat does information consume in an information-rich world? it consumesthe attention of its recipients. Hence a wealth of information creates a povertyof attention and a need to allocate that attention efficiently among theoverabundance of information sources that might consume it.

-Herbert Simon; Nobel price in Economics-

OverviewDevelop algorithms to extract knowledge from some set of data in aneffective and efficient fashion, to predict yet to be observed datafollowing certain rules.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 6 / 51

Concepts Concepts

INTRO

What is “Learning”?Given: a colection of examples (data) E (phenotypes and covariates)

Produce: an equation or description (T) that covers all or most examples,and predicts (P) the value, class or category of a yet-to-be observedexample.

The algorithm ’learns’ relationships and associations between alreadyobserved examples to predict phenotypes when their covariates are observed.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 7 / 51

Concepts Concepts

MOTIVATION

Definitiona computer program is said to learn from experience E with respect to someclass of task T and performance measure P, if its performance at tasks in T, asmeasured by P, improves with experience E.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 8 / 51

Concepts Concepts

INTRO

Machine Learning is a piece in the process to adquire new knowledge.Workflow in Data Mining tasks

From Inza et al. (2010)

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 9 / 51

Concepts Concepts

OUTLINE OF THE COURSE

In this courseBasic concepts in Machine Learning

Design of a learning system.

Regularization and bias-variance trade off.

Ensemble methods:

BoostingRandom Forest

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 10 / 51

Learning System Design Description

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 11 / 51

Learning System Design Description

Why is it important?

Vital to implement an effective learning.

What should be consideredWonder what do we want to answer.

What scenario is expected.

Design the learning and validation sets in consequence.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 12 / 51

Learning System Design Description

Learning system in genomic selection

Genome-wide association studiesGoal: Find genetic variants associated to a given trait.

What is the phenotype distribution in our population.

Prediction of genetic merit in future generations is less important.

Diseases: Case-control, case-case-control designs.

Genomic selectionGoal: Predict genomic merit of individuals w/o phenotype.

We expect DNA recombinations in subsequent generations.

Re-phenotyping every x generations.

Overlapped or discrete generations.

Select training and testing sets according to the characteristics of ourpopulation.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 13 / 51

Learning System Design Types of designs

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 14 / 51

Learning System Design Types of designs

Learning designSame learning and validation set

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 15 / 51

Learning System Design Types of designs

Learning designk-fold cross validaion

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 16 / 51

Learning System Design Types of designs

Learning designTraining and testing sets

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 17 / 51

Ensemble methods Ensemble methods

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 18 / 51

Ensemble methods Ensemble methods

Introduction

Wide variate of competing methodsBayes alphabet, Bayesian LASSO, Ridge regression, Logistic regression,Neural networks, ...

The comparative accuracy depends strongly on the trait, problemaddressed or genetic architecture.

A priori we don’t know what method is better for a new problem.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 19 / 51

Ensemble methods Ensemble methods

Introduction

EnsemblesEnsembles are combination of different methods (usually simplemodels).

They have very good predictive ability because use complementary andadditivity of models performances.

Ensembles have better predictive ability than methods separately.

They have known statistics properties (no “black boxes”).

“In a multitud of counselors there is saftey”

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 20 / 51

Ensemble methods Ensemble methods

Introduction

Ensemblesy = c0 + c1f1(y,X)+ c2f2(y,X)+ ...+ cMfM(y,X)+ e

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 21 / 51

Ensemble methods Ensemble methods

Building Ensembles: Two steps

1. Developing a population ofvaried models

Also called base learners.

May be “weak” models:slightly better than randomguess.

Same/different method.

Features Subset Selection(FSS).

May capturenon-linearities andinteractions.

Partition of the input space.

2. Combining them to form acomposite predictor

Voting.

Estimated weight.

Averaging.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 22 / 51

Ensemble methods Ensemble methods

Building Ensembles: Two steps

1. Developing a population ofvaried models

Also called base learners.

May be “weak” models:slightly better than randomguess.

Same/different method.

Features Subset Selection(FSS).

May capturenon-linearities andinteractions.

Partition of the input space.

2. Combining them to form acomposite predictor

Voting.

Estimated weight.

Averaging.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 22 / 51

Ensemble methods Ensemble methods

Examples

Most common ensemblesModel averaging (e.g. Bayesian model averaging).

Bagging.

Boosting.

Random Forest.

Can be worseMost ensembling use variations of one kind of modeling examples, butcomplex and heterogeneus ensembling may be imagined.

Boosting and Random ForestHigh dimensional heuristic search algorithms to detect signal covariates.

Do not model any particular gene action or genetic architecture.

Do not provide a simple estimate of effect size.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 23 / 51

Ensemble methods Bagging

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 24 / 51

Ensemble methods Bagging

Bagging

Bootstrap aggregating

bootstrap data and average resultsy = 1

M ∑Mm=1 fm(Ψm), with Ψm being a bootstrapped sample of the N

records of (y,X).fm(•) is the model of choice applied to the bootstrapped data.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 25 / 51

Ensemble methods Bagging

Bagging

Bootstrap aggregating

e∼ N(0,σ2e ) i.i.d.

Averaging residuals ei = 1M ∑

Mm=1(yi− yim), we expect that e approximatte

to zero by a factor of M.Unfortunately, e are not independent during the process and a limit isusualy reached.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 26 / 51

Ensemble methods Boosting

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 27 / 51

Ensemble methods Boosting

Boosting

PropertiesBased on AdaBoost (Freund and Schapire, 1996).

May be applied to both continuous and categorical traits.

Bühlmann and Yu (2003) proposed a version for high dimensionalproblems.

Covariate selectionSmall step gradient descent

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 28 / 51

Ensemble methods Boosting

Boosting

In genomic selectionApply base learners on the residuals of the previous one.

Implement feature selection at each step.

Apply a small weight on each learner and train a new learner onresiduals.

It does not require heritance model specification (additivity, epistasis,dominance, . . . ).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 29 / 51

Ensemble methods Random Forest

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 30 / 51

Ensemble methods Random Forest

Random Forest

PropertiesBased on classification and regression trees (CART).

Analyze discrete or continuous traits.

Implements feature selection.

Exploits randomization.

Massively non-parametric.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 31 / 51

Ensemble methods Random Forest

Random Forest

Advantages in genomic selectionIt does not require heritance model specification (additivity, epistasis,dominance, . . . ).

It is able to capture complex interactions in the data.

Implements bagging (Breiman, 1996).

Reduce error prediction by a factor of the number of trees.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 32 / 51

Ensemble methods Examples

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 33 / 51

Ensemble methods Examples

ExamplesL2-Boosting algorithm applied to high-dimensional problems in genomic selection (GeneticsResearch, 2010)

Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa

Prediction accuracy for productive lifetime in a testing set in dairy cattle(3304 training/1398 testing; 32,611 SNPs)

Method Pearson correlation MSE biasBoosting_OLS 0.65 1.08 0.08

Bayes A 0.63 2.81 1.26Bayesian LASSO 0.66 1.10 0.10

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 34 / 51

Ensemble methods Examples

ExamplesL2-Boosting algorithm applied to high-dimensional problems in genomic selection (GeneticsResearch, 2010)

Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa

Prediction accuracy for progeny average feed conversion rate in a testing setin broilers (333 training/61 testing; 3481 SNPs)

Pearson correlation MSE biasBoosting_NPR 0.37 0.006 -0.018Boosting OLS 0.33 0.006 -0.011

Bayes A 0.27 0.007 -0.016Bayesian LASSO 0.26 0.007 -0.010

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 35 / 51

Ensemble methods Examples

ExamplesAnalysis of discrete traits in a genomic selection context using Bayesian regressions andMachine Learning (reviewing)

Gonzalez-Recio O. and S. Forni

Prediction accuracy (cor(y, y)) for Scrotal Hernia incidence from three linesof PIC

TBA BTL RanFor L2B LhBLine A (923 purebred) 0.13 0.22 0.26 0.17 0.09Line B (919 purebred) 0.34 0.32 0.38 0.12 0.32Line C (700 crossbred) 0.24 0.15 0.23 0.24 0.15

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 36 / 51

Ensemble methods Examples

ExamplesAnalysis of discrete traits in a genomic selection context using Bayesian regressions andMachine Learning (reviewing)

Gonzalez-Recio O. and S. Forni

Area under the ROC curve for Scrotal Hernia incidence from three lines ofPIC

Method TBA BTL RanFor L2B LhBLine A (923 purebred) 0.64 0.65 0.67 0.55 0.60Line B (919 purebred) 0.70 0.69 0.73 0.60 0.72Line C (700 crossbred) 0.62 0.62 0.67 0.67 0.66

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 37 / 51

Ensemble methods Examples

ExamplesAnalysis of discrete traits in a genomic selection context using Bayesian regressions andMachine Learning (reviewing)

Prediction accuracy for Scrotal Hernia incidence from a nucleus line of PIC

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 38 / 51

Regularization Bias-Variance trade off

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 39 / 51

Regularization Bias-Variance trade off

Background

RegularizationAnalysis of high throughput genotyping data: large p, small n problem.

Models without regularization or feature subset selection (FSS) are proneto overfitting and decrease predictive ability.

Including all covariates increases the complexity of the model.

Follow Occam’s Razor: “entities must not be multiplied beyondnecessity” or “ When accuracy of two hypothesis is similar, prefer thesimpler one”.

Generalization is hurt by complexity.

All new assumptions introduce possibilities for error, then, keep itsimple.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 40 / 51

Regularization Bias-Variance trade off

Model complexity

Bias-variance trade offLow complexity: high bias, low variance.

Large complexity: low bias, high variance.

Optimum intermedium

0.0 0.5 1.0 1.5 2.0 2.5 3.0

510

15bias-variance trade off

Model complexity

VarianceBias^2MSE

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 41 / 51

Regularization Bias-Variance trade off

Model complexity

Bias-variance trade offLow complexity: high bias, low variance.

Large complexity: low bias, high variance.

Optimum intermedium

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 42 / 51

Regularization Bias-Variance trade off

“Regularization” in shrinkage models

Penalization term or prior assumptions

Ridge Regression: penalize ∑ps=1 β 2

s .

Bayes B (C, D,...): set snp variance/coefficient to zero with probabilityπ , and remaining snp variances are assumed inverted chi-squared priordistribution.

Bayes A: assume a inverted chi-squared prior distribution for SNPvariance.

LASSO: penalize λ ∑ps=1 |βs|.

Bayesian LASSO: double exponential prior distribution (controlled byλ ) on SNP coefficients.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 43 / 51

Regularization Model complexity in ensembles

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 44 / 51

Regularization Model complexity in ensembles

Complexity of ensembles

Use simple models.

Use many models.

Interpretation of many models, even simple model, may be much harderthan with a single model.

Ensembles are competitive in accuracy though at a probable loss ofinterpretability.

Too complex ensembles may lead to overfitting.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 45 / 51

Regularization Model complexity in ensembles

Complexity of ensembles

Are ensembles truly complex?They appear so, but do they act so?

Controling complexity in ensembles is not as simple as merely countcoefficients or assume prior distrbutions.

Many ensembles do not show overfitting (Bagging, Random Forest).

Control the complexity of the ensembles using cross-validation (Thereexist more complicated ways).

Tune the number of ensembles constructed.Use more or less complex “base learners”.

In general, ensembles are rather robust to overfitting.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 46 / 51

Regularization Model complexity in ensembles

Complexity of ensembles

Mean Squared Error in the training set (2 different base learners).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 47 / 51

Regularization Model complexity in ensembles

Complexity of ensembles

Mean Squared Error in the testing set (2 different base learners).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 48 / 51

Regularization Model complexity in ensembles

Complexity of ensembles

Mean Squared Error in the testing set (2 different base learners).

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 48 / 51

Remarks

Remarks

Machine LearningNew data/concepts are frequently generated in molecularbiology/genomic, and ML can efficiently adapt to this fast evolvingnature.

ML is able to deal with missing and noisy data from many scenarios.

ML is able to deal with huge volumes of data generated by novelhigh-throughput devices, extracting hidden relationships not noticeableto experts.

ML can adjust its internal structure to the data producing accurateestimates.

ML uses algorithms that learn from the data (combinations of artificialinteligence and statistics).

Need a careful data preprocessing and design of the learning system.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 49 / 51

Remarks

Remarks

EnsemblesEnsembles are combination of several base learners, improving accuracysubstantially.

Ensembles may seem complex, but they do not act so.

Perform extremely well in a variety of possible complex domains.

Have desirable statistical properties.

Scale well computationally.

We will learn how to implement ensembles in a genomic selectioncontext.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 50 / 51

Remarks

To take home

Inherent complexity of genetic/biologic systems have unknownproperties/rules that may not be parametrized.

Learn from experiences, interpret from knowledge.

If worried for shrinkage, use boosting.

If believe in state of nature yet, use Random Forest.

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 51 / 51