Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola...

transcript

Introduction to Machine Learning applied to genomicselection

O. González-Recio

1Dpto Mejora Genética Animal, INIA, Madrid;

O. González-Recio (INIA) Machine Learning UPV Valencia, 20-24 Sept. 2010 1 / 51

Outline

1 ConceptsConcepts

2 Learning System DesignDescriptionTypes of designs

3 Ensemble methodsOverviewBaggingBoostingRandom ForestExamples

4 RegularizationBias-Variance trade offModel complexity in ensembles

5 Remarks

Concepts Concepts

Outline

1 ConceptsConcepts

5 Remarks

Concepts Concepts

MACHINE LEARNING

What is “Learning”?Making useful changes in our minds. -Marvin Minsky-

Denotes changes in the system that enable the system to make the sametask more effectively the next time. -Herbert Simon-

Machine LearningMultidisciplinary field. Bio-informatics, statistics, genomics, datamining, astronomy, www, ...

Avoids rigid parametric models that may be far away from ourobservations.

Concepts Concepts

MACHINE LEARNING

What is “Learning”?Making useful changes in our minds. -Marvin Minsky-

Denotes changes in the system that enable the system to make the sametask more effectively the next time. -Herbert Simon-

Machine LearningMultidisciplinary field. Bio-informatics, statistics, genomics, datamining, astronomy, www, ...

Avoids rigid parametric models that may be far away from ourobservations.

Concepts Concepts

MACHINE LEARNING

Machine Learning in genomic selectionMassive amount of information.

Need to extract knowledge from large, noisy, redundant, missing andfuzzy data.

ML is able to extract hidden relationships that exist in these hugevolumes of data and do not follow a particular parametric design.

Supervised Learning: we have a target output (phenotypes).

Concepts Concepts

MACHINE LEARNING

Massive Genomic InformationWhat does information consume in an information-rich world? it consumesthe attention of its recipients. Hence a wealth of information creates a povertyof attention and a need to allocate that attention efficiently among theoverabundance of information sources that might consume it.

-Herbert Simon; Nobel price in Economics-

OverviewDevelop algorithms to extract knowledge from some set of data in aneffective and efficient fashion, to predict yet to be observed datafollowing certain rules.

Concepts Concepts

MACHINE LEARNING

Massive Genomic InformationWhat does information consume in an information-rich world? it consumesthe attention of its recipients. Hence a wealth of information creates a povertyof attention and a need to allocate that attention efficiently among theoverabundance of information sources that might consume it.

-Herbert Simon; Nobel price in Economics-

OverviewDevelop algorithms to extract knowledge from some set of data in aneffective and efficient fashion, to predict yet to be observed datafollowing certain rules.

Concepts Concepts

What is “Learning”?Given: a colection of examples (data) E (phenotypes and covariates)

Produce: an equation or description (T) that covers all or most examples,and predicts (P) the value, class or category of a yet-to-be observedexample.

The algorithm ’learns’ relationships and associations between alreadyobserved examples to predict phenotypes when their covariates are observed.

Concepts Concepts

MOTIVATION

Definitiona computer program is said to learn from experience E with respect to someclass of task T and performance measure P, if its performance at tasks in T, asmeasured by P, improves with experience E.

Concepts Concepts

Machine Learning is a piece in the process to adquire new knowledge.Workflow in Data Mining tasks

From Inza et al. (2010)

Concepts Concepts

OUTLINE OF THE COURSE

In this courseBasic concepts in Machine Learning

Design of a learning system.

Regularization and bias-variance trade off.

Ensemble methods:

BoostingRandom Forest

Learning System Design Description

Outline

1 ConceptsConcepts

5 Remarks

Why is it important?

Vital to implement an effective learning.

What should be consideredWonder what do we want to answer.

What scenario is expected.

Design the learning and validation sets in consequence.

Learning system in genomic selection

Genome-wide association studiesGoal: Find genetic variants associated to a given trait.

What is the phenotype distribution in our population.

Prediction of genetic merit in future generations is less important.

Diseases: Case-control, case-case-control designs.

Genomic selectionGoal: Predict genomic merit of individuals w/o phenotype.

We expect DNA recombinations in subsequent generations.

Re-phenotyping every x generations.

Overlapped or discrete generations.

Select training and testing sets according to the characteristics of ourpopulation.

Learning System Design Types of designs

Outline

1 ConceptsConcepts

5 Remarks

Learning designSame learning and validation set

Learning designk-fold cross validaion

Learning designTraining and testing sets

Ensemble methods Ensemble methods

Outline

1 ConceptsConcepts

5 Remarks

Introduction

Wide variate of competing methodsBayes alphabet, Bayesian LASSO, Ridge regression, Logistic regression,Neural networks, ...

The comparative accuracy depends strongly on the trait, problemaddressed or genetic architecture.

A priori we don’t know what method is better for a new problem.

Introduction

EnsemblesEnsembles are combination of different methods (usually simplemodels).

They have very good predictive ability because use complementary andadditivity of models performances.

Ensembles have better predictive ability than methods separately.

They have known statistics properties (no “black boxes”).

“In a multitud of counselors there is saftey”

Introduction

Ensemblesy = c0 + c1f1(y,X)+ c2f2(y,X)+ ...+ cMfM(y,X)+ e

Building Ensembles: Two steps

1. Developing a population ofvaried models

Also called base learners.

May be “weak” models:slightly better than randomguess.

Same/different method.

Features Subset Selection(FSS).

May capturenon-linearities andinteractions.

Partition of the input space.

2. Combining them to form acomposite predictor

Voting.

Estimated weight.

Averaging.

Building Ensembles: Two steps

1. Developing a population ofvaried models

Also called base learners.

May be “weak” models:slightly better than randomguess.

Same/different method.

Features Subset Selection(FSS).

May capturenon-linearities andinteractions.

Partition of the input space.

2. Combining them to form acomposite predictor

Voting.

Estimated weight.

Averaging.

Examples

Most common ensemblesModel averaging (e.g. Bayesian model averaging).

Bagging.

Boosting.

Random Forest.

Can be worseMost ensembling use variations of one kind of modeling examples, butcomplex and heterogeneus ensembling may be imagined.

Boosting and Random ForestHigh dimensional heuristic search algorithms to detect signal covariates.

Do not model any particular gene action or genetic architecture.

Do not provide a simple estimate of effect size.

Ensemble methods Bagging

Outline

1 ConceptsConcepts

5 Remarks

Bagging

Bootstrap aggregating

bootstrap data and average resultsy = 1

M ∑Mm=1 fm(Ψm), with Ψm being a bootstrapped sample of the N

records of (y,X).fm(•) is the model of choice applied to the bootstrapped data.

Bagging

Bootstrap aggregating

e∼ N(0,σ2e ) i.i.d.

Averaging residuals ei = 1M ∑

Mm=1(yi− yim), we expect that e approximatte

to zero by a factor of M.Unfortunately, e are not independent during the process and a limit isusualy reached.

Ensemble methods Boosting

Outline

1 ConceptsConcepts

5 Remarks

Boosting

PropertiesBased on AdaBoost (Freund and Schapire, 1996).

May be applied to both continuous and categorical traits.

Bühlmann and Yu (2003) proposed a version for high dimensionalproblems.

Covariate selectionSmall step gradient descent

Boosting

In genomic selectionApply base learners on the residuals of the previous one.

Implement feature selection at each step.

Apply a small weight on each learner and train a new learner onresiduals.

It does not require heritance model specification (additivity, epistasis,dominance, . . . ).

Ensemble methods Random Forest

Outline

1 ConceptsConcepts

5 Remarks

Random Forest

PropertiesBased on classification and regression trees (CART).

Analyze discrete or continuous traits.

Implements feature selection.

Exploits randomization.

Massively non-parametric.

Random Forest

Advantages in genomic selectionIt does not require heritance model specification (additivity, epistasis,dominance, . . . ).

It is able to capture complex interactions in the data.

Implements bagging (Breiman, 1996).

Reduce error prediction by a factor of the number of trees.

Ensemble methods Examples

Outline

1 ConceptsConcepts

5 Remarks

ExamplesL2-Boosting algorithm applied to high-dimensional problems in genomic selection (GeneticsResearch, 2010)

Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa

Prediction accuracy for productive lifetime in a testing set in dairy cattle(3304 training/1398 testing; 32,611 SNPs)

Method Pearson correlation MSE biasBoosting_OLS 0.65 1.08 0.08

Bayes A 0.63 2.81 1.26Bayesian LASSO 0.66 1.10 0.10

ExamplesL2-Boosting algorithm applied to high-dimensional problems in genomic selection (GeneticsResearch, 2010)

Gonzalez-Recio O., K.A. Weigel, D. Gianola, H. Naya and G.J.M. Rosa

Prediction accuracy for progeny average feed conversion rate in a testing setin broilers (333 training/61 testing; 3481 SNPs)

Pearson correlation MSE biasBoosting_NPR 0.37 0.006 -0.018Boosting OLS 0.33 0.006 -0.011

Bayes A 0.27 0.007 -0.016Bayesian LASSO 0.26 0.007 -0.010

ExamplesAnalysis of discrete traits in a genomic selection context using Bayesian regressions andMachine Learning (reviewing)

Gonzalez-Recio O. and S. Forni

Prediction accuracy (cor(y, y)) for Scrotal Hernia incidence from three linesof PIC

TBA BTL RanFor L2B LhBLine A (923 purebred) 0.13 0.22 0.26 0.17 0.09Line B (919 purebred) 0.34 0.32 0.38 0.12 0.32Line C (700 crossbred) 0.24 0.15 0.23 0.24 0.15

Gonzalez-Recio O. and S. Forni

Area under the ROC curve for Scrotal Hernia incidence from three lines ofPIC

Method TBA BTL RanFor L2B LhBLine A (923 purebred) 0.64 0.65 0.67 0.55 0.60Line B (919 purebred) 0.70 0.69 0.73 0.60 0.72Line C (700 crossbred) 0.62 0.62 0.67 0.67 0.66

Prediction accuracy for Scrotal Hernia incidence from a nucleus line of PIC

Regularization Bias-Variance trade off

Outline

1 ConceptsConcepts

5 Remarks

Background

RegularizationAnalysis of high throughput genotyping data: large p, small n problem.

Models without regularization or feature subset selection (FSS) are proneto overfitting and decrease predictive ability.

Including all covariates increases the complexity of the model.

Follow Occam’s Razor: “entities must not be multiplied beyondnecessity” or “ When accuracy of two hypothesis is similar, prefer thesimpler one”.

Generalization is hurt by complexity.

All new assumptions introduce possibilities for error, then, keep itsimple.

Model complexity

Bias-variance trade offLow complexity: high bias, low variance.

Large complexity: low bias, high variance.

Optimum intermedium

0.0 0.5 1.0 1.5 2.0 2.5 3.0

15bias-variance trade off

Model complexity

VarianceBias^2MSE

Model complexity

Bias-variance trade offLow complexity: high bias, low variance.

Large complexity: low bias, high variance.

Optimum intermedium

“Regularization” in shrinkage models

Penalization term or prior assumptions

Ridge Regression: penalize ∑ps=1 β 2

Bayes B (C, D,...): set snp variance/coefficient to zero with probabilityπ , and remaining snp variances are assumed inverted chi-squared priordistribution.

Bayes A: assume a inverted chi-squared prior distribution for SNPvariance.

LASSO: penalize λ ∑ps=1 |βs|.

Bayesian LASSO: double exponential prior distribution (controlled byλ ) on SNP coefficients.

Regularization Model complexity in ensembles

Outline

1 ConceptsConcepts

5 Remarks

Complexity of ensembles

Use simple models.

Use many models.

Interpretation of many models, even simple model, may be much harderthan with a single model.

Ensembles are competitive in accuracy though at a probable loss ofinterpretability.

Too complex ensembles may lead to overfitting.

Are ensembles truly complex?They appear so, but do they act so?

Controling complexity in ensembles is not as simple as merely countcoefficients or assume prior distrbutions.

Many ensembles do not show overfitting (Bagging, Random Forest).

Control the complexity of the ensembles using cross-validation (Thereexist more complicated ways).

Tune the number of ensembles constructed.Use more or less complex “base learners”.

In general, ensembles are rather robust to overfitting.

Mean Squared Error in the training set (2 different base learners).

Mean Squared Error in the testing set (2 different base learners).

Remarks

Machine LearningNew data/concepts are frequently generated in molecularbiology/genomic, and ML can efficiently adapt to this fast evolvingnature.

ML is able to deal with missing and noisy data from many scenarios.

ML is able to deal with huge volumes of data generated by novelhigh-throughput devices, extracting hidden relationships not noticeableto experts.

ML can adjust its internal structure to the data producing accurateestimates.

ML uses algorithms that learn from the data (combinations of artificialinteligence and statistics).

Need a careful data preprocessing and design of the learning system.

Remarks

EnsemblesEnsembles are combination of several base learners, improving accuracysubstantially.

Ensembles may seem complex, but they do not act so.

Perform extremely well in a variety of possible complex domains.

Have desirable statistical properties.

Scale well computationally.

We will learn how to implement ensembles in a genomic selectioncontext.

Remarks

To take home

Inherent complexity of genetic/biologic systems have unknownproperties/rules that may not be parametrized.

Learn from experiences, interpret from knowledge.

If worried for shrinkage, use boosting.

If believe in state of nature yet, use Random Forest.

Introduction to Machine Learning applied to genomic selectionacteon.webs.upv.es/docs/Curso Gianola...

Documents