+ All Categories
Home > Documents > Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and...

Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and...

Date post: 01-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
68
Bioinformatics, statistics and multiple testing Alan Moses ML4bio ML4bio With slides from Quaid Morris
Transcript
Page 1: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Bioinformatics, statistics and multiple testing

Alan MosesML4bioML4bio

With slidesfrom QuaidMorris

Page 2: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Outline for TodayOutline for Today

• BioinformaticsBioinformatics– GO and other annotations– The annoying thing about bioinformaticsThe annoying thing about bioinformatics

• Review of hypothesis testing– Parametric vs. non-parametric testsParametric vs. non parametric tests– Exact tests – Multivariate hypothesis testingyp g

• Multiple hypothesis testing– Bonferoni, FDRBonferoni, FDR– Application to gene set enrichment analysis

Page 3: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

ConDens kinase substrate predictionConDens kinase substrate prediction

• Andy Lai was a MSc student in my lab who BOI1

List of predicted Cbk1 targets in yeast

developed a cool new way to predict kinasesubstrates based on amino acid sequence alignments.

BOI1SEC3MPT5SSD1DSF2FIR1FIR1YNL058CKIN1YGR117CKIN2

• He predicted new lists of substrates for some

IRC8YJL016WACE2RGA2

• He predicted new lists of substrates for some kinases, and wanted to show that the predictions were good, without doing any experiments. CG8617

Oatp30B

List of predicted Cbk1 targets in drosophila

• Gene Set Enrichment Analysis is the answerCG9467ecpan

Page 4: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Where Do Gene Lists Come From?Where Do Gene Lists Come From?

• Molecular profiling e.g. mRNA, protein– Identification Gene list– Quantification Gene list + values

Ranking Clustering (biostatistics)– Ranking, Clustering (biostatistics)• Interactions: Protein interactions, microRNA

targets, transcription factor binding sites g p g(ChIP)

• Genetic screen e.g. of knock out libraryA i ti t di (G id )• Association studies (Genome-wide)– Single nucleotide polymorphisms (SNPs)– Copy number variants (CNVs)Copy number variants (CNVs)

Quaid Morris

Page 5: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

What is the Gene Ontology (GO)?What is the Gene Ontology (GO)?

• Set of biological phrases (terms) which are

www.geneontology.org

Set of biological phrases (terms) which are applied to genes:– protein kinase– protein kinase

– apoptosis

membrane– membrane

• Dictionary: term definitions

• Ontology: A formal system for describing knowledge

Jane Lomax @ EBI

Page 6: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

GO StructureGO Structure

• Terms are related within a hierarchy– is‐a– part‐of

• Describes multiple levels of detail oflevels of detail of gene function

• Terms can have more• Terms can have more than one parent or child

Page 7: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

What GO Covers?What GO Covers?• GO terms divided into three aspects:

– cellular component

– molecular function

– biological process (important pathway source)

l 6 h h t iglucose-6-phosphate isomerase activity

Cell divisionCell division

Page 8: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

TermsTerms• Where do GO terms come from?

GO dd d b di EBI d– GO terms are added by editors at EBI and gene annotation database groups

T dd d b t– Terms added by request

– Experts help with major development

32029 99% i h d fi i i– 32029 terms, >99% with definitions.• 19639 biological_process

• 2859 cellular component• 2859 cellular_component

• 9531 molecular_function

• As of July 15, 2010As of July 15, 2010

Page 9: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Annotations

• Genes are linked or associated with GO

Annotations

Genes are linked, or associated, with GO terms by trained curators at genome databasesdatabases– Known as ‘gene associations’ or GO annotations

Multiple annotations per gene– Multiple annotations per gene 

• Some GO annotations created automatically ( ith t h i )(without human review)

Page 10: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Annotation Sources• Manual annotation

– Curated by scientistsCurated by scientists• High quality• Small number (time‐consuming to create)

– Reviewed computational analysis

• Electronic annotation– Annotation derived without human validation

• Computational predictions (accuracy varies)‘ l ’ h l d• Lower ‘quality’ than manual codes

• Key point: be aware of annotation origin 

Page 11: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Evidence TypesEvidence Types• Experimental Evidence Codes

• EXP: Inferred from Experiment• IDA: Inferred from Direct Assay

• Author Statement Evidence Codes

• TAS: Traceable Author y• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern

Statement• NAS: Non-traceable

Author Statement• Curator Statement Evidence

CodesIC: Inferred by • IC: Inferred by Curator

• ND: No biological Data available

• Computational Analysis Evidence Codes• ISS: Inferred from Sequence or Structural

Similarity• ISO: Inferred from Sequence Orthology

ISA: Inferred from Sequence Alignment• ISA: Inferred from Sequence Alignment• ISM: Inferred from Sequence Model• IGC: Inferred from Genomic Context• RCA: inferred from Reviewed Computational

Analysis

• IEA: Inferred from electronic annotation

See http://www.geneontology.org

Page 12: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Wide & Variable Species Coverage

Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.

Page 13: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Accessing GO: QuickGO

http://www.ebi.ac.uk/ego/See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

Page 14: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Biomart 0.7

Quaid Morris

Page 15: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Ensembl BioMartEnsembl BioMart• Convenient access to gene list annotation

Select genome

Select filters

Select attributesto download

Quaid Morris

Page 16: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Sources of Gene AttributesSources of Gene Attributes

• Ensembl BioMart (eukaryotes)Ensembl BioMart (eukaryotes)– http://www.ensembl.org

• Entrez Gene (general)(g )– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene

• Model organism databasesg– E.g. SGD: http://www.yeastgenome.org/

• Also available through R

Quaid Morris

Page 17: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Why is it all such a mess?Why is it all such a mess?• Naming of molecules was done by whoever found it first.

o Proteins and Genes do not always have consistent nameso Proteins and Genes do not always have consistent names.

o More important genes that were studied by many groups have many names. Competing research groups may purposefully omit the name(s) used by other groups

• Database identifiers (IDs) are unique stable names or numbersDatabase identifiers (IDs) are unique, stable names or numbers that help track database records, but…

o Each database will typically use its own internal IDs and naming conventions

o The more important a gene/protein is the more databases will have information for it so ito The more important a gene/protein is, the more databases will have information for it, so it will have many IDs

o Databases are frequently updated, so we always have to keep track of the database version that was used

• Records for: Gene, DNA, RNA, Proteino Important to recognize the correct record type

o Different data sources pertain to different data types (e.g., Pfam only has proteins)

o The relationship between Genes, DNA, RNA and Proteins is not 1 to 1

Page 18: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Common IdentifiersSpecies-specificHUGO HGNC BRCA2MGI MGI:109337RGD 2219

GeneEnsembl ENSG00000139618Entrez Gene 675U i H 34012 RGD 2219

ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1SG S

Unigene Hs.34012

RNA transcriptGenBank BC026160 1 SGD S000002187 or YDL029W

AnnotationsInterPro IPR015252OMIM 600185

GenBank BC026160.1RefSeq NM_000059Ensembl ENST00000380152

OMIM 600185Pfam PF09104Gene Ontology GO:0000724SNPs rs28897757E i t l Pl tf

ProteinEnsembl ENSP00000369497RefSeq NP_000050.2U iP t BRCA2 HUMAN Experimental Platform

Affymetrix 208368_3p_s_atAgilent A_23_P99452CodeLink GE60169

UniProt BRCA2_HUMAN or A1YBP1_HUMANIPI IPI00412408.1EMBL AF309413

Red = Recommended

Illumina GI_4502450-SEMBL AF309413 PDB 1MIU

Quaid Morris

Page 19: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

ID Mapping ServicesID Mapping Services

• SynergizerSynergizer– http://llama.med.harvard.edu/synergiz

er/translate/

• Ensembl BioMart– http://www.ensembl.org

• PICR (proteins only)– http://www.ebi.ac.uk/Tools/picr/

R l• R language annotation databases

– http://www.bioconductor.org

Quaid Morris

Page 20: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

ID Mapping ChallengesID Mapping Challenges• Avoid errors: map IDs correctly

• Gene name ambiguity – not a good IDGene name ambiguity  not a good ID– e.g. FLJ92943, LFS1, TRP53, p53

– Better to use the standard gene symbol: TP53g y

• Excel error‐introduction– OCT4 is changed to October‐4g

• Problems reaching 100% coverage– E.g. due to version issues

– Use multiple sources to increase coverageZeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently ywhen using Excel in bioinformatics BMC Bioinformatics. 2004 Jun 23;5:80

Quaid Morris

Page 21: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Summary so farSummary so far• GO (and other functional annotations) are a great ( ) g

way to tell us about the functions of a list of gene• In order to use these, we need to compare our

li t t h t’ i th GO d t bgene list to what’s in the GO database…– Genes and their products and attributes have many

identifiers (IDs)( )– Bioinformatics often means converting or mapping

IDs from one type to anotherID mapping services are available– ID mapping services are available

– Use standard, commonly used IDs to reduce ID mapping challenges

Page 22: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Outline for TodayOutline for Today

• BioinformaticsBioinformatics– GO and other annotations– The annoying thing about bioinformaticsThe annoying thing about bioinformatics

• Review of hypothesis testing– Parametric vs. non-parametric testsParametric vs. non parametric tests– Exact tests – Multivariate hypothesis testingyp g

• Multiple hypothesis testing– Bonferoni, FDRBonferoni, FDR– Application to gene set enrichment analysis

Page 23: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

What is a P-value?What is a P value?

• A) The probability that the null hypothesisA) The probability that the null hypothesis is true

• B) Probability of a test statistic under the• B) Probability of a test statistic under the null distributionC) P b bilit f i t j ti f• C) Probability of an incorrect rejection of the null hypothesis

• D) Some subset of the above

Modified from Quaid Morris

Page 24: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

What is a P-value?What is a P value?

• A) The probability that the null hypothesisA) The probability that the null hypothesis is true

• B) Probability of a test statistic under the• B) Probability of a test statistic under the null distributionC) P b bilit f i t j ti f• C) Probability of an incorrect rejection of the null hypothesis

• D) Some subset of the above

N f th !Modified from Quaid Morris

None of these!

Page 25: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

What is a P-value?What is a P value?

• Probability of observing something asProbability of observing something as extreme or more under the null hypothesis

What is this thing?

• Usually it’s a “test statistic” but it can be any summary of the data…

• Always a sum or integral over the “tail” or “tails” of a distribution.

Page 26: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Hypothesis testingHypothesis testing• Random variables:

– H: H0 (null hypothesis) or H1 (alternative hypothesis)– Data: X1, X2, … XN (independent and identically distributed –

IID)– t is a test statistic, t = f(X)– t* observed value of test statistic

• Parameters:α: significance level– α: significance level

– Reject H0 if P-value < α

• P-value is:– Pr[ t is “as or more extreme” than t* | H0 is true ]

26Modified from Quaid Morris

Page 27: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

P-value versus false rejectionsP value versus false rejections

• P-value is:P value is:– Pr[ t is “as or more extreme” than t* | H0 is true ]

• False rejection probability:– Pr[ H0 is true | H0 is rejected ]– aka “False discovery rate”

27Modified from Quaid Morris

Page 28: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

P-value factsP value facts

• Note that: Pr[P-value < p | H0 is true] = pNote that: Pr[P value p | H0 is true] p

• So under the null distribution P-value is aSo under the null distribution, P-value is a random variable that is uniformly distributed between 0 and 1.

• Given different tests with P-values p1, p2, …, pN you can combine them into a single P-value. “Fisher’s method”

• Fisher figured out that test statistic X2 = -2 Σi ln[ pi ] is chi-square with 2N degrees of freedom if p’s are uniform {0,1}

• Sometimes called “meta analysis” because you can combine the

28Modified from Quaid Morris

y yresults of many analyses this way

Page 29: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

E g 2-sample testssamples

E.g., 2 sample tests

W k l b ti d t• We make several observations under two situations, and we want to find out whether there is a statistical difference.

gene

s

• Which genes have differential expression in the different tumor types?the different tumor types?

Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, Botstein D. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8418-23.

Page 30: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

2 sample test2 sample testGene Expression levels for a single gene

95

4 6

65

0.8

1

ensi

ty

Distribution of gene expression levels9

6 3

4

7

65

Normal Breast-like

0 2

0.4

0.6

Pro

babi

lity

d

01

2-1 1

10

0 0

0

0.2

-5 0 5 10Gene expression

1 -2 10

Basal subtype

p

Question: How likely is it that the difference between the two samples is due to chance?

30Modified from Quaid Morris

Page 31: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

2 sample t-test2 sample t-testSummarize the data with theso called “t statistic”

0.8

1

ensi

tyNormal Breast-like: N1=10Mean: m1 = 5.6

so-called t-statistic Distribution of gene expression levels

0 2

0.4

0.6

Pro

babi

lity

d

Basal subtype: N2=13

Mean: m1 5.6 Std: s1 = 1.6

Mean: m = 0 30

0.2

-5 0 5 10Gene expression T statistic =

Mean: m1 = 0.3 Std: s1 = 1.0

21 mm − p

H0: Black and red scores are drawn from a distribution with the same meanH Th t t l

T-statistic =

2

22

1

21

Ns

Ns

+

31

H1: The two means are not equalModified from Quaid Morris

Page 32: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

2 sample t-testP-value = shaded area * 2

2 sample t-test

0.8

1

ensi

tyT-distribution

dens

ity

P value shaded area 2Distribution of gene expression levels

0 2

0.4

0.6

Pro

babi

lity

d

Pro

babi

lity

d

0

0

0.2

-5 0 5 10Gene expression T statistic =

21 mm −

T-statistic

0

pT-statistic =

2

22

1

21

Ns

Ns

+ H0: Black and red scores are drawn from a distribution with the same meanH Th t t lDistribution of this statistic is

32

H1: The two means are not equalModified from Quaid Morris

Distribution of this statistic is known under the null hypothesis

Page 33: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Examples of inappropriate distributions for T-testsT-test assumes data are (approximately) normally distributedT-test detects differences between means, not necessarily between distributions

Values are positive and have increasing density near zero, e.g. sequence counts

Distributions with outliers, or “heavy-tailed” distributions

Bimodal “two-bumped” distributions.

ility

den

sity

bilit

y de

nsity

lity

dens

ity

Pro

bab

Gene expression 0

Pro

bab

Gene expression

Pro

babi

l

Gene expression

Solutions: “non-parametric two-sample tests”1) Robust test for difference of medians (WMW)2) Di f diff f di ib i (K S)

33

2) Direct test of difference of distributions (K-S)

Quaid Morris

Page 34: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Enrichment analysis with two-sample, not paired Wilco on Rank S mWilcoxon Rank Sum

aka Mann-Whitney U test or simply “WMW”

1) Rank gene scores, calculate RB,1) Rank gene scores, calculate RB, sum of ranks of black values

6.55 6

2.15.6

ranks12

dens

ity

5.64.53.22.1

-1.1-2.5-0.5

2345

RB = 21

Pro

babi

lity

1.70.1

-1.12 5

3.21.76.5

N2 redvalues

6789

P

Gene Expression

H : Probability that a red ranks are-2.5-0.54.5

0.1

910

H0: Probability that a red ranks are greater than black ranks is 0.5H1: red ranks are greater than black ranks

34

N1 blackvalues

Z Quaid Morris

Page 35: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Wilcoxon-Mann-Whitney (WMW) testaka Mann-Whitney U-test, Wilcoxon rank-sum test

2) Calculate Z-score:

RB = 21

dens

ity

BNNNR2

1211

++−

mean rank

Pro

babi

lity

U

2=

3) Calculate P value:

= -1.4

P

Gene ExpressionNormal distributiony

P-value = shaded area * 23) Calculate P-value:

H : Probability that a random sample from

obab

ility

dens

ity H0: Probability that a random sample from distribution of red score is > than one from black is 0.5H1: Otherwise

35 ZZ

Pro

0-1.4

H1: Otherwise

Quaid Morris

Page 36: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

WMW test detailsWMW test details

• Described method is only applicable forDescribed method is only applicable for large N1 and N2 and when there are no tied scorestied scores

• WMW test is robust to (a few) outliers12/)1( NNNN 12/)1( 2121 ++= NNNNuσ

36 Quaid Morris

Page 37: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Kolmogorov-Smirnov (K-S) test for diff f di ib idifference of distributions

Empirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5

Pr

Gene Expression0C

um

Gene Expression0

1) Calculate cumulative distributions of red and black

37 Quaid Morris

Page 38: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Kolmogorov-Smirnov (K-S) testKolmogorov Smirnov (K S) testEmpirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5

Pr

Gene Expression0C

um

Gene Expression0

1) Calculate cumulative distributions of red and black

38 Quaid Morris

Page 39: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Kolmogorov-Smirnov (K-S) testKolmogorov Smirnov (K S) testEmpirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5

Pr

Gene Expression 0C

um

Gene Expression 0

1) Calculate cumulative distributions of red and black

39 Quaid Morris

Page 40: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Kolmogorov-Smirnov (K-S) testKolmogorov Smirnov (K S) testEmpirical (cumulative)

dens

ity

prob

abili

ty

1.0

Empirical (cumulative) distribution

roba

bilit

y d

mul

ativ

e p

0.5Distance = 0.4

Pr

Gene Expression 0

Test statistic: Maximum vertical difference

Cum

Gene Expression 0

Test statistic: Maximum vertical difference between the two cumulative distributions

Distribution of test statistic is known dl f th d l i di t ib ti

40

regardless of the underlying distributionsModified from Quaid Morris

Page 41: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

WMW and K-S test caveatsWMW and K S test caveats• Neither tests is as sensitive as the T-test, i.e. they , y

require more data points to detect the same amount of difference, so use the T-test whenever it is valid.K S t t d WMW i diff t K S• K-S test and WMW can give you different answers: K-S detects difference of distributions, WMW detects whether samples from one tend to be higher than those from the other (or vice versa)

• Technical issue: Tied scores and/or small # of observations can be a problem for someobservations can be a problem for some implementations of the WMW or KS-test

41 Quaid Morris

Page 42: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Central limit theorem• If you have a moderately large sample, you

can do statistical tests that don’t depend oncan do statistical tests that don t depend on assumptions about the distribution of the datadata

nsity

E.g., black data mean is almost certainly greater than red mean, but there are a lot of tied ‘0’ values that might mess up K S and WMW tests

abili

ty d

en might mess up K-S and WMW tests.

Central Limit Theorem: Distribution of your the estimate of means is Gaussian. (Assuming

Pro

b

Gene Expression 0

your sample is big enough, i.i.d., and that the variance is finite)

Under the null hypothesis, average red = average black and is N(μ,σ2), where μ is the mean and σ2 is the variance.

Page 43: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

What is the distribution of my data?What is the distribution of my data?

• Because of the central limit theorem and permutation tests, you don’t usually have to worry about it

• A good way to check is using a “qq plot”• A good way to check is using a qq-plot . – This compares the “theoretical quantiles” of a

particular distribution to the quantiles in your data.– If they don’t disagree too badly, you can usually

be safe assuming your data are consistent with that distribution

• With large genomics data sets, you will have enough power to reject the hypothesis that your data “truly” come from any distributionyour data truly come from any distribution

Page 44: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Permutation testsPermutation tests• Often, the null distribution of the testOften, the null distribution of the test

statistic is unclear or not analytical.• In these cases, you can generate an

i i l di t ib ti b li f thempirical distribution by sampling from the null distribution and then evaluating your test statistic against this distribution.g

• In many genomic applications it is often possible to get a sample from the null distribution by randomizing (i e permuting)distribution by randomizing (i.e. permuting) the association between genes and corresponding data.p g

44 Quaid Morris

Page 45: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

When permuting, you have to think deep thoughts about what your null hypothesis really is.

Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker Nature Biotechnology 31, 38–45 (2013) A gene ontology inferred from molecular networks

Page 46: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Exact tests

• Sometimes the probability of anSometimes, the probability of an observation as extreme or more can be calculated directly under the H0calculated directly under the H0

• In this case there is no “test statistic”E “bi i l t t” “Fi h ’ E t T t”• E.g., “binomial test”, “Fisher’s Exact Test” and “hypergeometric test” Use for

Gene Set Enrichment Analysis

• These tests are feasible now because computers calculate these probabilities

Page 47: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

E g Binomial testE.g., Binomial test

• You did a poll were you get “yes” or “no”You did a poll were you get yes or no answers each time, and you have some prior belief about the frequency of “yes” orprior belief about the frequency of yes or “no” under the null hypothesis. E.g., if people don’t care then p should be 50%people don t care, then p should be 50%

P-value = Pr(73 or more “yes” | 102 total, p=50%)

!n⎞⎛

X=102

!)!(!

kknn

kn

−=⎟⎟

⎞⎜⎜⎝

⎞⎛102

, where

ΣX=73

(0.5)X (1 – 0.5)102-X⎟⎟⎠

⎞⎜⎜⎝

⎛X

102P-value =

Page 48: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

E g Fisher’s Exact testE.g., Fisher s Exact test

• You developed a prediction method where youYou developed a prediction method where you got a 2 x 2 table as the result

predictedp ed c ed

erve

d

positive negative

nega

tive

14 178I won’t bother you with the formula, but the probability of the “configuration” of the 2 x 2

7Obs

epo

sitiv

e

31table can be calculated exactly

doesn’t make any assumption about the distribution of positives and negatives

P-value = Pr(a “configuration” as extreme or more | no association)

To calculate this you need to sum up a lot of possible tables

p g

To calculate this, you need to sum up a lot of possible tables

According to R, in this case P-value = 0.05666

Page 49: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

The hypergeometric testThe hypergeometric testGene list

RRP6MRD1RRP7

H0: List is a random sample from populationH1: More black genes than expected

RRP7RRP43RRP42

Background population:500 bl k500 black genes, 4500 red genes

49 Quaid Morris

Page 50: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

The hypergeometric functionThe hypergeometric function

Probability a random sample of k genes contains q black genes when the background population contains m black genes out of n total genes:

m⎛ ⎞ n −m⎛ ⎞ # ways to # ways to choose m

q

⎝ ⎜

⎠ ⎟ n m

q − k

⎝ ⎜

⎠ ⎟

n⎛ ⎞

choose q out of m genes

q-k out of n-mgenes

=n

k

⎝ ⎜

⎠ ⎟ # ways to

choose k out of n genes

50

n

k

⎝ ⎜

⎠ ⎟ =

n!(n − k)!k!

is called “n choose k” for details seehttp://www.khanacademy.org/video/combinations

Quaid Morris

Page 51: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

The hypergeometric testThe hypergeometric testGene list Null distribution

RRP6MRD1RRP7

500

4

⎝ ⎜

⎠ ⎟ 4500

1

⎝ ⎜

⎠ ⎟

5000⎛ ⎜

⎞ ⎟

500

5

⎝ ⎜

⎠ ⎟ 4500

0

⎝ ⎜

⎠ ⎟

5000⎛ ⎜

⎞ ⎟

+ = 4.6 x 10-4

P-value

RRP7RRP43RRP42

5⎝ ⎜

⎠ ⎟ 5⎝

⎜ ⎠ ⎟

Background population:500 bl k500 black genes, 4500 red genes

51 Quaid Morris

Page 52: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Important detailsImportant details• One way to test for under-enrichment of “black”, test for y ,

over-enrichment of “red”• Same as a “One-tailed Fisher’s Exact Test”• Need to choose “background population” appropriately,

e.g., if only portion of the total gene complement is queried (or available for annotation), only use thatqueried (or available for annotation), only use that population as background.

• To test for enrichment of more than one independent t f t ti ( d bl k d i l )types of annotation (red vs black and circle vs square), we need to apply the hypergeometric test separately for each type. ***multivariate hypothesis testing***

52

yp yp g

Quaid Morris

Page 53: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Multivariate hypothesis testsMultivariate hypothesis tests

• Basic problem is the “or more”P-value is the “probability of observing something as extreme or more under the null hypothesis”

Basic problem is the or more

We would have to do theWe would have to do the sum in all dimensions.

Instead there are two major strategies for multivariate hypothesis testing:1. Likelihood ratio test – summarizes the multivariate hypothesis with a single yp g

test statistic, and then do the sum in a single dimension2. Test each dimension independently – very conservative because it ignores the

potential correlation between dimensions.

When we want to know which dimensions are causing the rejection of the null hypothesis, we typically use #2

Page 54: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Gene set enrichment analysisGene set enrichment analysis

• Which (if any) annotations are enriched inWhich (if any) annotations are enriched in our gene list?

• Test each annotation independently using• Test each annotation independently using the hypergeometric test

• Need to correct P-values because there are so many annotations tested…

Page 55: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Outline for TodayOutline for Today

• BioinformaticsBioinformatics– GO and other annotations– The annoying thing about bioinformaticsThe annoying thing about bioinformatics

• Review of hypothesis testing– Parametric vs. non-parametric testsParametric vs. non parametric tests– Exact tests – Multivariate hypothesis testingyp g

• Multiple hypothesis testing– Bonferoni, FDRBonferoni, FDR– Application to gene set enrichment analysis

Page 56: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Multiple test correction:Multiple test correction: Bonferroni and False Discovery

R tRate

56 Quaid Morris

Page 57: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Mark Gerstein P-value paradoxMark Gerstein P value paradox

– His lab publishes about 30 researchHis lab publishes about 30 research papers/year. E.g., published 33 papers in 2011 (>300 in the last 10 years)

– At P-value=0.05, how many significant , y gresults/year are expected from his lab under the null hypothesis?

Page 58: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

How to win the P-value lottery, part 1How to win the P value lottery, part 1

Random draws

… 7,834 draws later …Expect a random draw with observed

i h t,

enrichment once every 1 / P-value draws

Background population:500 bl k500 black genes, 5000 red genes

58 Quaid Morris

Page 59: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

How to win the P-value lottery, part 2Keep the gene list the same, evaluate different annotations

Observed draw Different annotationsObserved drawRRP6MRD1RRP7

Different annotationsRRP6MRD1RRP7RRP7

RRP43RRP42

RRP7RRP43RRP42

59 Quaid Morris

Page 60: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

ORA tests need correctionORA tests need correction

From the Gene Ontology website:Current ontology statistics: 25206 terms

• 14825 biological process• 2101 cellular component• 8280 molecular function

Buying 1 or 2 or even 10 lottery tickets, you still have a small chance of winning. However, if you by 25,000 tickets, your chances of winning start to improve.

60 Quaid Morris

Page 61: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Simple P-value correction: BonferroniSimple P value correction: Bonferroni

If M = # of annotations tested:

Corrected P-value = M x original P-valueg

Corrected P-value is greater than or equal to the probability thatg q p yone or more of the observed enrichments could be due to

random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

Quaid Morris

Page 62: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Bonferroni correction caveatsBonferroni correction caveats

• Bonferroni correction is very stringent andBonferroni correction is very stringent and can “wash away” real enrichments.

• Often users are willing to accept a less• Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR) which leads to a gentlerrate (FDR), which leads to a gentler correction when there are real enrichmentsenrichments.

62 Quaid Morris

Page 63: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

False discovery rate (FDR)• FDR is the expected proportion of the

observed enrichments due to random chance.

• Compare to Bonferroni correction which is a bound on the probability that any one of the observedon the probability that any one of the observed enrichments could be due to random chance.

• Typically FDR corrections are calculated using the yp y gBenjamini-Hochberg procedure.

• FDR threshold is often called the “q-value”

Quaid Morris

Page 64: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Controlling FDR using the B j i i H hb d IBenjamini-Hochberg procedure I

• Say you want to bound the FDR at α youSay you want to bound the FDR at α, you need to calculate the corresponding P-value threshold tvalue threshold t

• First, calculate the P-values for all the tests and then sort them so that p is thetests, and then sort them so that p1 is the smallest (i.e. most significant) P-value, and p is the leastand pm is the least.

64Benjamini, Y. & Hochberg, Y. (1995) J. R. Stat. Soc. B 85, 289–300 Quaid Morris

Page 65: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Controlling FDR using the B j i i H hb d IIBenjamini-Hochberg procedure II

• t = pr where r is the max value for which:t pr where r is the max value for which:

FDR threshold

pr ≤ rα / m

FDR threshold

pr ≤ rα / m

rank # of tests

Cavaet: Assumes independent or positively correlated tests.correlated tests.

65 Quaid Morris

Page 66: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Reducing multiple test correction stringencyReducing multiple test correction stringency

• Can control the stringency by reducing theCan control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GOrestrict testing to the appropriate GO annotations.

66 Quaid Morris

Page 67: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

Reducing multiple test correction stringencyReducing multiple test correction stringency

• The correction to the P-value threshold ⟨The correction to the P value threshold ⟨depends on the # of tests that you do, so, no matter what the more tests you do theno matter what, the more tests you do, the more sensitive the test needs to be

• Can control the stringency by reducing the• Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations;testing to the appropriate GO annotations; or select only larger GO categories.

Quaid Morris

Page 68: Bioinformatics, statistics and multiple testing · 2015-02-24 · Bioinformatics, statistics and multiple testing Alan Moses ML4bio With slides from Quaid ... No biological Data available

SummarySummary• Multiple test correction

– Bonferroni: stringent, controls probability of at least one false positive

– FDR: more forgiving, controls expected proportion of false positives -- typically use B j i i H hbBenjamini-Hochberg


Recommended