+ All Categories
Home > Documents > Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Date post: 04-Jan-2016
Category:
Upload: shad-zamora
View: 36 times
Download: 3 times
Share this document with a friend
Description:
Refined Non Parametric Methods for Genomic inference. Peter J. Bickel Department of Statistics University of California at Berkeley, USA. Joint work with Nancy R. Zhang (Stanford), James B. Brown (UCB) and Haiyan Huang (UCB). Motivating Questions. → GENCODE Exons. - PowerPoint PPT Presentation
47
Refined Non Parametric Refined Non Parametric Methods for Genomic Methods for Genomic inference inference Peter J. Bickel Peter J. Bickel Department of Statistics Department of Statistics University of California at University of California at Berkeley, USA Berkeley, USA work with Nancy R. Zhang (Stanford), James B. Brown and Haiyan Huang (UCB)
Transcript
Page 1: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Refined Non Parametric Refined Non Parametric Methods for Genomic Methods for Genomic

inferenceinference Peter J. BickelPeter J. Bickel

Department of StatisticsDepartment of StatisticsUniversity of California at Berkeley, USAUniversity of California at Berkeley, USA

Joint work with Nancy R. Zhang (Stanford), James B. Brown (UCB) and Haiyan Huang (UCB)

Page 2: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Motivating QuestionsMotivating Questions

Page 3: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Association of functional annoAssociation of functional annotations in Human Genometations in Human Genome

5' 3'

→ Transcription Start Sites (TSSs)

→ GENCODE Exons

3' 5'

The ENCODE Consortium found that many Transcription The ENCODE Consortium found that many Transcription Start Sites are anti-sense to GENCODE exonsStart Sites are anti-sense to GENCODE exons

They also found vastly more TSSs than previously They also found vastly more TSSs than previously supposedsupposed

Is the association between TSSs and exons in the anti-Is the association between TSSs and exons in the anti-sense direction real, or experimental noise in TSS sense direction real, or experimental noise in TSS identification? identification?

Page 4: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Association of experimental annotatioAssociation of experimental annotations across whole chromosomesns across whole chromosomes

Do two factors tend to bind together more closely or more often than other pairs of factors? Does a factor’s binding site relative to TSSs tend to change across genomic regions?

Page 5: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

The statistical relation of TranscriptioThe statistical relation of Transcription Start Sites and protein binding sitesn Start Sites and protein binding sites

Normalized Chip-chIP signals around GENCODE Normalized Chip-chIP signals around GENCODE TSSs in ENCODE regionsTSSs in ENCODE regions

Most peak over the TSS and are clearly significantMost peak over the TSS and are clearly significant Does the upstream bump in CTCF constitute good Does the upstream bump in CTCF constitute good

evidence of enchancer binding activity?evidence of enchancer binding activity?

Normalized signal intensity

Figure from ENCODE Consortium Paper: Nature, June 14th, 2007

Enchancer activity?

Page 6: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

What is a non-parametric What is a non-parametric model for the Genome and model for the Genome and

why is it needed?why is it needed?

Page 7: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Feature Overlap: the Feature Overlap: the questionquestion

A mathematical question arises:A mathematical question arises:

Do these features overlap more, or Do these features overlap more, or less than “expected at random”? less than “expected at random”?

5' 3'

→Transcription Fragments

→ Conserved sequence

Page 8: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Our formulationOur formulation

Defining “expectation” and “at Defining “expectation” and “at random”:random”: The genome is highly structuredThe genome is highly structured Analysis of feature inter-dependence Analysis of feature inter-dependence

must account for superficial structuremust account for superficial structure ““Expected at random” becomes:Expected at random” becomes:

Overlap between two feature sets Overlap between two feature sets bearing structure, under no biological bearing structure, under no biological constraintsconstraints

Page 9: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Naïve MethodNaïve Method Treating bases as being independent with same distTreating bases as being independent with same dist

ribution (ordinary bootstrap)ribution (ordinary bootstrap) Hypothesis: Feature markings are independent Hypothesis: Feature markings are independent Specific Object Test based on Specific Object Test based on % Feature Overlap – (% Feature1)(% Feature2) % Feature Overlap – (% Feature1)(% Feature2) and standard statistics and standard statistics

Why naïve ? Bases are NOT independentWhy naïve ? Bases are NOT independent Better method: keeping one type of feature fixed anBetter method: keeping one type of feature fixed an

d simulating moving start site of another feature und simulating moving start site of another feature uniformly (feature bootstrap)iformly (feature bootstrap)

Why still a problem?Why still a problem? Even if feature occurrences are independent functionally, tEven if feature occurrences are independent functionally, t

here can be clumping caused by the complex underlying ghere can be clumping caused by the complex underlying genome sequence structure enome sequence structure

(i.e. inhomogeneity, local sequence dependence) (i.e. inhomogeneity, local sequence dependence)

Page 10: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

A non parametric modelA non parametric model

Requirements:Requirements:a)a) It should roughly reflect known It should roughly reflect known

statistics of the genomestatistics of the genome

b)b) It should encompass methods listedIt should encompass methods listed

c)c) It should be possible to do inference, It should be possible to do inference, tests, set confidence bounds tests, set confidence bounds meaningfullymeaningfully

Page 11: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Segmented Stationary Segmented Stationary ModelModel

Let Let XXi i = = base at position base at position i, i=1,…,ni, i=1,…,n

such that for each such that for each k=1,…,rk=1,…,r, is: , is: Stationary (homogeneity within blocks) Stationary (homogeneity within blocks) Mixing (bases at distant positions are nearly independent)Mixing (bases at distant positions are nearly independent) rr << << n n

1 111 1 1( ,..., ) ( ,..., ,..., ,..., ),

n n rn r rX X X X X X 1 ... rn n n

{ :1 }jk kX j n

1n 2n 1rn rn

Page 12: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Empirical InterpretationsEmpirical Interpretations Within a segment:Within a segment:

For For kk small compared to minimum segment small compared to minimum segment length, statistics of random kmers do not length, statistics of random kmers do not differ between large subsegments of segmentdiffer between large subsegments of segment

Knowledge of the first kmer does not help in Knowledge of the first kmer does not help in predicting a distant kmerpredicting a distant kmer

Remark: Remark: If this model holds it also applies to derived If this model holds it also applies to derived

local features, e.g. {local features, e.g. {II11,…,,…,IInn} where } where IIkk = 1 if = 1 if position position kk belongs to binding site for given belongs to binding site for given factorfactor

Page 13: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Mentioned other models Mentioned other models are special cases for are special cases for rr = 1 = 1

Independent identically distributed Independent identically distributed (bootstrap)(bootstrap)

Stationary MarkovStationary Markov Uniform displacement of start sites Uniform displacement of start sites

(Homogeneous Poisson Process) (Homogeneous Poisson Process)

Page 14: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Is the Effect Serious?Is the Effect Serious?

Ordinary Ordinary bootstrapbootstrap Base-by-base Base-by-base

sampling sampling randomly from randomly from observed observed sequence for sequence for two features two features separatelyseparately

Feature Feature randomization:randomization: Keep one type of Keep one type of

feature fixed and feature fixed and randomizing the randomizing the start positions start positions of the other of the other

Example Statistic: Overlap between two features in a binary sequence of 10K bases (region statistic in the ENCODE studies) Feature 1: occurrence of motif 111000; Feature 2: more than six 1’s in 10 consecutive bases

True distribution: Mean=5.23 SD=0.53

Ordinary Bootstrap: Mean=4.83 SD=0.26

Feature Randomization: Mean=6.19 SD=0.81

Block Bootstrap: Mean=4.81 SD=0.55

Page 15: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Evidence for Segmented Evidence for Segmented StationarityStationarity

DNA sequence is known to bDNA sequence is known to be inhomogeneouse inhomogeneous

However, it has been segmenHowever, it has been segmented into homogeneous domaited into homogeneous domains based on:ns based on: Base composition (e.g. finding IBase composition (e.g. finding I

sochores) sochores) CpG densityCpG density Density of higher order features Density of higher order features

(e.g. ORFS, palindromes, TFBS)(e.g. ORFS, palindromes, TFBS) Our model aims to capture thOur model aims to capture th

ese “domain-specific” effeese “domain-specific” effects, while avoiding parametricts, while avoiding parametric assumptions within domaic assumptions within domainsns

Figure from Li, 2001:

References: Elton (1974, J. Theoretical Bio.), Braun and Müller (1998, Statistical Science), Li et al. (1998, Genome Res.), Liu and Lawrence (1999, Bioinformatics)

Page 16: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Inference with our modelInference with our model

Use Use XX11,…,,…,XXnn for basic data, but for basic data, but XXkk could be base identity, feature could be base identity, feature identity, a vector of feature identity, a vector of feature identities obeying segmented identities obeying segmented stationarity assumption. stationarity assumption.

Page 17: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Many genomic statistics are function of one or more sums of the form:

e.g. is 1 or 0 depending on the presence or absence of a feature or features

Using our model for inferenceUsing our model for inference

When the summands are small compared to When the summands are small compared to SS::

Gaussian case Gaussian case

Example: Region overlap for common Example: Region overlap for common features, or rare features over large regions features, or rare features over large regions

n

iiUgS

1

Under segmented stationarity, these distributions can be estimated from the data

kXg

Page 18: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Distributions of feature Distributions of feature overlapsoverlaps

The Block BootstrapThe Block Bootstrap Can’t observe independent Can’t observe independent

occurrences of ENCODE regions, but occurrences of ENCODE regions, but if our hypothesis of segmented if our hypothesis of segmented stationarity holds then the distribution stationarity holds then the distribution of sum statistics and their functions of sum statistics and their functions can be approximated as followscan be approximated as follows

Page 19: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Block Bootstrap for r = 1Block Bootstrap for r = 1Algorithm 4.1: a) Given L << n choose a number N uniformly at random from

b) Given the statistics Tn(X1,…,Xn) , under the assumption that X1,…,Xn is stationary, compute

c) Repeat B times to obtain d) Estimate the distribution of by the empirical

distribution:

By Theorem 4.2.1 of Politis, Romano and Wolf (1999)

Ln ,...,1

*L1L TT LN1N ,..., XX

*LB

*L1 TT ,...,

nn T

BjXXL

nnnLjB 1,,...,1

** TT

,0* NB

Page 20: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Block Bootstrap AnimationBlock Bootstrap Animationr = 1r = 1

*1X )( *

1*

1 XfS

*BX

Observed Sequence (X): Statistic:

S=f(X)

…… …

)( **BXfS B

Draw a block of length L from original sequence, this is the block-bootstrapped sequence.

Calculate statistic on the block bootstrapped sequence.Repeat this procedure identically B times.

*2X )( *

2*

2 XfS

Page 21: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Observing the distributionsObserving the distributions

Block bootstrap distribution of the Region Overlap Statistic

Shown here with the PDF of the normal distribution with the same mean and variance

QQplot of BB distribution vs. standard normal

The histogram of

Is approximately the same as density of

Page 22: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

What if What if r r > 1> 1

The estimated distribution is always The estimated distribution is always heavier tailed leading to heavier tailed leading to conservative conservative p p valuesvalues

But it can be enormously so if the But it can be enormously so if the segment means of the statistic differ segment means of the statistic differ substantiallysubstantially

Less so but still meaningful if the Less so but still meaningful if the means agree but variances differ means agree but variances differ

Page 23: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Simulation StudySimulation Study

For simplicity, we concatenate 2 For simplicity, we concatenate 2 homogeneous regions generated as homogeneous regions generated as aboveabove

Page 24: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Simulation Results and Simulation Results and comparison to a naïve comparison to a naïve

methodmethodTrue distribution

Uniform Start Site Shuffling

Block Bootstrap without Segmentation

Block Bootstrap with True Segmentation

Page 25: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

SolutionsSolutions

1)1) Segment using biological Segment using biological knowledgeknowledge

Essentially done in ENCODE: poor Essentially done in ENCODE: poor segmentation occasionally led to non-segmentation occasionally led to non-Gaussian distributions (excessively Gaussian distributions (excessively conservative)conservative)

2)2) Segment using a particular linear Segment using a particular linear statistic which we expect to statistic which we expect to identify homogeneous segments identify homogeneous segments

Page 26: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Block Bootstrap with Block Bootstrap with SegmentationSegmentation

Draw a block from each sub-segment Draw a block from each sub-segment and concatenate to form a block and concatenate to form a block bootstrap sample bootstrap sample

Page 27: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Block Bootstrap given Block Bootstrap given SegmentationSegmentation

1. Draw Subsample of length L:

f1L f2L f3L

2. Compute statistic on subsample:

T(X*)

3. Do this B times: T(X1*),…T(XB

*)

Page 28: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Simulation Results, with Simulation Results, with segmentationsegmentation

True distribution

Uniform Start Site Shuffling

Block Bootstrap without Segmentation

Block Bootstrap with True Segmentation

Block Bootstrap with Estimated Segmentation

Page 29: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Dyadic SegmentationDyadic Segmentation

Define,Define,

Find Find jjmaxmax maximizing maximizing MM((jj) creating intervals I) creating intervals Ileft left and Iand Irightright

If length of both intervals falls below a stopping If length of both intervals falls below a stopping criterion, stopcriterion, stop

Else, repeat process for IElse, repeat process for Ileft left and/or I and/or Irightright, whichever , whichever are longer than stopping criterion, with redefined are longer than stopping criterion, with redefined MM((jj) )

nijXAvejiXAve

n

j

n

jjM

iij

j

1:1:

1 2

Page 30: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Dyadic SegmentationDyadic Segmentation

change in mean of the statistic

Statistic as a function of position

First cut maximizes the difference between the means in the new segments

All subsequent cuts are greedy, making maximal splits

The mean is recomputed in each segment, so long as the segment is longer than a set threshold

No new cuts exist, the segmentation is complete

Page 31: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

True distribution

Uniform Start Site Shuffling

Block Bootstrap without Segmentation

Block Bootstrap with True Segmentation

Block Bootstrap with Estimated Segmentation

Page 32: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Confidence Bounds: Confidence Bounds: rr > 1 > 1

Given a statistic, e.g. basepair % Given a statistic, e.g. basepair % overlap:overlap:

Find such that:

as small as possible

“Average basepair overlap over all potential genomes for the region considered”

Page 33: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Use Algorithm 4.1Use Algorithm 4.1

For each segment pick random block For each segment pick random block of length proportional to segment of length proportional to segment lengthlength

Concatenate to get block of length LConcatenate to get block of length L Compute % bp overlap for blockCompute % bp overlap for block Repeat many timesRepeat many times Use 100(1-Use 100(1-αα) percentiles of this for ) percentiles of this for

Page 34: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Testing AssociationTesting Association

Question: How do we estimate Question: How do we estimate null distribution given only data null distribution given only data for which we believe the null is for which we believe the null is false?false?

Page 35: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Testing Association (bp Testing Association (bp overlap)overlap)

1X 2X

Observed Sequence (Feature 1 = , Feature 2 = ):

Sample two blocks of equal length.

1Y

2Y

2X 1X

1Y

2Y

Align Feature 1 of first block with Feature 2 of second block,And vice versa.

Calculate overlap in the blocks after swapping = (X2)(Y1)+(X1)(Y2)Statistic is: (X2)(Y1)+(X1)(Y2), properly normalized and set to mean 0. Under the null hypothesis of independence, this should be Gaussian.

Page 36: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Test StatisticTest Statistic

H H : Features : Features not not associated in each segment (so-called associated in each segment (so-called “dummy overlap”)“dummy overlap”)

Then has a Gaussian distribution. Then has a Gaussian distribution.

We form the test statistic:We form the test statistic:

where: where:

Length of segment i/n

% of basepairs in segment i identified as Feature 1

% of basepairs in segment i identified as Feature 2

Page 37: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Null DistributionNull Distribution

Choose pairs of blocks at randomChoose pairs of blocks at random Compute false (“dummy”) overlap Compute false (“dummy”) overlap HH Compute Compute II = % Feature 1 and = % Feature 1 and JJ = % = %

Feature 2Feature 2 Block bootstrapped Null: Block bootstrapped Null: H – IJH – IJ

If r > 1, pairs of blocks are chosen in each If r > 1, pairs of blocks are chosen in each region, region, HH and and IJ IJ are weighted sums across are weighted sums across regions.regions.

The Null is mean zero, and has the The Null is mean zero, and has the correct variancecorrect variance

Page 38: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Example from ENCODE Example from ENCODE datadata

ENm001: ENCODE Consortium annotated over ENm001: ENCODE Consortium annotated over 2500 feature-instances exclusive of UTRs and 2500 feature-instances exclusive of UTRs and CDSs CDSs

Question: “Do these (largely) non-coding Question: “Do these (largely) non-coding features exhibit more overlap with constrained features exhibit more overlap with constrained sequences than expected at random?”sequences than expected at random?”

To answer, we used the block bootstrap to To answer, we used the block bootstrap to obtain null distribution obtain null distribution

When null is Gaussian, it has the correct When null is Gaussian, it has the correct variancevariance

When not, it is overly conservativeWhen not, it is overly conservative Segmentation can reduce conservativeness, and Segmentation can reduce conservativeness, and

detect significance that would otherwise be detect significance that would otherwise be missedmissed

Page 39: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

-3 -2 -1 0 1 2 3

0.02

0.04

0.06

0.08

0.10

0.12

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

No Segmentation

-3 -2 -1 0 1 2 3

0.02

0.04

0.06

0.08

0.10

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Estimated Segmentation

p-value 0.001

p-value 0.1

Page 40: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

There are two L’sThere are two L’s

LLss : the minimum segment length : the minimum segment length during segmentationduring segmentation To be discussedTo be discussed

L L : the length of blocks during : the length of blocks during subsamlingsubsamling Chosen on grounds of stabilityChosen on grounds of stability

Page 41: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

A philosophical question:A philosophical question:The Issue of ScaleThe Issue of Scale

Relevant probability assessments Relevant probability assessments depend on segmentationdepend on segmentation

Segmentation depends on scaleSegmentation depends on scale Things which seem surprising on Things which seem surprising on

small scales, may not be at larger small scales, may not be at larger onesones

E.g. differences in GC contentE.g. differences in GC contentMy view: It’s only My view: It’s only

determinable determinable biologicallybiologically

Page 42: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Some Future DirectionsSome Future Directions KS type tests KS type tests

Beyond overlap, KS-type tests can compare the distributions of Beyond overlap, KS-type tests can compare the distributions of features, e.g. “Does the pattern of constrained sequence in coding features, e.g. “Does the pattern of constrained sequence in coding regions differ from that in non-coding regions?” regions differ from that in non-coding regions?”

MaximaMaxima Aggregative plots can summarize one feature in the neighborhood of Aggregative plots can summarize one feature in the neighborhood of

another, e.g. “Does binding data (such as Chip-chIP) show that a another, e.g. “Does binding data (such as Chip-chIP) show that a given regulatory factor tends to bind near TSSs?”given regulatory factor tends to bind near TSSs?”

Other types of associationOther types of association Does wavelet analysis offer significant support for the large scale Does wavelet analysis offer significant support for the large scale

association of replication timing and conservation?association of replication timing and conservation? Many others arising from ENCODE, modENCODE, and elsewhereMany others arising from ENCODE, modENCODE, and elsewhere

Other types of segmentationOther types of segmentation Dyadic segmentation is analytically convenient, but other Dyadic segmentation is analytically convenient, but other

segmentations may be usefulsegmentations may be useful

Page 43: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

AcknowledgementsAcknowledgements

The ENCODE ConsortiumThe ENCODE Consortium The MSA and Transcription and The MSA and Transcription and

Regulation GroupsRegulation Groups Especially: Elliot Margulies, Tom Especially: Elliot Margulies, Tom

Gingeras and Ewan BirneyGingeras and Ewan Birney Supported by NIGMS and NHGRISupported by NIGMS and NHGRI

Page 44: Peter J. Bickel Department of Statistics University of California at Berkeley, USA
Page 45: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Association of functional annoAssociation of functional annotations in Human Genometations in Human Genome

Category Transcript survey method Number of TSS clusters P value Singleton clusters(%)

Known GENCODE 5' ends 1,730        10-70 25 (74 overall)

Novel GENCODE sense exons 1,437        10-39 64

  GENCODE antisense exons 521      10-8 65

  Unbiased transcription survey 639      10-63 71

  CpG island 164      10-90 60

Unsupported None 2,666 - 83.4

Table from ENCODE Consortium Paper: Nature, June 14th, 2007

Page 46: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Dyadic SegmentationDyadic Segmentation

For a minimum region length Ls and threshold b initialize:

Algorithm 4.8

ntt 10 ,0t

1. For i = 1,…,|t|-1, let M(i)(j) and V(i)(j) be respectively the processes (4.7) and (4.8) computed on the subsequence Xti-1+1,

…,Xti. Let t’i = argmaxjM(i)(k), and mi = min(t’i – ti-1,ti - t’i). Let:

2. Let Vi = V(i)(t’i). Let: If stop, return t.

3. Let i* = argmaxi Bi , and tnew = t’i*

4. Let t = t ∪ tnew reordered so that ti is monotonically increasing in i.

.,0

;,)(

otherwise

LmtMB sii

i

i

Algorithm 4.8

ntt 10 ,0t

b

V

BttIJ

ii

iiii

1 0

iiJ

Page 47: Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Recommended