Two bioinformatics applications of dynamic Bayesian networks William Stafford Noble Department of...

Post on 27-Mar-2015

217 views 3 download

Tags:

transcript

Two bioinformatics applications of dynamic Bayesian networks

William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and EngineeringUniversity of Washington

Outline

• Segmenting genomic data– Background: DNA, chromatin and DNase I– Simple solution– Wavelets– Hierarchical model

• Matching peptides to mass spectra– Background: tandem mass spectrometry– Modeling peptide fragmentation

GenesGenes

Gene Gene ‘domains’‘domains’

DnaseIDnaseIHypersensitive Hypersensitive SiteSite

Trans-Trans-factor factor

complexcomplex

Chromatin Fiber Chromatin Fiber

NucleusNucleus

GenomicGenomicDNADNA

Packaged into Packaged into ChromatinChromatin

The human genome in vivo

Measuring chromatin

accessibility

A simple hidden Markov model

• Each state contains a single Gaussian.• The model has six parameters (two transitions, two means, two standard

deviations).• The parameters are initialized randomly and trained in an unsupervised

fashion via expectation-maximization.• EM is re-started 100 times, and we select the parameters that yield the

highest likelihood.• The original data set is then segmented using either Viterbi or posterior

decoding.

Openchromatin

Closedchromatin

very

^

1.5 megabases

A problem, and two solutions

• Problem: We are interested in phenomena occurring at multiple scales.

• Solution #1: Perform a wavelet smooth prior to HMM analysis.

• Solution #2: Build a more complex probability model.

Change point model

• Four-state model: – major DNase hypersensitive site (DHS),– minor DHS,– intermediate sensitivity region, and– insensitive region.

• Continuous mixture of Gaussians at each state.

• Gamma distribution of lengths within each region.

Spanning the gaps

Beginning in State 1 (Insensitive)

Spanning the gaps

Beginning in State 4 (Major DHS)

Selecting the number of states

Improved fit to the data

Each panel is a QQ plot of the difference between the observed residuals and the theoretical Gaussian.

Insensitive Intermediate sensitivity

Minor DHS Major DHS

Capturing different scales

Enrichment of biologically relevant features

Future directions

• Many types of genomic data– Phylogenetic conservation scores– Various histone modifications– Replication timing, etc.

• Perform segmentions in multiple dimensions simultaneously.

• Assign statistical significance to observed segments.

Shotgun proteomics

TrainedModel

TestPSMs

TrainingPSMs

ProbabilityModel

Evaluation

PSM = peptide-spectrum match

Peptide sequence influences peak height

Bayesian network

• We model peptide fragmentation using a Bayesian network.

• Nodes represent random variables, and edges represent conditional dependencies.

• Each node stores a conditional probability table (CPT) giving Pr(node|parents).

1.000.00no b-ion observed

0.750.25 b-ion observed

intensity > 50% intensity < 50%

Is b-ionobserved?

b-ionintensity

Ion series modeled in a Markov chain

Is b-ionobserved?

b-ionintensity

Is b-ionobserved?

b-ionintensity

Is b-ionobserved?

b-ionintensity

Is b-ionobserved?

b-ionintensity

Is b-ionobserved?

b-ionintensity

~ PepHMM (Han et al., 2005).

A more realistic model

Is b-ionobserved?

b-ion intensity

N-termAA

C-term AA

Is ion detectable?

Fractionalm/z

Is protonmobile?

Ion series modeled in a Markov chain

model nullpeptide ions,-bPr

modelpeptide ions,-bPrlogbLOR

Vectors of log-odds ratios

Correct peptide-spectrum matches Incorrect peptide-spectrum matches

Binary classifier

Model Evaluation: Accuracy

Model Redundant TP/FP Unique TP/FP

Bayes Net 285/300, 95% 137/144, 95.1%

SEQUEST 288/300, 96% 136/144, 94.4%

InsPecT 274/300, 91.3% 131/144, 90.9%

TrainedModel

TestPSMs

TrainingPSMs

ProbabilityModel

Evaluation

An incorrect identification

SEQUEST: LRPGAELLEGAHVGNFVEMKBayes net: HQDETQDALNALDLLTNEK

Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2

This peptide does not appear in E. coli, the organism from which this protein sample was derived.

Co-eluting peptides

SEQUEST: AFPEAVLFIHPLDAKBayes net: DVFVHFSALQGNQFK

Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2

Future directions

• Build a single Bayesian network that includes all ion types.

• Produce more descriptive outputs from the Bayesian network for input to the classifier.

• Add more biophysical details to the model: chromatography retention time, a better mass-to-charge estimate, etc.

• Generate a better (larger, more accurate) gold standard data set.

Acknowledgments

• DNase I hypersensitivity– John Stamatoyannopoulos– Pete Sabo– Scott Kuehn– many others in the Stam

lab

• Wavelet analysis: Bob Thurman

• Change point model– Charles Lawrence– Heng Lian– William Thompson

• Mass spectrometry– Aaron Klammer– Jeff Bilmes– Sheila Reynolds– Michael MacCoss