Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | ava-fisher |
View: | 217 times |
Download: | 3 times |
Two bioinformatics applications of dynamic Bayesian networks
William Stafford NobleDepartment of Genome Sciences
Department of Computer Science and EngineeringUniversity of Washington
Outline
• Segmenting genomic data– Background: DNA, chromatin and DNase I– Simple solution– Wavelets– Hierarchical model
• Matching peptides to mass spectra– Background: tandem mass spectrometry– Modeling peptide fragmentation
GenesGenes
Gene Gene ‘domains’‘domains’
DnaseIDnaseIHypersensitive Hypersensitive SiteSite
Trans-Trans-factor factor
complexcomplex
Chromatin Fiber Chromatin Fiber
NucleusNucleus
GenomicGenomicDNADNA
Packaged into Packaged into ChromatinChromatin
The human genome in vivo
Measuring chromatin
accessibility
A simple hidden Markov model
• Each state contains a single Gaussian.• The model has six parameters (two transitions, two means, two standard
deviations).• The parameters are initialized randomly and trained in an unsupervised
fashion via expectation-maximization.• EM is re-started 100 times, and we select the parameters that yield the
highest likelihood.• The original data set is then segmented using either Viterbi or posterior
decoding.
Openchromatin
Closedchromatin
very
^
1.5 megabases
A problem, and two solutions
• Problem: We are interested in phenomena occurring at multiple scales.
• Solution #1: Perform a wavelet smooth prior to HMM analysis.
• Solution #2: Build a more complex probability model.
Change point model
• Four-state model: – major DNase hypersensitive site (DHS),– minor DHS,– intermediate sensitivity region, and– insensitive region.
• Continuous mixture of Gaussians at each state.
• Gamma distribution of lengths within each region.
Spanning the gaps
Beginning in State 1 (Insensitive)
Spanning the gaps
Beginning in State 4 (Major DHS)
Selecting the number of states
Improved fit to the data
Each panel is a QQ plot of the difference between the observed residuals and the theoretical Gaussian.
Insensitive Intermediate sensitivity
Minor DHS Major DHS
Capturing different scales
Enrichment of biologically relevant features
Future directions
• Many types of genomic data– Phylogenetic conservation scores– Various histone modifications– Replication timing, etc.
• Perform segmentions in multiple dimensions simultaneously.
• Assign statistical significance to observed segments.
Shotgun proteomics
TrainedModel
TestPSMs
TrainingPSMs
ProbabilityModel
Evaluation
PSM = peptide-spectrum match
Peptide sequence influences peak height
Bayesian network
• We model peptide fragmentation using a Bayesian network.
• Nodes represent random variables, and edges represent conditional dependencies.
• Each node stores a conditional probability table (CPT) giving Pr(node|parents).
1.000.00no b-ion observed
0.750.25 b-ion observed
intensity > 50% intensity < 50%
Is b-ionobserved?
b-ionintensity
Ion series modeled in a Markov chain
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
~ PepHMM (Han et al., 2005).
A more realistic model
Is b-ionobserved?
b-ion intensity
N-termAA
C-term AA
Is ion detectable?
Fractionalm/z
Is protonmobile?
Ion series modeled in a Markov chain
model nullpeptide ions,-bPr
modelpeptide ions,-bPrlogbLOR
Vectors of log-odds ratios
Correct peptide-spectrum matches Incorrect peptide-spectrum matches
Binary classifier
Model Evaluation: Accuracy
Model Redundant TP/FP Unique TP/FP
Bayes Net 285/300, 95% 137/144, 95.1%
SEQUEST 288/300, 96% 136/144, 94.4%
InsPecT 274/300, 91.3% 131/144, 90.9%
TrainedModel
TestPSMs
TrainingPSMs
ProbabilityModel
Evaluation
An incorrect identification
SEQUEST: LRPGAELLEGAHVGNFVEMKBayes net: HQDETQDALNALDLLTNEK
Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2
This peptide does not appear in E. coli, the organism from which this protein sample was derived.
Co-eluting peptides
SEQUEST: AFPEAVLFIHPLDAKBayes net: DVFVHFSALQGNQFK
Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2
Future directions
• Build a single Bayesian network that includes all ion types.
• Produce more descriptive outputs from the Bayesian network for input to the classifier.
• Add more biophysical details to the model: chromatography retention time, a better mass-to-charge estimate, etc.
• Generate a better (larger, more accurate) gold standard data set.
Acknowledgments
• DNase I hypersensitivity– John Stamatoyannopoulos– Pete Sabo– Scott Kuehn– many others in the Stam
lab
• Wavelet analysis: Bob Thurman
• Change point model– Charles Lawrence– Heng Lian– William Thompson
• Mass spectrometry– Aaron Klammer– Jeff Bilmes– Sheila Reynolds– Michael MacCoss