Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | christian-have |
View: | 309 times |
Download: | 3 times |
Programming, Logic and Intelligent Systems plis.ruc.dk CBIT Roskilde University Denmark
Logic-Statistic Models with Constraints for Biological Sequence Analysis
Christian Theil Have, <[email protected]>
Motivation and outline
● Short motivation and introduction to biological sequence analysis● Different ways of integrating constraints with probabilistic models● Combining models with constraints
Biological sequence analysis
• Alignment of biological sequences• Phylogeny• Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function prediction
:The basic problems
Biological sequence analysis
• Alignment of biological sequences• Phylogeny➔ Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function prediction
:The basic problems
...We focus on gene prediction for now
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C
AATATAGGCATAGCGCACAGACAGATAAAAATTACA GAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGAAATATAGGCATAGCGCACAGACAGATA
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand
Biological sequence analysis
• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene
AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand● In general, DNA sequences have an exponential amount of different gene
.compositions
Biological sequence analysis, tools of the trade
● ) (Statistical models in order of expression power● Hidden Markov Models● Probabilistic Context Free Grammars● Probabilistic Context Sensitive Grammars
● Stochastic Definite Clause Grammars● All these can be modeled in PRISM
● Probabilistic extension of Prolog● : Problems
● Computational complexity of inference● Extremely large sequences● Use of more expressive models infeasible● : Essential Enforce right independence assumptions
● limit amount of conditional probabilities
- Gene finding with Hidden Markov Models
• ) ( Hidden Markov Models HMMs commonly used for gene prediction• < >A Hidden Markov Model is a quadruple S,A,T,E• S is a set of states• A is a set of emission symbols• T is a set of transition probabilities• E is a set of emission probabilities• An observation is a sequence of emissions• Transition and emission probabilities can be derived from sampleobservations though parameter estimation• Decoding finds the most probable sequence of states corresponding to anobservation
Genefinding with Hidden Markov Models
: - .Example Toy HMM for gene finding
: Decoding The Viterbi algorithm
:Finding the most probable path for a given sequence
argmax P(state sequence | observation)
: Method• Incrementally keep track of the most probable path to a given state• ) / (Dynamic programming tabling in Prolog PRISM ) ( Time steps observation
States
Time complexity O(|states| * |observation|)
Predicting is decoding
:Decoding of an HMM may be considered as an optimization problem● We have a set of variables T0 .. Tn , one for each time step• A set of constraints, C :, on these variables
● : Goal Optimize P(state sequence | observation) , subject to C
T1 T2 T3 TnT0
) ( Time steps observation
States
➔ Accomplished with Viterbi )algorithm in O | states| *| observation| ) using DP
A state S is in the domain of Ti iff there is a state in the domain of Ti-1 from which there is a transition to S and the state has an emission corresponding to the emission in the observation
Constraints as model structure
● The structure of the HMM consists of● states● allowed transitions between these states● possible emissions from these states
● The structure of the HMM defines a regular language● ) ( ..Can model only regular languages, but● Not all regular languages can be modeled equally compact● Some regular languages requires an exponential amount of states
- Consider a fully connected automaton with only N
:states
All-different: No state visited more than once
-Side constraints
StatisticalModel -Side Constraints
- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.
-Side constraints
Advantages✔ Convenient method of expression✔ - Can express non regular languages✔ Does affect the number of states
StatisticalModel -Side Constraints
- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.
-Side constraints
Advantages✔ Convenient method of expression✔ - Can express non regular languages✔ Does affect the number of states
Problems✗ Models with constraints can fail
✗ Probability mass disappears✗ Complicates model inference
✗ & - ERF Baum Welch derives wrongdistributions✗ Decoding must adhere to constraints
✗ Constraint solving techniques needed✗ - NP Complete in general case
StatisticalModel -Side Constraints
- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.
-Side constraints
Advantages✔ Convenient method of expression✔ - Can express non regular languages✔ Does affect the number of states
Problems✗ Models with constraints can fail
✗ Probability mass disappears✗ Complicates model inference
✗ & - ERF Baum Welch derives wrongdistributions✗ Decoding must adhere to constraints
✗ Constraint solving techniques needed✗ - NP Complete in general case Possible solutions
Parameter learning:● / - Training with fgEM Failure adjusted maximization
● Requires failure estimates● - Apply soft constraints do not fail
Inference:● - Incremental constraint solving● Local constraints
StatisticalModel -Side Constraints
- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.
: Example Fixing known genes
C C
N N
C C
N N
S ECC C
knowngeneDNA
● / Difficult expensive to model with model structure● = > !HMM needs to do position counting many states required
● -Easy to model with side constraints● : Local constraint Affects only a limited size sequential set of variables● Decoding possible in linear time complexity
Combining models
Gene predictor A Gene predictor B
A Genes B Genes
.Combine the predictions of several models to form more accurate predictions
:Obvious approaches● Union
● Many false positives● Conflicts
● / Intersection majority voting● Lowest commondenominator● Throws away the most
interesting predictions
Combining models with constraints
Gene predictor A Gene predictor B
A Genes B Genes
.Combine the predictions of several models to form more accurate predictions
Obvious approaches● Union
● Many false positives● Conflicts
● Intersection● Lowest commondenominator● Throws away the most
interesting predictions
We need to know the strengths of individual models to define
...better constraints
Combining models with constraints
:Issues to consider● Ability to combine both blackbox and whitebox models● The nature of the combination constraints
● Uncertainty● : ..Lack of knowledge what the right constraints
● Induction : Some possible ways to represent combination constraints being considered
● Hard constraints● Inability to handle uncertainty
● Factorial Hidden Markov Models● Probability distribution defines how much to listen to each model● : Throws away information What model contributed what?● Expensive to train
● Bayesian networks● Model probablistic constraints● We can model sequences with Dynamic Bayesian Networks
● -Soft Constraints● Possibly good complement to probabilistic inference
● -Co training● Use the models to train each other
Outlook
● Formulating biosequence problems in terms of constraints● Integrating these constraints in probablistic models● Tradeoffs between constraint representations
● ...Finding the right balance● Combining models with constraints● Inference and parameter estimation in mixed models