+ All Categories
Home > Technology > ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological...

ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological...

Date post: 11-May-2015
Category:
Upload: christian-have
View: 309 times
Download: 3 times
Share this document with a friend
Popular Tags:
29
Programming, Logic and Intelligent Systems plis.ruc.dk CBIT Roskilde University Denmark Logic-Statistic Models with Constraints for Biological Sequence Analysis Christian Theil Have, <[email protected]>
Transcript
Page 1: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Programming, Logic and Intelligent Systems plis.ruc.dk CBIT Roskilde University Denmark

Logic-Statistic Models with Constraints for Biological Sequence Analysis

Christian Theil Have, <[email protected]>

Page 2: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Motivation and outline

● Short motivation and introduction to biological sequence analysis● Different ways of integrating constraints with probabilistic models● Combining models with constraints

Page 3: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Alignment of biological sequences• Phylogeny• Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function prediction

:The basic problems

Page 4: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Alignment of biological sequences• Phylogeny➔ Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function prediction

:The basic problems

...We focus on gene prediction for now

Page 5: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C

AATATAGGCATAGCGCACAGACAGATAAAAATTACA GAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGAAATATAGGCATAGCGCACAGACAGATA

Page 6: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA

Page 7: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA

Page 8: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA

Page 9: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA

Page 10: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand

Page 11: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand

Page 12: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand

Page 13: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand

Page 14: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis

• Gene prediction: Predict genes and non-genes in a DNA sequence● : DNA is composed of nucletides A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a gene

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● ) (.There are three possible genes in this sample in this frame on this strand● In general, DNA sequences have an exponential amount of different gene

.compositions

Page 15: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Biological sequence analysis, tools of the trade

● ) (Statistical models in order of expression power● Hidden Markov Models● Probabilistic Context Free Grammars● Probabilistic Context Sensitive Grammars

● Stochastic Definite Clause Grammars● All these can be modeled in PRISM

● Probabilistic extension of Prolog● : Problems

● Computational complexity of inference● Extremely large sequences● Use of more expressive models infeasible● : Essential Enforce right independence assumptions

● limit amount of conditional probabilities

Page 16: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

- Gene finding with Hidden Markov Models

• ) ( Hidden Markov Models HMMs commonly used for gene prediction• < >A Hidden Markov Model is a quadruple S,A,T,E• S is a set of states• A is a set of emission symbols• T is a set of transition probabilities• E is a set of emission probabilities• An observation is a sequence of emissions• Transition and emission probabilities can be derived from sampleobservations though parameter estimation• Decoding finds the most probable sequence of states corresponding to anobservation

Page 17: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Genefinding with Hidden Markov Models

: - .Example Toy HMM for gene finding

Page 18: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

: Decoding The Viterbi algorithm

:Finding the most probable path for a given sequence

argmax P(state sequence | observation)

: Method• Incrementally keep track of the most probable path to a given state• ) / (Dynamic programming tabling in Prolog PRISM ) ( Time steps observation

States

Time complexity O(|states| * |observation|)

Page 19: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Predicting is decoding

:Decoding of an HMM may be considered as an optimization problem● We have a set of variables T0 .. Tn , one for each time step• A set of constraints, C :, on these variables

● : Goal Optimize P(state sequence | observation) , subject to C

T1 T2 T3 TnT0

) ( Time steps observation

States

➔ Accomplished with Viterbi )algorithm in O | states| *| observation| ) using DP

A state S is in the domain of Ti iff there is a state in the domain of Ti-1 from which there is a transition to S and the state has an emission corresponding to the emission in the observation

Page 20: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Constraints as model structure

● The structure of the HMM consists of● states● allowed transitions between these states● possible emissions from these states

● The structure of the HMM defines a regular language● ) ( ..Can model only regular languages, but● Not all regular languages can be modeled equally compact● Some regular languages requires an exponential amount of states

- Consider a fully connected automaton with only N

:states

All-different: No state visited more than once

Page 21: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

-Side constraints

StatisticalModel -Side Constraints

- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.

Page 22: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

-Side constraints

Advantages✔ Convenient method of expression✔ - Can express non regular languages✔ Does affect the number of states

StatisticalModel -Side Constraints

- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.

Page 23: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

-Side constraints

Advantages✔ Convenient method of expression✔ - Can express non regular languages✔ Does affect the number of states

Problems✗ Models with constraints can fail

✗ Probability mass disappears✗ Complicates model inference

✗ & - ERF Baum Welch derives wrongdistributions✗ Decoding must adhere to constraints

✗ Constraint solving techniques needed✗ - NP Complete in general case

StatisticalModel -Side Constraints

- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.

Page 24: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

-Side constraints

Advantages✔ Convenient method of expression✔ - Can express non regular languages✔ Does affect the number of states

Problems✗ Models with constraints can fail

✗ Probability mass disappears✗ Complicates model inference

✗ & - ERF Baum Welch derives wrongdistributions✗ Decoding must adhere to constraints

✗ Constraint solving techniques needed✗ - NP Complete in general case Possible solutions

Parameter learning:● / - Training with fgEM Failure adjusted maximization

● Requires failure estimates● - Apply soft constraints do not fail

Inference:● - Incremental constraint solving● Local constraints

StatisticalModel -Side Constraints

- : Side constraints● Constraints which are not embedded in the model.● Delimits allowed derivations.

Page 25: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

: Example Fixing known genes

C C

N N

C C

N N

S ECC C

knowngeneDNA

● / Difficult expensive to model with model structure● = > !HMM needs to do position counting many states required

● -Easy to model with side constraints● : Local constraint Affects only a limited size sequential set of variables● Decoding possible in linear time complexity

Page 26: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Combining models

Gene predictor A Gene predictor B

A Genes B Genes

.Combine the predictions of several models to form more accurate predictions

:Obvious approaches● Union

● Many false positives● Conflicts

● / Intersection majority voting● Lowest commondenominator● Throws away the most

interesting predictions

Page 27: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Combining models with constraints

Gene predictor A Gene predictor B

A Genes B Genes

.Combine the predictions of several models to form more accurate predictions

Obvious approaches● Union

● Many false positives● Conflicts

● Intersection● Lowest commondenominator● Throws away the most

interesting predictions

We need to know the strengths of individual models to define

...better constraints

Page 28: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Combining models with constraints

:Issues to consider● Ability to combine both blackbox and whitebox models● The nature of the combination constraints

● Uncertainty● : ..Lack of knowledge what the right constraints

● Induction : Some possible ways to represent combination constraints being considered

● Hard constraints● Inability to handle uncertainty

● Factorial Hidden Markov Models● Probability distribution defines how much to listen to each model● : Throws away information What model contributed what?● Expensive to train

● Bayesian networks● Model probablistic constraints● We can model sequences with Dynamic Bayesian Networks

● -Soft Constraints● Possibly good complement to probabilistic inference

● -Co training● Use the models to train each other

Page 29: ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Outlook

● Formulating biosequence problems in terms of constraints● Integrating these constraints in probablistic models● Tradeoffs between constraint representations

● ...Finding the right balance● Combining models with constraints● Inference and parameter estimation in mixed models


Recommended