Dynamic Programming(cont’d)
CS 466Saurabh Sinha
Affine Gap Penalties
• In nature, a series of k indels often come as a singleevent rather than a series of k single nucleotideevents:
Normal scoring wouldgive the same scorefor both alignments
This is morelikely.
This is lesslikely.
ATA__GGCATGATCGC
ATA_G_GCATGATCGC
Accounting for Gaps• Gaps- contiguous sequence of spaces in one of the rows
• Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for
extending the gap.
Affine gap penalty in DP
• When computing si,j, need to look at si,j-1,si,j-2, si,j-3,…. and si-1,j, si-2,j, …
• Each cell needs O(n) time for update• O(n2) cells• Therefore, O(n3) algorithm• We can still do this in O(n2) time
Affine Gap PenaltyRecurrences
si,j = s i-1,j - σ max s i-1,j –(ρ+σ)
si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ)
si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j
Continue Gap in w (deletion)Start Gap in w (deletion): from middle
Continue Gap in v (insertion)
Start Gap in v (insertion):from middle
Match or Mismatch
End deletion: from top
End insertion: from bottom
Reading assignmentSection 6.10 (J & P)Multiple Alignment
Gene Prediction
• Gene: A sequence of nucleotides codingfor protein
• Gene Prediction Problem: Determine thebeginning and end positions of genes in agenome
Gene Prediction: Computational Challenge
The
Gen
etic
Cod
e
SO
UR
CE
:ht
tp://
ww
w.b
iosc
ienc
e.or
g/at
lase
s/ge
neco
de/g
enec
ode.
htm
• In 1961 Sydney Brenner and Francis Crickdiscovered frameshift mutations
• Systematically deleted nucleotides fromDNA– Single and double deletions dramatically
altered protein product– Effects of triple deletions were minor– Conclusion: every triplet of nucleotides,
each codon, codes for exactly oneamino acid in a protein
Codons
• In 1964, Charles Yanofsky and Sydney Brennerproved collinearity in the order of codons withrespect to amino acids in proteins
• As a result, it was incorrectly assumed that thetriplets encoding for amino acid sequences formcontiguous strips of information.
Great Discovery Provoking Wrong Assumption
Exons and Introns• In eukaryotes, the gene is a combination of
coding segments (exons) that are interrupted bynon-coding segments (introns)
• This makes computational gene prediction ineukaryotes even more difficult
• Prokaryotes don’t have introns - Genes inprokaryotes are continuous
Splicing
exon1 exon2 exon3intron1 intron2
transcript ion
translat ion
sp licing
exon = cod ingintron = non-coding
Batzoglou
Gene prediction
• More difficult in eukaryotes than inprokaryotes (due to introns).
• In human genome, ~3% of DNAsequence is genes
• Lot of “junk” DNA between genes, andeven inside genes (between exons).
• Gene prediction must deal with this.
Gene prediction: broadlyspeaking
• Statistical approaches:look for features than appear frequentlyin genes and infrequently elsewhere
• Similarity based approaches:a newly sequenced gene may be similarto a known gene.– even this is not so simple. The exon
structures may be different betweenotherwise similar genes
Statistical approaches
Splicing Signals
Exons are interspersed with introns andtypically flanked by GT and AG
Splice site detection
5’ 3’Donor site
Position
% -8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 1 54 … 21
C 26 … 15 5 0 1 2 … 27
G 25 … 12 78 99 0 41 … 27
T 23 … 13 8 1 98 3 … 25
From lectures by Serafim Batzoglou (Stanford)
Consensus splice sites
Donor: 7.9 bitsAcceptor: 9.4 bits
Splicing and gene prediction
• Using splice sites (profiles) to predictgenes ?
• Limited scope, too many falsepredictions
• Let us consider gene prediction in prokaryotes (no introns)
• Detect potential coding regions by looking at ORFs– A region of length n is comprised of (n/3) codons– Stop codons break genome into segments between
consecutive Stop codons– The subsegments of these that start from the Start codon
(ATG) are ORFs
Genomic Sequence
Open reading frame
ATG TGA
Open Reading Frames (ORFs)
ORFs
• 6 reading frames in any given sequence– 6 ways to map the DNA sequence to codon
sequence (+1,+2,+3,-1,-2,-3)– 3 on either strand
• Look at all 6 reading frames for ORFs
• Long open reading frames may be a gene– At random, we should expect one stop codon
every (64/3) ~= 21 codons– However, genes are usually much longer than
this• A basic approach is to scan for ORFs whose length
exceeds certain threshold– This is naïve because some genes (e.g. some
neural and immune system genes) are relativelyshort
Long vs.Short ORFs
Codon usage
• In a given sequence (e.g., an ORF), computefrequency distribution of codons (64 elementarray): codon usage array
• Codon usage array for coding sequences isdifferent from that for non-coding sequences
• If the codon usage array for an ORF is muchmore similar to that of coding sequences thanto that of non-coding sequences, the ORFcould be a gene
Codon usage
• Codons coding for “Arg” in human:– CGU: 37%, CGC: 38%, CGA: 7%, CGG:
10%, AGA: 5%, AGG: 3%– In a coding sequence, codon CGC is 12
times more likely than codon AGG– An ORF preferring CGC over AGG is likely
to be a gene
Codon Usage in Human Genome
Codon usage• One way to test if an ORF is a gene is to
compute– Pr(ORF sequence under a coding sequence
model)– Pr(ORF sequence under a non-coding model)– Ratio of the two.
• These methods work best in prokaryotes• The exon-intron trouble is not handled yet• Hidden Markov models that use codon usage
ideas and splice site ideas, all in one– We’ll see more of this in second half of course
Promoter Structure in Prokaryotes(E.Coli)
Transcription startsat offset 0.
• Pribnow Box (-10)
• Gilbert Box (-30)
• RibosomalBinding Site (+10)
Ribosomal Binding Site
Statistical approaches:summary
• Splicing sites• Codon usage• Promoter motifs, such as -10 element,
-30 element• Ribosome binding site
Similarity based approaches
• Some genomes may be very well-studied,with many genes having beenexperimentally verified.
• Closely-related organisms may havesimilar genes
• Unknown genes in one species may becompared to genes in some closely-related species
The basic approach• Given a protein sequence, and a genomic
sequence, find a set of substrings of thegenomic sequence whose concatenation bestfits the protein sequence
• First cut: Find fragments in the genomicsequence that match portions of the proteinsequence (local alignment)
• Then find the “optimal” subset of non-overlapping fragments
Exon chaining
• Each of the fragments of the genomicsequence that somewhat match the protein(locally) is a putative exon
• The “goodness” of the match is the “weight”assigned to this putative exon
• Thus, we have a set of weighted intervals(l,r,w): for a fragment from l to r, with weight wrepresenting how well it matches (a portionof) the protein
Exon Chaining Problem
• Input: A set of weighted intervals (l,r,w)• Output: A maximum weight chain of
non-overlapping intervals from this set
Exon Chaining Problem: Graph Representation
• This problem can be solved with dynamicprogramming in O(n) time.
21
edge from every li to riedge between every two successive vertices
Assumptions
• No two intervals have a commonboundary point. So the (li,ri) define 2ndistinct points, if there are n intervals
Exon Chaining AlgorithmExonChaining (G, n) //Graph, number of intervalsfor i ← to 2n si ← 0for i ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j ← index of vertex for left end of the interval I w ← weight of the interval I si ← max {sj + w, si-1}else si ← si-1return s2n
Not very helpful
• A chain is a set of non-overlappingexons in order (left to right)
• But the matching protein portions maynot be in the same order !
Spliced Alignment
• Begins by selecting either all putative exonsbetween potential acceptor and donor sites or byfinding all substrings similar to the target protein(as in the Exon Chaining Problem).
• This set is further filtered in a such a way thatattempt to retain all true exons, with some falseones.
• Then find the chain of exons such that thesequence similarity to the target proteinsequence is maximized
Spliced Alignment Problem: Formulation
• Input: Genomic sequences G, targetsequence T, and a set of candidateexons (blocks) B.
• Output: A chain of exons Γ such thatthe global alignment score between Γ*and T is maximized
Γ* - concatenation of all exons from chain Γ
The DAG
• Vertices: One vertex for each block in B• Directed edge connecting non-overlapping blocks• Label of vertex = string of block it represents• A path through the DAG spells out the string
obtained by concatenating that particular chain ofblocks
• Weight of a path is the score of the optimalalignment between the string it spells out and thetarget sequence
Dynamic programming
• Genomic sequence G = g1g2…gn• Target sequence T = t1t2…tm• As usual, we want to find the optimal
alignment score of the i-prefix of G andthe j-prefix of T
• Problem is, there are many i-prefixespossible (since multiple blocks mayinclude position i)
Idea
• Find the optimal alignment score of thei-prefix of G and the j-prefix of Tassuming that this alignment uses aparticular block B at position i
• S(i, j, B)• For every block B that includes i
Recurrence
If i is not the starting vertex of block B:• S(i, j, B) =
max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty
S(i – 1, j – 1, B) + δ(gi, tj) }
If i is the starting vertex of block B:• S(i, j, B) =
max { S(i, j – 1, B) – indel penaltymaxall blocks B’ preceding block B S(end(B’), j, B’) – indel penaltymaxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj)}