Sequence AnalysisFall ‘18
Are you ready for this course?
•Computer Science prerequisites Can you program? Can you write pseudocode? Can you download, install, run apps?
•Biology prerequisites Do you know the central dogma of biology? Have you memorized the amino acids and their properties? Do you know how evolution works?
You need this:
Zvelebil & Baum = Z&B
How Biologists Think• Spatio-temporo-chemical phenomena• Experimental methods/design• Macromolecular structure• Evolution
�4
How Computer Scientists Think• Convert desired
input/output to set of Data Structures, and Instructions
• Query a database.• Simulate.• Optimize.
�5http://shaunskeen.com/
How a computational biologist thinks
Understands biological systems.
Designs experiments to produce new data.
Compares predictions to experimental results.
�6
Converts biological systems into data structures.
Converts hypotheses into instructions.
Writes database queries.
Biology sideof brain
CS sideof brain
m o d e l i n g
v a l i d a t i n g
The Scientific Process -- DIPA
�7
Biology Computer Science
• Data generation experiments filling databases• Interpretation Informal models Formal models• Prediction Outcomes Simulations• Action Experimental design Obtain more data
An example of thinking like a computational
biologist: The Evolution of
Chromosomes ...
�8
Syntenic groups: genes that stay together, generally for survival
Trp operon genes in bacteria
• Inversions
Large scale evolutionary changes in chromosomes...
• Duplications
�10
“...the most important evolutionary force since the emergence of the universal common ancestor.” -- Susumu Ohno
• TranspositionsA syntenic group appears on a different chromosome.
A syntenic group appears on the opposite strand, different location.
•Mouse and human chromosomes have the same content, but gene locations are scrambled!
NOTE: no transpositions from/to X, Y chromosomes!
...have led to chromosomes that are mosaics of other species' chromosomes
Let's look at meiosis, Prophase I
http://www.phschool.com/science/biology_place/labbench/lab3/concepts2.html
How?
Chiasmata
�13
http://ib.bioninja.com.au/higher-level/topic-10-genetics-and-evolu/101-meiosis/chiasmata.html
where crossovers occur
Normal prophase 1
non-sister cromatids synapse, sometimes cross over
alleles swapped, gene order conserved
C
da
BA D
bc tetrad forms
C
da
BA D
bc
Cda B
A Db c
xx
...then, tetrads line up (metaphase I), separate (anaphase I)
xx
Abnormal prophase 1
illegitimate synapses form, sometimes cross over
alleles swapped, genes conserved, order reversed
CBA D
d abc tetrad forms,
something goes wrong.... it is backwards.
...then, tetrads line up (metaphase I), separate (anaphase I)
CBA D
d abc
CB
A D
d a
bc
�16
A B C D + a b c d => A b c D + a B C d
A B C D + a b c d => A c b D + a C B d
Normal prophase 1
Abnormal prophase 1
x
..or maybe it happens this way.
A
BC
D
c
b
a
x legitimate synapses form, but due to a loop, they get too close, crossovers are swapped.
CB
A D
d a
bc
same result
Accumulation of inversions over (how much?) evolutionary time
Mouse chromosomes colored according to homology to human chromosomes.
rug rat mouse rat
Can we map our ancestry using inversions?
rug rat mouse rat
How many inversions: mouse-to-human, rat-to-human, rat-to-mouse?
How many inversions have happened? How do we count them? The answer tells us
the evolutionary distance.
�21
How many inversions is this?
The way a computational biologist thinks of the problem of counting
inversions.
�22
The Pancake Flipping ProblemA sloppy cook at a pancake diner makes pancakes of all different sizes and stacks them haphazardly.
A meticulous waiter likes the pancakes to be stacked with the largest on the bottom and the smallest on top. On the way to the table, using only one hand with a spatula, he flips the pancakes until they are arranged by size, largest on bottom, smallest on top.
•What is the algorithm for flipping?
•What is the algorithm for finding the fewest flips?
In class exercise:
Given the arrangement below, flip the pancakes until they are in order. How many flips? (You can order the numbers instead of the pancakes.)
642315 123456
In class exercise: work in pairs
•Write detailed instructions on how to stack six pancakes in order by flipping. The instructions should not depend on the starting order.
•Generate a random pancake order. Apply your algorithm. 125436 ---> ... ---> ... ---> 123456
•Pick a volunteer: Write pseudocode on the board.
TATTAGCCCGTGACAAACTAAGCCTATG
�26
TATTAGCCTAGTTTGTCACGAGCCTATG5’ 3’
5’3’
5’ 3’
5’3’ATAATCGGATCAAACAGTGCTCGGACAC
ATAATCGGGCACTGTTTGATTCGGATAC
Flipping changes chain direction
If we make reverse complement to be negative numbers:1 2 3 4
becomes,1 -3 -2 4
Evolution by reversals.123456789
-4-3-2-156789
-4-3-6-512789
-4-35612789
-4-356-9-8-7-2-1
1234-9-8-7-6-5
123789-4-6-5
12378964-5
-8-7-3-2-1964-5
ancestral mammal
humanmouse
“-” indicates reverse complement.
Pancakes-burned-on-one-side Problem
A sloppy cook at a pancake diner makes pancakes of all different sizes and stacks them haphazardly. He’s also a bad cook. All of the pancakes are burned on one side.
The waiter likes the pancakes to be stacked with the largest on the bottom and the smallest on top and with all of the burned sides down!
He has two spatulas and can flip any number of adjacent pancakes at a time.
•How does he arrange the pancakes in the fewest flips?
In class exercise, part 2:
•Write instructions for the Two-Spatula-Pancakes-burned-on-one-side Problem. When flipped, pancakes change order and sign. Flip any segment.
•Generate a random pancake arrangement. Apply your algorithm 1 -2 5 4 -3 6 ---> ... ---> ... ---> 123456
"Dot plot"�30
1 2 3 4 5 6 7 8 9
2 9 5 4 6 8 7 1 3
sequence 1se
quen
ce 2 Place a dot where
sequences are identical
A data structure for comparing sequences
Data structure!
Simpler dot plot: line segments
�31
1 2 3 4 5 6 7 8 9
9 5 4 6 7 8 1 2 3
If dots are bases, lines are sequences. If dots are sequences, lines are syntenic groups
+/- integers are direct/rc strands
�32
Z&B p. 158
1 2 3 4 5 6 7 8 9-4 -3 5 6 -9 -8 -7 -2-1
1 2 3 4 5 6 7 8 9 -4 -3 5 6 -9 -8 -7 -2-1
Each number represents a line segment
Each negative number represents its reverse complement
reverse complement alignment
direct alignment
�33
P. Pevzner et al, Trends in Genetics. Volume 20, Issue 12, December 2004, Pages 631–639
Decompose graph into cyclesConnect breakpoints
Breakpoint graph decomposition algorithm for counting minimal genomic reversals in the mammalian X chromosome: human versus mouse.
�34
1 2 3 4 5 6 7 8 9
1 2-7-6-5-4-3 8 9
1 2-7-8 3 4 5 6 9
1-3 8 7-2 4 5 6 9
Breakpoint graph decomposition
�35
1 2 3 4 5 6 7 8 9
1 2-7-6-5-4-3 8 9
1 2-7-8 3 4 5 6 9
1-3 8 7-2 4 5 6 9
Number of inversions = number of cycles
Breakpoint graph decomposition 1
1 2 3 4 5 6 7 8 9
1 2-7-6-5-4-3 8 9
1 2 6 7-5-4-3 8 9
1 2 6 7-5 4 3 8 9
Number of inversions = N/2 - 1 N = # of verteces
Breakpoint graph decomposition 2
�37
How many inversions?
P. Pevzner et al, Trends in Genetics. Volume 20, Issue 12, December 2004, Pages 631–639
Breakpoint graph decomposition
How do I find the alignment matrix given the sequences?
�38
A similarity matrix
�39
•Identical characters get score=1.•Non-identical gets score=0.
A C T G A A C C T1
11
11
111
1
0 0 0 0 0 00 0 0 0 0
0 0 0 0 00
000 0 0 0 000 0
0 0 0 00 00 0 0 00 00 0 0 00 00 0 0 00 00 0 0 0 000
A C
T G A
A C
C T
11
1
1
1
1
1
11
1
11
1
1
... in its most basic form:
An alignment matrix.
�40
1=aligned (associated)0=not aligned
•Boolean matrix.•Max one “1” per row.•Max one “1” per column. 0
1
A C T G A A C C T1
11
1
101
1
0 0 0 0 0 00 0 0 0 0
0 0 0 0 00
000 0 0 0 000 0
0 0 0 00 00 0 0 00 00 0 00 00 0 0 00 00 0 0 0 000
A C
T G A
A C
C T
00
0
0
0
0
0
10
0
00
0
0
An alignment matrix.
�41
•If A(i,j)==1, then A(m,n)=0 for all (m<i && n>j)•If A(i,j)==1, then A(m,n)=0 for all (m>i && n<j)i.e. SW and NE of “1” must be zero.
Alignments may be “sequential” or "non-sequantial".If alignment is sequential, see rules below.
0
1
A C T G A A C C T1
11
1
101
1
0 0 0 0 0 00 0 0 0 0
0 0 0 0 00
000 0 0 0 000 0
0 0 0 00 00 0 0 00 00 0 00 00 0 0 00 00 0 0 0 000
A C
T G A
A C
C T
00
0
0
0
0
0
10
0
00
0
0
A Dot Plot is a similarity matrixEach position in the matrix D[i,j] is either
dot, if A[i] == B[j]
blank, otherwise.
AAGACGTTTA GACGTACT
Dot plot diagonals are alignmentsTo find short, unbroken alignments, we set window size and stringency and mark all dots that pass with a line.
AAGACGTTTA GACGTACT
Show all diagonals with at least 4 out
of 5 matches.
"window size" is the length of a diagonal, "stringency" is minimum number of matches in the window.
Dot matrix with reverse complement
AAGACGTTTA GACGTACT
Base matches its complement
Reverse diagonal means inverse alignment
Now we have two types of dots
Install UGENE • ugene.net/• Download UGENE manual.• Follow instructions for installing UGENE
on your system.
�45
Take-home exercise 1: UGENE Turn in Thurs. Sep 6
• Open FASTA/human_T1.fa (or download from "UGENE files" link)• Right-click in sequence window. Select Analyze...Build Dotplot... Set
x=y=human_T1. Direct repeats and inverted. Set minimum length = 50, 100%identity. Click OK.
• Navigate the dotplot window by zooming and scrolling. • Find locus of the longest repeat in chromosome coordinates. • Annotate it as a repeat unit, call it “longest repeat”.• Next class (or by the end of class), turn in a small strip of
paper with your name, the location of the longest direct or reverse complement repeat in the sequence, and the first 5 bases of the 5' end.
�46
We're done. Review.
• How is genome rearrangement like flipping a stack of burnt pancakes?
• What is a similarity matrix?• What is an alignment matrix?• What is the significance of a row of dots in a
dotplot?• What is the first step of algorithm development?
(a) define data structures? (b) define loop structures? (c) google stackoverflow?
�47See you next time.