Sequence Analysis Fall ‘18 · Sequence Analysis Fall ‘18. Are you ready for this course?...

Post on 17-Jul-2020

4 views 0 download

transcript

Sequence AnalysisFall ‘18

Are you ready for this course?

•Computer Science prerequisites Can you program? Can you write pseudocode? Can you download, install, run apps?

•Biology prerequisites Do you know the central dogma of biology? Have you memorized the amino acids and their properties? Do you know how evolution works?

You need this:

Zvelebil & Baum = Z&B

How Biologists Think• Spatio-temporo-chemical phenomena• Experimental methods/design• Macromolecular structure• Evolution

�4

How Computer Scientists Think• Convert desired

input/output to set of Data Structures, and Instructions

• Query a database.• Simulate.• Optimize.

�5http://shaunskeen.com/

How a computational biologist thinks

Understands biological systems.

Designs experiments to produce new data.

Compares predictions to experimental results.

�6

Converts biological systems into data structures.

Converts hypotheses into instructions.

Writes database queries.

Biology sideof brain

CS sideof brain

m o d e l i n g

v a l i d a t i n g

The Scientific Process -- DIPA

�7

Biology Computer Science

• Data generation experiments filling databases• Interpretation Informal models Formal models• Prediction Outcomes Simulations• Action Experimental design Obtain more data

An example of thinking like a computational

biologist: The Evolution of

Chromosomes ...

�8

Syntenic groups: genes that stay together, generally for survival

Trp operon genes in bacteria

• Inversions

Large scale evolutionary changes in chromosomes...

• Duplications

�10

“...the most important evolutionary force since the emergence of the universal common ancestor.” -- Susumu Ohno

• TranspositionsA syntenic group appears on a different chromosome.

A syntenic group appears on the opposite strand, different location.

•Mouse and human chromosomes have the same content, but gene locations are scrambled!

NOTE: no transpositions from/to X, Y chromosomes!

...have led to chromosomes that are mosaics of other species' chromosomes

Let's look at meiosis, Prophase I

http://www.phschool.com/science/biology_place/labbench/lab3/concepts2.html

How?

Chiasmata

�13

http://ib.bioninja.com.au/higher-level/topic-10-genetics-and-evolu/101-meiosis/chiasmata.html

where crossovers occur

Normal prophase 1

non-sister cromatids synapse, sometimes cross over

alleles swapped, gene order conserved

C

da

BA D

bc tetrad forms

C

da

BA D

bc

Cda B

A Db c

xx

...then, tetrads line up (metaphase I), separate (anaphase I)

xx

Abnormal prophase 1

illegitimate synapses form, sometimes cross over

alleles swapped, genes conserved, order reversed

CBA D

d abc tetrad forms,

something goes wrong.... it is backwards.

...then, tetrads line up (metaphase I), separate (anaphase I)

CBA D

d abc

CB

A D

d a

bc

�16

A B C D + a b c d => A b c D + a B C d

A B C D + a b c d => A c b D + a C B d

Normal prophase 1

Abnormal prophase 1

x

..or maybe it happens this way.

A

BC

D

c

b

a

x legitimate synapses form, but due to a loop, they get too close, crossovers are swapped.

CB

A D

d a

bc

same result

Accumulation of inversions over (how much?) evolutionary time

Mouse chromosomes colored according to homology to human chromosomes.

rug rat mouse rat

Can we map our ancestry using inversions?

rug rat mouse rat

How many inversions: mouse-to-human, rat-to-human, rat-to-mouse?

How many inversions have happened? How do we count them? The answer tells us

the evolutionary distance.

�21

How many inversions is this?

The way a computational biologist thinks of the problem of counting

inversions.

�22

The Pancake Flipping ProblemA sloppy cook at a pancake diner makes pancakes of all different sizes and stacks them haphazardly.

A meticulous waiter likes the pancakes to be stacked with the largest on the bottom and the smallest on top. On the way to the table, using only one hand with a spatula, he flips the pancakes until they are arranged by size, largest on bottom, smallest on top.

•What is the algorithm for flipping?

•What is the algorithm for finding the fewest flips?

In class exercise:

Given the arrangement below, flip the pancakes until they are in order. How many flips? (You can order the numbers instead of the pancakes.)

642315 123456

In class exercise: work in pairs

•Write detailed instructions on how to stack six pancakes in order by flipping. The instructions should not depend on the starting order.

•Generate a random pancake order. Apply your algorithm. 125436 ---> ... ---> ... ---> 123456

•Pick a volunteer: Write pseudocode on the board.

TATTAGCCCGTGACAAACTAAGCCTATG

�26

TATTAGCCTAGTTTGTCACGAGCCTATG5’ 3’

5’3’

5’ 3’

5’3’ATAATCGGATCAAACAGTGCTCGGACAC

ATAATCGGGCACTGTTTGATTCGGATAC

Flipping changes chain direction

If we make reverse complement to be negative numbers:1 2 3 4

becomes,1 -3 -2 4

Evolution by reversals.123456789

-4-3-2-156789

-4-3-6-512789

-4-35612789

-4-356-9-8-7-2-1

1234-9-8-7-6-5

123789-4-6-5

12378964-5

-8-7-3-2-1964-5

ancestral mammal

humanmouse

“-” indicates reverse complement.

Pancakes-burned-on-one-side Problem

A sloppy cook at a pancake diner makes pancakes of all different sizes and stacks them haphazardly. He’s also a bad cook. All of the pancakes are burned on one side.

The waiter likes the pancakes to be stacked with the largest on the bottom and the smallest on top and with all of the burned sides down!

He has two spatulas and can flip any number of adjacent pancakes at a time.

•How does he arrange the pancakes in the fewest flips?

In class exercise, part 2:

•Write instructions for the Two-Spatula-Pancakes-burned-on-one-side Problem. When flipped, pancakes change order and sign. Flip any segment.

•Generate a random pancake arrangement. Apply your algorithm 1 -2 5 4 -3 6 ---> ... ---> ... ---> 123456

"Dot plot"�30

1 2 3 4 5 6 7 8 9

2 9 5 4 6 8 7 1 3

sequence 1se

quen

ce 2 Place a dot where

sequences are identical

A data structure for comparing sequences

Data structure!

Simpler dot plot: line segments

�31

1 2 3 4 5 6 7 8 9

9 5 4 6 7 8 1 2 3

If dots are bases, lines are sequences. If dots are sequences, lines are syntenic groups

+/- integers are direct/rc strands

�32

Z&B p. 158

1 2 3 4 5 6 7 8 9-4 -3 5 6 -9 -8 -7 -2-1

1 2 3 4 5 6 7 8 9 -4 -3 5 6 -9 -8 -7 -2-1

Each number represents a line segment

Each negative number represents its reverse complement

reverse complement alignment

direct alignment

�33

P. Pevzner et al, Trends in Genetics. Volume 20, Issue 12, December 2004, Pages 631–639

Decompose graph into cyclesConnect breakpoints

Breakpoint graph decomposition algorithm for counting minimal genomic reversals in the mammalian X chromosome: human versus mouse.

�34

1 2 3 4 5 6 7 8 9

1 2-7-6-5-4-3 8 9

1 2-7-8 3 4 5 6 9

1-3 8 7-2 4 5 6 9

Breakpoint graph decomposition

�35

1 2 3 4 5 6 7 8 9

1 2-7-6-5-4-3 8 9

1 2-7-8 3 4 5 6 9

1-3 8 7-2 4 5 6 9

Number of inversions = number of cycles

Breakpoint graph decomposition 1

1 2 3 4 5 6 7 8 9

1 2-7-6-5-4-3 8 9

1 2 6 7-5-4-3 8 9

1 2 6 7-5 4 3 8 9

Number of inversions = N/2 - 1 N = # of verteces

Breakpoint graph decomposition 2

�37

How many inversions?

P. Pevzner et al, Trends in Genetics. Volume 20, Issue 12, December 2004, Pages 631–639

Breakpoint graph decomposition

How do I find the alignment matrix given the sequences?

�38

A similarity matrix

�39

•Identical characters get score=1.•Non-identical gets score=0.

A C T G A A C C T1

11

11

111

1

0 0 0 0 0 00 0 0 0 0

0 0 0 0 00

000 0 0 0 000 0

0 0 0 00 00 0 0 00 00 0 0 00 00 0 0 00 00 0 0 0 000

A C

T G A

A C

C T

11

1

1

1

1

1

11

1

11

1

1

... in its most basic form:

An alignment matrix.

�40

1=aligned (associated)0=not aligned

•Boolean matrix.•Max one “1” per row.•Max one “1” per column. 0

1

A C T G A A C C T1

11

1

101

1

0 0 0 0 0 00 0 0 0 0

0 0 0 0 00

000 0 0 0 000 0

0 0 0 00 00 0 0 00 00 0 00 00 0 0 00 00 0 0 0 000

A C

T G A

A C

C T

00

0

0

0

0

0

10

0

00

0

0

An alignment matrix.

�41

•If A(i,j)==1, then A(m,n)=0 for all (m<i && n>j)•If A(i,j)==1, then A(m,n)=0 for all (m>i && n<j)i.e. SW and NE of “1” must be zero.

Alignments may be “sequential” or "non-sequantial".If alignment is sequential, see rules below.

0

1

A C T G A A C C T1

11

1

101

1

0 0 0 0 0 00 0 0 0 0

0 0 0 0 00

000 0 0 0 000 0

0 0 0 00 00 0 0 00 00 0 00 00 0 0 00 00 0 0 0 000

A C

T G A

A C

C T

00

0

0

0

0

0

10

0

00

0

0

A Dot Plot is a similarity matrixEach position in the matrix D[i,j] is either

dot, if A[i] == B[j]

blank, otherwise.

AAGACGTTTA GACGTACT

Dot plot diagonals are alignmentsTo find short, unbroken alignments, we set window size and stringency and mark all dots that pass with a line.

AAGACGTTTA GACGTACT

Show all diagonals with at least 4 out

of 5 matches.

"window size" is the length of a diagonal, "stringency" is minimum number of matches in the window.

Dot matrix with reverse complement

AAGACGTTTA GACGTACT

Base matches its complement

Reverse diagonal means inverse alignment

Now we have two types of dots

Install UGENE • ugene.net/• Download UGENE manual.• Follow instructions for installing UGENE

on your system.

�45

Take-home exercise 1: UGENE Turn in Thurs. Sep 6

• Open FASTA/human_T1.fa (or download from "UGENE files" link)• Right-click in sequence window. Select Analyze...Build Dotplot... Set

x=y=human_T1. Direct repeats and inverted. Set minimum length = 50, 100%identity. Click OK.

• Navigate the dotplot window by zooming and scrolling. • Find locus of the longest repeat in chromosome coordinates. • Annotate it as a repeat unit, call it “longest repeat”.• Next class (or by the end of class), turn in a small strip of

paper with your name, the location of the longest direct or reverse complement repeat in the sequence, and the first 5 bases of the 5' end.

�46

We're done. Review.

• How is genome rearrangement like flipping a stack of burnt pancakes?

• What is a similarity matrix?• What is an alignment matrix?• What is the significance of a row of dots in a

dotplot?• What is the first step of algorithm development?

(a) define data structures? (b) define loop structures? (c) google stackoverflow?

�47See you next time.