MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu

Post on 22-Feb-2016

21 views 0 download

Tags:

description

CS173. Lecture 11: Repeats II, Mutations. MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu. Announcements. TA HW1 Comments. - PowerPoint PPT Presentation

transcript

http://cs173.stanford.edu [BejeranoWinter12/13] 1

MW  11:00-12:15 in Beckman B302Prof: Gill BejeranoTAs: Jim Notwell & Harendra Guturu

CS173

Lecture 11: Repeats II, Mutations

http://cs173.stanford.edu [BejeranoWinter12/13] 2

Announcements• TA HW1 Comments

http://cs173.stanford.edu [BejeranoWinter12/13] 3

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG

Transcription

http://cs173.stanford.edu [BejeranoWinter12/13] 4

http://cs173.stanford.edu [BejeranoWinter12/13] 5

Transcription RegulationChromatin / Proteins

DNA / Proteins

Extracellular signals

Repeats

http://cs173.stanford.edu [BejeranoWinter12/13] 6

Sequences that repeat many times in the genome

• Take up cumulatively a whooping half of the genome• Come in two major, very different, flavors

http://cs173.stanford.edu [BejeranoWinter12/13] 7

I

II

http://cs173.stanford.edu [BejeranoWinter12/13] 8

I. Interspersed Repeats

Get a copy out of the genome, and into a new location.

http://cs173.stanford.edu [BejeranoWinter12/13] 9

II. Simple Repeats

•Every possible motif of mono-, di, tri- and tetranucleotide repeats is vastly overrepresented in the human genome.

•These are called microsatellites,Longer repeating units are called minisatellites,The real long ones are called satellites.

•Highly polymorphic in the human population.•Highly heterozygous in a single individual.•As a result microsatellites are used in paternity testing, forensics, and the inference of demographic processes.

•There is no clear definition of how many repetitions make a simple repeat, nor how imperfect the different copies can be.

•Highly variable between species: e.g., using the same search criteria the mouse & rat genomes have 2-3 times more microsatellites than the human genome. They’re also longer in mouse & rat.

AAAAAAAAACACACACACCAACAACAA

http://cs173.stanford.edu [BejeranoWinter12/13] 10

DNA Replication

http://cs173.stanford.edu [BejeranoWinter12/13] 11

Simple Repeats Create Funky DNA structures

http://cs173.stanford.edu [BejeranoWinter12/13] 12

These Bumps Give The DNA Polymerase Hiccups

http://cs173.stanford.edu [BejeranoWinter12/13] 13

Expandable Repeats and Disease

Restriction Enzymes• Restriction enzymes recognize and make a cut within

specific DNA sequences, known as restriction sites. • This is usually a 4-6 base pair palindromic sequence.• Naturally found in different types of bacteria• Bacteria use restriction enzymes to protect themselves

from foreign DNA • Many have been isolated and sold for use in lab work

http://cs173.stanford.edu [BejeranoWinter12/13] 14

blunt end

sticky end

DNA Fingerprint BasicsDNA fragments of different size will be produced by a restriction enzyme that cuts at the points shown by the arrows.

15

DNA fragments are then separated based on size using gel

electrophoresis.

16

DNA Fingerprinting can be used in paternity testing or

murder cases.

17

http://cs173.stanford.edu [BejeranoWinter12/13] 18

There are Tracks for it

http://cs173.stanford.edu [BejeranoWinter12/13] 19

Interspersed vs. Simple Repeats

From an evolutionary point of view transposons and simple repeats are very different.

Different instances of the same transposon share common ancestry (but not necessarily a direct common progenitor).

Different instances of the same simple repeat most often do not.

Genome Content, Genome Function DONE• Transcripts

• Protein coding genes• Non-coding RNAs

• Gene regulatory elements• Promoters• Enhancers• Repressors• Insulators

• Epigenomics• Nucleosomes, open chromatin• Histone modifications

• Repeats• Interspersed repeats / mobile elements• Simple repeats

http://cs173.stanford.edu [BejeranoWinter12/13] 20

Categories are NOT mutually exclusive• We already discussed repeat instances that became

• Coding exons• Enhancers

• There are known genomic loci that• Code for protein coding exons and act as enhancers.• Ditto for non-coding RNA + enhancer.

• There are bi-direction exons• Coding in both directions• Coding and anti-sense• Both non-coding

http://cs173.stanford.edu [BejeranoWinter12/13] 21

http://cs173.stanford.edu [BejeranoWinter12/13] 22

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG

http://cs173.stanford.edu [BejeranoWinter12/13] 23

human

mouserat

chimp

chicken

fugu

zfish

dog

tetra

human

mouserat

chimp

chicken

fugu

zfish

dog

tetra

opossum

cow

macaque

platypus

opossum

cow

macaque

platypus

Comparative Genomics

“Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky

t

http://cs173.stanford.edu [BejeranoWinter12/13] 24

The genome is constantly replicated

Every cell holds 2 copies of all its DNA = its genome.The human body is made of ~1013 cells.All originate from a single cell through repeated cell divisions.

cell

genome =all DNA

chicken ≈ 1013 copies(DNA) of egg (DNA)

chicken

eggegg

egg

celldivision

DNA strings =Chromosomes

http://cs173.stanford.edu [BejeranoWinter12/13] 25

Evolution = Mutation + Selection

Mistakes can happen during DNA replication. Mistakes are oblivious to DNA segment function. But then selection kicks in.

...ACGTACGACTGACTAGCATCGACTACGA...

chicken

egg ...ACGTACGACTGACTAGCATCGACTACGA...

functionaljunk

TT CAT

“anythinggoes”

many changesare not tolerated

chicken

This has bad implications – disease, and good implications – adaptation.

http://cs173.stanford.edu [BejeranoWinter12/13] 26

Mutation

Chromosomal (ie big) Mutations

• Five types exist:–Deletion–Inversion–Duplication–Translocation–Nondisjunction

Deletion• Due to breakage• A piece of a

chromosome is lost

Inversion• Chromosome segment

breaks off• Segment flips around

backwards• Segment reattaches

Duplication• Occurs when a

genomic region is repeated

Whole Genome Duplication at the Base of the Vertebrate Tree

http://cs173.stanford.edu [BejeranoWinter12/13] 31

Xen.Laevis WGD

Translocation• Involves two

chromosomes that aren’t homologous

• Part of one chromosome is transferred to another chromosomes

Nondisjunction• Failure of chromosomes to

separate during meiosis• Causes gamete to have too many

or too few chromosomes• Disorders:

– Down Syndrome – three 21st chromosomes

– Turner Syndrome – single X chromosome– Klinefelter’s Syndrome – XXY

chromosomes

Genomic (ie small) Mutations

• Six types exist:–Substitution (eg GT)

–Deletion–Insertion–Inversion–Duplication–Translocation

35

Example: Human-Chimp Genomic DifferencesN

umbe

r of e

vent

s

Nucleotid

e substi

tutions

Indels

< 10 Kb

Microinve

rsions <

100 Kb

Deletions/D

uplicatio

ns

Microinve

rsions >

100 Kb

Pericentr

ic inve

rsions

Fusion

http://cs173.stanford.edu [BejeranoWinter12/13] 36

Inferring Genomic MutationsFrom Alignments of Genomes

37

A Gene tree evolves with respect to a Species tree

Species tree

Gene tree

SpeciationDuplicationLoss

By “Gene” we meanany piece of DNA.

http://cs173.stanford.edu [BejeranoWinter12/13] 38

TerminologyOrthologs : Genes related via speciation (e.g. C,M,H3)Paralogs: Genes related through duplication (e.g. H1,H2,H3)Homologs: Genes that share a common origin

(e.g. C,M,H1,H2,H3)

Species tree

Gene tree

SpeciationDuplicationLoss

singleancestralgene

http://cs173.stanford.edu [BejeranoWinter12/13] 39

Typical Molecular DistancesIf they were only evolving neutrally:• To which is H1 closer in sequence, H2 or H3?• To which H is M closest?• And C?(Selection may change distances)

Species tree

Gene tree

SpeciationDuplicationLoss

singleancestralgene

http://cs173.stanford.edu [BejeranoWinter12/13] 40

Gene trees and even species trees are figments of our (scientific) imagination

Species trees and gene trees can be wrong.All we really have are extant observations, and fossils.

Species tree

Gene tree

SpeciationDuplicationLoss

singleancestralgene

ObservedInferred

Gene Families

41

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gapin the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Scoring Function

• Sequence edits:AGGCCTC

Mutations AGGACTC

Insertions AGGGCCTC

Deletions AGG . CTC

Scoring Function:Match: +mMismatch: -sGap: -d

Score F = (# matches) m - (# mismatches) s – (#gaps) d

Alternative definition:

minimal edit distance

“Given two strings x, y,find minimum # of edits (insertions, deletions,

mutations) to transform one string to the other”

Cost of edit operationsneeds to be biologicallyinspired (eg DEL length).

Solve via Dynamic Programming

Are two sequences homologous?

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Given an (optimal) alignment between two genome regions,you can ask what is the probability that they are (not) related by homology?

Note that (when known) the answer is a function of the molecular distance between the two (eg, between two species)

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

DP matrix:

http://cs173.stanford.edu [BejeranoWinter12/13] 45

Chaining Alignments

Chaining highlights homologous regions between genomes (it bridges the gulf between syntenic blocks and base-by-base alignments.

Local alignments tend to break at transposon insertions, inversions, duplications, etc.

Global alignments tend to force non-homologous bases to align.Chaining is a rigorous way of joining together local alignments into larger structures.

dot plots:DP matrix:

46

“Raw” (B)lastz track (no longer displayed)

Protease Regulatory Subunit 3

Alignment = homologous regions

Chains & Nets: How they’re built

• 1: Blastz one genome to another– Local alignment algorithm– Finds short blocks of similarity

Hg18: AAAAAACCCCCAAAAAMm8: AAAAAAGGGGG

Hg18.1-6 + AAAAAAMm8.1-6 + AAAAAA

Hg18.7-11 + CCCCCMm8.1-5 - CCCCC

Hg18.12-16 + AAAAAMm8.1-5 + AAAAA

47

Chains & Nets: How they’re built• 2: “Chain” alignment blocks together

– Links blocks that preserve order and orientation– Not single coverage in either species

Hg18: AAAAAACCCCCAAAAAMm8: AAAAAAGGGGGAAAAA

Hg18: AAAAAACCCCCAAAAA Mm8 chains

Mm8.1-6 +

Mm8.7-11 -

Mm8.12-16 +

Mm8.12-15 + Mm8.1-5 + 48

Another Chain ExampleA B C

D E

Ancestral Sequence

A B CD E

Human SequenceA B CD E

Mouse Sequence

B’

In Human BrowserImplicitHumansequence

Mousechains B’

D E

D E

In Mouse BrowserImplicitMousesequence

Humanchains

… D E

49

The Use of an Outgroup

A B CD E

Outgroup Sequence

A B CD E

Human SequenceA B CD E

Mouse Sequence

B’

In Human BrowserImplicitHumansequence

Mousechains B’

D E

D E

In Mouse BrowserImplicitMousesequence

Humanchains

… D E

50

http://cs173.stanford.edu [BejeranoWinter12/13] 51

Chains join together related local alignments

Protease Regulatory Subunit 3

likely ortholog

likely paralogsshared domain?

http://cs173.stanford.edu [BejeranoWinter12/13] 52

Chains• a chain is a sequence of gapless aligned blocks, where there must

be no overlaps of blocks' target or query coords within the chain.• Within a chain, target and query coords are monotonically non-

decreasing. (i.e. always increasing or flat)• double-sided gaps are a new capability (blastz can't do that) that

allow extremely long chains to be constructed.• not just orthologs, but paralogs too, can result in good chains. but

that's useful!• chains should be symmetrical -- e.g. swap human-mouse -> mouse-

human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments.

• chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.

• chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki]

http://cs173.stanford.edu [BejeranoWinter12/13] 53

Before and After Chaining

http://cs173.stanford.edu [BejeranoWinter12/13] 54

Chaining AlgorithmInput - blocks of gapless alignments from blastzDynamic program based on the recurrence relationship: score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i