1
Transforming Men into Mice:
Are there Fragile Regions in Human Genome?
2
From Biology to Computing
From Biology to Computing:
……
Problem Formulation
6
What are the similarity blocks and how to find them?
What is the architecture of the ancestral genome?
What is the evolutionary scenario for transforming one genome into the other?
Unknown ancestor~ 80 million yearsago
Mouse (X chrom.)
Human (X chrom.)
Genome rearrangements
7
Transforming mice into men (X chromosome)
8
Genome Rearrangements: Evolutionary “Earthquakes”
What is the evolutionary scenario for transforming one genome into the other?
What is the organization of the ancestral genome?
Are there any rearrangement hotspots in mammalian genomes?
10
Susumu Ohno: Two Hypothesis
Ohno, 1970, 1973 Whole Genome Duplication
Hypothesis: Big leaps in evolution would have been
impossible without whole genome duplications. Random Breakage Hypothesis: Genomic architectures are shaped by
rearrangements that occur randomly (there are no fragile regions).
11
Whole Genome Duplication Hypothesis Finally Confirmed After Years’ of Controversy
The Whole Genome Duplication hypothesis first met with skepticism
and was only recently confirmed.
Kellis, Birren & Lander, Nature, 2004
“Our analysis resolves the long-standing controversy on the ancestry of the yeast genome”
“There was a whole-genome duplication.”
Wolfe, Nature, 1997“There was no whole-genome duplication.” Dujon, FEBS, 2000
“Duplications occurred independently” Langkjaer, JMB, 2000
“Continuous duplications” Dujon, Yeast 2003“Multiple duplications” Friedman, Gen. Res, 2003
“Spontaneous duplications” Koszul, EMBO, 2004
12
Random Breakage Hypothesis Meets a Different Fate
The random breakage hypothesis was embraced by biologists and has become de facto theory of chromosome evolution.
Nadeau & Taylor, 1984, PNAS First estimate of the number of synteny blocks
between human and mouse First convincing arguments in favor of the
Random Breakage Model (RBM) RBM was re-iterated in hundreds of papers
13
Random Breakage Hypothesis Meets a Different Fate
The random breakage hypothesis was embraced by biologists and has become de facto theory of chromosome evolution
Nadeau & Taylor, PNAS 1984 First estimate of the number of synteny blocks
between human and mouse First convincing arguments in favor of the Random
Breakage Model (RBM) RBM was re-iterated in hundreds of papers
Pevzner & Tesler, PNAS 2003 Rejected RBM and proposed the Fragile Breakage
Model Postulated existence of rearrangement hotspots
and vast breakpoint reuse
14
Are the Rearrangement Hotspots
Real?
The Fragile Breakage Model did not live long.
In 2004 David Sankoff presented convincing arguments against the Fragile Breakage Model (Sankoff & Trinh, 2004) “… we have shown that breakpoint re-use
of the same magnitude as found in Pevzner and Tesler, 2003 may very well be artifacts in a context where NO re-use actually occurred.”
15
Random Breakage Theory re-re-re-visited
Ohno, 1970, Nadeau & Taylor, 1984 introduced RBM
Pevzner & Tesler, 2003 argued against RBM
16
Random Breakage Theory re-re-re-visited
Ohno, 1970, Nadeau & Taylor, 1984 introduced RBM
Pevzner & Tesler, 2003 argued against RBM
Sankoff & Trinh, 2004 argued against Pevzner & Tesler, 2003 arguments against
RBM
17
Random Breakage Theory re-re-re-visited
Ohno, 1970, Nadeau & Taylor, 1984 introduced RBM
Pevzner & Tesler, 2003 argued against RBM
Sankoff & Trinh, 2004 argued against Pevzner & Tesler, 2003 arguments against RBM
Today I will argue against Sankoff & Trinh, 2004 arguments against Pevzner & Tesler, 2003 arguments against RBM
22
History of Chromosome X
Rat Consortium, Nature, 2004
23
Human-Mouse-Rat Phylogeny
26
Tumor Genomes
Tumor cells often exhibit chromosomal aberrations:
27
Tumor Genomes
Thousands of individual rearrangements known for different tumors.
promoter c-ab1 oncogene
BCR genepromoterpromoter ABL gene
BCR genepromoter
Rearrangements may disrupt genes and alter gene regulation.
Example: translocation in leukemia yields “Philadelphia” chromosome:
Chr 9
Chr 22
28
Breast Cancer Tumor Genome
MCF7 is human breast cancer cell line. Cytogenetic analysis suggests complex architecture:
What is the detailed architecture of MCF7 tumor genome?
What sequence of rearrangements produced MCF7?
30
Reversals(also called inversions)
Classically, blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,
…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. Clinical: occurs in many cancers. Evolution: occurred about once-twice every million years
on the evolutionary path between human and mouse.
1 32
4
10
56
8
9
7
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
31
Reversals(also called inversions)
1 32
4
10
56
8
9
7
1, 2, 3, -8, -7, -6, -5, -4, 9, 10
Classically, blocks represent conserved genes. In the course of evolution or in a clinical context, blocks
1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10.
Clinical: occurs in many cancers. Evolution: occurred one-two times every million years
on the evolutionary path between human and mouse.
32
Reversals(also called inversions)
1 32
4
10
56
8
9
7
1, 2, 3, -8, -7, -6, -5, -4, 9, 10
The inversion introduced two breakpoints(disruptions in gene order).
33
Sorting by reversals
Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: 2 3 4 5 6 7 8 1Step 3: 2 3 4 5 6 7 8 -1Step 4: -8 -7 -6 -5 -4 -3 -2 -1Step 5: 1 2 3 4 5 6 7 8
34
Sorting by reversalsMost parsimonious scenarios
Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: 1 2 3 4 5 6 7 8
The reversal distance is the minimum number of reversals required to transform one gene order into another.
Here, the distance is 4.
35
Sorting by ReversalsBreakpoint distance
Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: 1 2 3 4 5 6 7 8
36
Sorting by ReversalsBreakpoint distance
Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: 1 2 3 4 5 6 7 8
Sorting by Reversal = breakpoint elimination
37
Sorting by ReversalsBreakpoint distance
Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: 1 2 3 4 5 6 7 8
Sorting by Reversal = breakpoint elimination
How many breakpoints can be eliminated by a single reversal?
38
Sorting by ReversalsBreakpoint distance
Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: 1 2 3 4 5 6 7 8
Sorting by Reversal = breakpoint elimination
reversal distance >= # breakpoints / 2 = 6/2 = 3
39
Sorting by ReversalsBreakpoint distance
Step 0: 2 -4 -3 5 -8 -7 -6 1Step 1: 2 3 4 5 -8 -7 -6 1Step 2: -5 -4 -3 -2 -8 -7 -6 1Step 3: -5 -4 -3 -2 -1 6 7 8Step 4: 1 2 3 4 5 6 7 8
Sorting by Reversal = breakpoint elimination
reversal distance >= # breakpoints / 2 = 6/2 = 3
This formula vastly underestimates the reversal distance by assuming that breakpoints are never re-used.
42
Breakpoint graph
Reversal Distance Theorem (slightly imprecise version):
reversal distance = number of blocks–number of cycles
0 2 -4 -3 5 -8 -7 -6 1 9
45
Rearrangements in Multi-Chromosomal Genomes
are not limited to reversals…
translocations:
46
Rearrangements in Multi-chromosomal Genomes
Besides reversals…
translocations:
fusions and fissions of chromosomes
Reversals on Circular Genomes
reversal
P=(+a-b-c+d) Q=(+a-b-d+c)
a c
d
b
a c
d
b
Reversals on Circular Genomes
reversal
P=(+a-b-c+d) Q=(+a-b-d+c)
a c
d
b
a c
d
b
Reversals on Circular Genomes
reversal
P=(+a-b-c+d) Q=(+a-b-d+c)
a c
d
b
a c
d
b
Reversals on Circular Genomes
reversal
P=(+a-b-c+d) Q=(+a-b-d+c)
a c
d
b
a c
d
b
Reversals on Circular Genomes
reversal
P=(+a-b-c+d) Q=(+a-b-d+c)
a c
d
b
a c
d
b
Reversals on Circular Genomes
reversal
P=(+a-b-c+d) Q=(+a-b-d+c)
A reversal replaces two black edges with two other black edges
a c
d
b
a c
d
b
Reversals on Circular Chromosomes
reversal
A reversal replaces two black edges with two other black edges
a c
b
a c
d
b
d
a b c d a b c d
Not a Reversal
P=(+a-b-c+d) Q=??????
This operation also replaces two black edges with two other black edges. But it is not a reversal.
a c
d
b
a c
d
b
Fissions
∗
P=(+a-b-c+d) Q=(+a-b)(-c+d)
Fissions split a single chromosome into two – also replace two black edges with two other black edges.
a c
d
b
a c
d
b
∗
fission
2-Breaks
2-Break replaces any pair of black edges with another pair.
P=(+a-b-c+d) Q=(+a-b-d+c)
2-breaka c
d
b
a c
d
b
∗
∗
2-Break Distance Problem
Given two genomes, find the shortest sequence of 2-Breaks transforming one genome into another.
Two Genomes as Black-Red and Green-Red Cycles
P=(+a-b-c+d)
Q=(+a+c+b-d)
a c
d
b
a b
d
c
P
Q
Common Red Edges
a c
d
b
a b
d
c
P
Q
a
b
c
d
Superimposing...
a c
d
b
a b
d
c
P
Q
Qa
b
c
d
Superimposing...
a c
d
b
a b
d
c
P
Q
Qa
b
c
d
Superimposing...
a c
d
b
a b
d
c
P
Q
Qa
b
c
d
Superimposing...
a c
d
b
a b
d
c
P
Q
Qa
b
c
d
Breakpoint Graph
Breakpoint Graph
BG(P,Q)
a c
d
b
a b
d
c
P
Q
a
b
c
d
Breakpoint Graph: Red, Black, and Green Matchings
Breakpoint graph is formed by red, black and green edges.
a
b
c
d
Black-Red Cycles (red genome)
Breakpoint graph is formed by red, black and green edges.
black and red edges form genome P
a
b
c
d
Green-Red Cycles(green genome)
Breakpoint graph is formed by red, black and green edges .
green and red edges form genome Q
a
b
c
d
Black-Green Cycles(breakpoint graph)
Breakpoint graph is formed by red, black and green edges.
black and green edges form black-green cycles
cycle (P,Q) – number of cycles in the breakpoint graph of genomes P and Q
a
b
cc
d
Breakpoint Graph of Two Identical Genomes
Trivial Breakpoint Graph is a breakpoint graph of two identical genomes.
Q=(+a-b-c+d)
BG(Q,Q)a
b
c
d
Identity Breakpoint Graph Consists of Trivial Black-Green Cycles
Identity Breakpoint Graph is a breakpoint graph of two identical genomes.
Identity breakpoint graph consists of trivial cycles, each formed by one green and one black edge.
a
b
c
d
# trivial cycles = # genes
Genome Rearrangements Affect Black-Green Cycles
cycle(P,Q)=2 cycles cycle(Q,Q)= 4 trivial cycles
Transforming genome P into genome Q corresponds to transforming black-green cycles in G(P,Q) into trivial cycles in G(Q,Q).
a c
d
b
a
b
c
d
Rearrangements Change Breakpoint Graphs and
Cycle(P,Q)
cycle(P',Q) = 3cycle(Q,Q) = 4
=#genes
a
c
b
d
a
c
b
d
a
c
b
d
cycle(P,Q) = 2
BG(P,Q)
BG(P',Q)
BG(Q,Q)
trivial cycles
Sorting by 2-Breaks2-breaks
P=Q0 → Q1 → ... → Qd=Q
BG(P,Q) → BG(Q1,Q) → ... → BG(Q,Q)
cycle(P,Q) cycles→..............→cycle(Q,Q)=#genes
# of black-green cycles increased by #genes - cycle(P,Q)
How much each 2-break can contribute to this increase?
A 2-Break:
adds 2 new black edges and thus creates at most 2 new cycles (containing two new black edges)
removes 2 black edges and thus destroys at least 1 old cycle (containing two old old edges):
change in the number of cycles ≤ 2-1=1.
Each 2-Break Increases #Cycles by at Most 1
2-Break increases the number of cycles by at most one since any non-trivial cycle can be split into two cycles with a 2-break
∗∗
There Exist 2-Breaks Increasing #Cycles by 1
Any 2-Break increases the number of cycles by at most one
Any non-trivial cycle can be split into two cycles with a 2-break
Every sorting by 2-breaks must increase #cycles by #genes - cycle(P,Q)
2-Break distance between genomes P and
Q:
#genes - cycle(P,Q)
2-Breaks Distance
79
Human-mouse breakpoint graph
Human and mouse genomes can be viewed as strings in the alphabet of 280 synteny blocks (at least 0.5 million nucleotides in length)
The breakpoint graph on these blocks has 35 cycles
2-Break distance between HUMAN and MOUSE:
#genes - cycle(HUMAN,MOUSE)=280-35=245
2-Break Distance between HUMAN and MOUSE
109
GRIMM Web Server: Multichromosomal rearrangements
114
What are the similarity blocks and how to find them?
Unknown ancestor~ 80 million yearsago
Mouse (X chrom.)
Human (X chrom.)
Genome rearrangements
115
Finding Synteny Blocks
25,839 anchors Anchors enlarged for
visibility. Apparent density may be an illusion.
First, separate noise synteny blocks
116
GRIMM-Synteny on X chromosome (a)
Macro/Micro-rearrangements
25,839 anchors Anchors enlarged for
visibility. Apparent density may be an illusion.
First, separate noise synteny blocks
and then separate microrearrangements
(inside synteny blocks)
macrorearrangements(of whole blocks)
117
A single synteny block with 1114 anchors and 85 micro-rearrangements.
GRIMM-SyntenyBlowup of a synteny block
118
GRIMM-Synteny on X chromosome (a)
From anchors to synteny blocks
119
Synteny Block Generation
GRIMM-Synteny(Genome,w,) w: gap size : minimum synteny block size
Represent Genome in 2-D and form a graph whose vertex set is the set of genes (anchors) in 2-D
Connect two vertices by an edge if the 2-D distance between them is < w. The connected components in the resulting graph define synteny blocks
Delete small synteny blocks (length )
120
Yet Another Synteny Block Generation
ST-Synteny(Genome,w,) w: gap size : minimum synteny block size
Define each gene (anchor) in Genome as a separate block and iteratively amalgamate the resulting blocks
Amalgamate two adjacent blocks if they contain two genes that are separated by less than w genes in another genome.
Delete any short block containing < elements
121
Two Algorithms: Which One is “Better”?
GRIMM-Synteny(Genome,w,∆)
Represent Genome in 2-D and form a graph whose vertex set is the set of genes in 2-D
Connect two vertices by an edge if the 2-D distance between them is < w. The connected components in the resulting graph define synteny blocks
Delete small synteny blocks (length C)
ST-Synteny(Genome,w, ) Define each gene in Genome
as a separate block and iteratively amalgamate the resulting blocks
Amalgamate two adjacent blocks if they contain two genes that are separated by less than w genes in another genome.
Delete any short block containing < elements
122
GRIMM-Synteny on X chromosome (b)
From anchors to synteny blocks
124
GRIMM-Synteny on X chromosome (d)
From anchors to synteny blocks
11 synteny blocks.
176 micro-rearrangements within these blocks.
125
GRIMM-Synteny on X chromosome (e)
From anchors to synteny blocks
130
GRIMM-SyntenyHuman-mouse breakpoint graph
131
Evidence for fragile regions (rearrangement hotspots) in mammalian evolution
132
GRIMM determines minimum number of rearrangements is 7 (naked eye gives 6).
There are numerous 7-step scenarios. The true scenario may have more than 7
steps.
GRIMM on X chromosome
133
GRIMM on X chromosome: breakpoint re-uses
134
GRIMM on ALL chromosomes
GRIMM determines minimum number of rearrangemnts is 245 (naked eye gives 130).
There are numerous 245-step scenarios. The true scenario may have more than 245
steps.
136
Are There any Rearrangement Hotspots in Human Genome?
137
Are There any Rearrangement Hotspots in Human Genome?
Theorem. Yes
138
Are There any Rearrangement Hotspots in Human Genome?
Theorem. YesProof:•Every rearrangement creates up to 2 breakpoints
139
Are There any Rearrangement Hotspots in Human Genome?
Theorem. YesProof:•Every rearrangement creates up to 2 breakpoints• If there were no breakpoint re-use then after k rearrangements we would have 2k breakpoints
140
Are There any Rearrangement Hotspots in Human Genome?
Theorem. YesProof:•Every rearrangement creates up to 2 breakpoints• If there were no breakpoint re-use then after k rearrangements we would have 2k breakpoints• Human-mouse comparison reveals 2k=260 breakpoints
141
Are There any Rearrangement Hotspots in Human Genome?
Theorem. YesProof:•Every rearrangement creates up to 2 breakpoints• If there were no breakpoint re-use then after k rearrangements we would have 2k breakpoints• Human-mouse comparison reveals 2k≈260 breakpoints • If there were no breakpoint re-use, how many rearrangements happened on the human-mouse evolutionary path?
142
Are There any Rearrangement Hotspots in Human Genome?
Theorem. YesProof:•Every rearrangement creates up to 2 breakpoints• If there were no breakpoint re-use then after k rearrangements we would have 2k breakpoints• Human-mouse comparison reveals 2k≈260 breakpoints • If there were no breakpoint re-use, how many rearrangements happened on the human-mouse evolutionary path? #rearrangements=#breakpoints/2=260/2=130
143
Are There any Rearrangement Hotspots in Human Genome?
Proof continues:• If there were no breakpoint re-use: #rearrangements=#breakpoints/2=260/2=130
144
Are There any Rearrangement Hotspots in Human Genome?
Proof continues:• If there were no breakpoint re-use: #rearrangements=#breakpoints/2=260/2=130 • Hannenhalli-Pevzner theorem implies that there were at least 245 rearrangements on the human-mouse evolutionary path
145
Are There any Rearrangement Hotspots in Human Genome?
Proof continues:• If there were no breakpoint re-use: #rearrangements=#breakpoints/2=260/2=130 • Hannenhalli-Pevzner theorem implies that there were at least 245 rearrangements on the human-mouse evolutionary path• Is 245 larger than 130?
146
Are There any Rearrangement Hotspots in Human Genome?
Proof continues:• If there were no breakpoint re-use: #rearrangements=#breakpoints/2=260/2=130 • The Hannenhalli-Pevzner theorem implies that there were at least 245 rearrangements on the human-mouse evolutionary path• Is 245 larger than 130? • Yes, 245 >> 130
147
Are There any Rearrangement Hotspots in Human Genome?
Theorem. YesProof:•Every rearrangement creates up to 2 breakpoints• If there were no breakpoint re-use then after k rearrangements we would have 2k breakpoints• Human-mouse comparison reveals 2k≈260 breakpoints • If there were no breakpoint re-use, how many rearrangements happened on the human-mouse evolutionary path? #rearrangements=#breakpoints/2=260/2=130
148
Are There any Rearrangement Hotspots in Human Genome?
Proof continues:• If there were no breakpoint re-use: #rearrangements=#breakpoints/2=260/2=130 • Hannenhalli-Pevzner theorem implies that there were at least 245 rearrangements on the human-mouse evolutionary path• Is 245 larger than 130? • Yes, 245 >> 130• There was a vast breakpoint re-use – an argument against the random breakage model (according to scan statistics).
149
Human/mouse comparison reveals the size of breakpoint regions (regions between
consecutive synteny blocks) is small, accounting for ~ 5% of genome
breakpoint re-use is very high, approx. 1.9 uses per breakpoint region on average
Mouse genome paper (Nature, 2002):The analysis suggests that chromosomal breaks may have a tendency to reoccur in certain regions.
High Breakpoint re-use provides evidence against the Random Breakage Model
155
Random Breakage Theory re-re-visited
Sankoff and Trinh, 2004 refute this conclusion and suggested that RBM is correct.
158
Random Breakage Theory re-re-visited
If you are not criticized, you may not be doing much
Donald Rumsfeld
But how can one criticize a
THEOREM???
160
Are There any Rearrangement Hotspots in Human Genome?
Theorem. Yes!Proof:• ………………………………………………………• ………………………………………………………• Is 245 larger than 130? • Yes, 245 >> 130• ……………………………………………………… ……………………………………………………….
161
Are There any Rearrangement Hotspots in Human Genome?
Theorem. Yes!Proof:• ………………………………………………………• ………………………………………………………• Is 245 larger than 130? • Yes, 245 >> 130• ………………………………………………………Sankoff did not question the validity of the proof - he questioned the validity of the numbers.The computed #breakpoint regions (260) and the rearrangement distance (245) are parameter-dependent and may be wrong
162
Sankoff-Trinh Argument
Designed a simulation where a series of random rearrangements created the appearance of rearrangement hotspots
How can it be???
163
Sankoff-Trinh Argument
Sankoff & Trinh designed a simulation where a series of random rearrangements created the appearance of rearrangement hotspots
How can it be??? S&T emphasized the importance of synteny
block generation and parameter choice S&T argued that the breakpoint re-use we observed
is caused by artifacts of parameter-dependent synteny block generation and micro-rearrangements
165
Walking in Sankoff-Trinh Shoes
Sankoff and Trinh used a simple synteny block generation algorithm (ST-Synteny) and claimed that it is similar to GRIMM-Synteny
ST-Synteny indeed appears to be similar to GRIMM-Synteny
We reproduced Sankoff-Trinh’s simulation and their ST-Synteny algorithm
168
Sankoff-Trinh Synteny Block Generation
ST-Synteny(Genome,w,) w: gap size : minimum synteny block size
Define each gene (anchor) in Genome as a separate block and iteratively amalgamate the resulting blocks
Amalgamate two adjacent blocks if they contain two genes that are separated by less than w genes in another genome.
Delete any short block containing < elements
170
GRIMM-Synteny Block Generation
GRIMM-Synteny(Genome,w,) w: gap size : minimum synteny block size
Represent Genome in 2-D and form a graph whose vertex set is the set of genes in 2-D
Connect two vertices by an edge if the 2-D distance between them is < w. The connected components in the resulting graph define synteny blocks
Delete small synteny blocks (length )
171
Comparing Two Algorithms GRIMM-
Synteny(Genome,w,∆) Represent Genome in 2-D
and form a graph whose vertex set is the set of genes in 2-D
Connect two vertices by an edge if the 2-D distance between them is < w. The connected components in the resulting graph define synteny blocks
Delete small synteny blocks (length C)
ST-Synteny(Genome,w, ) Define each gene in Genome
as a separate block and iteratively amalgamate the resulting blocks
Amalgamate two adjacent blocks if they contain two genes that are separated by less than w genes in another genome.
Delete any short block containing < elements
172
Comparing Two Algorithms GRIMM-
Synteny(Genome,w,∆) Represent Genome in 2-D
and form a graph whose vertex set is the set of genes in 2-D
Connect two vertices by an edge if the 2-D distance between them is < w. The connected components in the resulting graph define synteny blocks
Delete small synteny blocks (length C)
ST-Synteny(Genome,w, ) Define each gene in Genome
as a separate block and iteratively amalgamate the resulting blocks
Amalgamate two adjacent blocks if they contain two genes that are separated by less than w genes in another genome.
Delete any short block containing < elements
The algorithms look very similar but do they produce similar results?
178
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
179
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
180
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
181
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
182
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
183
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
184
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
185
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
186
ST-Synteny Flaw I
Permutation-3 2 -1 -5 4
GRIMM-Synteny
ST-Synteny
Synteny blocks by GRIMM-Synteny & ST-Synteny
hypothetical genome 1
hypo
thet
ical
gen
ome
2
192
ST-Synteny Flaw II
GRIMM-Synteny vs.ST-Synteny
Number of blocks44, 10
Total block length (Mb)95, 140
Breakpoint regions (%)38, 9
Breakpoint re-use: 1.97, 1.64 Changing
parameters does not help
ST-Synteny
193
ST-Synteny Flaw III ST-Synteny is not even symmetric, i.e., the number of synteny blocks between human and mouse may differ from the number of synteny blocks
between mouse and human
194
ST-Synteny Results in Much Higher Breakpoint Reuse than GRIMM-Synteny
An artifact of ST-Synteny rather than any argument against the Fragile Breakage Model
201
Random Breakage Model re-re-re-visited
Sankoff & Trinh, 2004 emphasized the importance of accurate synteny block generation but felt victims of their own flawed ST-Synteny algorithm
ST-Synteny was never applied to real data
Peng et al., 2006 (PLOS Computational Biology): If Sankoff & Trinh fixed their ST-Synteny algorithm, they would
confirm rather than reject Pevzner-Tesler’s Fragile Breakage Model
Sankoff, 2006 (PLOS Computational Biology): Not only did we foist a hastily conceived and incorrectly
executed simulation on an overworked RECOMB conference program committee, but worse—nostra maxima culpa—we obliged a team of high-powered researchers to clean up after us!
”nostra maxima culpa” = It’s all our fault (Latin)
Kikuta et al., Genome Res. 2007: “... the Nadeau and Taylor hypothesis is not possible for the explanation of synteny in rat.”
All Recent Studies Support FBM
205
Where are the rearrangement hotspots located?
We demonstrated the existence of rearrangement hotspots but did not answer the question where they are.
We presented the preliminary answer to the question in Murphy et al., Science, 2005 (joint work with “Mammalian Genomic Architectures” consortium).
Many groups are currently trying to identify all fragile regions in mammalian genomes (Alekseyev and PP, Genome Biology, 2010)
Turnover Fragile Breakage Turnover Fragile Breakage ModelModel
Recent studies reveal evidence for the “birth and death” of the fragile regions, implying that they move to different locations in different lineages.
This discovery resulted in the Turnover Fragile Turnover Fragile Breakage Model (TFBM)Breakage Model (TFBM) that accounts for the “birth and death” of the fragile regions and sheds light on a possible relationship between rearrangements and Matching Segmental DuplicationsMatching Segmental Duplications.
TFBM points to locations of the currently fragile regions in the human genome.
Tests vs. Models
Why biologists believed in RBM for 20 years? Because RBM implies the exponential distribution of the sizes of the blocks observed in real genomes.
A flaw in this logic: RBM is not the only model that complies with the “exponential distribution” test.
Why RBM was refuted? Because RBM does not comply with the “breakpoint reuse” test: RBM implies low reuse but real genomes reveal high reuse.
FBM complies with both the “exponential distribution” and “breakpoint reuse” tests.
But is there a test that both RBM and FBM fail?
Exponential distribution
Breakpoint reuse
RBM YES NO
FBM YES YES
Model
Test
Tests vs. Models
RBM and FBM fail the Multispecies Breakpoint Reuse (MBR) test.
Exponential distribution
Breakpoint reuse
MBR
RBM YES NO NO
FBM YES YES NO
Model
Test
Tests vs. Models
TFBM passes all three tests.
Exponential distribution
Breakpoint reuse
MBR
RBM YES NO NO
FBM YES YES NO
TFBM YES YES YES
Model
Test
Implications of TFBM
Where are the (currently) Fragile Regions in the Human genome?
Prediction Power of TFBM Can we determine currently active regions in
the human genome HH from comparison with other mammalian genomes?
RBM provides no clue FBM suggests to consider the breakpoints
between HH and any other genome TFBM suggests to consider the closest
genome such as the macaque-human ancestor QHQH. Breakpoints in G(QH,H)G(QH,H) are likely to be reused in the future rearrangements of HH.
Validation of Predictions for
the Macaque-Human Ancestor Prediction of fragile regions on (QH,H)(QH,H) based on
the mouse, rat, and dog genomes:
Using mouse genome MM as a proxy: accuracy 34 / 552 ≈ 6%
Using mouse-rat-dog ancestor genome MRDMRD: accuracy 18 / 162 ≈ 11%
Using macaque genome QQ: accuracy 10 / 68 ≈ 16% (using synteny blocks larger than 500K)
Putative Active Fragile Regions in the Human Genome
Unsolved Mystery: What Causes Fragility? Zhao and Bourque, Genome Res. 2009,
suggested that fragility is promoted by Matching Segmental Duplications, a pair of long similar regions located within breakpoint regions flanking a rearrangement.
TFBM is consistent with this hypothesis since the similarity between MSDs deteriorates with time, implying that MSDs are also subject to a “birth and death” process.
222
Reconstructing Genomic Architecture of Tumor Genomes
1) Pieces of tumor genome: clones (100-250kb).
Human DNA
2) Sequence ends of clones (500bp).
3) Map end sequences to human genome.
Tumor DNA
Each clone corresponds to a pair of end sequences (ES pair) (x,y).
yx
223
Human genome(known)
Tumor genome(unknown)
Unknown sequence of rearrangements
Location of ES pairsin human genome.(known)
Map ES pairs tohuman genome.
-C -D EA B
B C EA D
x2 y2x3 x4 y1 x5 y5 y4 y3x1
Tumor Genome Reconstruction Puzzle
Reconstruct tumor genome
224
B C EA D
-C
-D
E
A
B
Tumor
Human
Tumor Genome Reconstruction
225
B C EA D
-C
-D
E
A
B
Tumor
Human
Tumor Genome Reconstruction
226
B C EA D
-C
-D
E
A
B
Tumor (x2,y2)
(x3,y3)
(x4,y4)
(x1,y1)
y4 y3x1 x2 x3 x4 y1 y2
Tumor Genome Reconstruction
227
B
C
E
A
D
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
(x2,y2)
(x3,y3)
(x4,y4)
(x1,y1)
ESP Plot
Human
228
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
229
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
230
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
231
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
232
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
233
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
234
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
235
B
C
E
A
D
Human
Human
B C EA D
2D Representation of ESP Data
• Each point is ES pair.• Can we reconstruct the tumor genome from the positions of the ES pairs?
236
B
C
E
A
D
Human
Human
B
-D
E
A
DA C
E
-C
B
-C -D EA B
ReconstructedTumor Genome
237
Real data noisy and incomplete!
238
Breast Cancer MCF7 Cell Line
Human chromosomes MCF7 chromosomes5 inversions
15 translocations
Raphael et al. 2003.
241
Complex Tumor Genomes
247
Sequencing Tumor Clones Confirms Complex Mosaic Structure
Volik et al., Decoding the fine-scale structure of breast cancer genome and transcriptome: Implications for Tumor Genome Project, Genome Res., 2006
248
Sequencing Tumor Clones Confirms Complex Mosaic Structure
Volik et al., Decoding the fine-scale structure of breast cancer genome and transcriptome: Implications for Tumor Genome Project, Genome Res., 2006
Hampton et al., A sequence-level map of chromosome breakpoints yields insights into the evolution of cancer genome. Genome Res, 2008 (157 breakpoints found using next generation sequencing)