Identification of large-scale genomic rearrangements
between closely related organisms
Bob Mau1,2, Aaron Darling1,3, Fred Blattner4,5, Nicole Perna1,5
Departments of Animal Health and Biomedical Sciences1, Oncology2, Computer Science3, Laboratory of Genetics4 , Genome Center
University of Wisconsin – Madison
The Amazing Variety of Diseases caused by E.coli strains
in Bacterial Pathogenesis: A Molecular Approach
“… is due to the fact different strains have acquired different sets of virulence genes. Most strains of E.coli are avirulent because they lack these virulence genes. E.coli is an excellent example of the maxim that it is the set of virulence genes carried by an organsims that make it a pathogen, not its species or genus designation.”
Categories of Bacterial Genome Evolution
• Local Single Base Mutations
Indels (Small insertions and deletions
• Global (Large-scale) Rearrangements Inversions, translocations, inverted translocations
• Gene Gain and Loss Horizontal or Lateral Transfer
Transformation, Transduction, and Conjugation Phage Integration
Mobile Elements Transposons and Insertion Sequences
Gene Duplication ( Mediated by mobile elements )
From the two E. coli genomes sequenced at the Blattner lab, we’ve identified:
• ~3900 genes common to both K-12 and O157:H7
• 528 genes unique to K-12
• 1387 genes unique to O157:H7
• 40 % of these genes are of unknown function.
The primary reasons for these wholesale differences are: lateral transfer, phage integration , and one whopperof a duplication.
Strategy of Global Alignment of Two Highly Related Genomes:
K
O
Partially Sorted
Suffix Arrays
STEP 1
Quickly find all 16-mer matches between genomes
(K1,O1)
:
(Ki,Oi)
:
(Kn,On)
STEP 2
Collapse consecutive pairs to form a collection of maximally exact matches. (MEMs)
Use LIS algorithm to construct a collinear set of maximally ordered matches.
STEP 3
Extend across intervening regions via anchored alignments from individual MEM endpoints
Unique Insert
Substitution
K-12 vs O157:H7 MEM Stats
• 43,235 total MEMs (24 bps) • 31,640 form maximal collinear subset• The largest exact match is 2,632 bases• 62 MEMs exceed 1000 bps• Over 11,000 exceed 100 bps• 18,212 single base differences (SNPs)• Resulted in a segmentation of O157:H7 into 357
intervals of backbone or unique insert.
A Three-way Genomic Comparison: Parkhill et.al. Nature
E. coli K-12 MG1655
S. Typhi CT18
S. Typhi-murium LT2
The “Traditional” WAY to view MEMs
{(a0,b0),(a1,b1),…, (aK,bK)} for K+1 genomes
For the reference genome G0, a0 < b0 by convention.
For the NON reference genomes, ak<bk means the match is oriented with G0, ak>bk means the match occurs on the opposite strand (reverse complement)
A novel approach, wherein:
• Extensibility: works just as well for N as it does for 2 genomes, provided there is sufficient sequence similarity.
• Automatically identifies inversions, translocations, and inverted translocations
• Determines a maximal collinear subset within each locally collinear region, without recourse to an LIS step
• Very space efficient and very fast
2 5 J a n u a ry 2 0 0 1
N a tu re 4 0 9, 5 2 9 - 5 3 3 (2 0 0 1 ) © M acm illan P u b lish ers L td .
G en o m e seq u en ce o fen tero h a em o rrh a g icE sch erich ia co li O 1 5 7 :H 7
N IC O L E T . P E R N A , G U Y P L U N K E T T III,
V A L E R IE B U R L A N D , B O B M A U ,
J E R E M Y D . G L A S N E R , D E B R A J . R O S E ,
G E O R G E F . M A Y H E W , P E T E R S . E V A N S ,
J A S O N G R E G O R ,
H E A T H E R A . K IR K P A T R IC K ,
G Y Ö R G Y P Ó S F A I, J E R E M IA H H A C K E T T ,
S A R A K L IN K , A D A M B O U T IN , Y IN G S H A O ,
L E S L IE M IL L E R , E R IK J . G R O T B E C K ,
N . W A Y N E D A V IS , A L E X L IM ,
E IL E E N T . D IM A L A N T A ,
K O N S T A N T IN O S D . P O T A M O U S IS ,
J E N N IF E R A P O D A C A ,
T H O M A S S . A N A N T H A R A M A N , J IE Y I L IN ,
G A L E X Y E N , D A V ID C . S C H W A R T Z ,
R O D N E Y A . W E L C H &
F R E D E R IC K R . B L A T T N E R
T h e b acteriu m E sch erich ia co li O 1 5 7 :H 7is a w o rld w id e th reat to p u b lic h ealth an dh as b een im p licated in m an y o u tb reak s o fh aem o rrh ag ic co litis, so m e o f w h ichin clu d ed fatalities cau sed b y h aem o ly ticu raem ic sy n d ro m e. C lo se to 7 5 ,0 0 0 caseso f O 1 5 7 :H 7 in fectio n are n o w estim ated too ccu r an n u ally in th e U n ited S tates. T h esev erity o f d isease, th e lack o f effectiv etreatm en t an d th e p o ten tial fo r larg e-scaleo u tb reak s fro m co n tam in ated fo o d su p p liesh av e p ro p elled in ten siv e research o n th ep ath o g en esis an d d etectio n o f E . co liO 1 5 7 :H 7 (ref. 4 ). H ere w e h av e seq u en cedth e g en o m e o f E . co li O 1 5 7 :H 7 to id en tifycan d id ate g en es resp o n sib le fo rp ath o g en esis, to d ev elo p b etter m eth o d s o fstrain d etectio n an d to ad v an ce o u ru n d erstan d in g o f th e ev o lu tio n o f E . co li,th ro u g h co m p ariso n w ith th e g en o m e o f th en o n -p ath o g en ic lab o rato ry strain E . co liK -1 2 (ref. 5 ). W e fin d th at lateral g en etran sfer is far m o re ex ten siv e th anp rev io u sly an ticip ated . In fact, 1 ,3 8 7 n ewg en es en co d ed in strain -sp ecific clu sters o fd iv erse sizes w ere fo u n d in O 1 5 7 :H 7 .T h ese in clu d e can d id ate v iru len ce facto rs,altern ativ e m etab o lic cap acities, sev eralp ro p h ag es an d o th er n ew fu n ctio n s— all o fw h ich co u ld b e targ ets fo r su rv eillan ce.
N a tu re © M a c m illa n P u b lis h e rs L td
2 0 0 1 R e g is te re d N o . 7 8 5 9 9 8 E n g la n d .
Multiple Oriented Offset
For each non-reference genome, determine the polarity with respect to G0
1ip
As well as the offset: kkk apad *0 The Multiple Oriented Offset is the N vector:
)},(),...,,{( 11 NN pdpdMoo
Canonical MEM Equivalence Classes
By appending the interval in reference genome coordinates: (a0, b0) to the Moo, the MEM is completely specified.
We aggregate MEMs by their generalized offset,
inducing a partition on the set of MEMs. This defines a CMemEC:
{Moo,{(a01, b0
1), (a02, b0
2),…, (a0M, b0
M)}}
||||..1
01
K
kkk
K
kk apadOG
In this example, it’s abundantly clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication.
We could probably get by with modest extensions of existing methods (MUMmer or our earlier algorithm) to account for the large amounts of laterally transferred lineage-specific sequence.
In this example, it’s abundantly clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication.
We could probably get by with modest extensions of existing methods (MUMmer or our earlier algorithm) to account for the large amounts of laterally transferred lineage-specific sequence.
But, hey, biology ain’t easy ...
Figure 1: Simplest Block and Strip Diagram
G1: Strip 1
G2: Strip 2
G3: Strip 3
1 2 3 4 5 6 7
G4: Strip 4
1 -7 5 6 4 3 2
-3 -2 -1 -7 5 6 4
-7 4 5 6 -3 -2 -1
1 -3 -2 4 5 6 7
G0: Reference Strip
1 2 3 4 5 6 7
Cut pt. Terminus Origin
G0: Reference
G1: Genome 1
1 2 -3 4 -6 -5 7
1 2 3 4 6 -5 7
G2: Genome 2
G3: Genome 3
1 -3 -2 4 5 6 7
G5: Genome 5
G4: Genome 4 1 2 -3 4 5 -6 7
1 2 3 4 5 6 7
Figure 2: Example with Variable Block Lengths
Figure 1: Large-scale Genomic Rearrangements
Genome 2
Genome 1
Zero Pt. Terminus Origin
Genome 3
Genome 4
Genome 5
Species Tree
MRCA
Figure 3: Segmentation Graph S(G0)
LOOk at the Picture and
Sorted Merge Lists of Six Enterobacterial strains
MG1655 W3110 EDL933 Sakai CT18 LT2
Six SMLs of bimers, one for each genome. A bimer is the lexicographically lesser of an n-mer (we use n=23) and its reverse complement, together with an orientation flag.
K-12 O157:H7 Typhi TyphimuriumEscherichia coli Salmonella Enterica
0 12 3 10 7 1 5 4 2 11 6 9 8 0
C20 C21 C22 C22.5 C23 C24 C25 C1 C2 C3 C4 C5 C6 C7
A Transformation of CO92 to KIM by Inversions Near the Origin
0 1 2 3 4 5 6 7 8 9 10 11 12 0
K5 K4 K3 K2 K1 K25 K24 K23 K22 K21 K20.5 K20 K19 K18
0 12 3 10 7 1 5 4 2 11 6 9 8 0C20 C21 C22 C22.5 C23 C24 C25 C1 C2 C3 C4 C5 C6 C7
A Transformation of CO92 to KIM by Inversions Near the Origin
0 1 2 3 4 5 6 7 8 9 10 11 12 0K5 K4 K3 K2 K1 K25 K24 K23 K22 K21 K20.5 K20 K19 K18
0 8 9 6 11 2 4 5 1 7 10 3 12 0
0 1 5 4 2 11 6 9 8 7 10 3 12 0
0 1 11 2 4 5 6 9 8 7 10 3 12 0
0 1 3 10 7 8 9 6 5 4 2 11 12 0
0 1 3 2 4 5 6 9 8 7 10 11 12 0
0 1 3 2 4 5 6 9 8 7 10 11 12 0