Download - Identification of large-scale genomic rearrangements between closely related organisms Bob Mau 1,2, Aaron Darling 1,3, Fred Blattner 4,5, Nicole Perna.

Identification of large-scale genomic rearrangements

between closely related organisms

Bob Mau1,2, Aaron Darling1,3, Fred Blattner4,5, Nicole Perna1,5

Departments of Animal Health and Biomedical Sciences1, Oncology2, Computer Science3, Laboratory of Genetics4 , Genome Center

University of Wisconsin – Madison

The Amazing Variety of Diseases caused by E.coli strains

in Bacterial Pathogenesis: A Molecular Approach

“… is due to the fact different strains have acquired different sets of virulence genes. Most strains of E.coli are avirulent because they lack these virulence genes. E.coli is an excellent example of the maxim that it is the set of virulence genes carried by an organsims that make it a pathogen, not its species or genus designation.”

Categories of Bacterial Genome Evolution

• Local Single Base Mutations

Indels (Small insertions and deletions

• Global (Large-scale) Rearrangements Inversions, translocations, inverted translocations

• Gene Gain and Loss Horizontal or Lateral Transfer

Transformation, Transduction, and Conjugation Phage Integration

Mobile Elements Transposons and Insertion Sequences

Gene Duplication ( Mediated by mobile elements )

From the two E. coli genomes sequenced at the Blattner lab, we’ve identified:

• ~3900 genes common to both K-12 and O157:H7

• 528 genes unique to K-12

• 1387 genes unique to O157:H7

• 40 % of these genes are of unknown function.

The primary reasons for these wholesale differences are: lateral transfer, phage integration , and one whopperof a duplication.

Strategy of Global Alignment of Two Highly Related Genomes:

K

O

Partially Sorted

Suffix Arrays

STEP 1

Quickly find all 16-mer matches between genomes

(K1,O1)

:

(Ki,Oi)

:

(Kn,On)

STEP 2

Collapse consecutive pairs to form a collection of maximally exact matches. (MEMs)

Use LIS algorithm to construct a collinear set of maximally ordered matches.

STEP 3

Extend across intervening regions via anchored alignments from individual MEM endpoints

Unique Insert

Substitution

K-12 vs O157:H7 MEM Stats

• 43,235 total MEMs (24 bps) • 31,640 form maximal collinear subset• The largest exact match is 2,632 bases• 62 MEMs exceed 1000 bps• Over 11,000 exceed 100 bps• 18,212 single base differences (SNPs)• Resulted in a segmentation of O157:H7 into 357

intervals of backbone or unique insert.

A Three-way Genomic Comparison: Parkhill et.al. Nature

E. coli K-12 MG1655

S. Typhi CT18

S. Typhi-murium LT2

The “Traditional” WAY to view MEMs

{(a0,b0),(a1,b1),…, (aK,bK)} for K+1 genomes

For the reference genome G0, a0 < b0 by convention.

For the NON reference genomes, ak<bk means the match is oriented with G0, ak>bk means the match occurs on the opposite strand (reverse complement)

A novel approach, wherein:

• Extensibility: works just as well for N as it does for 2 genomes, provided there is sufficient sequence similarity.

• Automatically identifies inversions, translocations, and inverted translocations

• Determines a maximal collinear subset within each locally collinear region, without recourse to an LIS step

• Very space efficient and very fast

2 5 J a n u a ry 2 0 0 1

N a tu re 4 0 9, 5 2 9 - 5 3 3 (2 0 0 1 ) © M acm illan P u b lish ers L td .

G en o m e seq u en ce o fen tero h a em o rrh a g icE sch erich ia co li O 1 5 7 :H 7

N IC O L E T . P E R N A , G U Y P L U N K E T T III,

V A L E R IE B U R L A N D , B O B M A U ,

J E R E M Y D . G L A S N E R , D E B R A J . R O S E ,

G E O R G E F . M A Y H E W , P E T E R S . E V A N S ,

J A S O N G R E G O R ,

H E A T H E R A . K IR K P A T R IC K ,

G Y Ö R G Y P Ó S F A I, J E R E M IA H H A C K E T T ,

S A R A K L IN K , A D A M B O U T IN , Y IN G S H A O ,

L E S L IE M IL L E R , E R IK J . G R O T B E C K ,

N . W A Y N E D A V IS , A L E X L IM ,

E IL E E N T . D IM A L A N T A ,

K O N S T A N T IN O S D . P O T A M O U S IS ,

J E N N IF E R A P O D A C A ,

T H O M A S S . A N A N T H A R A M A N , J IE Y I L IN ,

G A L E X Y E N , D A V ID C . S C H W A R T Z ,

R O D N E Y A . W E L C H &

F R E D E R IC K R . B L A T T N E R

T h e b acteriu m E sch erich ia co li O 1 5 7 :H 7is a w o rld w id e th reat to p u b lic h ealth an dh as b een im p licated in m an y o u tb reak s o fh aem o rrh ag ic co litis, so m e o f w h ichin clu d ed fatalities cau sed b y h aem o ly ticu raem ic sy n d ro m e. C lo se to 7 5 ,0 0 0 caseso f O 1 5 7 :H 7 in fectio n are n o w estim ated too ccu r an n u ally in th e U n ited S tates. T h esev erity o f d isease, th e lack o f effectiv etreatm en t an d th e p o ten tial fo r larg e-scaleo u tb reak s fro m co n tam in ated fo o d su p p liesh av e p ro p elled in ten siv e research o n th ep ath o g en esis an d d etectio n o f E . co liO 1 5 7 :H 7 (ref. 4 ). H ere w e h av e seq u en cedth e g en o m e o f E . co li O 1 5 7 :H 7 to id en tifycan d id ate g en es resp o n sib le fo rp ath o g en esis, to d ev elo p b etter m eth o d s o fstrain d etectio n an d to ad v an ce o u ru n d erstan d in g o f th e ev o lu tio n o f E . co li,th ro u g h co m p ariso n w ith th e g en o m e o f th en o n -p ath o g en ic lab o rato ry strain E . co liK -1 2 (ref. 5 ). W e fin d th at lateral g en etran sfer is far m o re ex ten siv e th anp rev io u sly an ticip ated . In fact, 1 ,3 8 7 n ewg en es en co d ed in strain -sp ecific clu sters o fd iv erse sizes w ere fo u n d in O 1 5 7 :H 7 .T h ese in clu d e can d id ate v iru len ce facto rs,altern ativ e m etab o lic cap acities, sev eralp ro p h ag es an d o th er n ew fu n ctio n s— all o fw h ich co u ld b e targ ets fo r su rv eillan ce.

N a tu re © M a c m illa n P u b lis h e rs L td

2 0 0 1 R e g is te re d N o . 7 8 5 9 9 8 E n g la n d .

Multiple Oriented Offset

For each non-reference genome, determine the polarity with respect to G0

1ip

As well as the offset: kkk apad *0 The Multiple Oriented Offset is the N vector:

)},(),...,,{( 11 NN pdpdMoo

Canonical MEM Equivalence Classes

By appending the interval in reference genome coordinates: (a0, b0) to the Moo, the MEM is completely specified.

We aggregate MEMs by their generalized offset,

inducing a partition on the set of MEMs. This defines a CMemEC:

{Moo,{(a01, b0

1), (a02, b0

2),…, (a0M, b0

M)}}

||||..1

01

K

kkk

K

kk apadOG

In this example, it’s abundantly clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication.

We could probably get by with modest extensions of existing methods (MUMmer or our earlier algorithm) to account for the large amounts of laterally transferred lineage-specific sequence.

In this example, it’s abundantly clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication.

We could probably get by with modest extensions of existing methods (MUMmer or our earlier algorithm) to account for the large amounts of laterally transferred lineage-specific sequence.

But, hey, biology ain’t easy ...

Figure 1: Simplest Block and Strip Diagram

G1: Strip 1

G2: Strip 2

G3: Strip 3

1 2 3 4 5 6 7

G4: Strip 4

1 -7 5 6 4 3 2

-3 -2 -1 -7 5 6 4

-7 4 5 6 -3 -2 -1

1 -3 -2 4 5 6 7

G0: Reference Strip

1 2 3 4 5 6 7

Cut pt. Terminus Origin

G0: Reference

G1: Genome 1

1 2 -3 4 -6 -5 7

1 2 3 4 6 -5 7

G2: Genome 2

G3: Genome 3

1 -3 -2 4 5 6 7

G5: Genome 5

G4: Genome 4 1 2 -3 4 5 -6 7

1 2 3 4 5 6 7

Figure 2: Example with Variable Block Lengths

Figure 1: Large-scale Genomic Rearrangements

Genome 2

Genome 1

Zero Pt. Terminus Origin

Genome 3

Genome 4

Genome 5

Species Tree

MRCA

Figure 3: Segmentation Graph S(G0)

LOOk at the Picture and

Sorted Merge Lists of Six Enterobacterial strains

MG1655 W3110 EDL933 Sakai CT18 LT2

Six SMLs of bimers, one for each genome. A bimer is the lexicographically lesser of an n-mer (we use n=23) and its reverse complement, together with an orientation flag.

K-12 O157:H7 Typhi TyphimuriumEscherichia coli Salmonella Enterica

0 12 3 10 7 1 5 4 2 11 6 9 8 0

C20 C21 C22 C22.5 C23 C24 C25 C1 C2 C3 C4 C5 C6 C7

A Transformation of CO92 to KIM by Inversions Near the Origin

0 1 2 3 4 5 6 7 8 9 10 11 12 0

K5 K4 K3 K2 K1 K25 K24 K23 K22 K21 K20.5 K20 K19 K18

0 12 3 10 7 1 5 4 2 11 6 9 8 0C20 C21 C22 C22.5 C23 C24 C25 C1 C2 C3 C4 C5 C6 C7

A Transformation of CO92 to KIM by Inversions Near the Origin

0 1 2 3 4 5 6 7 8 9 10 11 12 0K5 K4 K3 K2 K1 K25 K24 K23 K22 K21 K20.5 K20 K19 K18

0 8 9 6 11 2 4 5 1 7 10 3 12 0

0 1 5 4 2 11 6 9 8 7 10 3 12 0

0 1 11 2 4 5 6 9 8 7 10 3 12 0

0 1 3 10 7 8 9 6 5 4 2 11 12 0

0 1 3 2 4 5 6 9 8 7 10 11 12 0

0 1 3 2 4 5 6 9 8 7 10 11 12 0