+ All Categories
Home > Documents > Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools...

Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools...

Date post: 13-May-2019
Category:
Upload: duongdat
View: 223 times
Download: 0 times
Share this document with a friend
35
Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. [email protected]
Transcript
Page 1: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Comparative genomics:

Overview & Tools +

MUMmer algorithm

Urmila Kulkarni-Kale

Bioinformatics Centre

University of Pune, Pune 411 007.

[email protected]

Page 2: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

2

Genome sequence: Fact file

• 1995: The first complete genome sequence of Haemophilus infuenzae Rd-was published

• Biological systems are dynamic and evolving

• The forth dimension: Time

• Genome sequence is a snapshot of evolution

• Correlation between Phenotypic properties and Genomic region is not straightforward as phenotypic properties are result of many to many interactions

Page 3: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

3

Genomes: the current status• Published complete genomes: 403

» Archaeal: 81

» Bacterial: 1226

» Eukaryal: 169

• Ongoing:

» Archaeal: 107

» Prokaryotic: 3478

» Eukaryotic: 1209

As of Jan 21, 2010

GOLD database

Viral: >4500

Metagenomics:203

Page 4: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

4

Genome databases

• Genomes at NCBI, EBI, TIGR

Page 5: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

5

H. influenzae Complete Genome

Page 6: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

6

Function information clock of E. coli

Generated on March 2K4

Page 7: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

7

Comparison of the coding regions

• Begins with the gene

identification algorithm:

infer what portions of the

genomic sequence

actively code for genes.

• There are four basic

approaches.

Page 8: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

8

Knowledge of Full Genome sequence:

Solutions or new questions…?

• Still struggling

with the gene

counters…

Correct #

of

genes…?

Page 9: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

9

Genome analyses

• Variation in

– Genome size

– GC content

– Codon usage

– Amino acid composition

– Genome organisation

• Single circular chromosomes

• Linear chromosome + extra chromosomal elements

G, A, P, R: GC rich

I, F, Y, M, D: AT rich

E. coli: 4.6Mbp

M. pneumoniae: 0.81Mbp

B. subtilis: 4.20Mbp

B. burgdorferi: 29%

M. tuberculosis: 68%

Page 10: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

10

CG: Comparisons between genomes

• The stains of the same species

• The closely related species

• The distantly related species

– List of Orthologs

– Evolution of individual genes

– Evolution of organisms

Page 11: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

11

Page 12: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

12

CG helps to ask some interesting questions

• Identification similarities/differences

between genomes may allow us to

understand :

– How 2 organisms evolved?

– Why certain bacteria cause diseases while

others do not?

– Identification and prioritization of drug targets

Page 13: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

13

CG: Unit of comparison

• Unit of comparison: Gene/Genome

– Number

– Content (sequence)

– Location (map position)

– Gene Order

– Gene Cluster (Genes that are part of a known metabolic pathway, are found to exist as a group)

– Colinearity of gene order is referred as synteny

– A conserved group of genes in the same order in two genomes as a syntenic groups or syntenic clusters

– Translocation: movement of genomic part from one position to another

Page 14: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

14

Structure of tryptophan operon • Numbers: Gene number

• Arrows: Direction of transcription

• //: Dispersion of operon by 50 genes

Domain fusion

trpD and trpG

trpF and trpC

trpB and trpA

genetically linked

separate genes

Dan

dek

ar e

t al

., 1

998

Page 15: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

15

Important observations with regard to Gene Order

• Order is highly conserved in closely related

species but gets changed by rearrangements

• With more evolutionary distance, no

correspondence between the gene order of

orthologous genes

• Group of genes having similar biochemical

function tend to remain localized

– Genes required for synthesis of tryptophan (trp

genes) in E. coli and other prokaryotes

Page 16: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

16

Synteny

• Refers to regions of two genomes that show considerable similarity in terms of

– sequence and

– conservation of the order of genes

• likely to be related by common descent.

Page 17: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

17

COGs: Phylogenetic classification of proteins

encoded in complete genomes

Page 18: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

18

Genome analyses@NCBIPairwise genome comparison of protein

homologs (symmetrical best hits)

http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi

Page 19: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

19

Integr8: CG site at EBI

http://www.ebi.ac.uk/integr8

Page 20: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

20

Comparative Genomics Tools

• BLAST2

• MUMmer

• PipMaker

• AVID/VISTA

• Comparisons and analyses at both

– Nucleic acid and protein level

Page 21: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

21

BLAST2

• Available at NCBI

• Input: GI or FASTA sequence (range can be

specified)

• Output:

– Graphical

– Alignment of 2 genomes

Page 22: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

22

Genome Alignment Algorithm:

MUMmer

• Developed by

– Dr. Steven Salzberg’s group at TIGR

– NAR (1999) 27:2369-2376

– NAR (2002) 30:2478-2483

• Availability

– Free

– TIGR site

Page 23: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

23

Features of MUMmer

• The algorithm assumes that sequences are closely related

• Can quickly compare millions of bases

• Outputs:

– Base to base alignment

– Highlights the exact matches and differences in the genomes

– Locates

• SNPs

• Large inserts

• Significant repeats

• Tandem repeats and reversals

Page 24: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

24

Definitions are drawn from biology

• SNP: Single mutation surrounded by two matching regions

– Regions of DNA where 2 sequences have diverged by more than one SNP

• Large inserts: regions inserted into one of the genomes

– Sequence reversals, lateral gene transfer

• Repeats: the form of duplication that has occurred in either genome.

• Tandem repeats: regions of repeated DNA in immediate succession but with different copy number in different genomes.

– A repeat can occur 2.5 times

Page 25: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

25

Techniques used in the MUMmer

Algorithm

Compute Suffix trees for every genome

Longest Increasing Subsequence (LIS)

Alignment using Smith & Waterman algorithm

Integration of

these techniques

for genome alignment

Page 26: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

26

MUMmer: Steps in the alignment process

Read two

genomesPerform Maximum Unique

Match (MUM) of genomes

Sort and order the

MUMs using LIS

Close the gaps

in the

Alignment

Using SNPs,

mutation regions,

repeats, tandem

repeats

Output

alignment

• MUMs

• regions that do not

match exactly

Page 27: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

27

MUMmer steps

• Locating MUMs

• Sorting MUMs

• Closure with gaps

G1: ACTGATTACGTGAACTGGATCCA

G2: ACTCTAGGTGAAGTGATCCA

Page 28: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

28

Genome1: ACTGATTACGTGAACTGGATCCA

Genome2: ACTCTAGGTGAAGTGATCCA

Genome1: ACTGATTACGTGAACTGGATCCA

Genome2: ACTCTAGGTGAAGTGATCCA

ACTGATTACGTGAACTGGATCCA

ACTC--TAGGTGAAGT-GATCCA

Page 29: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

29

What is a MUM?

• MUM is a subsequence that occurs exactly once in both genomes and is NOT part of any longer sequence

• Two characters that bound a MUM are always mismatches

• Principle: if a long matching sequence occurs exactly once in each genome, it is certainly to be part of global alignment

GenA: tcgatcGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAcgactta

GenB: gcattaGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAtccagag

Similar to

BLAST & FASTA!!

Page 30: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

30

Sorting & ordering MUMs• MUMs are sorted according to their position in

Genome A

• The order of matching MUMs in Genome B is considered

• LIS algorithm to locate longest set of MUMs which occur in ascending order in both genomes

2 4

MUM5:transposition

MUM3:

Random match

Inexact repeat

Leads to Global MUM-alignment

Page 31: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

31

MUMmer Results

• 2 strains of M. tuberculosis

– H37Rv & CDC1551

– Genome size: 4Mb

– Time: 55 s

• Generating suffix tree: 5 s

• Sorting MUMs: 45s

• S&W alignment: 5 s

Page 32: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

32

Alignment of M. tuberculosis strains

CDC1551 (Top) & H37Rv (bottom)

Single green lines

indicate SNPs

Blue lines

indicate insertions

Page 33: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

33

Comparison of 2 Mycoplasma genomes

cousins that are distantly related

• M. genitalium: 580 074 nt

• M. pneumoniae: 816 394 (+226 000)

• Analysis of proteins tell us that all M.g. proteins are present in P.m.

• Alignment was carried using

– FASTA (dividing each genome into 1000 bp)

– All-against-all searches

– Fixed length of pattern (25)

– Using MUMmer (length = 25)

Page 34: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

34

Comparison of 2 Mycoplasma genomes

Using FASTA

Fixed length

patterns: 25mers

MUMmer

Page 35: Comparative genomics: Overview & Tools + MUMmer algorithm · Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune,

Jan 21, 2010 © UKK, Bioinformatics Centre,

University of Pune.

35

Post-sequencing challenges • Genome sequencing is just the beginning to

appreciate biocomplexity

• Sequence-based function assignment approaches

fail as the sequence similarity drops …

• Structure-based function prediction approaches are

limited by the availability of structures,

association of structural motifs & associated

functional descriptor

• As a result, in any genome,

Genes with unknown

function: ~60%

Genes with known

function: ~ 40%


Recommended