De Novo Repeat Classification and Fragment Assembly

Post on 12-Jan-2016

31 views 0 download

Tags:

description

De Novo Repeat Classification and Fragment Assembly. 석사 1 년 김 우 연. PROGRAMS related Repeat. Repeat Annotation - libraries RepeatMasker ( A.F.A. Smit and P. Green, unpubl. ) MaskerAid ( Bedell et al. 2000 ) No de novo compilation Repeat Analysis RepeatMatch ( Delcher et al. 1999 ) - PowerPoint PPT Presentation

transcript

Pusan National UniversityInterdisciplinary Program of Bioinformatics

De Novo Repeat Classification De Novo Repeat Classification and Fragment Assemblyand Fragment Assembly

석사 1 년김 우 연

PROGRAMS related RepeatPROGRAMS related Repeat

Repeat Annotation - libraries RepeatMasker ( A.F.A. Smit and P. Green, unpubl. ) MaskerAid ( Bedell et al. 2000 ) No de novo compilation

Repeat Analysis RepeatMatch ( Delcher et al. 1999 ) REPuter ( Kurtz et al. 2000, 2001 ) RECON, RepeatFinder, LTR_STRUC No compact overview or summary of the repeat family

Genome Research Received January 27, 2004 Accepted in revised form June 29, 2004

CONTENTSCONTENTS

Introduction Concepts Methods

De Bruijn Graphs & A-Bruijn Graphs RepeatGluer Algorithm Constructing A-Bruijn Graphs Without the Similarity Matrix Fragment Assembly FragmentGluer Algorithm

Results and Discussion

INTRODUCTIONINTRODUCTION

“The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack” – Bao and Eddy (2002)

One of the difficulties in repeat classification is that many repeats represent mosaics of sub-repeats – Bailey et al. 2002

Aims Proposing a new approach to repeat classification FragmentGluer assembler

CONCEPSCONCEPS

Genomic dot-plotGenomic dot-plot

Genomic dot-plot of an imaginary sequence

An imaginary evolutionary process

Gluing repeated regions leads to the repeat graph the final genome

The idea of our approachThe idea of our approach

By gluing points together, repeats transform into the

A-Bruijn graph

Mosaic repeat organizationMosaic repeat organization

BAC from human Chromosome Y Repeat pairs by REPuter & Sub-repeats by our division Repeat multigraph Repeat graph RepeatFinder vs RECON vs REPuter

METHODSMETHODS

De Bruijn Graphs & A-Bruijn De Bruijn Graphs & A-Bruijn GraphsGraphs

De Bruijn Graph: ACTGCTGCC

ACT CTG

TGCGCT GCC

ACTGCTGCC ACTGCTGCC

De Bruijn Graphs & A-Bruijn De Bruijn Graphs & A-Bruijn GraphsGraphs

A-Bruijn Graph: … AT … ACT … ACAT …

Whirls & Bulges

Available gaps & mismatch

RepeatGluer AlgorithmRepeatGluer Algorithm

Construct the A-Bruijn graph Eliminate whirls Remove bulges Erosion – Remove all leaves Straighten zigzag paths Forming the consensus sequence Output repeat families

Constructing A-Bruijn Graphs Without Constructing A-Bruijn Graphs Without the Similarity Matrixthe Similarity Matrix

Constructing of the A-Bruijn graph assumes S and A S and { S1, …, St } can construct A-Bruijn graph of S

A set for every pair of consecutive positions in S Matrix |Si| x |Sj|

A snapshot of a “small” area of matrix A

S: A genomic sequencen: the length of SA: matrix n x n{ S1, …, St }: A set of substrings|Si|: the length of the string Si

Fragment AssemblyFragment Assembly

Assemblers Phrap ( Green 1994 ) Celera assembler ( Myers et al. 2000 ) EULER assembler ( Pevzner et al. 2001 )

http://nbcr.sdsc.edu/euler

ARCHNE, Phusion, CAP, TIGR

Building an accurate assembler EULER + Phrap EULER+ EULER’s accuracy in analyzing repeats & Phrap’s ability to han

dle low-coverage regions, low-quality reads, and read ends Less memory than the original EULER FragmentGluer algorithm

FragmentGluer AlgorithmFragmentGluer Algorithm

1. Construct the A-Bruijn graph of S2. Eliminate whirls by splitting the composed vertices 3. Remove bulges 4. Erosion procedure by removing all leaves5. Straighten zigzag paths6. Thread each read7. Definition consensus sequence8. Output repeat families9. Transform mate-pairs into mate-paths after step 610. Assemble the resulting contigs into scaffolds by the

EULER Scaffolding algorithm

RESULTS AND DISCUSSIONRESULTS AND DISCUSSION

BenchmarkingBenchmarking

Results of a study of 518 human chromosome 20 clones.

  Phrap ARACHNE EULER+

Av.# contigs/clone 6.8 13.8 6.2

Av. coverage 99.30% 98.60% 98.80%

# misassembled contigs 37 17 7

# missing repeats 5 9 4

EULER produced the least number of misassembled contigs. EULER also had the least number of missing repeat copies (4), ahead of Phrap (5) and Arachne (9). Average coverage, over 518 clones, was 99.3% for Phrap, 98.8% for EULER, and 98.6% for ARACHNE Average number of contigs per clone was the least for EULER (6.2) followed by Phrap (6.8) and ARACHNE (13.8).

More researchMore research

The consensus sequence analysis of FragmentGluer Detecting de novo HERVs as the consensus sequence of

FragmentGluer