+ All Categories
Home > Documents > Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.

Date post: 18-Jan-2018
Category:
Upload: berniece-sherman
View: 218 times
Download: 0 times
Share this document with a friend
Description:
Assembly process overview

If you can't read please download the document

Transcript

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009 Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines OUTLINE Assembly process overview A Genome Sequencing Project Building a Library Break DNA into random fragments (8- 10x) SHOTGUNs Whole Genome Shotgun Bac-Bac Shotgun Size of inserts: --Bac insert: ~150KB --Fosmid insert: ~30KB --Normal insert: ~3KB Clone and scaffold (a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7):47-54 Building a Library Break DNA into random fragments (~10x) -- Amplify the fragments in a vector -- Sequence bases at each end Assembling the fragments Break DNA into random fragments Sequence the ends of the fragments Assemble the sequenced ends Forward-reverse constraints The sequenced ends are facing towards each other The distance between the two fragments is known Building Scaffolds Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap Finishing the Project Unifying View of Assembly Assembly Algorithms Assembly Methods Overlap-layout-consensus greedy (Phrap, CAP3, TIGR...) graph-based (Euler) Phrap/CAP3 Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done !!! IDEAL CASE !!! Real World Problems Sequencing errors Chimera Repeats Contaminants Polymorphism Orientation Error Correction Overlap b/w two sequences All pairs alignment Try all pairs must consider ~ n^2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible Build a table of k-mers contained in sequences (single pass through the genome) Generate the pairs from k-mer table (single pass through k-mer table) Repeats Repeat sequence The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly. Computer 35 (7):47-54 Interspersed repeats Short interspersed element (SINE), eg. Alu PUCsplice.for.end acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga aattgttatccgctcacaattccacacaacatacgagccggaagcataaa >PUCsplice.rev.begin tttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatt tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt >PUCsplice.rev.end tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt cgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc Cross_match cross_match -minmatch 12 -penalty -2 - minscore 20 -screen CLONE.fasta /net/share/sequence_pipeline/vector.fasta Phrap -- Phragment Assembly Program (or Phils Revised Assembly Program) Phrap is a program for assembling shotgun DNA sequence dataPhrap is a program for assembling shotgun DNA sequence data Key Features:Key Features: a. Uses the entire read content no need for trimming.a. Uses the entire read content no need for trimming. b. User supplied (i.e. Repbase) + internally computed data better accuracy of assembly in the presence of repeats.b. User supplied (i.e. Repbase) + internally computed data better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads its not a consensus!c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads its not a consensus! Phrap --Phrap is a program for assembling shotgun DNA sequence data d. Provides extensive information about assembly contained in phrap.out, *.ace and *.screen.contigs.qual filesd. Provides extensive information about assembly contained in phrap.out, *.ace and *.screen.contigs.qual files e. Handles very large datasets hundreds of thousands of reads are easily manipulated.e. Handles very large datasets hundreds of thousands of reads are easily manipulated. f. Generate output files contain some important data and enable visualization by other programsf. Generate output files contain some important data and enable visualization by other programs Banded Search K-mers >GL1234.b1 gattaagttgggtaacgccagggttttcccagtcac gattaagttgggta attaagttgggtaa ttaagttgggtaac taagttgggtaacg... Phrap output files *.contigs fasta file containing the contigs*.contigs fasta file containing the contigs Contigs with more than one read Singletons (single reads with a match to some other contig but that couldnt be merged consistently to it) *.singlets fasta file of the singlet reads*.singlets fasta file of the singlet reads Reads with no match to other read *.ace allows for viewing the assembly using Consed*.ace allows for viewing the assembly using Consed *.view required for viewing the assembly using Phrapview*.view required for viewing the assembly using Phrapview Phrap parameters phrap -new_ace CLONE.fasta.screen >outfile OPTIONS DEFAULT FUNCTION -penalty -2 =>Stringency. -gap_init penalty-2. -gap_ext penalty-1 -minmatch* 14 =>timeMatches -bandwidth14 =>timeString. -minscore30 =>String. *highly sensitive! bigger genomes bigger value Phrap parameters OPTIONS DEFAULT FUNCTION -forcelevel0~10 =>String. -repeat_stringency String. -force_high* =>String. -revise_greedy** Misassembly -shatter_greedy** ContigLength * Ignore edited high-quality discrepancies **break assembly at weak joins Phrap parameters OPTIONS DEFAULT FUNCTION -max_subclone_size 5000 F.-R. check -default_qual 15 -preassemble* -group_delim* _ *used together Consed Genome Research 8: , 1998 Consed A program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads. Consed A program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc. d. Autofinish automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates. Phred/Phrap/Consed Pipeline Chromat_dir Phd_dir Edit_dir Directories: Comparison of shotgun sequence data from Wolbachia genome Project Computer 35 (7):47-54 Softwares CAP3 (for EST):Phrap (for large genome):--Similar algorithm --Insufficient documentation and support -- Always have to write scripts to parse outputs --NO PERFECT PROGRAM!!! Questions??


Recommended