Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | dana-sutton |
View: | 37 times |
Download: | 2 times |
Genome sequence assembly
concepts and methods
Shih-Jon Wang
May 13, 2008
• Assembly Process Overview• Assembly algorithms• Repeats• Scaffolding • Phred/Phrap/Consed• Assembly pipelines
OUTLINE
Assembly process overview
A Genome Sequencing Project
Building a Library
• Break DNA into random fragments (8-10x)
SHOTGUNs
• Whole Genome Shotgun
• Bac-Bac Shotgun
• Size of inserts:
• --Bac insert: ~150KB
• --Fosmid insert: ~30KB
• --Normal insert: ~3KB
Clone and scaffold
(a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences,
or contigs, in the genome under assembly.
Computer 35 (7):47-54
Building a Library
• Break DNA into random fragments (~10x)
• Break DNA into random fragments (~10x)-- Amplify the fragments in a vector-- Sequence 800-1000 bases at each end
Assembling the fragments
Assembling the fragments
• Break DNA into random fragments• Sequence the ends of the fragments• Assemble the sequenced ends
Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known
Building Scaffolds
Assembly Gaps
--sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap--physical gap: no information about adjacent contigs, nor about the DNA spanning the gap
Finishing the Project
Unifying View of Assembly
Assembly Algorithms
Assembly Methods
• Overlap-layout-consensus
– greedy (Phrap, CAP3, TIGR...)
– graph-based (Euler)
Phrap/CAP3
Greedy • Build a rough map of fragment overlaps• Pick the largest scoring overlap• Merge the two fragments• Repeat until no more merges can be done!!! IDEAL CASE !!!
Real World Problems
• Sequencing errors
• Chimera
• Repeats
• Contaminants
• Polymorphism
• Orientation
Error Correction
Overlap b/w two sequences
All pairs alignment
• Try all pairs – must consider ~ n^2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs
are possible
– Build a table of k-mers contained in sequences (single pass through the genome)
– Generate the pairs from k-mer table (single pass
through k-mer table)
Repeats
Repeat sequence
The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly.
Computer 35 (7):47-54
重覆序列■ 重覆頻率分
Interspersed repeats Short interspersed element (SINE), eg. Alu <300 bp Long interspersed element (LINE), ca. 5 kb
Tandem repeats Satellite DNA Minisat. & Variable number of tandem repeats Microsat.: mono-, di-, tri-, tetra-nucleotide
■ 重覆方向分 同向重覆序列 反向重覆序列
Repeat detection
Pre-assembly: find fragments that belong to repeats
• statistically (Reps)
• repeat database (RepeatMasker)
Statistical repeat detection
• Significant deviations from average coverage flagged as repeats.
- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with
theoretical value(e.g., 800 bp reads & 8x coverage - reads "arrive" e
very 100 bp)• Problem 1: assumption of uniform distribution of
fragments - leads to false positives non-random libraries poor clonability regions
• Problem 2: repeats with low copy number are missed
Scaffolding
Sequencing hierarchy
• Random sequencing– unrelated reads ~700 pairs• Assembly– un-related contigs 5K-10K pairs• Scaffolding– unrelated scaffolds 30K~ 50K pairs• Finishing/gap closure– completed genomes millions-billions of bas
e-pairs
Definition
Scaffolder output
• order and orientation of contigs• size of gaps between contigs• linking evidence: mate-pairs spanning gaps
Clone-mates
Linking information
Hierarchical scaffolding
Ambiguous scaffold
Phred/Phrap/Consed Analysis
What is Phred/Phrap/Consed ?
Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading;b. Quality (confidence) assignment to each individual base;c. Vector & repeat sequences identification and masking;d. Sequence assembly;e. Assembly visualization and editing;f. Automatic finishing.
How to deal with the enormous amount of reads generated by
the high throughput DNA sequencers?
Phred Genome Research 8: 175-194
PhredPhred is a program that performs several
tasks:
a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI, ESD (MegaBACE) and LI-COR.
b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.
Phred
c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base.
d. Creates output files – base calls and quality values are written to output files.
File Directories
• chromat_dir/
• edit_dir/
• phd_dir/
Trace File High quality region – no ambiguities (Ns)
Trace File Medium quality region – some
ambiguities (Ns)
Trace File Poor quality region – low confidence
Phred value formula
q = - 10 x log10 (p) whereq - quality valuep - estimated probability error for a base call
Examples:Examples:
qq = 20 means = 20 means pp = 10 = 10-2-2 (1 error in 100 (1 error in 100 bases)bases)qq = 40 means = 40 means pp = 10 = 10-4-4 (1 error in 10,000 (1 error in 10,000 bases)bases)
Base Calling
• phred -id . -p -pd ../phd_dir
• phred -view pf84c05.s1
The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g
BEGIN_COMMENT
CHROMAT_FILE: EBV10201A02.gABI_THUMBPRINT: PHRED_VERSION: 0.990722.gCALL_METHOD: phredQUALITY_LEVELS:99TIME: Thu May 24 00:18:58 2001TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 12153TRIM: CHEM: termDYE: big
END_COMMENT
BEGIN_DNAt 8 5c 13 17a 19 26c 19 32
t 6 11908t 6 11908a 6 11921a 6 11921g 6 11927g 6 11927t 6 11947t 6 11947c 6 11953c 6 11953a 6 11964a 6 11964g 6 11981g 6 11981c 4 11994c 4 11994n 4 12015n 4 12015c 4 12037c 4 12037n 4 12044n 4 12044n 4 12058n 4 12058n 4 12071n 4 12071n 4 12085n 4 12085n 4 12098n 4 12098n 4 12111n 4 12111n 4 12124n 4 12124c 4 12144c 4 12144n 4 12151n 4 12151END_DNAEND_DNA END_SEQUENCEEND_SEQUENCE
t 24 2221t 24 2221a 24 2232a 24 2232a 22 2245a 22 2245a 27 2261a 27 2261g 25 2272g 25 2272c 19 2286c 19 2286c 12 2302c 12 2302t 19 2314t 19 2314g 12 2324g 12 2324g 15 2331g 15 2331g 19 2346g 19 2346g 23 2363g 23 2363t 33 2378t 33 2378g 36 2390g 36 2390c 44 2404c 44 2404c 44 2419c 44 2419t 39 2433t 39 2433a 39 2446a 39 2446a 34 2460a 34 2460t 35 2470t 35 2470g 34 2482g 34 2482
t 16 8191t 16 8191g 19 8200g 19 8200t 13 8211t 13 8211c 13 8229c 13 8229g 4 8241g 4 8241n 4 8253n 4 8253c 4 8263c 4 8263t 10 8276t 10 8276t 9 8286t 9 8286c 12 8301c 12 8301t 16 8313t 16 8313c 12 8329c 12 8329c 12 8336c 12 8336c 15 8343c 15 8343t 19 8356t 19 8356c 9 8371c 9 8371g 13 8386g 13 8386g 14 8397g 14 8397a 7 8417a 7 8417g 9 8427g 9 8427g 4 8445g 4 8445
phd2fasta• phd2fasta program
– –converts .phdfiles to sequence in multifasta format
– –writes .qualfile (quality scores) for each trace file – –phd2fasta -id ../phd_dir -os CLONE.fasta -oq
CLONE.fasta.qual
• Output: – –fasta.seqcontains fastasequences – –fasta.seq.qualcontains quality scores
Vector Sequence Cleaning (1)
• DNA sequence cleaning: quality trimming and vector removal---Lucy:
• Lucy Steps: – Read input seq#, seq info, and quality info– Chop off splice sites– Remove vector insert– Produce output seq for fragment assembly.
Vector Sequence Cleaning (2)• Restriction on file name:can’t contain any symbol eg. “–” “. “ “_”• Lucy major parameters to set up:-vector vector_completeSeq splice_site_file
(splice_site_file: 2 splice-site seq before and after the insertion point on the vector)
• Lucy Output: – identified locations of good/clean region – trim seq without vector, linker, Ns (<3% Ns)
splice_site_file• ~ 150 bases, 50 bases overlap around splice • >PUCsplice.for.begingattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacgacggccagtgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga
• >PUCsplice.for.endacggccagtgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtgaaattgttatccgctcacaattccacacaacatacgagccggaagcataaa
• >PUCsplice.rev.begintttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt
• >PUCsplice.rev.endtcacacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgtcgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc
Cross_match
Cross_match
• cross_match -minmatch 12 -penalty -2 -minscore 20 -screen CLONE.fasta /net/share/sequence_pipeline/vector.fasta
Phrap-- Phragment Assembly Program (or Phil’s Revise
d Assembly Program)• Phrap is a program for assembling shotgun DNA Phrap is a program for assembling shotgun DNA
sequence data sequence data • Key Features:Key Features:• a. Uses the entire read content – no need fa. Uses the entire read content – no need f
or trimming.or trimming.• b. User supplied (i.e. Repbase) + internally b. User supplied (i.e. Repbase) + internally
computed data – better accuracy of assembly in computed data – better accuracy of assembly in the presence of repeats.the presence of repeats.
• c. Contig sequence is constituted by a mosc. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s aic of the highest quality parts of the reads – it’s not a consensus! not a consensus!
Phrap
--Phrap is a program for assembling shotgun DNA --Phrap is a program for assembling shotgun DNA sequence data sequence data
• d. Provides extensive information about assembld. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.coy – contained in phrap.out, *.ace and *.screen.contigs.qual filesntigs.qual files
• e. Handles very large datasets – hundreds of thoe. Handles very large datasets – hundreds of thousands of reads are easily manipulated.usands of reads are easily manipulated.
• f. Generate output files – contain some important f. Generate output files – contain some important data and enable visualization by other programsdata and enable visualization by other programs
Banded Search
K-mers
• >GL1234.b1
gattaagttgggtaacgccagggttttcccagtcac…
gattaagttgggta
attaagttgggtaa
ttaagttgggtaac
taagttgggtaacg
...
Phrap output files
• *.contigs – fasta file containing the contigs*.contigs – fasta file containing the contigs– Contigs with more than one readContigs with more than one read– Singletons (single reads with a match to some other Singletons (single reads with a match to some other
contig but that couldn’t be merged consistently to it)contig but that couldn’t be merged consistently to it)
• *.singlets – fasta file of the singlet reads*.singlets – fasta file of the singlet reads– Reads with no match to other readReads with no match to other read
• *.ace – allows for viewing the assembly using C*.ace – allows for viewing the assembly using Consedonsed
• *.view – required for viewing the assembly usin*.view – required for viewing the assembly using Phrapviewg Phrapview
Phrap parameters• phrap -new_ace CLONE.fasta.screen >outfile
• OPTIONS DEFAULT FUNCTION• -penalty -2 ↑=>↑Stringency
. -gap_init penalty-2
. -gap_ext penalty-1• -minmatch* 14 ↑=>↓time↓Matches• -bandwidth 14 ↓=>↓time↓String. • -minscore 30 ↑=>↑String.• *highly sensitive! bigger genomes bigger value
Phrap parameters
• OPTIONS DEFAULT FUNCTION• -forcelevel 0~10 ↓=>↑String. • -repeat_stringency 0.95• 0<x<1 ↑=>↑String.• -force_high* ↑=>↑String.• -revise_greedy** ↓Misassembly• -shatter_greedy** ↓ContigLength
* Ignore edited high-quality discrepancies**break assembly at weak joins
Phrap parameters
• OPTIONS DEFAULT FUNCTION• -max_subclone_size• 5000 F.-R. check• -default_qual 15• -preassemble*• -group_delim* _
*used together
Consed Genome Research 8: 195-202, 1998
Consed
A program for viewing and editing assemblies A program for viewing and editing assemblies produced by Phrapproduced by Phrap
Key Features:Key Features:
a. Assembly viewer - allows for visualization of contigs, aa. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and fissembly (aligned reads), quality values of reads and final sequence. nal sequence.
b. Trace file viewer – single and multiple trace files can bb. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequene visualized allowing for comparison of a given sequence in several reads.ce in several reads.
Consed
A program for viewing and editing assemblieA program for viewing and editing assemblies produced by Phraps produced by Phrap
Key Features:Key Features:
c. Navigation – identify and list regions which are below a gc. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepanciesiven quality threshold, contain high quality discrepancies, single-strand coverage, etc., single-strand coverage, etc.
d. Autofinish – automatic set of functions for: gap closure, d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relatiimprovement of sequence quality, determination of relative orientation of contigs, identification of regions covereve orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The prod by a single read or by reads of a single strand. The program automatically performs primer picking and chooses gram automatically performs primer picking and chooses the templates.the templates.
Phred/Phrap/Consed Pipeline
Chromat_dirChromat_dir
Phd_dirPhd_dir
Edit_dirEdit_dir
DirectoriesDirectories::
Assembly view ing/editingConsed
Assem blyPhrapassem bled contigs - se qs_ fas ta .sc re en .con tigsassem bly file - seq s_ fa s ta .sc re e n .a ce#
Vector screening and m askingCross_M atch (local alignment program) x vec to r.seqscreened/masked file - seq s_fa s ta .scre en
Conversion - phd to fastaphd2fasta.plnucleotide sequences - seq s_fa s taquality values - seq s_ fa s ta .sc re e n .q u a l
Quality (confidence) values assignm entPhredphd files - * .p hd
Inputchromatogram files
Comparison of shotgun sequence data from Wolbachia genome Project
Computer 35 (7):47-54
CAP3 3XA 6189 57 443
PHRAP 3XA 6396 54 529
CAP3 3XB 12,368 44 71
PHRAP 3XB 13,116 47 228
CAP3 3XC 10,709 49 227
PHRAP 3XC 11,406 45 332
CAP3 3XD 11,408 43 115
PHRAP 3XD 11,350 49 240
CAP3 5XA 10,582 42 249
PHRAP 5XA 18,268 31 252
CAP3 5XB 26,034 17 100
PHRAP 5XB 33,693 18 115
CAP3 5XC 20,939 29 172
PHRAP 5XC 20,912 27 261
CAP3 5XD 14,219 35 46
PHRAP 5XD 14,696 33 129
CAP3 8XA 71,025 12 83
PHRAP 8XA 71,395 8 80
CAP3 8XB 53,127 8 59
PHRAP 8XB 53,078 7 36
CAP3 8XC 52,134 8 4
PHRAP 8XC 76,922 6 6
CAP3 8XD 72,690 7 35
PHRAP 8XD 102,523 6 60
CAP3 10XA 91,380 4 28
PHRAP 10XA 91,329 3 11
CAP3 10XB 167,655 1 5
PHRAP 10XB 138,551 2 7
CAP3 10XC 106,631 5 44
PHRAP 10XC 77,747 4 12
CAP3 10XD 79,900 4 2
PHRAP 10XD 79,978 3 2
Softwares
• CAP3 (for EST): http://genome.cs.mtu.edu/cap/cap3.html
• Phrap (for large genome): http://www.phrap.org
• --Similar algorithm• --Insufficient documentation and support• -- Always have to write scripts to parse outpu
ts• --NO PERFECT PROGRAM!!!