Vijayachitra Modhukur BIIT
Next generation sequencing (NGS)
Bioinformatics course 11/13/12 1
Sequencing
Bioinformatics course 11/13/12 2
Microarrays vs NGS
11/13/12 Bioinformatics course 3
� Sequences do not need to be known in advance � Highly quantitative � Lesser noise levels , do not suffer from cross hybridization � NGS provides increased sensitivity to detect rare sequences
in complex genomic samples � Accurate single-nucleotide resolution permits the
discrimination between highly related sequences � The lowered cost of NGS makes comprehensive mapping of
multiple features possible Paul J. Hurd et al
Outline of NGS
Bioinformatics course 11/13/12 4
Why sequencing? � Genome architecture � Disease diagnosis � Variability studies � Comparative genomics � Gene regulation � Drug design � and many more……
Bioinformatics course 11/13/12 5
11/13/12 Bioinformatics course 6
Different generations (computers and sequencing)
11/14/12 Bioinformatics course 7
First Generation – Sanger sequencing
11/13/12 Bioinformatics course 8
� http://www.youtube.com/watch?v=aPN8LP4YxPo&feature=related
Application – Human genome project 1990-2002
11/14/12 Bioinformatics course 9
Human genome project key finding
11/14/12 Bioinformatics course 10
� 1. There are approximately 23,000 genes in human beings, the same range as in mice and roundworms. Understanding how these genes express themselves will provide clues to how diseases are caused.
� 2. The human genome has significantly more segmental duplications (nearly identical, repeated sections of DNA) than other mammalian genomes. These sections may underlie the creation of new primate-specific genes
� 3. At the time when the draft sequence was published fewer than 7% of protein families appeared to be vertebrate specific
http://en.wikipedia.org/wiki/Human_Genome_Project/
Second generation sequencing
11/13/12 Bioinformatics course 11
11/13/12 Bioinformatics course 12
http://sciblogs.co.nz/code-for-life/2012/03/22/the-world-in-dna-sequencers/
ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796
Break through NGS technology
Bioinformatics course 11/13/12 13
NGS platforms
11/13/12 Bioinformatics course 14
Leading Platforms
454 Solexa/Illumina SOLiD (ABI)
Bp per run 400 Mb 2-3 Gb 3-6 Gb
Read length 250-400 bp 35-50 (70-100) bp 35-50 bp
run time 10 hr 2.5 days 5 days
Download 20 min 27 hr (44 min) ~1 day
Analysis 2-5 hr 2 days 2-3 days
Files 20-50 Gb 1T 1 T
With 3730s, ~60Mb per year Specifications as of summer 2008
Massive amount of sequenced data
Bioinformatics course 11/13/12 15
Sequencing projects
Bioinformatics course 11/13/12 16
Application
11/13/12 Bioinformatics course 17
Human Genome
Human genome
11/14/12 Bioinformatics course 19
http://www.mdpi.com/journal/genes/special_issues/nextgen-sequencing/
1,000 genome project
Bioinformatics course 11/13/12 20
1,000 genome project
11/13/12 Bioinformatics course 21
� Small inter individual differences in regulatory regions found in all human population
� Genetic variation association to disease � Discover novel genetic variats such as snps, cnvs etc., � Better improvement of human reference sequence. � Key results � “Each person carry 250 to 300 loss-of-function variants in
annotated genes and 50 to 100 variants previously implicated in inherited disorders”.
Analysis
11/13/12 Bioinformatics course 22
data to analysis
cpu/memory intensive
NGS pipeline
Bioinformatics course 11/13/12 24
11/13/12 Bioinformatics course 25
Name Description
BLAT BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step.
BowtieUses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.
BWAUses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie but allows indels in alignment
ELAND Implemented by Illumina. Includes ungapped alignment with a finite read length.
GMAP and GSNAP
Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for digital gene expression, SNP and indel genotyping.
MAQ Ungapped alignment that takes into account quality scores for each base
MOSAIK
Fast gapped aligner and reference-guided assembler. Aligns reads using a banded Smith-Waterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in size from very short to very long.
RazerSNo read length limit. Hamming or edit distance mapping with configurable error rates. Configurable and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping.
SHRiMPIndexes the reads instead of the reference genome. Uses masks to generate possible keys. Can map ABI SOLiD color space reads.
SLIDER
Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.
SOAPRobust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses a 12 letter hash table. Now SOAP2 is much faster than the first version.
SOCSFor ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color errors). Uses an iterative version of the Rabin-Karp string search algorithm.
SSAHA Fast for a small number of variants.Taipan de-novo Assembler for Illumina reads
based on http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
Quality scores � Each base from a sequencer comes with a quality score � Base-calling error probabilities � Phred quality score � Q = 10 log10 P � higher quality score indicates a smaller probability of error
Bioinformatics course 11/13/12 26
http://www.illumina.com/truseq/quality_101/quality_scores.ilmn
Quality scores
Bioinformatics course 11/13/12 27
http://www.illumina.com/truseq/quality_101/quality_scores.ilmn
File formats
Bioinformatics course 11/13/12 28
fastQ
Raw data
http://en.wikipedia.org/wiki/FASTQ_format
fastQ to fasta
SAM/BAM format
11/13/12 Bioinformatics course 31 Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
SAM/BAM Format
Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc.
SAM (Sequence Alignment/Map) format
Single unified format for storing read alignments to a reference genome
BAM (Binary Alignment/Map) format
Binary equivalent of SAM
Developed for fast processing/indexing
Advantages
Can store alignments from most aligners
Supports multiple sequencing technologies
Supports indexing for quick retrieval/viewing
Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)
Reads can be grouped into logical groups e.g. lanes, libraries, individuals/genotypes
Supports second best base call/quality for hard to call bases
Possibility of storing raw sequencing data in BAM as replacement to SRF & fastq
SAM format
11/14/12 Bioinformatics course 32
Each bit in SAM format
11/14/12 Bioinformatics course 33
Sequence alignment � Reference alignment � De novo alignment
Bioinformatics course 11/13/12 34
Spaced seed vs BWT
Bioinformatics course 11/14/12 35
Burrows wheeler transform � Original : WBWBWB# � Compressed : WWW#BBB = 3W#3B
Bioinformatics course 11/13/12 36
Burrows wheeler transform
Book IVChapter 4
Da
ta C
om
pre
ssion
Alg
orith
ms
Lossless Data Compression Algorithms 437
The BWT algorithm must use a character that marks the end of the data, suchas the # symbol. Then the BWT algorithm works in three steps. First, it rotatestext through all possible combinations, as shown in the Rotate column of Table4-1. Second, it sorts each line alphabetically, as shown in the Sort column ofTable 4-1. Third, it outputs the final column of the sorted list, which groupsidentical characters together in the Output column of Table 4-1. In this exam-ple, the BWT algorithm transforms the string WBWBWB# into WWW#BBB.
Table 4-1 Rotating and Sorting DataRotate Sort Output
WBWBWB# BWBWB#W W#WBWBWB BWB#WBW WB#WBWBW B#WBWBW WWB#WBWB WBWBWB# #BWB#WBW WBWB#WB BWBWB#WB WB#WBWB BBWBWB#W #WBWBWB B
At this point, the BWT algorithm hasn’t compressed any data but merelyrearranged the data to group identical characters together; the BWT algo-rithm has rearranged the data to make the run-length encoding algorithmmore efficient. Run-length encoding can now convert the WWW#BBB stringinto 3W#3B, thus compressing the overall data.
After compressing data, you’ll eventually need to uncompress that samedata. Uncompressing this data (3W#3B) creates the original BWT output ofWWW#BBB, which contains all the characters of the original, uncompresseddata but not in the right order. To retrieve the original order of the uncom-pressed data, the BWT algorithm repetitively goes through two steps, asshown in Figure 4-1.
The BWT algorithm works in reverse by adding the original BWT output(WWW#BBB) and then sorting the lines repetitively a number of times equalto the length of the string. So retrieving the original data from a 7-characterstring takes seven adding and sorting steps.
After the final add and sort step, the BWT algorithm looks for the only linethat has the end of data character (#) as the last character, which identifiesthe original, uncompressed data. The BWT algorithm is both simple tounderstand and implement, which makes it easy to use for speeding up ordi-nary run-length encoding.
Bioinformatics course 11/13/12 37
Sequence assembly- Solving a jigaw puzzle
Bioinformatics course 11/13/12 38
Sequence assembly- repeating patterns
Bioinformatics course 11/13/12 39
Greedy Assemblers � Greedily joins the reads together that are most similar to
each other. � Examples : Phrap, Cap3, TIGR assembler,
© 2009 SIB LF June 4, 2010
Greedy
• Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.
• An example is shown below, where the assembler joins, in order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.
© 2009 SIB LF June 4, 2010
Overlap-layout-consensus
• Overlap-layout-consensus - The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap. The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes - a Hamiltonian path (Figure below). This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem.
• An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed. In a layout stage, the graph is simplified by removing redundant information. Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome. In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.
HE(-"*8% 4-*8'% &#-% *% 7*D$(-;*"% 4(+#0(@% I'(% $';DJ% (,4(:% ;+% $'(% 8;D$2-(% #+% $'(% "(K% 9*% L*0;"$#+;*+% DMD"(?%D#--(:8#+,% $#% $'(% D#--(D$% "*M#2$% #&% $'(% -(*,:% *"#+4%$'(% 4(+#0(% 9N42-(% #+% $'(% -;4'$?@% I'(% -(0*;+;+4%(,4(:%-(8-(:(+$%&*":(%#E(-"*8:%;+,2D(,%7M%-(8(*$:%9(F(08";N(,%7M%$'(%-(,%";+(:?
Bioinformatics course 11/13/12 40
Overlap layout consensus
Page 9 Barbara Hutter Assembly
● Based on all pairwise comparisons● Constuction of an overlap graph
• nodes = reads (sequences)
• egdes = connections between overlapping reads
● Layout: look for paths in the overlap graph which are segments of the genome to assemble (contigs)
• goal: find Hamiltonian path = a path that contains all nodes exactly once● Consensus: following the Hamiltonian path, combine the overlapping sequences in
the nodes into the sequence of the genome
• in case of different nucleotides: majority vote considering base qualities● Programs using the OLC:
• Arachne, Celera Assembler (CABOG), newbler, Minimus, Edena, CAP, PCAP
Overlap-Layout-Consensus
http://gepard.bioinformatik.uni-saarland.de/teaching/ws-2011-12/special-topic-lecture-bioinformatics-next
Bioinformatics course 11/13/12 41
De bruign graph- Velvet
Bioinformatics course 11/13/12 42
Online resources � NCBI-SRA � NCBI-GEO � The European Nucleotide Archive (ENA) � Array express
Bioinformatics course 11/13/12 43
Visualization tools
NATURE METHODS SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S3
REVIEW
sequence similarity. A user can interactively explore the sequence relationships between different contigs and view the results of search operations such as ‘find repeats’. Consed’s assembly view can display the output of a sequence comparison utility called ‘cross_match’, using arcs to connect regions with sequence similarity between user-selected contigs. Different colors dis-tinguish features such as directed repeats from inverted repeats. One advantage of viewing sequence similarity in ‘assembly view’ is that it can be integrated with a read coverage plot (Fig. 1a), which can reveal regions of unexpectedly high coverage often indicative of similar sequences that were erroneously collapsed by the assembler into one. The user can click to examine the sequence similarity at the base level, and click again to exam-ine the underlying reads. There are also standalone tools with related functionality; for example, Miropeats15, widely used for early genome sequencing projects, is a UNIX C-shell script that generates static images using arc representations to indicate different types of repeats.
Next-generation sequence viewers. As sequencing through-put increases and costs decrease, individual genome sequenc-ing has become feasible and has led to initiatives such as the 1,000 Genomes project (http://www.1000genomes.org/). These data provide an unprecedented opportunity to characterize the landscape of human genotypes, and a new generation of com-putational methods has emerged as a result16. In some cases, visual inspection can facilitate the evaluation and interpretation of read alignment techniques and variation detection outputs.
Assembly visualization tools possess most of the necessary functionality, but they were built with Sanger data in mind and initially strained under the substantially higher read volume of NGS technologies. Several of these tools are being retrofitted to tackle larger data sets, including Consed and the updated Gap5, but a new wave of tools is also being designed with this purpose in mind: for example, EagleView17, MapView18 and IGV (Table 1). Unlike finishing software, these tools are primarily data viewers and do not provide direct editing functionality. Because of their emphasis on browsing, many provide more flexible zooming capabilities and enable a user to freely zoom out to higher-level views. The commercially available CLC Genomics Workbench (CLC bio) is particularly user friendly and includes its own read alignment programs, which can be launched through a GUI.
In the resequencing context, mate pairs provide valuable infor-mation about structural variation, such as insertions, deletions and inversions. As discussed in the previous section, mate pairs can also indicate misassemblies, and users performing variation detection on draft assemblies should be aware of these issues. LookSeq19 and Gap5 use the vertical-axis position to indicate insertion size. This places inconsistent mate pairs at the extremes of the plot and visually separates large insert sizes, which are con-sistent with deletions, from small insert sizes, which suggest inser-tion events. When analyzing structural variations, it is important to consider gene annotations—for example, whether a single nucleotide variation leads to a synonymous or nonsynonymous amino acid change. For this reason, several of these visualization
Table 1 | Tools for visualizing sequencing dataName Cost OS Description URL
Stand-alone tools
ABySS-Explorer25 Free Win, Mac, Linux Interactive assembly structure visualization tool http://tinyurl.com/abyss-explorer/CLC Genomics Workbench $ Win, Mac, Linux Integrates NGS data visualization with analysis tools;
user friendlyhttp://www.clcbio.com/
Consed3* Free Mac, Linux Widely used; assembly finishing package; NGS compatible http://www.phrap.org/DNASTAR Lasergene14 $ Win, Mac Analysis suite with an assembly finishing package;
NGS compatiblehttp://www.dnastar.com/
EagleView17 Free Win, Mac, Linux Assembly viewer; compatible with single-end NGS http://tinyurl.com/eagleview/Gap12,13 Free Linux Widely used; assembly finishing package; Gap5 is
NGS compatiblehttp://staden.sourceforge.net/
Hawkeye6 Free Win, Mac, Linux (S) Sanger sequencing assembly viewer http://amos.sourceforge.net/hawkeye/Integrative Genomics Viewer (IGV)*
Free Win, Mac, Linux Genome browser with alignment view support (Table 2); NGS compatible
http://www.broadinstitute.org/igv/
MapView18 Free Win, Linux Read alignment viewer; custom file format for fast NGS data loading
http://evolution.sysu.edu.cn/mapview/
MaqView Free Mac, Linux Read alignment viewer; fast NGS data loading from Maq alignment files
http://maq.sourceforge.net/
Orchid Free Linux (S) Assembly viewer customized to display paired-end relationships
http://tinyurl.com/orchid-view/
Sequencher $ Win, Mac Assembly finishing package http://www.genecodes.com/SAMtools tview8 Free Win, Mac, Linux Simple and fast text alignment viewer; NGS compatible http://samtools.sourceforge.net/
Web-based tools
LookSeq19 Free Uses AJAX; y axis for insert size; user configures data resources; NGS compatible
http://lookseq.sourceforge.net/
NCBI Assembly Archive Viewer7
Free Graphical interface to contig and trace data in NCBI’s Assembly Archive
http://tinyurl.com/assmbrowser/
Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. (S) indicates that compilation from source is required. “Assembly finishing package” enables interactive sequence editing and/or integration with tools for automated assembly improvement.*Our recommendationBioinformatics course 11/13/12 44
Dr. Ece Gamsiz Bioinformatics course 11/13/12 45
Next lectures
11/13/12 Bioinformatics course 46
� RNA sequencing, method, application, advantages over microarrays
� Chip sequencing � Epigenomics, DNA methylation, histone modification � ……..