SHRiMP: The SHort Read Mapping Package
Michael BrudnoDepartment of Computer Science
University of Toronto 11/09/08
Handling NGS Data
• NGS: at least 3 distinct read types:– Illumina/Solexa, 454
letter-space
– AB SOLiD color-space (di-base sequencing)
– 2-pass SMS (Helicos) 2 reads, same location higher error rates
• Need new algorithms– SOLiD: Biologists want letters, not colors– 2-pass: How to best handle two reads?
SHRiMP Overview
Isolate similarity in stages:
1. Spaced Seed Filtering
2. Vectorized Smith-Waterman
3. Full Alignment– Specialized for SOLiD, 2-pass, Letter-space
4. Compute p-values (and other statistics)
} Common
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
TGAGCGTTC|||TGAATAGGA
A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
AB SOLiD: Dibase Sequencing
AB SOLiD reads look like this:
T012233102
A G
C T
1
2
2
33
0 0
00
1
TGAGCGTTCT012033102TGAATAGGA
HMM!!!hmm???
G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT
SNPs
TGAGTT 12210 TGACTT 12120TGAATT 12030TGATTT 12300
AB SOLiD: Color space is complex!
INDELS
TGAGTTA 122103
TGA-TTA 12-303
TGAGTTTA 1221003
TGAGTATA 1221333It’s
bloody complicated!
AB SOLiD: Translations
• Look at: 012233102• Recall: 012033102• 4 translations for every color sequence
A A C T T A T G G A A G
C T
1
2
2
33
0 0
00
1
0 1 2 0 3 3 1 0 2
C C A G G C G T T C
G G T C C G C A A G
T T G A A T A C C T
TGAGCGTTC|||TGAATAGGA
TGAGCGTTC|||||||||TGAGCGTTC
AB SOLiD: Modified Smith-Waterman
• 4 S-W matrices, one per translation• Errors transition into other matrix• ‘Crossover’ penalty charged for errors
Translation A Translation C
T T GT T
GGe
no
me
G A T A C C T C C A A G C G T T C
A G
C G
T T
C
…
AB SOLiD: Obligatory Comparison
• SHRiMP and AB Mapper (1.6)– SHRiMP seed weight 8 (1111001111)– AB 35_2, 35_3 schemas
• 10,000 35bp reads– C. savignyi (173Mb), very high polymorphism
• Considering single top hits only
SHRiMP AB 35_2 AB 35_3
% mapped 19.83 6.67 10.94
Runtime 13m04 1h24 2h25
AB SOLiD: Resultant Alignments
• SHRiMP emits letter-space alignments
– Clear to biologists
– Color-space need not be scary!
G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| |||T: GAaCCCCTTACAACTGAACCCC-TACR: 1 T1211000203110121201000-231 25
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
2-pass SMS Reads
• SMS reads have high error rates
– “Dark bases” (skipped letters)
– Multiple passes are possible
– Ameliorate errors over passes• Good chance of missing base in one read• Acceptable chance of getting it in at least one
Mapping 2-pass Reads
ReadsOriginal
C-GACTTTACTGACTTA
CTGA-T---
Reference Genome
?
CTG-ACTCAGCA-T
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
S=9
SMS 2-pass: SHRiMP with 2 reads
CTGCACT
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
CTGAC-TCAG-CAT
SMS 2-pass: SHRiMP with 2 reads
CTG-ACTCAGCA-TS=9
CTGCACT
CTGACAT
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
C-TG-ACTCA-GCA-T
CT-GAC-TC-AG-CAT
S=8
SMS 2-pass: SHRiMP with 2 reads
CTGAC-TCAG-CAT
CTG-ACTCAGCA-TS=9
CTGCACT
CTGACAT
CATGCACT
CTAGACAT
C-TGAC-TCA-G-CAT
CT-GAC-TC-AG-CAT
CATGCACT
CTAGACAT
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: Near-optimal Alignments
•Compute a DP matrix
•Sum it up with the DP matrix computed in reverse +
0 -2 -4 -6 -8 -10 -12
-2 4 2 0 -2 -4 -6
-4 2 1 -1 4 2 0
-6 0 -1 5 3 1 -1
-8 -2 -3 3 2 7 5
-10 -4 -5 1 7 5 4
-12 -6 0 -1 5 4 9
9 3 5 6 0 -6 -12
3 5 6 8 2 -4 -10
4 6 8 2 4 -2 -8
2 1 3 4 6 0 -6
-4 -2 0 6 1 2 -4
-6 -4 -2 0 2 4 -2
-12 -10 -8 -6 -4 -2 0
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: Near-optimal Alignments
•Compute a DP matrix
•Sum it up with the DP matrix computed in reverse
•Leave only near optimal alignments
=
9
9 8
8 9
9 9
9 9
9 9
9
9 1 1 0 -8 -16 -24
1 9 8 7 0 -8 -16
0 8 9 1 7 0 -8
-4 0 1 9 9 1 -7
-12 -4 -3 9 3 9 1
-16 -8 -7 1 9 9 2
-24 -16 -8 -7 1 2 9
Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003)
AT
—T
A—
CC
A—
—T
GG
CC
A—
—A
AA
—C
C—
• Build a DAG representing the (near) optimal alignments of the two reads
• Generate seeds (short paths) from the DAG
• Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW.
• Do full alignment for top hits
SMS 2-pass: SHRiMP with 2-pass data
AT
—T
A—
CC
A—
—T G
G
CC
A—
—A
AA
—C
C—
TT
Type Separate Profile WSG
No hits % 0.13 4.91 4.31
Multiple % 26.45 9.34 9.13
Uniq cor % 63.00 74.90 75.84
Runtime 9m 11m 12m
SMS 2-pass: Results (in brief)
• 10,000 synthetic reads (~25-65 bp)– 7% deletion,1% insertion, 1% sub rate
• Mapped to Human chromosome 1– Spaced seed weight 8: 111101111
• Fast mapping of short reads to a genome
-- Handles:
• color-space (SOLiD) reads
• 2-pass (SMS) reads
• insertions and deletions
-- Easy to parallelize
• Computation of p-values & other statistics for hits
SHRiMP Summary
• Faster Mapping (biggest complaint)
• Matepair data support
• Transcriptome Data
• Suggestions?
SHRiMP TODO List
Acknowledgements
SHRiMP is brought to you by:
– Steve Rumble– Vlad Yanovsky– Adrian Dalca – Marc Fiume
– Phil Lacroute– Arend Sidow
http://compbio.cs.toronto.edu/shrimp
University of Toronto
Stanford University