Review:MethodologiesforSVsdetectionFritz Sedlazeck
Nov, 16, 2018
Mygroup/interestsDetectionofVariants
SnifflesSedlazeck et.al. (2018)
SURVIVORJeffareset.al.(2017)
BOD-ScoreSedlazecket.al.(2013)
Mapping/Assemblyreads
NextGenMap-LRSedlazecket.al.(2018)
FalconUnzipChin et.al.(2016)
NextGenMapSedlazecket.al.(2013)
Benchmarking
SVgenotyperChander et.al. (in prep.)
TeaserSmolka et.al.(2015)
SequencingJünemann et.al.(2013)
ApplicationsModelorganisms:-Cancer(SKBR3)(Nattestadet.al.2018)-miRNA editing(Vesely et.al.2012)
NonModelorganisms:-Cottus transposons (Dennenmoseret.al.2017)-Clunio (Kaiseret.al.2016)-Seabass (Vij et.al.2016)-Pineapple (Minget.al.2015)
Figure'1'
“moonlight”'
Early2000sdogma:SNPsaccountformosthumangeneticvariation
https://hapmap.ncbi.nlm.nih.gov
Segmentalduplications(a.k.a.Lowcopyrepeats)
Bailey et al, 2002~5% of the human genome is duplicated!
Self Dotplot: 10 megabases of Chr 15(dot = 1 kb exact match)
Variationingenomestructure.So-called"structuralvariation"(SV)
DB CAReference
DB CA BDuplication
CB DInversion A
DCADeletion *DB CXInsertion A
Translocation RB QA
CNV
CNV
SV
SV
SV
SV is a superset of copy number variation (CNV). Not all structural changes affect
copy number (e.g., inversions)!
Ourunderstandingofstructuralvariationisdrivenbytechnology
1940s - 1980sCytogenetics / Karyotyping
1990sCGH / FISH /
SKY / COBRA
2000sGenomic microarrays
BAC-aCGH / oligo-aCGH
TodayHigh throughput DNA sequencing
Whyare structuralvariations relevant/important?
• They are common and affect a large fraction of the genome
• They are a major driver of genome evolution
GenomicDisordersEvolution
Whyare structuralvariations relevant/important?
• Genetic basis of traits
Impactonregulation Impactonphenotypes
RegulatoryState
CellLine
A549Aorta
B_cells_PB
_Roadm
ap
CD14C
D16__m
onocyte
_CB
CD14C
D16__m
onocyte
_VB
CD4_ab_T_
cell_VB
CD8_ab_T_
cell_CB
CM_CD
4_ab_T
_cell_VB
DND_41
eosinop
hil_VBEPC
_VB
erythroblas
t_CB
Fetal_Ad
renal_Gland
Fetal_Intestine_
Large
Fetal_Intestine_
Small
Fetal_Muscle_L
eg
Fetal_Muscle_T
runk
Fetal_S
tomach
Fetal_Th
ymusGas
tric
GM12878
H1_mes
enchym
al
H1_neurona
l_progenitor
H1_trop
hoblastH1E
SC H9HeL
a_S3Hep
G2HMECHSM
M
HSMMtube
HUVEC_
prol_CBHUV
ECIMR
90iPS_20b
iPS_DF
_19_11
iPS_DF
_6_9K56
2
Left_Ve
ntricleLun
g
M0_mac
rophage_C
B
M0_mac
rophage_V
B
M1_mac
rophage_C
B
M1_mac
rophage_V
B
M2_mac
rophage_C
B
M2_mac
rophage_V
B
Monocytes_
CD14_PB_
Roadma
p
Monocytes_
CD14
MSC_V
B
naive_B
_cell_VB
Natural_Killer_cells_P
B
neutrop
hil_CB
neutrop
hil_mye
locyte_B
M
neutrop
hil_VBNH_
A
NHDF_A
DNHE
KNHL
FOsteoblOva
ry
Pancrea
sPlac
enta
Psoas_Mus
cle
Right_A
trium
Small_IntestineSple
en
T_cells_PB
_Roadm
apThymus
CTCF_b
inding_siteACT
IVE
CTCF_b
inding_siteINACTIVE
CTCF_b
inding_sitePOI
SED
CTCF_b
inding_siteREP
RESSED
enhancerACTIVE
enhancerIN
ACTIVE
enhancerPOIS
ED
enhancerREPR
ESSED
open_chromatin_regio
nACTIVE
open_chromatin_regio
nINACT
IVE
open_chromatin_regio
nNA
open_chrom
atin_reg
ionPOIS
ED
open_chromatin_regio
nREPRE
SSED
promoterACTIVE
promoter_flanking
_region
ACTIVE
promoter_flanking
_region
INACTIVE
promoter
_flankin
g_region
POISED
promoter_flanking
_region
REPRES
SED
promoterIN
ACTIVE
promoter
POISED
promoterREPR
ESSED
TF_bind
ing_siteACT
IVE
TF_bind
ing_siteINACTIVE
TF_bind
ing_siteNA
TF_bind
ing_sitePOI
SED
TF_bind
ing_siteREP
RESSED
A549Aorta
B_cells_PB
_Roadm
ap
CD14C
D16__m
onocyte
_CB
CD14C
D16__m
onocyte
_VB
CD4_ab_T_
cell_VB
CD8_ab_T_
cell_CB
CM_CD
4_ab_T
_cell_VB
DND_41
eosinop
hil_VBEPC
_VB
erythroblas
t_CB
Fetal_Ad
renal_Gland
Fetal_Intestine_
Large
Fetal_Intestine_
Small
Fetal_Muscle_L
eg
Fetal_Muscle_T
runk
Fetal_S
tomach
Fetal_Th
ymusGas
tric
GM12878
H1_mes
enchym
al
H1_neurona
l_progenitor
H1_trop
hoblastH1E
SC H9HeL
a_S3Hep
G2HMECHSM
M
HSMMtube
HUVEC_
prol_CBHUV
ECIMR
90iPS_20b
iPS_DF
_19_11
iPS_DF
_6_9K56
2
Left_Ve
ntricleLun
g
M0_mac
rophage_C
B
M0_mac
rophage_V
B
M1_mac
rophage_C
B
M1_mac
rophage_V
B
M2_mac
rophage_C
B
M2_mac
rophage_V
B
Monocytes_
CD14_PB_
Roadma
p
Monocytes_
CD14
MSC_V
B
naive_B
_cell_VB
Natural_Killer_cells_P
B
neutrop
hil_CB
neutrop
hil_mye
locyte_B
M
neutrop
hil_VBNH_
A
NHDF_A
DNHE
KNHL
FOsteoblOva
ry
Pancrea
sPlac
enta
Psoas_Mus
cle
Right_A
trium
Small_IntestineSple
en
T_cells_PB
_Roadm
apThymus
CTCF_b
inding_siteACT
IVE
CTCF_b
inding_siteINACTIVE
CTCF_b
inding_sitePOI
SED
CTCF_b
inding_siteREP
RESSED
enhancerACTIVE
enhancerIN
ACTIVE
enhancerPOIS
ED
enhancerREPR
ESSED
open_chromatin_regio
nACTIVE
open_chromatin_regio
nINACT
IVE
open_chromatin_regio
nNA
open_chrom
atin_reg
ionPOIS
ED
open_chromatin_regio
nREPRE
SSED
promoterACTIVE
promoter_flanking
_region
ACTIVE
promoter_flanking
_region
INACTIVE
promoter
_flankin
g_region
POISED
promoter_flanking
_region
REPRES
SED
promoterIN
ACTIVE
promoter
POISED
promoterREPR
ESSED
TF_bind
ing_siteACT
IVE
TF_bind
ing_siteINACTIVE
TF_bind
ing_siteNA
TF_bind
ing_sitePOI
SED
TF_bind
ing_siteREP
RESSED
0500
1000
1500
2000
scale
affecte
d #
Outline
1. CNVanalysis
2. SVsanalysis1. Assemblybased2. Shortreads3. Longreads
3. Reviewplan
Humansdifferbyroughly3,000deletions(>=500bp)
Humansdifferbyafewhundredduplications
Copy-number Profiles
Gingko http://qb.cshl.edu/ginkgo
Interactive Single Cell CNV analysis & clustering• Easy-to-use, web interface, parameterized for binning,
segmentation, clustering, etc• Per cell through project-wide analysis in any species
Compare MDA, DOP-PCR, and MALBAC• DOP-PCR shows superior resolution and consistency
Available for collaboration• Analyzing CNVs with respect to different clinical outcomes• Extending clustering methods, prototyping scRNA
Interactive analysis and assessment of single-cell copy-number variations.Garvin T, Aboukhalil R, Kendall J, Baslan T, Atwal GS, Hicks J, Wigler M, Schatz MC (2015) Nature Methods doi:10.1038/nmeth.3578
Data are noisy
Potentialforbiasesateverystep• WGA:Non-uniformamplification• LibraryPreparation:Lowcomplexity,readduplications,barcoding• Sequencing:GCartifacts,shortreads• Computation:mappability,GCcorrection,segmentation,treebuilding
CoverageistoosparseandnoisyforSNPanalysis->Requiresspecialprocessing
CNVanalysis§ Dividethegenomeinto“bins”with~50– 100reads/bin§ Mapthereadsandcountreadsperbin
Useuniquelymappablebasestoestablishbins
1.Binning
1.Binning
CNVanalysis§ Dividethegenomeinto“bins”with~50– 100reads/bin§ Mapthereadsandcountreadsperbin
Useuniquelymappablebasestoestablishbins
1.Binning
5 4 5 10 11 5 2 5
CNVanalysis§ Dividethegenomeinto“bins”with~50– 100reads/bin§ Mapthereadsandcountreadsperbin
Useuniquelymappablebasestoestablishbins
2. Normalization
Also correct for mappability, GC content, amplification biases
3. Segmentation
CircularBinarySegmentation(CBS)
i j j j ji ji
4.EstimatingCopyNumber
CN = argminnX
i,j
(Yi,j � Yi,j)2o
UsingNanopore MinION:CNVkaryotyping.
Nanopore sequencingforCNVdetection
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 20212223XY
SKBR3 cell line CNV Analysis
SID97277- partialchromosomaldeletions
MinIONdata
~60kreads
MiSeq Data
5qdeletion indicatespoorprognosis Chr11abnormalities
indicatepoor prognosis
SID97277karyotype
SID97279– trisomy6,15,22anddeletionsinchr11
MinIONData
~73kreads
MiSeq Data
Trisomy6correlatedwithintermediateprognosis
Abnormalitieson11indicatepoorprognosis
CNVdetectionsummary
• Advantages• Lesscoverageisrequired
• ->Applicationssuchassinglecellsequencing
• Disadvantages• Resolutionofevents
• usuallyinthemultikbp• Onlydeletionsandduplications• Coveragebiasesinshortreads
Assemblybased
1. Denovoassembly2. Genomicalignment(WGA)3. Detanglethegenomicalignmenttoidentifyvariants.
Ingredients for a good assembly
Current challenges in de novo plant genome sequencing and assemblySchatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243
Coverage
High coverage is required– Oversample the genome to ensure
every base is sequenced with long overlaps between reads
– Biased coverage will also fragment assembly
Lander Waterman Expected Contig Length vs Coverage
Read Coverage
Exp
ect
ed
Co
ntig
Le
ng
th (
bp
)
0 5 10 15 20 25 30 35 40
10
01
k1
0k
10
0k
1M
+dog mean
+dog N50
+panda mean
+panda N50
1000 bp
710 bp
250 bp
100 bp
52 bp
30 bp
Read Coverage
Expe
cted
Con
tig
Leng
th
Read Length
Reads & mates must be longer than the repeats– Short reads will have false overlaps
forming hairball assembly graphs– With long enough reads, assemble
entire chromosomes into contigs
Quality
Errors obscure overlaps– Reads are assembled by finding
kmers shared in pair of reads– High error rate requires very short
seeds, increasing complexity and forming assembly hairballs
Goal of WGA
• For two genomes, A and B, find a mapping from each position in A to its corresponding position in B
CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA
CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA
Not so fast...
• Genome A may have insertions, deletions, translocations, inversions, duplications or SNPs with respect to B (sometimes all of the above)
CCGGTAGGATATTAAACGGGGTGAGGAGCGTTGGCATAGCA
CCGCTAGGCTATTAAAACCCCGGAGGAG....GGCTGAGCA
WGA visualization
• How can we visualize whole genome alignments?
• With an alignment dot plot• N x M matrix
• Let i = position in genome A• Let j = position in genome B• Fill cell (i,j) if Ai shows similarity to Bj
• A perfect alignment between A and B would completely fill the positive diagonal
T
G
C
A
A C C T
B
A
B
A
Translocation Inversion Insertion
• Different structural variation types / misassemblies will be apparent by their pattern of breakpoints
• Most breakpoints will be at or near repeats
• Things quickly get complicated in real genomes
http://mummer.sf.net/manual/AlignmentTypes.pdf
Assemblybaseddetectionsummary
• Advantages• Enablesthedetectionofeveryevent• Goodqualityforinsertions
• Disadvantages• Genomicalignmentischallenging.• Heterozygouseventsarelikelymissed.
HowtodetectStructuralVariations
Sequencealignment“signals”forstructuralvariation
1. Align DNA sequences from sample to human reference genome
2. Look for evidence of structural differences
Ref.
Exp.
(a) Depth ofcoverage
(b) Paired-endmapping
(c) Split-readmapping
(d) de novoassembly
Low HighResolution
Lookingfor"discordant"paired-endfragments
Paired-end sequencing
Ref
Sample
paired-ends map farther away than expected
2000 bp
Slide in collaboration with Ira Hall
AprobabilisticframeworkforSVdiscovery
Layer et al, 2014
Ryan Layer
Lumpy integrates paired-end mapping, split-read mapping, and depth of coverage for better SV discovery accuracy
Problem#1:Oftenmanyfalsepositives
- Short reads + heuristic alignment + rep. genome = systematic alignment artifacts (false calls)
- Chimeras and duplicate molecules
- Ref. genome errors (e.g., gaps, mis-assemblies)
- ALL SV mapping studies use strict filters for above
Problem#2:Thefalsenegativerateisalsotypicallyhigh
- Most current datasets have low to moderate physical coverage due to small insert size (~10-20X)
- Breakpoints are enriched in repetitive genomic regions that pose problems for sensitive read alignment
- FILTERING!
- The false negative rate is usually hard to measure, but is thought to be extremely high for most paired-end mapping studies (>30%)
- When searching for spontaneous mutations in a family or a tumor/normal comparison, a false negative call in one sample can be a false positive somatic or de novo call in another.
Howtofilter/choosetheSVcaller?• Eachmethodappliesitsownheuristics.
Method # Sim. SV avg FDR avg SensitivityDELLY 33-198 0.13 0.75LUMPY 33-198 0.06 0.62Pindel 33-198 0.04 0.55SURVIVOR 33-198 0.01 0.70
PacBio /ONTsequencer
Advantage:• Longreads,Disadvantage:• Throughput/yield• Costs• Higherrorrates
LongReadTechnologies
• (+)SVsinrepetitiveregions• (+)SpanSVs• (+)Uniformcoverage• (+)CanidentifymorecomplexSVs
• (-)Higherseq.errorrate• (-)Hardtoalign
Mappingchallenges
BWA-MEM: NGMLR:
Mappingchallenges
BWA-MEM: NGMLR:
NGMLR+Sniffles
• NGMLR• Convexgapcostmodeltobetterdistinguishseq.errorvs.signal
• Novelmethodforsplitreadalignment.
• Sniffles• Includesmultiplestatisticalmodelstodistinguishnoisevs.signal
100
250
500 1k 5k 10k
50k
Indels
0
20
40
60
80
100
BLAS
R
100
250
500 1k 5k 10k
50k
Duplication
100
250
500 1k 5k 10k
50k
Translocation
100
250
500 1k 5k 10k
50k
Inversion
100
250
500 1k 5k 10k
50k
0
20
40
60
80
100
BWA
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
0
20
40
60
80
100
GraphMap
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
0
20
40
60
80
100
NGMLR
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
1.3Longreadmapping
Precise
Indicated
Wrong
Alignmentstoppedprior
Notaligned
Morecomplextypes
2.4LongreadSVcalling
100
250
500 1k 5k 10k
50k
Indels
0
20
40
60
80
100
SURV
IVOR
100
250
500 1k 5k 10k
50k
Duplication
100
250
500 1k 5k 10k
50k
Translocation
100
250
500 1k 5k 10k
50k
Inversion
100
250
500 1k 5k 10k
50k
0
20
40
60
80
100
PBHo
ney
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
0
20
40
60
80
100
Sniffles
+BWA
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
0
20
40
60
80
100
Sniffles
+NGM−LR
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
Precise
Indicated
Notfound
Additionalevents
2.4LongreadSVcalling
Precise
Indicated
Notfound
Additionalevents
100
250
500 1k 5k 10k
50k
Dup
020406080100
SURV
IVOR
100
250
500 1k 5k 10k
50k
Indel
100
250
500 1k 5k 10k
50k
Inv
100
250
500 1k 5k 10k
50k
Tra
100
250
500 1k 5k 10k
50k
InvDel
100
250
500 1k 5k 10k
50k
InvDup
100
250
500 1k 5k 10k
50k
020406080100
PBHoney
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
020406080100
Sniffles
+BWA
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
020406080100
Sniffles
+NGM−LR
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
Dup
020406080100
SURV
IVOR
100
250
500 1k 5k 10k
50k
Indel
100
250
500 1k 5k 10k
50k
Inv
100
250
500 1k 5k 10k
50k
Tra
100
250
500 1k 5k 10k
50k
InvDel
100
250
500 1k 5k 10k
50k
InvDup
100
250
500 1k 5k 10k
50k
020406080100
PBHo
ney
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
020406080100
Sniffles
+BWA
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
020406080100
Sniffles
+NGM−LR
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
100
250
500 1k 5k 10k
50k
INVDEL
INVDUPInversionflankedbydeletions:
• Haemophilia A
Invertedtandemduplication:• Pelizaeus-Merzbacher disease• MECP2• VIPR2
3.2NA12878
• Healthyfemale
• Goldstandardingenomics
• Sequencedwithmanytechnologiesindependently:• Illumina,PacBio,OxfordNanopore
3.2NA12878:Deletioncalling
Tech. Cov. Avg len SVs DEL DUP INV INS TRA
PacBio 55x 4,334 22,877 9,933 162 611 12,052 119
OxfordNanopore
28x 6,432 32,409 27,147 87 323 4,809 43
Illumina 50x 2x101 7,275 3,744 731 553 0 2,247
3.2NA12878:Deletioncalling
Tech. Cov. Avg len SVs DEL DUP INV INS TRA
PacBio 55x 4,334 22,877 9,933 162 611 12,052 119
OxfordNanopore
28x 6,432 32,409 27,147 87 323 4,809 43
Illumina 50x 2x101 7,275 3,744 731 553 0 2,247
3.2OxfordNanoporedeletions
illumina
PacBio
OxfordNanopore
3.2NA12878:Deletioncalling
Tech. Cov. Avg len SVs DEL DUP INV INS TRA
PacBio 55x 4,334 22,877 9,933 162 611 12,052 119
OxfordNanopore
28x 6,432 32,409 27,147 87 323 4,809 43
OxfordNanopore@Baylor
34x 4,982 12,596 7,102 169 113 5,166 46
Illumina 50x 2x101 7,275 3,744 731 553 0 2,247
3.2NA12878:Deletioncalling
Tech. Cov. Avg len SVs DEL DUP INV INS TRA
PacBio 55x 4,334 22,877 9,933 162 611 12,052 119
OxfordNanopore
28x 6,432 32,409 27,147 87 323 4,809 43
OxfordNanopore@Baylor
34x 4,982 12,596 7,102 169 113 5,166 46
Illumina 50x 2x101 7,275 3,744 731 553 0 2,247
3.2NA12878:check2,247 vs 119TRA
Illuminadata
Translocation:
PacBiodata
ONTdata
Truncatedreads:
InsertionInrep.region
Overlap Illumina TRA(%)Translocations 7.74Insertions 53.05Deletions 12.06Duplications 0.57Nested 0.31Highcoverage 1.87Lowcomplexity 9.79Explained 85.40
NA12878:check2,247 TRA
ONTdata
PacBiodata
Illuminadata
InsertionInrep.region
Inversion:
Translocation:
Truncatedreads:
InsertionInrep.region
SKBR-3usingPacbio
(Davidsonetal,2000)
Oftenusedforpre-clinicalresearchonHer2-targetingtherapeuticssuchasHerceptin(Trastuzumab)andresistancetothesetherapies.
MostcommonlyusedHer2-amplifiedbreastcancercellline
80chromosomes insteadof46
Her2GSDMB
TATDN1
8Mb
RARA
PKIA
InversionwasonlyfoundbySniffles
Her2
Chr 17Chr 8
1. Healthychromosome17&82. Translocationinto
chromosome83. Translocationwithin
chromosome84. Complex variantand
invertedduplicationwithinchromosome8
5. Translocationwithinchromosome8
Medicalapproach:UsingNanopore MinION
GBAMutationsinParkinsonandGaucher
ReviewonSVmethodologies
• Whichmethodsdoexistpermethodology?• Assemblyvs.shortreadmappingvs.longreadmapping
• Whataretheadvantages/disadvantagespermethodology• Accuracy• Costs• Limitations,remainingchallenges,complexalleles,polyploidy,etc.
• Whereisthefieldat?• Diploidassemblies• PhasingofSVs+SNPs
• Wehaveanoutlineandajournalthatisinterestedtoworkwithus.
Thankyou
• SVcallingisSNPcallingof2008• Readsaretypicallyshorterthantheallele.• Lotofnoiseinthedata