Review: Methodologies for SVs detection · segmentation, clustering, etc • Per cell through...

Review:MethodologiesforSVsdetectionFritz Sedlazeck

Nov, 16, 2018

Mygroup/interestsDetectionofVariants

SnifflesSedlazeck et.al. (2018)

SURVIVORJeffareset.al.(2017)

BOD-ScoreSedlazecket.al.(2013)

Mapping/Assemblyreads

NextGenMap-LRSedlazecket.al.(2018)

FalconUnzipChin et.al.(2016)

NextGenMapSedlazecket.al.(2013)

Benchmarking

SVgenotyperChander et.al. (in prep.)

TeaserSmolka et.al.(2015)

SequencingJünemann et.al.(2013)

ApplicationsModelorganisms:-Cancer(SKBR3)(Nattestadet.al.2018)-miRNA editing(Vesely et.al.2012)

NonModelorganisms:-Cottus transposons (Dennenmoseret.al.2017)-Clunio (Kaiseret.al.2016)-Seabass (Vij et.al.2016)-Pineapple (Minget.al.2015)

Figure'1'

“moonlight”'

Early2000sdogma:SNPsaccountformosthumangeneticvariation

https://hapmap.ncbi.nlm.nih.gov

Segmentalduplications(a.k.a.Lowcopyrepeats)

Bailey et al, 2002~5% of the human genome is duplicated!

Self Dotplot: 10 megabases of Chr 15(dot = 1 kb exact match)

Variationingenomestructure.So-called"structuralvariation"(SV)

DB CAReference

DB CA BDuplication

CB DInversion A

DCADeletion *DB CXInsertion A

Translocation RB QA

CNV

CNV

SV

SV

SV

SV is a superset of copy number variation (CNV). Not all structural changes affect

copy number (e.g., inversions)!

Ourunderstandingofstructuralvariationisdrivenbytechnology

1940s - 1980sCytogenetics / Karyotyping

1990sCGH / FISH /

SKY / COBRA

2000sGenomic microarrays

BAC-aCGH / oligo-aCGH

TodayHigh throughput DNA sequencing

Whyare structuralvariations relevant/important?

• They are common and affect a large fraction of the genome

• They are a major driver of genome evolution

GenomicDisordersEvolution

Whyare structuralvariations relevant/important?

• Genetic basis of traits

Impactonregulation Impactonphenotypes

RegulatoryState

CellLine

A549Aorta

B_cells_PB

_Roadm

ap

CD14C

D16__m

onocyte

_CB

CD14C

D16__m

onocyte

_VB

CD4_ab_T_

cell_VB

CD8_ab_T_

cell_CB

CM_CD

4_ab_T

_cell_VB

DND_41

eosinop

hil_VBEPC

_VB

erythroblas

t_CB

Fetal_Ad

renal_Gland

Fetal_Intestine_

Large

Fetal_Intestine_

Small

Fetal_Muscle_L

eg

Fetal_Muscle_T

runk

Fetal_S

tomach

Fetal_Th

ymusGas

tric

GM12878

H1_mes

enchym

al

H1_neurona

l_progenitor

H1_trop

hoblastH1E

SC H9HeL

a_S3Hep

G2HMECHSM

M

HSMMtube

HUVEC_

prol_CBHUV

ECIMR

90iPS_20b

iPS_DF

_19_11

iPS_DF

_6_9K56

2

Left_Ve

ntricleLun

g

M0_mac

rophage_C

B

M0_mac

rophage_V

B

M1_mac

rophage_C

B

M1_mac

rophage_V

B

M2_mac

rophage_C

B

M2_mac

rophage_V

B

Monocytes_

CD14_PB_

Roadma

p

Monocytes_

CD14

MSC_V

B

naive_B

_cell_VB

Natural_Killer_cells_P

B

neutrop

hil_CB

neutrop

hil_mye

locyte_B

M

neutrop

hil_VBNH_

A

NHDF_A

DNHE

KNHL

FOsteoblOva

ry

Pancrea

sPlac

enta

Psoas_Mus

cle

Right_A

trium

Small_IntestineSple

en

T_cells_PB

_Roadm

apThymus

CTCF_b

inding_siteACT

IVE

CTCF_b

inding_siteINACTIVE

CTCF_b

inding_sitePOI

SED

CTCF_b

inding_siteREP

RESSED

enhancerACTIVE

enhancerIN

ACTIVE

enhancerPOIS

ED

enhancerREPR

ESSED

open_chromatin_regio

nACTIVE


nINACT

IVE


nNA

open_chrom

atin_reg

ionPOIS

ED


nREPRE

SSED

promoterACTIVE

promoter_flanking

_region

ACTIVE

promoter_flanking

_region

INACTIVE

promoter

_flankin

g_region

POISED

promoter_flanking

_region

REPRES

SED

promoterIN

ACTIVE

promoter

POISED

promoterREPR

ESSED

TF_bind

ing_siteACT

IVE

TF_bind

ing_siteINACTIVE

TF_bind

ing_siteNA

TF_bind

ing_sitePOI

SED

TF_bind

ing_siteREP

RESSED

A549Aorta

B_cells_PB

_Roadm

ap

CD14C

D16__m

onocyte

_CB

CD14C

D16__m

onocyte

_VB

CD4_ab_T_

cell_VB

CD8_ab_T_

cell_CB

CM_CD

4_ab_T

_cell_VB

DND_41

eosinop

hil_VBEPC

_VB

erythroblas

t_CB

Fetal_Ad

renal_Gland

Fetal_Intestine_

Large

Fetal_Intestine_

Small

Fetal_Muscle_L

eg

Fetal_Muscle_T

runk

Fetal_S

tomach

Fetal_Th

ymusGas

tric

GM12878

H1_mes

enchym

al

H1_neurona

l_progenitor

H1_trop

hoblastH1E

SC H9HeL

a_S3Hep

G2HMECHSM

M

HSMMtube

HUVEC_

prol_CBHUV

ECIMR

90iPS_20b

iPS_DF

_19_11

iPS_DF

_6_9K56

2

Left_Ve

ntricleLun

g

M0_mac

rophage_C

B

M0_mac

rophage_V

B

M1_mac

rophage_C

B

M1_mac

rophage_V

B

M2_mac

rophage_C

B

M2_mac

rophage_V

B

Monocytes_

CD14_PB_

Roadma

p

Monocytes_

CD14

MSC_V

B

naive_B

_cell_VB

Natural_Killer_cells_P

B

neutrop

hil_CB

neutrop

hil_mye

locyte_B

M

neutrop

hil_VBNH_

A

NHDF_A

DNHE

KNHL

FOsteoblOva

ry

Pancrea

sPlac

enta

Psoas_Mus

cle

Right_A

trium

Small_IntestineSple

en

T_cells_PB

_Roadm

apThymus

CTCF_b

inding_siteACT

IVE

CTCF_b

inding_siteINACTIVE

CTCF_b

inding_sitePOI

SED

CTCF_b

inding_siteREP

RESSED

enhancerACTIVE

enhancerIN

ACTIVE

enhancerPOIS

ED

enhancerREPR

ESSED


nACTIVE


nINACT

IVE


nNA

open_chrom

atin_reg

ionPOIS

ED


nREPRE

SSED

promoterACTIVE

promoter_flanking

_region

ACTIVE

promoter_flanking

_region

INACTIVE

promoter

_flankin

g_region

POISED

promoter_flanking

_region

REPRES

SED

promoterIN

ACTIVE

promoter

POISED

promoterREPR

ESSED

TF_bind

ing_siteACT

IVE

TF_bind

ing_siteINACTIVE

TF_bind

ing_siteNA

TF_bind

ing_sitePOI

SED

TF_bind

ing_siteREP

RESSED

0500

1000

1500

2000

scale

affecte

d #

Outline

1. CNVanalysis

2. SVsanalysis1. Assemblybased2. Shortreads3. Longreads

3. Reviewplan

Humansdifferbyroughly3,000deletions(>=500bp)

Humansdifferbyafewhundredduplications

Copy-number Profiles

Gingko http://qb.cshl.edu/ginkgo

Interactive Single Cell CNV analysis & clustering• Easy-to-use, web interface, parameterized for binning,

segmentation, clustering, etc• Per cell through project-wide analysis in any species

Compare MDA, DOP-PCR, and MALBAC• DOP-PCR shows superior resolution and consistency

Available for collaboration• Analyzing CNVs with respect to different clinical outcomes• Extending clustering methods, prototyping scRNA

Interactive analysis and assessment of single-cell copy-number variations.Garvin T, Aboukhalil R, Kendall J, Baslan T, Atwal GS, Hicks J, Wigler M, Schatz MC (2015) Nature Methods doi:10.1038/nmeth.3578

Data are noisy

Potentialforbiasesateverystep• WGA:Non-uniformamplification• LibraryPreparation:Lowcomplexity,readduplications,barcoding• Sequencing:GCartifacts,shortreads• Computation:mappability,GCcorrection,segmentation,treebuilding

CoverageistoosparseandnoisyforSNPanalysis->Requiresspecialprocessing

CNVanalysis§ Dividethegenomeinto“bins”with~50– 100reads/bin§ Mapthereadsandcountreadsperbin

Useuniquelymappablebasestoestablishbins

1.Binning

1.Binning



1.Binning

5 4 5 10 11 5 2 5



2. Normalization

Also correct for mappability, GC content, amplification biases

3. Segmentation

CircularBinarySegmentation(CBS)

i j j j ji ji

4.EstimatingCopyNumber

CN = argminnX

i,j

(Yi,j � Yi,j)2o

UsingNanopore MinION:CNVkaryotyping.

Nanopore sequencingforCNVdetection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 20212223XY

SKBR3 cell line CNV Analysis

SID97277- partialchromosomaldeletions

MinIONdata

~60kreads

MiSeq Data

5qdeletion indicatespoorprognosis Chr11abnormalities

indicatepoor prognosis

SID97277karyotype

SID97279– trisomy6,15,22anddeletionsinchr11

MinIONData

~73kreads

MiSeq Data

Trisomy6correlatedwithintermediateprognosis

Abnormalitieson11indicatepoorprognosis

CNVdetectionsummary

• Advantages• Lesscoverageisrequired

• ->Applicationssuchassinglecellsequencing

• Disadvantages• Resolutionofevents

• usuallyinthemultikbp• Onlydeletionsandduplications• Coveragebiasesinshortreads

Assemblybased

1. Denovoassembly2. Genomicalignment(WGA)3. Detanglethegenomicalignmenttoidentifyvariants.

Ingredients for a good assembly

Current challenges in de novo plant genome sequencing and assemblySchatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243

Coverage

High coverage is required– Oversample the genome to ensure

every base is sequenced with long overlaps between reads

– Biased coverage will also fragment assembly

Lander Waterman Expected Contig Length vs Coverage

Read Coverage

Exp

ect

ed

Co

ntig

Le

ng

th (

bp

)

0 5 10 15 20 25 30 35 40

10

01

k1

0k

10

0k

1M

+dog mean

+dog N50

+panda mean

+panda N50

1000 bp

710 bp

250 bp

100 bp

52 bp

30 bp

Read Coverage

Expe

cted

Con

tig

Leng

th

Read Length

Reads & mates must be longer than the repeats– Short reads will have false overlaps

forming hairball assembly graphs– With long enough reads, assemble

entire chromosomes into contigs

Quality

Errors obscure overlaps– Reads are assembled by finding

kmers shared in pair of reads– High error rate requires very short

seeds, increasing complexity and forming assembly hairballs

Goal of WGA

• For two genomes, A and B, find a mapping from each position in A to its corresponding position in B

CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA

CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA

Not so fast...

• Genome A may have insertions, deletions, translocations, inversions, duplications or SNPs with respect to B (sometimes all of the above)

CCGGTAGGATATTAAACGGGGTGAGGAGCGTTGGCATAGCA

CCGCTAGGCTATTAAAACCCCGGAGGAG....GGCTGAGCA

WGA visualization

• How can we visualize whole genome alignments?

• With an alignment dot plot• N x M matrix

• Let i = position in genome A• Let j = position in genome B• Fill cell (i,j) if Ai shows similarity to Bj

• A perfect alignment between A and B would completely fill the positive diagonal

T

G

C

A

A C C T

B

A

B

A

Translocation Inversion Insertion

• Different structural variation types / misassemblies will be apparent by their pattern of breakpoints

• Most breakpoints will be at or near repeats

• Things quickly get complicated in real genomes

http://mummer.sf.net/manual/AlignmentTypes.pdf

Assemblybaseddetectionsummary

• Advantages• Enablesthedetectionofeveryevent• Goodqualityforinsertions

• Disadvantages• Genomicalignmentischallenging.• Heterozygouseventsarelikelymissed.

HowtodetectStructuralVariations

Sequencealignment“signals”forstructuralvariation

1. Align DNA sequences from sample to human reference genome

2. Look for evidence of structural differences

Ref.

Exp.

(a) Depth ofcoverage

(b) Paired-endmapping

(c) Split-readmapping

(d) de novoassembly

Low HighResolution

Lookingfor"discordant"paired-endfragments

Paired-end sequencing

Ref

Sample

paired-ends map farther away than expected

2000 bp

Slide in collaboration with Ira Hall

AprobabilisticframeworkforSVdiscovery

Layer et al, 2014

Ryan Layer

Lumpy integrates paired-end mapping, split-read mapping, and depth of coverage for better SV discovery accuracy

Problem#1:Oftenmanyfalsepositives

- Short reads + heuristic alignment + rep. genome = systematic alignment artifacts (false calls)

- Chimeras and duplicate molecules

- Ref. genome errors (e.g., gaps, mis-assemblies)

- ALL SV mapping studies use strict filters for above

Problem#2:Thefalsenegativerateisalsotypicallyhigh

- Most current datasets have low to moderate physical coverage due to small insert size (~10-20X)

- Breakpoints are enriched in repetitive genomic regions that pose problems for sensitive read alignment

- FILTERING!

- The false negative rate is usually hard to measure, but is thought to be extremely high for most paired-end mapping studies (>30%)

- When searching for spontaneous mutations in a family or a tumor/normal comparison, a false negative call in one sample can be a false positive somatic or de novo call in another.

Howtofilter/choosetheSVcaller?• Eachmethodappliesitsownheuristics.

Method # Sim. SV avg FDR avg SensitivityDELLY 33-198 0.13 0.75LUMPY 33-198 0.06 0.62Pindel 33-198 0.04 0.55SURVIVOR 33-198 0.01 0.70

PacBio /ONTsequencer

Advantage:• Longreads,Disadvantage:• Throughput/yield• Costs• Higherrorrates

LongReadTechnologies

• (+)SVsinrepetitiveregions• (+)SpanSVs• (+)Uniformcoverage• (+)CanidentifymorecomplexSVs

• (-)Higherseq.errorrate• (-)Hardtoalign

Mappingchallenges

BWA-MEM: NGMLR:

Mappingchallenges

BWA-MEM: NGMLR:

NGMLR+Sniffles

• NGMLR• Convexgapcostmodeltobetterdistinguishseq.errorvs.signal

• Novelmethodforsplitreadalignment.

• Sniffles• Includesmultiplestatisticalmodelstodistinguishnoisevs.signal

100

250

500 1k 5k 10k

50k

Indels

0

20

40

60

80

100

BLAS

R

100

250

500 1k 5k 10k

50k

Duplication

100

250

500 1k 5k 10k

50k

Translocation

100

250

500 1k 5k 10k

50k

Inversion

100

250

500 1k 5k 10k

50k

0

20

40

60

80

100

BWA

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

0

20

40

60

80

100

GraphMap

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

0

20

40

60

80

100

NGMLR

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

1.3Longreadmapping

Precise

Indicated

Wrong

Alignmentstoppedprior

Notaligned

Morecomplextypes

2.4LongreadSVcalling

100

250

500 1k 5k 10k

50k

Indels

0

20

40

60

80

100

SURV

IVOR

100

250

500 1k 5k 10k

50k

Duplication

100

250

500 1k 5k 10k

50k

Translocation

100

250

500 1k 5k 10k

50k

Inversion

100

250

500 1k 5k 10k

50k

0

20

40

60

80

100

PBHo

ney

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

0

20

40

60

80

100

Sniffles

+BWA

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

0

20

40

60

80

100

Sniffles

+NGM−LR

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

Precise

Indicated

Notfound

Additionalevents

2.4LongreadSVcalling

Precise

Indicated

Notfound

Additionalevents

100

250

500 1k 5k 10k

50k

Dup

020406080100

SURV

IVOR

100

250

500 1k 5k 10k

50k

Indel

100

250

500 1k 5k 10k

50k

Inv

100

250

500 1k 5k 10k

50k

Tra

100

250

500 1k 5k 10k

50k

InvDel

100

250

500 1k 5k 10k

50k

InvDup

100

250

500 1k 5k 10k

50k

020406080100

PBHoney

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

020406080100

Sniffles

+BWA

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

020406080100

Sniffles

+NGM−LR

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

Dup

020406080100

SURV

IVOR

100

250

500 1k 5k 10k

50k

Indel

100

250

500 1k 5k 10k

50k

Inv

100

250

500 1k 5k 10k

50k

Tra

100

250

500 1k 5k 10k

50k

InvDel

100

250

500 1k 5k 10k

50k

InvDup

100

250

500 1k 5k 10k

50k

020406080100

PBHo

ney

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

020406080100

Sniffles

+BWA

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

020406080100

Sniffles

+NGM−LR

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

100

250

500 1k 5k 10k

50k

INVDEL

INVDUPInversionflankedbydeletions:

• Haemophilia A

Invertedtandemduplication:• Pelizaeus-Merzbacher disease• MECP2• VIPR2

3.2NA12878

• Healthyfemale

• Goldstandardingenomics

• Sequencedwithmanytechnologiesindependently:• Illumina,PacBio,OxfordNanopore

3.2NA12878:Deletioncalling

Tech. Cov. Avg len SVs DEL DUP INV INS TRA

PacBio 55x 4,334 22,877 9,933 162 611 12,052 119

OxfordNanopore

28x 6,432 32,409 27,147 87 323 4,809 43

Illumina 50x 2x101 7,275 3,744 731 553 0 2,247



PacBio 55x 4,334 22,877 9,933 162 611 12,052 119

OxfordNanopore

28x 6,432 32,409 27,147 87 323 4,809 43

Illumina 50x 2x101 7,275 3,744 731 553 0 2,247

3.2OxfordNanoporedeletions

illumina

PacBio

OxfordNanopore



PacBio 55x 4,334 22,877 9,933 162 611 12,052 119

OxfordNanopore

28x 6,432 32,409 27,147 87 323 4,809 43

OxfordNanopore@Baylor

34x 4,982 12,596 7,102 169 113 5,166 46

Illumina 50x 2x101 7,275 3,744 731 553 0 2,247



PacBio 55x 4,334 22,877 9,933 162 611 12,052 119

OxfordNanopore

28x 6,432 32,409 27,147 87 323 4,809 43

OxfordNanopore@Baylor

34x 4,982 12,596 7,102 169 113 5,166 46

Illumina 50x 2x101 7,275 3,744 731 553 0 2,247

3.2NA12878:check2,247 vs 119TRA

Illuminadata

Translocation:

PacBiodata

ONTdata

Truncatedreads:

InsertionInrep.region

Overlap Illumina TRA(%)Translocations 7.74Insertions 53.05Deletions 12.06Duplications 0.57Nested 0.31Highcoverage 1.87Lowcomplexity 9.79Explained 85.40

NA12878:check2,247 TRA

ONTdata

PacBiodata

Illuminadata


Inversion:

Translocation:

Truncatedreads:


SKBR-3usingPacbio

(Davidsonetal,2000)

Oftenusedforpre-clinicalresearchonHer2-targetingtherapeuticssuchasHerceptin(Trastuzumab)andresistancetothesetherapies.

MostcommonlyusedHer2-amplifiedbreastcancercellline

80chromosomes insteadof46

Her2GSDMB

TATDN1

8Mb

RARA

PKIA

InversionwasonlyfoundbySniffles

Her2

Chr 17Chr 8

1. Healthychromosome17&82. Translocationinto

chromosome83. Translocationwithin

chromosome84. Complex variantand

invertedduplicationwithinchromosome8

5. Translocationwithinchromosome8

Medicalapproach:UsingNanopore MinION

GBAMutationsinParkinsonandGaucher

ReviewonSVmethodologies

• Whichmethodsdoexistpermethodology?• Assemblyvs.shortreadmappingvs.longreadmapping

• Whataretheadvantages/disadvantagespermethodology• Accuracy• Costs• Limitations,remainingchallenges,complexalleles,polyploidy,etc.

• Whereisthefieldat?• Diploidassemblies• PhasingofSVs+SNPs

• Wehaveanoutlineandajournalthatisinterestedtoworkwithus.

Thankyou

• SVcallingisSNPcallingof2008• Readsaretypicallyshorterthantheallele.• Lotofnoiseinthedata

Date post:	18-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Review: Methodologies for SVs detection · segmentation, clustering, etc • Per cell through...

Documents