+ All Categories
Home > Documents > Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant...

Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant...

Date post: 05-Jan-2016
Category:
Upload: pamela-adams
View: 221 times
Download: 1 times
Share this document with a friend
123
Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine
Transcript
Page 1: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Next-generation sequencing:from basics to future diagnosticsPART II: NGS analysis to find

variant

Sangwoo Kim, Ph.D.Assistant Professor,

Severance Biomedical Research Institute, Yonsei University College of Medicine

Page 2: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Overview

• PART I: NGS technologies and standard workflow– Next generation sequencing

• History and technology

– Data and its meaning; process workflow– Discussion

• PART II: NGS Analysis to find variants– NGS analysis to find variants

• Single nucleotide variants (SNVs)• Copy number variations (CNVs)• Structural variations (SVs)

• PART III: NGS application to diagnostics – NGS in genomic medicine– Potential application to forensic science

Page 3: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

FROM PREVIOUS SESSION

Conventional variant callingVariant calling in minor subgroups

3/123

Page 4: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Next-generation sequencing

Metzker et al, Nat Rev Genet, 2010

Massively Parallel Sequencing (a.k.a. Next-generation sequenc-

ing)

Illumina HiSeq2500

5500 SOLiD sys-tem

Ion Torrent PGM

via spatially separated, clonally amplified DNA templates or single DNA molecules

Page 5: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

The human genome project

Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc.“Rough draft” in 2000“Complete genome” published in 2003

13 years,$3 billion dollars.

The Human Genome Project (1990~2003)

5

Page 6: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

FASTQ format (NGS raw data)

one read

sequence

quality

A format for NGS read (FASTQ + quality)

Page 7: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Kim S and Paik S, in preparation

control

sequenc-ing

quality control

short read alignment (BAM files)

sequenc-ingraw reads

(FASTQ files)

germ-line mutation somatic mutation

copy numbervariation (CNV)

structuralvariation (SV)

A. Data Genera-tion

B. Variant Find-ing

C. Variant Anal-ysis

xenogeneic sequence

43%0%

31%

recurrence analysis

GKRRAGGGKRRAV*Gvariant impact prediction

mutation filtration/selection

tumor heterogeneity inference

disease

Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimiza-tion.Whole genome sequencing (WGS) allows

an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only in-terrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear.

D. Validation and functional assessment

variant confirmation

pathway analysis

functional study

Page 8: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

DATA PREPROCESSINGShort Read Alignment

8/123

Page 9: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Mapping back to genome

TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCT-TAAGAGCTGTGAGA

Where is this sequence in human genome?

Do this as fast as possible!

Page 10: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

brute force way

T G A C G T G T G A T T C A A A A A A G CThe reference genome (chr1, start)

G A T T C A A A Your query

G A T T C A A A

G A T T C A A A

G A T T C A A A

Find “GATTCAAA” in human genome

This is very long (3 billion)

Page 11: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

How fast should it be?

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algorithm 3 3.6x108 10 yrs

minimum required 0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 days

based on 200bp read length, 80x single-end wgs

Page 12: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Searching with index• Assume you’re searching “genome” in

a English dictionary– You don’t search every line in every page– You first find the page range of “g” in the

dictionary– in the above range (of ‘g’), you find the

page range of “ge” in the dictionary– in the above range (of ‘ge’), you find the

page range of “gen” in the dictionary

– ...– until you find “genome”

Page 13: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Searching with index• Assume you’re searching “genome” in

a English dictionary– You don’t search every line in every page– You first find the page range of “g” in the

dictionary– in the above range (of ‘g’), you find the

page range of “ge” in the dictionary– in the above range (of ‘ge’), you find the

page range of “gen” in the dictionary

– ...– until you find “genome”

How can we build an in-

dex for genome?

Page 14: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transform

14

Page 15: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

BANANA

Page 16: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

BANANA$Lexicographically smallest

Page 17: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

BANANA$ANANA$B

Page 18: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

BANANA$ANANA$BNANA$BA

Page 19: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA

Page 20: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

Page 21: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

Page 22: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

ANNB$AA

last col-umn

Page 23: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”

Page 24: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Burrows-Wheeler Transformation

0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”1. BWT just changes the order of the string2. BWT tends to collect similar characters together3. With only the transformed string, we can easily get the original string

Page 25: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

Page 26: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NANN

ANNAN

Page 27: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point

• the number of ‘N’• to determine the end point

start

end

Page 28: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point

• =5 • the number of ‘N’

• to determine the end point• =2

start

end

Page 29: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point

• =5 • the number of ‘N’

• to determine the end point• =2

start

end

Page 30: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

Page 31: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

This is a range for ‘A’ not ‘AN’!!

Page 32: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

Page 33: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point

• =1 • the number of ‘A’

• to determine the end point• =3

start

end

count of ‘A’ before start point = 1

Page 34: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘A’ + number of ‘A’ before start point• to determine the start point

• =1 + 1 = 2• the number of ‘A’ before end point

• to determine the end point• =3

start

end

count of ‘A’ before start point = 1“Ax” is not “AN” and less than “AN”

Page 35: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

start

end

The range of strings that start with “NAN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘N’ + number of ‘N’ before start point• to determine the start point

• =5 + 1 = 6• the number of ‘N’ before end point

• to determine the end point• =2

Page 36: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

LF Search

0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

Question: Find “NAN” from BANANA

NAN

startend

2nd row at the original permutation=number of rotations of original string=“NAN” exists at the 3rd position of “BANANA”

BANANA

Page 37: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 38: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 39: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 40: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 41: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 42: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 43: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf

Page 44: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Inexact matchingT G A C G T G T G A T T C A A A A A A G C

G A T T G A A A

When exact match does not exist:• continue other possible candidates (G -> A, C, T) and increase the mismatch count• If another mismatch occurs, again branch it out. • So edit distance is critical to alignment speed

Page 45: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

Goal achieved

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algorithm 3 3.6x108 10 yrs

minimum required 0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 days

Page 46: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

VARIANT CALLING – SNV CALLINGSNV calling

46/123

Page 47: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Detailed View

one read = one DNA fragmentaligned to a specific genomic region

= observation of our sample in this re-gion (1 time)

A genome region

Page 48: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Detailed View

A—AAAACAAAAC

A certain genomic posi-tion (in bp)

Page 49: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Detailed View

A—AAAACAAAAC

A certain genomic posi-tion (in bp)

reference allele

observation of our sample at this position from read 1

observation of our sample at this position from read 2

observation of our sample at this position from read 10

Page 50: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Why multiple observations?• Observations contain errors– errors from machine

• basecall error

– errors from mapping• mapping error

– errors from others• library prep error

• With accuracy of 99%...– 1% error from whole region– leads to

• ~30million false SNPs for whole genome• ~500k false SNPs for whole exome

Page 51: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Human diploid genomeG

A

G

G A

A

Homozygotic Reference

Heterozygotic Alternative

Homozygotic Alternative

G G

G GG

GG GGG G G

ASequencing error / map-ping error

G G

GGGG

G

A AA A

AA A

AA

AA

AA

AA

AA

A

A

somatic mutations

51/123

Page 52: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Allele fraction distribution (binomial)

Pr (𝜇−3𝜎 ≤ 𝑥≤𝜇+3𝜎 )≈0.9973Pr (35≤𝑥 ≤65)≈0.9973

Normal approximation of B(100,0.5)

52/123

Page 53: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Allele fraction distribution (binomial)G G

G GG

GG GGG G G

A

G G

GGGG

G

A AA A

AA A

AA

AA

AA

AA

AA

A

A

53/123

Page 54: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Inferring mutations

GAGAGGGGGAAAGAGA

reference allele

• True genotype = “AA” and no sequencing error

• True genotype = “AB” and– Read was generated from ‘A’ allele and no sequencing

error

– Read was generated from ‘B’ allele and sequencing error and ‘A’ was generated by chance

• True genotype = “BB” and sequencing error

Probability of observing “G” at the site of “G”

Obs

erva

tion

of d

onor

gen

ome

Page 55: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Inferring mutations

GAGAGGGGGAAAGAGA

reference allele

Probability of observing “A” at the site of “G”

Obs

erva

tion

of d

onor

gen

ome

• True genotype = “AA” and sequencing errorP(e)

• True genotype = “AB” and- Read was generated from ‘A’ allele and sequencing error and ‘T’ was generated by chance

- Read was generated from ‘B’ allele and no sequencing error

• True genotype = “BB” and no sequencing error

Page 56: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genotype determination

• L(g=AA|D)• L(g=AB|D)• L(g=BB|D)

Likelihood that the genotype is wild-type given the observation!

Likelihood that the genotype is mutant given the observation!

Page 57: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

57

Tools

Page 58: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

SOMATIC MUTATIONS

58

Page 59: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

59

Germline vs. Somatic mutation

sample from non-disease site

sample from disease site

reference sequence (e.g. hg19)

• UnifiedGenotyper• VarScan2• SomaticSniper• …

Page 60: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

60

Easy way to somatic mutations

sample from non-disease site

sample from disease site

GN=AA

GT=AB

Page 61: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

61

Joint Probabilities

Page 62: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

62

Joint Probabilities• P(GT=AB|GN=AA)

≠P(GT=AB|GN=AB) ≠P(GT=AB|GN=BB)Tumor genotype is dependent on normal genotype!!!

G: Joint Genotype Matrix

Page 63: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

WHEN SAMPLE IS NOT PURE

63

Page 64: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Heterogeneous Sample

G G

Normal Cells

G GG G

G G

Tumor Cells

G AG G

GGG

GG

AA

GG

GG

G

G

G G

G GG

GG GGG G

G

64/123

Page 65: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Causes of low-frequency• Sample contamination (e.g. stromal cells)

65/123

Page 66: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity

66/123

Page 67: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity• Extreme environments

67/123

Page 68: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity• Extreme environments• Somatic mosaicism

68/123

Page 69: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Heterogeneous Sample

G G

GGG

GG

AA

GG

GG

G

G

“2/15: No mutation. Two ‘A’s are from sequencing errors”

“2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!”

VS

69/123

Page 70: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Heterogeneous Sample

G G

GGG

GG

AA

GG

GG

G

G

“2/15: No mutation. Two ‘A’s are from sequencing errors...”

“2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!”

VS

“How do we know this?”

70/123

Page 71: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Estimating Cellularity • It is “easy” only if we already know where to see

(disease genotype is AB or BB)

But how do we know the genotype? (even without knowing α?)

1. Use SNP array - ONCOSNP (Yau et al, Genome Biol, 2009), Absolute (Carter et al, Nature Biotech, 2012)

2. SNP Calling - Snyder et al, PNAS, 2010, PurityEst (Su et al, Bioinformatics, 2012)

71/75

Page 72: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Accurate inference in Virmid

Estimate global within-individual con-tamination to accurate detection of so-matic mutations

72/123

Page 73: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Bias 1 - Loss of Reads (Virmid)

A

B

A

AB

𝑥𝑎=𝑝 (a   read   that   passes  𝑔1 being  unmapped )

g1

g2

𝑥𝑏=𝑝 (a   read   that   passes  𝑔2 being  unmapped )

¿𝑝 (𝑟1 has  𝑑+1or   more   variants   in   the   remaining   sites )

¿𝑝 (𝑟2 has  𝑑or   more   variants   in   the   remaining   sites )

r1r2

ref

𝑥𝑎=1−∑𝑖=0

𝑑

(𝑙−1𝑖 )𝑝𝑖 (1−𝑝 )𝑙 −1−𝑖𝑥𝑏=1−∑

𝑖=0

𝑑−1

(𝑙−1𝑖 )𝑝𝑖 (1−𝑝 )𝑙− 1− 𝑖

, where  𝑑=maximum  edit   distance ,  𝑙=read   length ,  and  𝑝=frequency  of   variation

73/123

Page 74: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Bias 2 - Loss of variants (Virmid)

reads from nor-mal

reads from dis-easeB-al-

lele

α

1-α

overestimate BAF

underestimate α

74/123

Page 75: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Estimated α

underestimated α

overestimated α

75/123

Page 76: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Calling low-fraction somatic mutations in Virmid

Kim S et al, Genome Biology 2013

76/123

Page 77: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Low frequent mutations in disease

Identification of de novo somatic mutation in ATK-MTOR-PIK3CA in hemimega-lencephaly

Lee J et al, Nature Genetics, 2012

77/123

Page 78: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Low frequent mutations in disease

Lim J et al, Nature Medicine 2015

Identification of MTOR driver mutations in focal cortical dysplaisa

78/123

Page 79: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

COPY NUMBER VARIATION (CNV)

79

Page 80: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Copy Number VariationChanges in copy number of large DNA segment - usually in terms of genes- e.g. HER2 amplification

Types of CNVs- Copy number gain (CN > 2):

- Increase of copy number due to ge-nomic rearrangement like insertion/duplication

- Copy number loss (CN < 2):- Decrease of copy number due to

deleterious genomic rearrangements

Copy number aberration (CNA)- refers to CNV particularly when the

events are associated with disease phe-notype

Page 81: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Comparative Genome Hybridization (CGH)

 500kb-1500kb fragmentfor optimal hybridization

Page 82: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Array CGH

Page 83: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Resolution

Page 84: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Benefits of NGS-based CNV detection

• High resolution (< 50 bp) in size• Data reuse (multi-purpose)– One NGS (whole-genome) sequencing

can be used to SNV, CNV, SV detection

• Can be improved with additional NGS information– Discordant reads in paired-end sequenc-

ing

Page 85: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Inferring CNVs from NGS

• Principle:– Samples with copy number gain (or loss)

will generate more (or less) reads in the region

gene

3 Copy (gain) 2 Copy (nor-mal)

1 Copy (loss)

Page 86: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Genome Informatics I (2015 Spring)

The signal3 Copy (gain) 2 Copy (nor-

mal)1 Copy (loss)

mapped to reference

Page 87: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

The signal3 Copy (gain) 2 Copy (nor-

mal)1 Copy (loss)

mapped to reference

catching these needs a system-atic approach!

Page 88: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Catching the signal

• Problems– Read depth is not uniform even without

copy number changes• GC bias• Mapping bias in repeat region• Natural variance (Poisson distribution)

Poisson distribution:  - The probability of a given number of events occurring in a fixed interval of time and/or space.

Example:- You have 120 phone calls a day, what is the best way to describe the

number of phone call in an hour?- Similarly, you generated 100,000,000 NGS reads from whole genome, what is the number of reads generated within chr1:12781718-12782228?

 

Page 89: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Significantly deviated read-depth

• Null hypothesis (H0):– copy number of a given region is unchanged– we assume the read-depth follows Poisson dist.

• Alternative hypothesis (Ha):– copy number of a given region is changed

• If H0 is right:– The read-depth (calculated from number of reads) within

a specific genomic region is not significantly deviated from the Poisson distribution

• If the read-depth is too deviated to explain with natural variance (Poisson distribution)– Copy number has been changed

Page 90: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Practically, we should consider

• Bias correction from sequence con-text (GC-bias, etc.)

• Event detection method– If the significant rise (or drop) of read-

depth looks like an event• mean-shift technique (CNVnator, Abyzov et

al 2013)• event-wise testing (Yoon et al, 2009)• paired-end signal (CNVer, Medvedev et al

2010)

Page 91: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

CNVNator

91/123

Page 92: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

STRUCTURE VARIATION (SV)

92

Page 93: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Beyond the SNVs

Page 94: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Beyond the SNVs

Page 95: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Beyond the SNVs

TFE3-KHSRP Translocation in Renal Cell Carcinoma

Page 96: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Structural Variations (SVs)

• Genomic rearrangements that affect >50bp of sequence

Alkan et al, Nat. Rev. Genetics 12, 363-376, 2011

Page 97: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

List of structural variations

Page 98: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

98/123

List of structural variations

Page 99: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Paired-end sequencing

Page 100: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Bix Seminar UCSD 100/123

Paired end reads for SV finding

Donor

Reference

Donor

Reference

Page 101: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 102: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 103: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 104: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 105: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 106: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 107: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 108: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA 

Page 109: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for Deletion Detection

Page 110: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for Deletion Detection

Page 111: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for Deletion Detection

Page 112: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for Deletion Detection

Page 113: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for Deletion Detection

Page 114: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Methods for Deletion Detection

Page 115: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Problems 1. Judgment of discordance

Page 116: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Problems 1. Judgment of discordance

Page 117: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Problem 2. Size of insertion

Page 118: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Problem 2. Large indels

Novel Sequence Insertion

Page 119: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Problem 2. Large Indels

Existing Se-quence Insertion

Page 120: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Problem 3. Nonspecific Mappings

Page 121: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

Problem 3. Nonspecific Mappings

Page 122: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

DISCUSSION

122/123

Page 123: Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical.

THANK YOU

123/123


Recommended