A technical and methodological introduction to NGS (data ... and... · A technical and...

Post on 23-May-2020

8 views 0 download

transcript

A technical and methodological introduction

to NGS (data) analysis.

biomina

Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

The digital code of DNA, Leroy Hood and David Galas Nature 421, 444-448, 23 January 2003

NGS Principles: Sanger

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

NGS Applications: DNA-Seq

Whole Genome Sequencing • Novel organisms, de novo reference genome

NGS Applications: DNA-Seq

Whole Genome Sequencing • Novel organisms, de novo reference genome • Structural variance detection

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing

• Candidate genes for disease • All genes in pathway • ... => PCR, Capture, MIPs, ...

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq

NGS Applications: DNA-Seq

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq • 16S metagenomics

NGS Applications: RNA-Seq

Whole Transcriptome Sequencing • Gene/Transcript variant identification

NGS Applications: RNA-Seq

Whole Transcriptome Sequencing • Gene/Transcript variant identification • Gene Expression

• Unbiased detection • Highly quantitative

NGS Applications: RNA-Seq

“Selective” Sequencing • Ribosome Profiling

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

NGS Data Description

What kind of data are we working with? - Sanger Sequencing:

- 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks

NGS Data Description

What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred target amplicons - exome panel: > 200.000 target amplicons

NGS Data Description

What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets.

NGS Data Description

What kind of data are we working with? - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets. => Amount of data : > 8.000.000 sequences / sample

NGS Data Description

What kind of data are we working with? - Data format : FASTQ - FASTA : >Sequence_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA GCTCTAGATCGATCTATACCGAT

- Add Quality (fasta-Q => FASTQ) @Read_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA + BCEECEEFFECGECGECFGFF@?<<=??<>53@##

NGS Data Description

What kind of data are we working with? - Data format : FASTQ @Read_Name

AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<< Standard == Sanger Format : Quality = phred + 33, ascii-encoded

NGS Data Description

What kind of data are we working with? - Data format : FASTQ @Read_Name

AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<<

Phred Score : correlates with the risk on error (probability that basecall is wrong) “High Quality” : P(error) < 0.001

Q > 30

NGS Data Description

What kind of data are we working with? Quality = phred + 33, ascii-encoded

=> Example: Value == B Ascii-decode : 66 Phred : 66-33 = 33 Chance on error = 1/10^3.3

NGS Data Description

What kind of data are we working with?

- Data format : FASTQ << WARNING >>

@Read_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA

+

BCEECEEFFECGECGECFGFF@?<<=??<>53@##

=> Standard == Sanger Format : Other scales exist !

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Adapter Trimming: Remove artificial sequences

Read Mapping: Place reads on the reference genome (BWA)

Quality-Trimming: Remove low quality sequences to improve specificity

Generate QC Reports: Visual inspection of main quality parameters

Optimize Mapping: Remove Duplicate reads (picard),

recalibrate mapping scores (GATK), realign around indels (GATK)

Call and Annotate Variants: Call variants(GATK) and

annotate using ANNOVAR, and snpEff (VariantDB)

Pre

- P

roce

ssin

g

Seq

uen

ce –

To

– V

ari

an

t

NGS Based Variant Calling From sequence to variant: Analysis flow

Adapter Trimming Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Sequence Read 1 Sequence Read 2

Sequence Barcode

Scan all reads for presence of artificial sequence & remove from the reads Note: Adapters are present when lenght(Targetted fragment) < read_length

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Low quality leads to high error rates (cfr Phred Score) => Due to chemical degradation, 3’ ends have a lower quality => We want a limit of 1 error in 1000 positions => Trim everything on 3’ end with quality < 30

Quality-Trimming

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Quality should improve after trimming

Generate QC Reports

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Base composition should be 25% for G,C,T,A

Generate QC Reports

Good run Failed Run

Read Mapping Sequence – To – Variant

NGS Data analysis From sequence to variant: Analysis flow

Burrows-Wheeler Transformation: - Highly efficient method to scan string for substring matches - Principle: Build Prefix Trie, scan top-down using reverse search. 1. Permute String 2. Sort Permuted Strings 3. Last Column = Burrows Wheeler Transformation of string. 4. Build prefix trie from BWT

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Generate QC Reports Insert Size

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Generate QC Reports Capture Efficiency

Pre - Processing

NGS Based Variant Calling From sequence to variant: Analysis flow

Generate QC Reports Capture Efficiency

Optimize Mapping Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

- Remove Duplicate reads (picard) => Reduce computational time => Reduce amplification bias

Optimize Mapping Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

- Realign around indels (GATK) => InDels are hard to align => P(>1 SNPs) < P(1 indel)

If at a certain locus, both InDel AND multiple SNPs => Replace SNPs by one InDel => Reduction of false positives

Call And Annotate Variants Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

- Call Variants (GATK) - Search for positions with statistically significant evidence for

a non-reference nucleotide - Take into account: base-quality, position in read, strand

bias, ...

Call And Annotate Variants Sequence – To – Variant

NGS Based Variant Calling From sequence to variant: Analysis flow

Annotate Variants (ANNOVAR, snpEff, ...) - Add as information to the variant to ease interpretation - Effect on Gene transcription (RefSeq, Ensembl, UCSC) - Quality parameters (GATK) - Occurence in control populations (dbSNP, ESP, HapMap, 1KG, ...) - Known pathogenic variations (dbSNP, OMIM, ...) - Effect on gene function (PolyPhen, MutationTaster, Sift, ...) - ...

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 1 : Quality Filtering: GATK Variant Recalibration “The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact.” “The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.”

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 1 : Quality Filtering: GATK Variant Recalibration Train model on known variants (both positive and negative)

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 1 : Quality Filtering: GATK Variant Recalibration Apply model to experimental data

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 2 : Select an inheritance model

De Novo Dominant Recessive Variant not present in Variant present in Variant homozygous either parent affected parent in patient, heterozygous in both parents

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 3 : Effect on gene function (~ from high to low)

- Variant causes gain/loss of stop/start coding? - Variant causes aberrant splicing of the transcript? - Variant replaces a highly conserved nucleotide/amino acid ? - Variant replaces an aminoacid, and is not reported in control

populations ? - Variant can modify binding of regulatory elements? - ...

Extended annotation is critical

Manual inspection of > 20.000 variants/sample is impossible. automation is needed

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

Final Remarks Reducing Computational complexity: Web-Tools

Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis (http://www.usegalaxy.org)

Final Remarks

Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis - Support for allmost all types of analysis - Variant Calling - RNA seq : Expression / transcript identification - MetaGenomics - ChIP-seq - Many organisms available by default (on main servers) - New organisms can be added on request (on Biomina Servers) Public Server: http://www.usegalaxy.org Biomina Server: http://www.biomina.be/apps/galaxy

Reducing Computational complexity: Web-Tools

Final Remarks

Variant Interpretation: VariantDB - Extensive annotation - Flexible filtering options - Automatic updates - Multiple output formats: - online (tabular) - offline (CSV) - API (JSON)

Reducing Computational complexity: Web-Tools

Final Remarks Reducing Computational complexity: Future ?

Final Remarks

Future NGS assays will be: - Real-Time - On-Site - Low-Cost - ....

Reducing Computational complexity: Future ?

biomina