A technical and methodological introduction to NGS (data ... and... · A technical and...

transcript

A technical and methodological introduction

to NGS (data) analysis.

biomina

Geert Vandeweyer 2015-04-24

Outline

Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity

The digital code of DNA, Leroy Hood and David Galas Nature 421, 444-448, 23 January 2003

NGS Principles: Sanger

target adaptors

library

Sample preparation

NGS Principles: Illumina (sbs)

The figures above are provided by

target adaptors

library

Sample preparation

Cluster Generation

Sequencing by Synthesis

Outline

NGS Applications: DNA-Seq

Whole Genome Sequencing • Novel organisms, de novo reference genome

Whole Genome Sequencing • Novel organisms, de novo reference genome • Structural variance detection

“Selective” Sequencing • Whole exome sequencing

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing

• Candidate genes for disease • All genes in pathway • ... => PCR, Capture, MIPs, ...

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq

“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq • 16S metagenomics

NGS Applications: RNA-Seq

Whole Transcriptome Sequencing • Gene/Transcript variant identification

Whole Transcriptome Sequencing • Gene/Transcript variant identification • Gene Expression

• Unbiased detection • Highly quantitative

“Selective” Sequencing • Ribosome Profiling

Outline

NGS Data Description

What kind of data are we working with? - Sanger Sequencing:

- 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks

What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred target amplicons - exome panel: > 200.000 target amplicons

What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets.

What kind of data are we working with? - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets. => Amount of data : > 8.000.000 sequences / sample

What kind of data are we working with? - Data format : FASTQ - FASTA : >Sequence_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA GCTCTAGATCGATCTATACCGAT

- Add Quality (fasta-Q => FASTQ) @Read_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA + BCEECEEFFECGECGECFGFF@?<<=??<>53@##

What kind of data are we working with? - Data format : FASTQ @Read_Name

AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<< Standard == Sanger Format : Quality = phred + 33, ascii-encoded

What kind of data are we working with? - Data format : FASTQ @Read_Name

AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<<

Phred Score : correlates with the risk on error (probability that basecall is wrong) “High Quality” : P(error) < 0.001

Q > 30

What kind of data are we working with? Quality = phred + 33, ascii-encoded

=> Example: Value == B Ascii-decode : 66 Phred : 66-33 = 33 Chance on error = 1/10^3.3

What kind of data are we working with?

- Data format : FASTQ << WARNING >>

@Read_Name

AACTACTAGATACTGATAGTATATCTCTCTTAATCGA

BCEECEEFFECGECGECFGFF@?<<=??<>53@##

=> Standard == Sanger Format : Other scales exist !

Outline

Adapter Trimming: Remove artificial sequences

Read Mapping: Place reads on the reference genome (BWA)

Quality-Trimming: Remove low quality sequences to improve specificity

Generate QC Reports: Visual inspection of main quality parameters

Optimize Mapping: Remove Duplicate reads (picard),

recalibrate mapping scores (GATK), realign around indels (GATK)

Call and Annotate Variants: Call variants(GATK) and

annotate using ANNOVAR, and snpEff (VariantDB)

ce –

NGS Based Variant Calling From sequence to variant: Analysis flow

Adapter Trimming Pre - Processing

Sequence Read 1 Sequence Read 2

Sequence Barcode

Scan all reads for presence of artificial sequence & remove from the reads Note: Adapters are present when lenght(Targetted fragment) < read_length

Pre - Processing

Low quality leads to high error rates (cfr Phred Score) => Due to chemical degradation, 3’ ends have a lower quality => We want a limit of 1 error in 1000 positions => Trim everything on 3’ end with quality < 30

Quality-Trimming

Pre - Processing

Quality should improve after trimming

Generate QC Reports

Pre - Processing

Base composition should be 25% for G,C,T,A

Generate QC Reports

Good run Failed Run

Read Mapping Sequence – To – Variant

NGS Data analysis From sequence to variant: Analysis flow

Burrows-Wheeler Transformation: - Highly efficient method to scan string for substring matches - Principle: Build Prefix Trie, scan top-down using reverse search. 1. Permute String 2. Sort Permuted Strings 3. Last Column = Burrows Wheeler Transformation of string. 4. Build prefix trie from BWT

Pre - Processing

Generate QC Reports Insert Size

Pre - Processing

Generate QC Reports Capture Efficiency

Pre - Processing

Generate QC Reports Capture Efficiency

Optimize Mapping Sequence – To – Variant

- Remove Duplicate reads (picard) => Reduce computational time => Reduce amplification bias

Optimize Mapping Sequence – To – Variant

- Realign around indels (GATK) => InDels are hard to align => P(>1 SNPs) < P(1 indel)

If at a certain locus, both InDel AND multiple SNPs => Replace SNPs by one InDel => Reduction of false positives

Call And Annotate Variants Sequence – To – Variant

- Call Variants (GATK) - Search for positions with statistically significant evidence for

a non-reference nucleotide - Take into account: base-quality, position in read, strand

bias, ...

Call And Annotate Variants Sequence – To – Variant

Annotate Variants (ANNOVAR, snpEff, ...) - Add as information to the variant to ease interpretation - Effect on Gene transcription (RefSeq, Ensembl, UCSC) - Quality parameters (GATK) - Occurence in control populations (dbSNP, ESP, HapMap, 1KG, ...) - Known pathogenic variations (dbSNP, OMIM, ...) - Effect on gene function (PolyPhen, MutationTaster, Sift, ...) - ...

Outline

NGS Based Variant Calling From variant to knowledge: Interpretation flow

Step 1 : Quality Filtering: GATK Variant Recalibration “The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact.” “The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.”

Step 1 : Quality Filtering: GATK Variant Recalibration Train model on known variants (both positive and negative)

Step 1 : Quality Filtering: GATK Variant Recalibration Apply model to experimental data

Step 2 : Select an inheritance model

De Novo Dominant Recessive Variant not present in Variant present in Variant homozygous either parent affected parent in patient, heterozygous in both parents

Step 3 : Effect on gene function (~ from high to low)

- Variant causes gain/loss of stop/start coding? - Variant causes aberrant splicing of the transcript? - Variant replaces a highly conserved nucleotide/amino acid ? - Variant replaces an aminoacid, and is not reported in control

populations ? - Variant can modify binding of regulatory elements? - ...

Extended annotation is critical

Manual inspection of > 20.000 variants/sample is impossible. automation is needed

Outline

Final Remarks Reducing Computational complexity: Web-Tools

Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis (http://www.usegalaxy.org)

Final Remarks

Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis - Support for allmost all types of analysis - Variant Calling - RNA seq : Expression / transcript identification - MetaGenomics - ChIP-seq - Many organisms available by default (on main servers) - New organisms can be added on request (on Biomina Servers) Public Server: http://www.usegalaxy.org Biomina Server: http://www.biomina.be/apps/galaxy

Reducing Computational complexity: Web-Tools

Final Remarks

Variant Interpretation: VariantDB - Extensive annotation - Flexible filtering options - Automatic updates - Multiple output formats: - online (tabular) - offline (CSV) - API (JSON)

Reducing Computational complexity: Web-Tools

Final Remarks Reducing Computational complexity: Future ?

Final Remarks

Future NGS assays will be: - Real-Time - On-Site - Low-Cost - ....

Reducing Computational complexity: Future ?

biomina

A technical and methodological introduction to NGS (data ... and... · A technical and...

Documents