Variant Calling Pipeline - UT Southwestern · Variant Calling Pipeline Erika Villa Bioinformatics...

Post on 17-May-2020

8 views 0 download

transcript

Variant Calling Pipeline

Erika Villa

Bioinformatics Core Facility

10/17/2018

Genome

A genome is the entire set of genetic material for an organism.blueprint of life that contains information to grow, develop, survive and reproduce

The human genome

~3 billion base pairs of DNA across 23 pairs of chromosomes.

~20,000 protein coding genes

No individuals are genetically identical

But we are more similar than we are different

More than 99 percent of the human genome is the same in all people.

Differences in less than 1 percent of our genome accounts for the vast diversity in humans across the globe.

Projects that give us insight about human differences

2015: 1000 genome project found typical individual varies in 4.1-5 million sites(~20 million bp) from reference.

2017: dbSnp 324 million variants for humans

Exome

The exome is a subset of the genome composed of only exons.Exons are the coding regions of a gene

The exons of all our genes make up approximately 1.5% of our genome Exonic mutations are thought to harbor ~85% of mutations largely effecting disease.There are some important DNA sequences that are not contained within the exome in noncoding DNA that have important biological functions, such as regulating the

coding regions of the genome.

Sequencing Approaches

Whole Genome(WGS), Whole Exome(WES), Target Gene Panel

Target Gene Panel • A gene panel is a gene subset of the exome

• It contains a subset of exons for a select group of genes

• Gene Panels are useful if you need to do deep sequencing > 1000X

• Many clinical tumor tests use gene panels.

What portion of the genome do you want to sequence?

Pros and Cons of WGS vs WES

Whole Genome

• ~$1300 for 30-40X coverage

• All variants possible

• Sequence can better predict large structural changes including CNV, large Indels, etc

• Whole Genome has more uniform coverage of the protein coding regions

Whole Exome

• ~$500 for 100X coverage

• Restricted to exonic regions

• In somatic/mosaic conditions you might need > 1000X coverage.

• Generate less data to store and analyze

COST

DETECTABLE VARIANTS

PROS

Human Reference GenomeTo make assertions about genetic variation we rely on a reference

Reference Genome: A representative example of a species' genetic makeup

Curated by Genome Reference Consortium (GRC)

• GRCh37/hg19: 2009 derived from thirteen anonymous volunteers from Buffalo, NY.

• GRCh38/hg38: Dec 2013-includes ALT contigs. More representative of population.

2001(150,000 gaps) 2009(250 gaps) 2013(12 gaps)

Build 38 was a significant ‘upgrade’, and due to its accuracy and reputation it is the ‘go to’ reference for many large scale projects

Catalogs of Human Variation

HapMap

• The International HapMap Project: SNP genotyping arrays to develop a haplotype map (HapMap) of the human genome.

1000G

• The 1000 Genomes project sequenced > 1000 genomes in pure and ad-mixture populations to identify human variation in the human genome

ExAC

• ExAC collected the SNP and Indel calls in ~ 26K genomes/exomes and their prevalence in different populations

gnomAD

• The Genome Aggregation Database (gnomAD) is a resource of aggregate genomes and aimed to harmonize both exome and genome sequencing data from over 120K exomes and 15K genomes.

Types of variation in Genome

• Single Nucleotide Polymorphisms(SNPs or SNVs)

• Short Insertions/Deletions (Indels)

• Large Structural Variations SNVs

INDELs

SVs

A C T G A

A T T G A

A A

A T T G ATT

A A G T T

Substitutions

Insertions

Deletions

Inversions

reference

Some SNPs of Interest

EXAMPLES• Non-synonymous mutations

- Results in Amino Acid change- Affects the Protein Sequence- Types of non-synonymous mutations

* Missense

* Nonsense: also described as stop_gained

Diseases can be driven by various types of genetic alterationsExamine Variants and understand features

Original Synonumous Missense Missense Nonsense

GAG GAA GAT GTG TAG

Glutamic Acid Glutamic Acid Aspartic Acid Valine Stop codon

Features used in biological sequence annotationEffects that we see in variants

In a gene? In an exon? Protein coding change? http://www.sequenceontology.org/

Structural Variants: The variation in structure of an organism's chromosome. Typically a structure variation affects a sequence length about 1Kb to 3Mb

1 kb = 10^3 bp1 Mb = 10^6 bp1 Gb = 10^9 bp

Alterations in Genome• A genetic disorder is a genetic problem caused by one or more abnormalities in the genome.

• A single-gene disorder is the result of a single mutated gene.

• Autosomal dominant disorders occur with only one mutated

copy of the gene.

• Recessive disorders require both copies are mutated.

• X-linked dominant disorders are caused by mutations in

genes on the X chromosome.

• Mitochondrial disease, also known as maternal inheritance,

applies to genes encoded by mitochondrial DNA.

Inherited Diseases

Complex Disease

• Complex diseases are caused by a combination of genetic, environmental, and lifestyle factors, most of which have not yet been identified.

• Some examples include Alzheimer's disease, scleroderma, asthma, Parkinson's disease, multiple sclerosis, osteoporosis, connective tissue diseases, kidney diseases, autoimmune diseases, etc

Somatic Mutations

Somatic Mutation Germline Mutation

Somatic mutation: An alteration in DNA that occurs after conception. Somatic mutations can occur in any of the cells of the body except the germ cells and therefore are not passed on to children. Can cause cancer or other diseases.

Somatic Disease• Acquired diseases are caused by acquired mutations in a gene or group of genes that occur during a person's life.

• These include many cancers, as well as some forms of neurofibromatosis.

Mosaicism • Mosaicism, involves the presence of two or more populations of cells with different genotypes in one individual, who has developed from a single fertilized egg.

• Intersex conditions can be caused by mosaicism where some cells

in the body have XX and others XY chromosomes

• Other endogenous factors can also lead to mosaicism including

mobile elements, DNA polymerase slippage, and unbalanced

chromosomal segregation.

• Exogenous factors include nicotine and UV radiation

Germline and Somatic Workflows for

Variant Discovery

BICF and BioHPC

Alignment WorkflowFirst Step for Germline and Somatic Workflows

Alignment: BWABurrow-Wheelers Aligner

“BWA is carefully designed to achieve a good balance between performance and accuracy”

SE and PE reads

Difficulties: ambiguity caused by repeats and sequencing errors.

Human Reference Sequences-GRCh37/hg19

- GRCh38

Other Organisms Reference Sequences

Available for e.g. Mouse(mm10/GRCm38)

Others not available

Alignment: DedupingWith or without UMI

Why are we so worried about sequence duplication?

• When DNA is sequenced, PCR is used to amplify sequence library to ensure that only DNA with “a known adapter” is sequenced.

• Since PCR has a small error rate, “early errors” can be amplified and could skew your results

• We remove duplicates to remove potential noise.

Alignment: Indel Realignment• Why does GATK need Indel Realignment?• Sometimes, alignment algorithms align reads inconsistently, adding the alignment gaps to different places.• Indel Realignment uses “known” gold standard indels to realign these gaps

Alignment Workflow: Recalibration• Why does GATK need Base Recalibration?

• Every base has a quality score that variant callers rely on these scores

• Quality scores are prone to different types of biases

• Base recalibration detects systematic errors made by the sequencer when it estimates the quality score of each base call

Germline Workflow

Germline Union VCF

Variant Callers

• Strelka2: https://github.com/Illumina/strelka

– Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Doruk Beyter, Peter Krusche, Christopher T Saunders. Strelka2: Fast and accurate variant calling for clinical sequencing applications.

• Speedseq: https://github.com/hall-lab/speedseq

– Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015 Oct;12(10):966-8.https://github.com/hall-lab/speedseq

• Platypus: http://www.well.ox.ac.uk/platypus

– Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SR; WGS500 Consortium, Wilkie AO, McVean G, Lunter G. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet. 2014 Aug;46(8):912-8.http://www.well.ox.ac.uk/platypus

• Gatk: https://software.broadinstitute.org/gatk/

– Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1-33.

Recommended Filtering for Germline Testing

• Depth >20• LOF or Missense (Coding Changes)• Alt Read Ct >3• Mutation Allele Frequency (MAF) >0.10• If novel:

- Called by 2+ callers

Important Terminology to understand

Different tumor cells can show distinct morphological and phenotypic profiles; eg. cell morphology and gene expression

Somatic Workflow

Somatic Variant Callers

• Shimmer: https://github.com/nhansen/Shimmer

– Hansen NF, Gartner JJ, Mei L, Samuels Y, Mullikin JC. Shimmer: detection of genetic alterations in tumors using next-generation sequence data. Bioinformatics. 2013 Jun 15;29(12):1498-503.

• Virmid: https://sourceforge.net/projects/virmid/

– Kim S, Jeong K, Bhutani K, Lee J, Patel A, Scott E, Nam H, Lee H, Gleeson JG, Bafna V. Virmid: accurate detection of somatic mutations with sample impurity inference. Genome Biol. 2013 Aug 29;14(8):R90

• VarScan: http://dkoboldt.github.io/varscan/

– Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012 Mar;22(3):568-76.

• Speedseq: https://github.com/hall-lab/speedseq

– Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015 Oct;12(10):966-8.https://github.com/hall-lab/speedseq

• MuTect:

• https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_ gatk_tools_walkers_cancer_m2_MuTect2.php

– http://archive.broadinstitute.org/cancer/cga/mutect

Recommended Filtering for Somatic Mutations

• Depth >20• LOF or Missense (Coding Changes)• MAF(Normal) * 5 < MAF(Tumor)• In COSMIC > 5 Subject

- Tumor: Alt Read Ct > 3- Tumor: MAF > 0.01

• Others- Tumor: Alt Read Ct >8- Tumor: MAF >0.05- Tumor: Called by 2+ Callers

Annotations

• ClinVar- ClinVar is a freely accessible, public archive that aggregates information about genomic variation and it’s relationship to human health.

• GWAS Catalog-GWAS Catalog is a quality controlled, manually curated, literature derived collection of published GWAS assaying at least 100,000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 × 10(-5).

• Decipher- The DECIPHER database contains data from 27,302 patients who have given consent to the broad data sharing; DECIPHER also supports more limited sharing via consortia. Used by clinical community to share and compare phenotype and genotypic data

Disease Studies

Cancer Datasets and Annotation

• Clinical Interpretation of Variants in CancerCIVIC

• Catalog of Somatic Mutation in CancerCOSMIC- GeneFusions- Gene Census- Curated Genes- Drug Resistance(so far 9 genes)- Genome Wide Screens

• The Cancer Genome Atlas - TCGA- Tons of data, RNASeq, CNV, WES, WGS etc

Astrocyte - BioHPC Workflow Platform

astrocyte.biohpc.swmed.eduor

portal.biohpc.swmed.edu: Cloud Services -> Astrocyte Workflow Platform

Standardized Workflow

Simple Web Forms

Online documentation & results visualization

Workflows run on HPC cluster without developer or user needing cluster knowledge

Bioinformatics Core Facility (BICF)BICF provides bioinformatics, statistical and data management support to researchers on campusBICF functions as a conduit between bioinformatics research programs and the clinical and basic science research community at UTSWPlease email bicf@utsouthwestern.edu with questions or comments about the workflow

Create New Project

Add DataTo Your Project

Adding Data To Your Project

#For NGS experiements, this is recommended

Data to Import

• Design File: tab delimited *txt file with sample names, Family/Group names, fastq file names

• Fastq Files: One or two fastq files per sample

• Capture Bed file: tab delimited file with target capture region in bed format. (Must contain at least 3 columns specifying chromosome, chromosome start position and chromosome end position)

Make A Design FileFamilyIDThis ID will be used to call samples in batchSampleIDThis ID will be used to name all workflow produced files. E.g. S0001 will produce S0001.bamFqR1Name of the fastq file(not full path)FqR2Name of the fastq file R2 (not full path)

Rules for Making Design File

• Use tab as delimiter- Excel save as “Text (tab delimited)”

• If no SubjectID, use same number/character for all riws

• If no FqR2, leave them empty• For all contentes, no “-”• For all contents, no spaces• Column names MUST be exactly the

same as documented

Run Workflow in this ProjectMy Project Select Project

mydesignfile.txt

mycapturefile.bed

GM12878.R1.fastq.gzGM12878.R2.fastq.gzmydesignfile.txt

mycapturefile.bed

SELECT YOUR FILES

Select your data file, set up workflow and submit

Project is Queued/Running/Complete

/RUNNING/QUEUED

GM12878.R1.fastq.gz

GM12878.R1.fastq.gz

Keep Trying: My first attempt belowMake sure you have all the appropriate files selected

BICF Help Desk: Email: bicf@utsouthwestern.edu Hours: 10-11am Daily Location: E4.380

Timeline of Germline workflowOne Sample

Key Files for Germline Pipeline• VCF file — SNPs/Indels for each sample

• SampleID.germline.vcf.gz• Coverage Histogram for each sample

• SampleID.coverage_histogram.png• Cumulative Distribution Plot for all samples

• coverage_cdf.png• QC for all samples

• SampleID.sequence.stats.txt• Structural Variants (unfiltered)

• SampleID.sssv.sv.vcf.gz.annot.txt• Copy Number for each sample

• SampleID.cnvcalls.txt

Key Files Somatic Mutation Pipeline

• VCF file — SNPs/Indels for each sample

• FamilyID.somatic.vcf.gz

• Match Check File

• FamilyID_matched.txt

• QC for tumor normal pairs

• FamilyID.sequence.stats.txt

BAM files can be viewed on

Referencesame as analysis reference

http://newbam.iobio.io/

VCF Files can be viewed by

http://vcf.iobio.io

Thank you

Questions?