Mutation Analysiscompbio.charite.de/tl_files/mutation-analysis-2012.pdf · 2015. 11. 27. ·...

Post on 22-Jan-2021

2 views 0 download

transcript

Mutation Analysis

Sebastian Bauer

Institut für Medizinische GenetikCharité Universitätsmedizin Berlin

2012/03/22

Workflow for Mutation Analysis

Raw Data Generation Sample preparation and sequencing

Raw Data Analysis Base calling

Whole Genome Mapping Alignment to a reference genome

Variant Calling Detection of genetic variation

Annotation Linking variants to biological information

Raw Data Generation

Prepare samples

Then sequence

Output is vendor-specific raw data

Workflow for Mutation Analysis

Raw Data Generation Sample preparation and sequencing

Raw Data Analysis Base calling

Whole Genome Mapping Alignment to a reference genome

Variant Calling Detection of genetic variation

Annotation Linking variants to biological information

Raw Data Analysis: Base Calling

Transform raw data in to sequences of bases

Exact procedure depends on the used sequencing platform

Most report additonally quality score for each base that canbe transformed into a Phread score

QPhred =−10 log10 P(error)

Example: QPhred = 20 ⇔ error = 1%

One Sequence Entry (Read) in the Output Fastq File@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65

Workflow for Mutation Analysis

Raw Data Generation Sample preparation and sequencing

Raw Data Analysis Base calling

Whole Genome Mapping Alignment to a reference genome

Variant Calling Detection of genetic variation

Annotation Linking variants to biological information

Whole Genome Mapping / Aligning

Current methods assume mapping to a reference genome

Allows to find variants with known associations to diseasesbut also new suspects

Most short read mapper use hashs or on data structuresbased on the Burrows-Wheeler transform

Output is some statistics and a SAM or BAM file

Whole Genome Mapping / Aligning

Current methods assume mapping to a reference genome

Allows to find variants with known associations to diseasesbut also new suspects

Most short read mapper use hashs or on data structuresbased on the Burrows-Wheeler transform

Output is some statistics and a SAM or BAM file

Workflow for Mutation Analysis

Raw Data Generation Sample preparation and sequencing

Raw Data Analysis Base calling

Whole Genome Mapping Alignment to a reference genome

Variant Calling Detection of genetic variation

Annotation Linking variants to biological information

Variant Calling: Genetic Variation

Variant Calling: Identify regions that differ from the reference

Single nucleotide variants (SNVs)TGCATTGCGTAGGCTGCATTCCGTAGGC

Short indels (=insertion/deletion)TGCATT– – –TAGGCTGCATTCCGTAGGC

MicrosatellitesTGCTCATCATCATCAGCTGCTCATCA– – – – – –GC

Minisatellites≤ 100bp

Copy number variations (largedeletions, duplications, inversions;CNVs)

≥ 1000bp

Variant Calling: Genotype

Easy Approach

Count alleles at each column X in the pileup and use cutoff rules1 Filter for Phread score (QPhread) of 20

2 Call a genotype heterozygous, if non-ref allele is between 20%and 80%, otherwise homozygous

Works reasonable well when coverage is > 20 (Nielsen et al. 2011)

More elaborate ones are based on probabilistic frameworks

P(G|X) ∝ P(X |G)P(G) = ∏i

P(Xi |G)P(G), G ∈ {A,C,T,G}

Likelihood P(X |G) from quality score P(Xi |G) for each entry i

P(G) allows to specifiy data-independent prior knowledge

Posterior 0 < P(G|X)< 1 assesses genotype and confidence

Variant Calling: Integrating Prior Knowlegde

Single Sample Prior for a Given Position X

Suppose that a G/T polymorphism is reported in dbSNP. Then,G G T GT Other combinations

P(G) 0.454 0.454 0.0909 < 10−4

GATK multi-sample uses estimated allelefrequencies from larger sample setscombined Hardy-Weinberg equilibrium

GATK-Beagle with linkage disequilibriumdata

Variant Calling: Other Examples of Extensions

(taken from the Illumina Website)

But......all is not much of use for rare mutations

Workflow for Mutation Analysis

Raw Data Generation Sample preparation and sequencing

Raw Data Analysis Base calling

Whole Genome Mapping Alignment to a reference genome

Variant Calling Detection of genetic variation

Annotation Linking variants to biological information

Annotation

Between 3 and 5 million SNVs per indivdual

Only few have a functional impact

Separating them is a challenge of bioinformatics

Many tools use supervised learning approaches(remember my last talk)

SNV features

cSNVs (protein-coding)– Amino acid residue substitions prop.– Evolutionary history of AA position– Sequence-function relationship– Structure-function relationship

rSNVs (regulatory)– Transcription– Pre-MRna splicing– MicroRNA binding– Post-translational modification sites

Annotation: Protein-Sequence-Based

(Cooper et. al)

Annotation: DNA-Sequence-Based

(Cooper et. al)

Flow chart for informed use ofSNV function prediction tools

Cline et al.

Final

Thanks for your attention!

References

Nielsen et. al. Genotype and SNP calling fromnext-generation sequencing data. Nature ReviewsGenetics. (2011)

Cline et. al. Using bioinformatics to predict the functionalimpact of SNVs. Bioinformatics. (2011)

Cooper et. al. Needles in stacks of needles: findingdisease-causal variants in a wealth of genomic data.Nature Reviews Genetics. (2011)