Mutation Analysiscompbio.charite.de/tl_files/mutation-analysis-2012.pdf · 2015. 11. 27. ·...

transcript

Mutation Analysis

Sebastian Bauer

Institut für Medizinische GenetikCharité Universitätsmedizin Berlin

2012/03/22

Workflow for Mutation Analysis

Raw Data Generation Sample preparation and sequencing

Raw Data Analysis Base calling

Whole Genome Mapping Alignment to a reference genome

Variant Calling Detection of genetic variation

Annotation Linking variants to biological information

Raw Data Generation

Prepare samples

Then sequence

Output is vendor-specific raw data

Raw Data Analysis: Base Calling

Transform raw data in to sequences of bases

Exact procedure depends on the used sequencing platform

Most report additonally quality score for each base that canbe transformed into a Phread score

QPhred =−10 log10 P(error)

Example: QPhred = 20 ⇔ error = 1%

One Sequence Entry (Read) in the Output Fastq File@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65

Whole Genome Mapping / Aligning

Current methods assume mapping to a reference genome

Allows to find variants with known associations to diseasesbut also new suspects

Most short read mapper use hashs or on data structuresbased on the Burrows-Wheeler transform

Output is some statistics and a SAM or BAM file

Whole Genome Mapping / Aligning

Current methods assume mapping to a reference genome

Allows to find variants with known associations to diseasesbut also new suspects

Most short read mapper use hashs or on data structuresbased on the Burrows-Wheeler transform

Output is some statistics and a SAM or BAM file

Variant Calling: Genetic Variation

Variant Calling: Identify regions that differ from the reference

Single nucleotide variants (SNVs)TGCATTGCGTAGGCTGCATTCCGTAGGC

Short indels (=insertion/deletion)TGCATT– – –TAGGCTGCATTCCGTAGGC

MicrosatellitesTGCTCATCATCATCAGCTGCTCATCA– – – – – –GC

Minisatellites≤ 100bp

Copy number variations (largedeletions, duplications, inversions;CNVs)

≥ 1000bp

Variant Calling: Genotype

Easy Approach

Count alleles at each column X in the pileup and use cutoff rules1 Filter for Phread score (QPhread) of 20

2 Call a genotype heterozygous, if non-ref allele is between 20%and 80%, otherwise homozygous

Works reasonable well when coverage is > 20 (Nielsen et al. 2011)

More elaborate ones are based on probabilistic frameworks

P(G|X) ∝ P(X |G)P(G) = ∏i

P(Xi |G)P(G), G ∈ {A,C,T,G}

Likelihood P(X |G) from quality score P(Xi |G) for each entry i

P(G) allows to specifiy data-independent prior knowledge

Posterior 0 < P(G|X)< 1 assesses genotype and confidence

Variant Calling: Integrating Prior Knowlegde

Single Sample Prior for a Given Position X

Suppose that a G/T polymorphism is reported in dbSNP. Then,G G T GT Other combinations

P(G) 0.454 0.454 0.0909 < 10−4

GATK multi-sample uses estimated allelefrequencies from larger sample setscombined Hardy-Weinberg equilibrium

GATK-Beagle with linkage disequilibriumdata

Variant Calling: Other Examples of Extensions

(taken from the Illumina Website)

But......all is not much of use for rare mutations

Annotation

Between 3 and 5 million SNVs per indivdual

Only few have a functional impact

Separating them is a challenge of bioinformatics

Many tools use supervised learning approaches(remember my last talk)

SNV features

cSNVs (protein-coding)– Amino acid residue substitions prop.– Evolutionary history of AA position– Sequence-function relationship– Structure-function relationship

rSNVs (regulatory)– Transcription– Pre-MRna splicing– MicroRNA binding– Post-translational modification sites

Annotation: Protein-Sequence-Based

(Cooper et. al)

Annotation: DNA-Sequence-Based

(Cooper et. al)

Flow chart for informed use ofSNV function prediction tools

Cline et al.

Thanks for your attention!

References

Nielsen et. al. Genotype and SNP calling fromnext-generation sequencing data. Nature ReviewsGenetics. (2011)

Cline et. al. Using bioinformatics to predict the functionalimpact of SNVs. Bioinformatics. (2011)

Cooper et. al. Needles in stacks of needles: findingdisease-causal variants in a wealth of genomic data.Nature Reviews Genetics. (2011)

Mutation Analysiscompbio.charite.de/tl_files/mutation-analysis-2012.pdf · 2015. 11. 27. ·...

Documents