final_presentation

Biology 101and 102, 202, 324, and 404...

DNA Structure

Chromosomes Genes Base pairs

Dad’sMom’s

Cancer Biology Refresher

germline mutations: inherited mutations. Present since

birth.

Present in both normal and tumor cells.

somatic mutations: arise over the course of an

individual’s lifetime.

Present in tumor cells, but not in normal cells.

Reference: A T C G A T C G A T C G

Tumor: A T G G A T C G C T C G

Normal: A T G G A T C G A T C G

1 2 3 4 5 6 7 8 9 10 11 12

A Statement of Facts

Breast cancer is the leading type of

cancer in women, accounting for

25% of all reported cases.

In 2013, there were 1.68 million

cases and 522,000 deaths due to

breast cancer.

Source: http://www.fanpop.com/clubs/breast-cancer-

awareness/images/372389/title/pink-ribbonA

http://www.fanpop.com/clubs/breast-cancer-awareness/images/372389/title/pink-ribbonA

Goal

Given paired normal-tumor sequencing data

from a breast cancer patient, identify the

somatic mutations present in the cancer

genome.

Example of Raw Sequencing Data

Approach

Can we use deep-sequenced data from other cancer

patients to classify the current patient’s SNPs?

Use 9 machine learning algorithms to predict which

SNPs are truly somatic.

10

(1) Number of reads covering or bridging the site(11) Sum of squares of reference mapping qualities(2) Number of reference Q13 bases on the forward strand(12) Sum of non-reference mapping qualities(3) Number of reference Q13 bases on the reverse strand(13) Sum of squares of non-reference mapping qualities(4) Number of non-reference Q13 bases on the forward strand(14) Sum of tail distances for reference bases(5) Number of non-reference Q13 bases on the reverse strand(15) Sum of squares of tail distance for reference bases(6) Sum of reference base qualities(16) Sum of tail distances for non-reference bases(7) Sum of squares of reference base qualities(17) Sum of squares of tail distance for non-reference bases(8) Sum of non-reference base qualities(18) P(D∣Gi=aa), phred−scaled, i.e. x is transformed to −10log(x)(9) Sum of squares of non-reference base qualities(19) maxGi≠aa(P(D∣Gi)), phred-scaled(10) Sum of reference mapping qualities(20) ∑Gi≠aa (P(D∣Gi)), phred-scaled(Q13 means base quality bigger or equal to Phred score 13; D represents the three dimensional vector (depth, number of reference bases and number of non-reference bases) at the current site; Gi∈{aa, ab, bb} means the genotype at site i, where a, b∈{A, C, T, G} and a is the reference allele and b is the non-reference allele.)

(41) QUAL: phred-scaled probability of the call given data(51) QD: variant confidence/unfiltered depth(42) Allele count for non-ref allele in genotypes(52) SB: strand bias (the variation being seen on only the forward or only the reverse strand)(43) AF: allele frequency for each non-ref allele(53) SumGLbyD(44) Total number of alleles in called genotypes(54) Allelic depths for the ref-allele(45) Total (unfiltered) depth over all samples(55) Allelic depths for the non-ref allele(46) Fraction of reads containing spanning deletions(56) DP: read depth (only filtered reads used for calling)(47) HRun: largest contiguous homopolymer run of variant allele in either direction(57) GQ: genotype quality computed based on the genotype likelihood(48) HaplotypeScore: estimate the probability that the reads at this locus are coming from no more than 2 local haplotypes(58) P(D∣Gi=aa), phred-scaled(49) MQ: root mean square mapping quality(59) P(D∣Gi=ab), phred-scaled(50) MQ0: total number of reads with mapping quality zero(60) P(D∣Gi=bb), phred-scaled.

(98) Forward strand non-reference base ratio F24/F4

(103) Sum of squares of non-reference mapping quality ratio F33/F13

(99) Reverse strand non-reference base ratio F25/F5

(104) Sum of non-reference tail distance ratio F36/F16

(100) Sum of non-reference base quality ratio F28/F8

(105) Sum of squares of non-reference tail distance ratio F37/F17

(101) Sum of squares of non-reference base quality ratio F29/F9

(106) Non-reference allele depth ratio F75/F55

(102) Sum of non-reference mapping quality ratio F32/F12

From Samtools:x1 - x20 for normalx21 - x40 for tumor

From GATK:x41 - x60 for normalx61 - x80 for tumor

Feature Extraction

Source: “Feature-based classifiers for Somatic mutation detection in tumor-normal paired sequencing data” by JiaruiDing, et al.

Selected 106 features that are computed from SAMtools and GATK (two popular genomics toolkits).

Random Forest, SVM, and Logistic Tree have already achieved good accuracy using these features.

Feature Selection and

Merging

Merge all the features for normal and tumor SNPS detected from SAMtools and GATK (4 1).

Delete the uninformative features (e.g., number of non-reference Q13 bases on the tumor) and features with too much missing data.

62 features left!

Approximate missing data by substituting the mean of data present (“mean imputation”).

tumor

SAMtools

normal

GATK

tumor

GATK

SNP

positions

normal

SAMtools

tumor

GATK

normal

GATK

tumor

SAMtools

Missing Data

Normalization

Normalization prevents large-valued features from dominating the principal components.

Method 1: perform mean-centering and then divide by the standard deviation. Use to detect machine and experimental errors.

Method 2: divide by the maximum of absolute value of the data. Use to normalize for machine learning algorithm input.

Principal Component Analysis (PCA)

Identifies an orthonormal basis that captures the greatest variance in our data.

Reduce the dimension to top 10 principal components. These account for 81.5% of the variance in our

data.

These principal components serve as “super feature” inputs for our machine learning algorithms

Initial classification by

SAMtools and GATK:

somatic

germline

germline

somati

c

9 Machine Learning Algorithms

1. Use first 10 principal components as features

2. Run the algorithms with training data from another

patient (860 samples).

3. Pass each SNP through every algorithm, tracking

whether it is classified as somatic or non-somatic.

4. Select a threshold (we used 8).

5. If more than the threshold number of algorithms classify

that SNP as somatic, we assign it a final label of

somatic!

QDA RBF SVM Linear SVM

Random ForestNaïve Bayes

Decision Tree Nearest Neighbors LDA

+ Neural Network

(processing sped up by

parallel programming)

OpenMP

OpenMPI

CUDA

More on Neural Networks

How did we do?

Cross-reference our somatic SNPs against several databases (gene function, disease + phenotype association etc.)

Compile a list of known breast cancer driver genes on chromosome 1 and search for them among our results:1. RAP1A [151 SNPs]

2. PARP1 [234 SNPs]

3. TACSTD2 [13 SNPs]

In total: 8,660 associations with breast cancer in other research

studies were found.

• microtubule assembly protein

• blocking dynamic instability of

microtubules used as a cancer

treatment, preventing cell migration

PARP1

• involved in DNA damage repair

• interacts with BRCA1 and BRCA2 (two

of the most cited breast cancer driver

genes)

during homologous recombination

TACSTD2

• tumor-associated calcium signal transducer

Source:

http://www.proteinatlas.org/i

mages_dictionary/microtub

ules__1__6376__1_blue_g

reen.jpg

Source:

http://en.wikipedia.org/wiki/

PARP1

MAP1A

Source: www.sinobiological.com

http://en.wikipedia.org/wiki/PARP1

Future Work

Deep-sequenced data is expensive to produce. For

best results, we need the data to come from

individuals of similar backgrounds (gender, ethnicity,

etc.)

Building a repository of data with high coverage for

each cancer type would increase our training set

size and ward off the perils of over-fitting.

Learn how to overcome the sequencing errors that

each sequencing technology is prone to.

References

1. Jiarui Ding, Ali Bashashati, and et al, “Feature-based classifiers for somatic mutation deletion in tumor-normal paired sequencing data”, Bioinformatics (2012), pg 167-175, vol. 28

2. Xindong Wu, Vipin Kumar and et al, “Top 10 Algorithms in Data Mining”, Knowl Information System (2008) 14:1–37

3. Jonathan Shlens, “A Tutorial on Pricipal Component Analysis”, Google Research (2014)

4. Christoforides, A. and J. Carpten, et al. "Identification of somatic mutations in cancer through Bayesian-based analysis of sequenced genome pairs." BMC Genomics (2012) 14 (1): 302

5. Shiraishi, Y. and Y. Sato, et al. "An empirical Bayesian framework for somatic mutation detection from cancer genome sequencing data." Nucleic Acids Research (2013)

6. SciKit: http://scikit-learn.org/stable/

7. ANN: http://takinginitiative.wordpress.com/2008/04/23/basic-neural-network-tutorial-c-implementation-and-source-code/

Thank You!

Date post:	15-Jul-2015
Category:	Documents
Upload:	david-stevens
View:	48 times
Download:	0 times

final_presentation

Documents