Detection, Diagnosis and Correction of Batch Effects in TCGA Data
Rehan Akbani
Dept. of Bioinformatics and Computational Biology, UT MD Anderson Cancer Center
Simplified Flow Diagram for TCGA
Tissue Mayo Clinic MD Anderson MSKCC Source Sites (TSS)
Biospecimen Core Resources (BCR)
Cancer Genome Characterization Centers (CGCC) / Genome Sequencing Centers (GSC)
…
Samples
International Nationwide Genomics Children’s
Consortium Hospital
DNA/RNA
…Broad USC UNC
Data TCGA
Data Coordination Website Center (DCC)
2
Step 1: Batch effects diagnoses
• Objectives – Detect / quantify batch effects – Identify source(s) of batch effects
• Tools / algorithms for diagnoses – MBatch R package http://bioinformatics.mdanderson.org/tcgabatcheffects/ 1. PCA-Plus plots (novel) 2. BatchCorr algorithm (novel) 3. Hierarchical clustering 4. Clinical correlates 5. Box plots 6. ANOVA / MANOVA
• Disclaimer: No substitute for human input
3
1. PCA-Plus http://bioinformatics.mdanderson.org/tcgabatcheffects/
4
5 p-values
2. BatchCorr
Batch effects present:
BatchCorr < 0.7 AND p-value < 0.05
BatchCorr value
3. Hierarchical Clustering
6
4.
READ
ClinicalCorrelates COAD/
miRNA
7
5. Box plots
Batch medians Batch means
8
9
6. ANOVA / MANOVA
62.3%
32.5%
2.3%
1.6% 0.4%
0.9%
Batch ID TSS
BCR
59.3%
33.9%
1.2%
1.1%
2.9%
0.1%
Batch ID
subtype
1.2%
1.2% 0.1%
TSS
Step 2: Batch effects correction
• Correct the source of the problem whenever possible
• When not possible, or source unknown, algorithms can be used
• Some algorithms for correction: – ComBat (aka Empirical Bayes) – ANOVA – Median Polish
• Included in MBatch R package http://bioinformatics.mdanderson.org/tcgabatcheffects/
10
Male
Female
Kidney cancer (KIRC) DNA methyla;on data (27k)
Dichotomyobserved
11
New dichotomy basedon batch ID appears
12
A.er removing sex chromosomes
A.er removing bad probes
Chromophobes
14
TCGA MBatch website http://bioinformatics.mdanderson.org/tcgabatcheffects/
15
TCGA MBatch website http://bioinformatics.mdanderson.org/tcgabatcheffects/
Zoom Pan Mouse-over
16
TCGA MBatch website http://bioinformatics.mdanderson.org/tcgabatcheffects/
17
TCGA MBatch website http://bioinformatics.mdanderson.org/tcgabatcheffects/
18
TCGA MBatch website http://bioinformatics.mdanderson.org/tcgabatcheffects/
Acknowledgments
UT MD Anderson • Nianxiang Zhang • Tod D. Casasent • Chris Wakefield • James M. Melott • Anna K. Unruh • Thomas C. Motter • Bradley M. Broom • John N. Weinstein
In-Silico • James Cleland • Andy Wong • Mike Ryan
Poster # 50
19