Empirical evaluation of prediction- and correlation network methods applied to genomic data

Empirical evaluation of prediction- and correlation network methods applied

to genomic data

Steve HorvathUniversity of California, Los Angeles

Review of weighted correlation network analysis (WGCNA)

When Is Hub Gene Selection Better than Standard Meta-Analysis? Evaluating systems biologic gene selection methods

The epigenetic clock: a highly accurate genomic predictor of age

Content

What is weighted correlation network analysis (WGCNA) ?

Construct a networkRationale: make use of interaction patterns between genes

Identify modulesRationale: module (pathway) based analysis

Relate modules to external informationArray Information: Clinical data, SNPs, proteomicsGene Information: gene ontology, EASE, IPARationale: find biologically interesting modules

Find the key drivers in interesting modulesTools: intramodular connectivity, causality testingRationale: experimental validation, therapeutics, biomarkers

Study Module Preservation across different data Rationale: • Same data: to check robustness of module definition• Different data: to find interesting modules.

Weighted correlation networks are valuable for a biologically meaningful…

• reduction of high dimensional data– expression: microarray, RNA-seq– gene methylation data, fMRI data, etc.

• integration of multiscale data – expression data from multiple tissues– SNPs (module QTL analysis)– Complex phenotypes

An anatomically comprehensive atlas ofthe adult human brain transcriptome

MJ Hawrylycz, E Lein,..,AR Jones (2012) Nature 489, 391-399

Allen Brain Institute

Data

• Brains from two healthy males (ages 24 and 39)• 170 brain structures• over 900 microarray samples per individual• 64K Agilent microarray• This data set provides a neuroanatomically precise,

genome-wide map of transcript distributions

Global gene networks.

Modules in brain 1

How to construct a weighted correlation network?

Systems biology as a field of study: interactions between the components of biological systems

Network=Adjacency Matrix

• A network can be represented by an adjacency matrix, A=[aij], that encodes whether/how a pair of nodes is connected.– A is a symmetric matrix with entries in [0,1] – For unweighted network, entries are 1 or 0

depending on whether or not 2 nodes are adjacent (connected)

– For weighted networks, the adjacency matrix reports the connection strength between node pairs

– Our convention: diagonal elements of A are all 1.

Two types of weighted correlation networks

Unsigned

Signed

network, absolute value

| ( , ) |

network preserves sign info

| 0.5 0.5 ( , ) |

ij i j

ij i j

a cor x x

a cor x x

Default values: β=6 for unsigned and β =12 for signed networks.We prefer signed networks…

Zhang et al SAGMB Vol. 4: No. 1, Article 17.

Adjacency versus correlation in unsigned and signed networks

Unsigned Network Signed Network

Advantages of soft thresholding with the power function

1. Robustness: Network results are highly robust with respect to the choice of the power β (Zhang et al 2005)

2. Calibration of different networks becomes straightforward, which facilitates consensus module analysis

3. Module preservation statistics are particularly sensitive for measuring connectivity preservation in weighted networks

4. Math reason: Geometric Interpretation of Gene Co-Expression Network Analysis. PloS Computational Biology. 4(8): e1000117

How to detect network modules?

Systems biology as a paradigm, usually defined in antithesis to the so-called reductionist paradigm (biological organization)

Module Definition• Based on the resulting cluster tree, we define

modules as branches• Modules are either labeled by integers

(1,2,3…) or equivalently by colors (turquoise, blue, brown, etc)

• We often use average linkage hierarchical clustering coupled with the topological overlap dissimilarity measure.

• Next we use the dynamic tree cutting method to define clusters. Langfelder et al 2007

Defining modules based on a hierarchical cluster tree

Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics 2008 24(5):719-720

Module=branch of a cluster tree

Dynamic hybrid branch cutting method combinesadvantages of hierarchical clustering and pam clustering

How does one find “consensus” modules based on multiple gene

expression data (networks)?

Example: Multiple Human brain expression data sets from Huntington's Disease

Publicly available caudate nucleus gene expression data from HD subjects and controls1) Durrenberger et al (2012). Selection of novel reference genes for use in the human central nervous system: a BrainNet Europe Study. Acta Neuropathol. 2012 Dec;124(6):893-903

2) Hodges et al Luthi-Carter (2006) Regional and cellular gene expression changes in human Huntington’s disease brain. Human Molecular Genetics, 2006, Vol. 15, No. 6

1. Construct a signed weighted correlation network based on 2 human gene expression data setsPurpose: keep track of co-expression relationships2. Identify consensus modules Purpose: find robustly defined and reproducible modulesTechnique: Consensus adjacency is a quantile of the input e.g. minimum, lower quartile, median

3. Relate modules to external informationHD disease statusGene Information: gene ontology, cell marker genesPurpose: find biologically meaningful modules

Analysis steps of WGCNA

Consensus dendrogram with module colors and meta-analysissignificance for diagnosis. The colors correspond to the meta-analysis Z score (with weights proportional to root of number of DOF); blue color denotes genes are down in HD vs controls, and red color denotes genes that are up in HD vs controls.

Question: How does one summarize the expression profiles in a module?

Answer: This has been solved.Math answer: module eigengene= first principal componentNetwork answer: the most highly connected intramodular hub gene

Both turn out to be equivalent

brown

123456789

101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185

brown

-0.1

0.0

0.1

0.2

0.3

0.4

Module Eigengene= measure of over-expression=average redness

Rows,=genes, Columns=microarray

The brown module eigengenes across samples

Module eigengenes are very useful• 1) They allow one to relate modules to each other

– Allows one to determine whether modules should be merged

• 2) They allow one to relate modules to clinical traits (HD status) and genetic variation (e.g. CAG tri-nucleotide repeat length)

-> avoids multiple comparison problem• 3) They allow one to define a measure of module

membership: kME=cor(x,ME)– Can be used for finding centrally located hub genes– Can be used to define gene lists for GO enrichment

When Is Hub Gene Selection Better than Standard Meta-Analysis? Evaluating systems biologic gene selection methods

Content

When does hub gene selection lead to more meaningful gene lists than a standard statistical

analysis based on significance testing?

• Here we address this question for the special case when multiple data sets are available.

• This is of great practical importance since for many research questions multiple gene expression or other -omics data sets are publicly available.

• In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules.

Intramodular hub genes versus whole network hubs

• Intramodular hubs have high intramodular connectivity kME with respect to a given module of interest

• Whole network hubs have high values of whole network connectivity k– k= row sum of the adjacency matrix– k= number of direct neighbors in case of an

unweighted network

Q & A• 1. Are whole-network hub genes relevant or should one

exclusively focus on intramodular hubs? • Answer: Focus exclusively on intramodular hubs in trait-related

modules. • 2. Do network-based gene selection strategies lead to gene lists

that are biologically more informative than those based on a standard marginal approaches?

• Answer: Yes, gene selection based on intramodular connectivity leads to biologically more informative gene lists than marginal approaches.

• 3. Do network-based gene selection strategies lead to gene lists that have more reproducible trait associations than those based on a standard marginal approaches? Answer: Overall no. But in case of a weak signal networks can help.

Criteria for judging gene selection methods

• Criterion 1 evaluates the biological insights gained, i.e. it is relevant in basic research.

• Criterion 2 evaluates the validation success in independent data sets, i.e. it is relevant when it comes to developing diagnostic or prognostic biomarkers.

Data sets used in the empirical evaluation

• We compare standard meta-analysis with consensus network analysis in three comprehensive and unbiased empirical studies:

• (1) Find genes predictive of lung cancer survival– Gold standard=cell proliferation related genes

• (2) Find age related DNA methylation markers– Gold standard= Polycomb group target genes

• (3) Find genes related to total cholesterol in mouse liver tissues– Gold standard= immune system related genes

R code in the WGCNA package

• For standard screening, we used the metaAnalysis function

• For finding hubs in consensus modules, we used the consensusKME function

Results

• The results demonstrate that intramodular hub gene status is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1).

• However, meta-analysis methods perform as good as (if not better) than a co-expression network approach in terms of validation success (criterion 2).

Overview of biological aging clocks

Here a biological aging clock

• is defined as a method for predicting the age (in years) of a subject/biological sample

• Examples1. based on telomere length2. based on gene expression levels3. based on protein expression levels4. DNA methylation levels

Telomere length versus age in white blood cells

• Relation between age and TRF in men (r=−0.45) and in women (r=−0.48)

Benetos A, et al (2001) Telomere Length as an Indicator of Biological Aging: The Gender Effect and Relation With Pulse Pressure and Pulse Wave Velocity Hypertension. 2001

p16INK4a clock

CDKN2A=p16Ink4A=tumor suppressor

• tumor suppressor protein encoded by the CDKN2A gene• Cyclin-dependent kinase inhibitor 2A, (CDKN2A, p16Ink4A)

– also known as multiple tumor suppressor 1 (MTS-1) • p16 plays an important role in regulating the cell cycle, and

mutations in p16 increase the risk of developing a variety of cancers, notably melanoma.

• Increased expression of the p16 gene as organisms age reduces the proliferation of stem cells. – This reduction in the division and production of stem cells

protects against cancer while increasing the risks associated with cellular senescence.

p16INK4a clock

• R^2=0.40 means that the age correlation is 0.63

• Liu Y et al (May 2009). "Expression of p16INK4a in peripheral blood T-cells is a biomarker of human aging". Aging Cell 8 (4): 439–48.

Disruptive clock technology based on DNA methylation levels

• State of the art of biological clock before epigenetic markers– Gene products (mRNA, protein levels) lead to an age

correlation = 0.63• DNA methylation levels (epigenetics) can be used

to define drastically more accurate clocks– Epigenetic clock leads to an age correlation = 0.96

DNA methylation age of human tissues and cell types.

Genome Biol. 2013 14(10):R115 PMID: 24138928

Training data sets

Data label (color)

DNA origin Platform Data Use n (Prop.Female)

Median Age(range)

Citation

1 (turquoise) Blood WB 27K Training 715 (0.38) 33 (16,88) Horvath 20122 (blue) Blood WB 450K Training 94 (0.28) 29 (18,65) Horvath 20123 (brown) Blood WB 450K Training 656 (0.52) 65 (19,100) Hannum 20124 (blue2) Blood PBMC 450K Training 72 (0) 3.1 (1,16) Alisch 20125 (green) Blood PBMC 450K Training 48 (0.52) 15 (3.5,76) Harris et al 20126 (red) Blood Cord 27K Training 216 (0.51) 0 (0,0) Adkins 20117 (black) Brain CRBLM 27k Training 168 (NaN) 45 (20,70) Liu 20138 (pink) Brain CRBLM 27K Training 114 (0.3) 44 (16,96) Gibbs 20109 (magenta) Brain FCTX 27K Training 133 (0.32) 43 (16,100) Gibbs 201010 (purple) Brain PONS 27K Training 125 (0.3) 43 (15,100) Gibbs 201011 (greenyellow)Brain Prefr.CTX27K Training 108 (0.48) 19 (-0.5,84) Numata 201212 (tan) BrainVariousCells450K Training 145 (0.48) 35 (13,79) Guintivano 201313 (salmon) Brain TCTX 27K Training 127 (0.33) 44 (15,100) Gibbs 201014 (cyan) Breast NL 27K Training 23 (1) 46 (19,75) Zhuang 201215 (midnightblue)Buccal 27K Training 109 (0.61) 15 (15,15) Essex 201116 (indianred) Buccal 27K Training 8 (0.75) 43 (16,68) Rakyan 201017 (grey60) Buccal 450K Training 53 (0.45) 0 (0,1.5) Martino 2013 18 (green2) Cartilage Knee 27k Training 41 (0.49) 66 (40,79) Fernández-Tajes 201319 (gold) Colon 27K Training 35 (0.63) 74 (43,90) TCGA, COAD20 (royalblue) Colon 450K Training 24 (0.54) 14 (3.5,19) Kellermayer 201321 (darkred) Dermal fibroblast27K Training 14 (1) 20 (6,73) Koch 201122 (darkgreen) Epidermis 27K Training 10 (0) 50 (26,71) Gronniger 201023 (darkturquoise)Gastric 27K Training 52 (NaN) 68 (25,88) Zouridis 201224 (darkgrey) Head+Neck 450K Training 50 (0.24) 62 (26,87) TCGA, HNSC25 (orange) Heart 27K Training 17 (0.41) 55 (16,68) Haas 2013 26 (darkorange)Kidney 450K Training 43 (0.3) 66 (31,83) TCGA, KIRP27 (lightsteelblue2)Kidney 450K Training 160 (0.34) 63 (38,90) TCGA, KIRC28 (skyblue) Liver 27K Training 57 (0.14) 51 (20,79) Shen 201229 (saddlebrown)Lung NL Adj 27K Training 27 (0.15) 69 (52,83) TCGA, LUSC30 (steelblue) Lung NL Adj 27K Training 24 (0.58) 66 (51,77) TCGA, LUAD31 (paleturquoise)Lung NL Adj 450K Training 40 (0.32) 73 (40,85) TCGA, LUSC32 (violet) MSC (bonemarrow)27K Training 16 (0.38) 52 (21,85) Bork 201033 (darkolivegreen)Placenta 27K Training 28 (1) 0 (0,0) Gordon 201234 (darkmagenta)Prostate NL 27K Training 69 (0) 61 (44,73) Kobayashi 201135 (sienna3) Prostate NL 450K Training 44 (0) 63 (44,72) TCGA, PRAD36 (yellowgreen)Saliva 27K Training 131 (0.015) 29 (21,55) Liu 201037 (skyblue3) Saliva 27K Training 69 (0) 35 (21,55) Bockland 201138 (plum1) Stomach 27K Training 41 (0.51) 69 (43,87) TCGA, STAD39 (orangered4)Thyroid 450K Training 25 (0.8) 40 (18,76) TCGA, THCA

Test data sets

40 (mediumpurple3)Blood WB 27K Test 191 (0.51) 43 (24,74) Teschendorff 201041 (lightsteelblue1)Blood WB 27K Test 93 (1) 63 (49,74) Rakyan 201042 (darkcyan) Blood WB 27K Test 262 (1) 67 (49,91) Song 201043 (orange) Blood WB 27K Test 269 (1) 64 (52,78) Teschendorff 2010 Song 200944 (green) Blood WB 450K Test 689 (0.71) 54 (17,70) Liu 201345 (darkorange2)Blood PBMC 27K Test 386 (0) 9.3 (3.6,18) Alisch 201246 (brown4) Blood PBMC 450K Test 38 (0.74) 44 (0,100) Heyn 201247 (bisque4) Blood PBMC 27K Test 92 (NaN) 33 (24,45) Lam 201248 (darkslateblue)Blood Cord 27K Test 48 (0.021) 0 (0,0) Turan49 (plum2) Blood Cord 27K Test 84 (0.52) 0 (0,0.75) Khulan 201250 (thistle2) Blood Cord 27K Test 53 (0.45) 0 (0,0) Gordon 201251 (darkblue) Blood CD4 Tcells450K Test 48 (NaN) 0.5 (0,1) Martino 201252 (salmon4) Blood CD4+CD1427K Test 50 (0.68) 34 (16,69) Rakyan 201053 (palevioletred3)Blood Cell Types450K Test 16 (0.62) 32 (17,60) Heyn 201354 (brown3) Brain Cerebellar27K Test 20 (0) 22 (1,60) Ginsberg 201255 (maroon) Brain Occipital Cortex27K Test 16 (0) 25 (1,60) Ginsberg 201256 (lightpink4) Breast NL Adj 450K Test 81 (1) 55 (28,90) TCGA, BRCA57 (lavenderblush3)Breast NL Adj 27K Test 27 (1) 51 (35,88) TCGA, BRCA58 (deepskyblue)Buccal 450K Test 51 (0.45) 0 (0,1.5) Martino 2013 59 (darkseagreen4)Colon 450K Test 38 (0.45) 72 (40,90) TCGA,COAD60 (coral1) Fat Adip 27K Test 10 (0.4) 75 (73,78) Ribel-Madsen 201261 (brown2) Heart 27K Test 6 (0) 60 (55,71) Pai 201162 (coral2) Kidney 27K Test 198 (0.35) 60 (33,86) TCGA, KIRC63 (mediumorchid)Liver 450K Test 37 (0.35) 68 (20,81) TCGA, LIHC64 (skyblue2) Lung NL Adj 450K Test 26 (0.46) 66 (42,86) TCGA, LUAD65 (yellow4) Muscle 27K Test 22 (0.55) 66 (53,78) Ribel-Madsen 201266 (skyblue1) Muscle 27K Test 44 (0) 25 (25,25) Jacobsen 201267 (plum) Placenta 450k Test 40 (NaN) 0 (0,0) Blair 201368 (orangered3)Saliva 27K Test 52 (0.92) 27 (21,55) Liu 201069 (mediumpurple2)Uterine Cervix 27K Test 152 (1) 25 (19,55) Zhuang 201270 (lightsteelblue)Uterine Endomet450K Test 28 (1) 62 (35,90) TCGA, UCEG71 (lightcoral) Various Tissues27K Test 44 (0.41) 71 (0,83) Myers 201272 (indianred4) Chimp+Human Tissues27K Other 35 (0.4) 47 (9,81) Pai 201173 (firebrick4) Ape WB 450k Other 32 (0.62) 22 (9,43) Hernando-Herraez 201374 (darkolivegreen4)Sperm 27K Other 19 (1) 0 (0,0) Pacheco 2011 75 (brown2) Sperm 450k Other 26 (0) 0 (0,0) Krausz 201276 (blue2) Vasc.Endoth(Umbilical)27K Other 42 (0.43) 0 (0,0) Gordon 2012

77 Stem cells+Somatic Cells27K Other 271 (NA) NA Nazor 201278 Stem cells+Somatic Cells450K Other 153 (0.63) NA Nazor 201279 Reprogrammed mesenchymal stromal cells 450K Other 24 (NA) NA Shao 201280 hESC and normal primary tissue27k Other 34 (NA) NA Calvanese 201281 hESC 27k Other 6 (NA) NA Ramos-Mejía 201282 Blood Cell Types450K Other 60 (0) NA Reinius 2012

Illumina data sets• The first 39 data sets were used to construct ("train") the age

predictor. • Data sets 40-71 were used to test (validate) the age predictor. • Data sets 72-82 served other purposes e.g. to estimate the DNAm

age of embryonic stem and iPS cells. • Training data were chosen i) to represent a wide spectrum of

tissues/cell types, ii) to involve samples whose mean age (43 years) is similar to that in the test data, and iii) to involve a high proportion of samples (37%) measured on the Illumina 450K platform since many on-going studies use this recent Illumina platform.

• Only studied 21369 CpGs (measured with the Infinium type II assay) which were present on both Illumina platforms (Infinium 450K and 27K) and had fewer than 10 missing values across the data sets.

Age predictor• To ensure an unbiased validation in the test data, only

used the training data to define the age predictor. • A transformed version of chronological age was

regressed on the CpGs using a penalized regression model (elastic net).

• The elastic net regression model automatically selected 353 CpGs.

• I refer to the 353 CpGs as (epigenetic) clock CpGs since their weighted average (formed by the regression coefficients) amounts to an epigenetic clock.

Accuracy across tissues and cell types (training)

Accuracy across test data

Accuracy in brain tissue

Results send to me via email

Blood data from Marco Boks Jan 2014 Blood data Jim Pankow, Jan 2014Median error=3.5 years

Aging clock applied to urine• This figure, created by Wei Guo from Zymo Research, • Median error=2.7 years, • Cor=0.98

Acknowledgements

• WGCNA analysis– Lin Song– Peter Langfelder

Date post:	23-Feb-2016
Category:	Documents
Upload:	faith
View:	35 times
Download:	0 times

Empirical evaluation of prediction- and correlation network methods applied to genomic data

Documents