Biostatistics
Weighted gene coWeighted gene co--expression network expression network analysis (WGCNA) and network edge analysis (WGCNA) and network edge
orienting (NEO)orienting (NEO)
Bin Zhang and Steve HorvathUniversity of California, Los Angeles, USADepartments of Human Genetics and Biostatistics
Part I: WGCNA
Part II: NEO
Biostatistics
Challenges of Modern GeneticsChallenges of Modern Genetics1. Genetic analysis of complex diseases is difficult
• Requires searching for many small effect genes • Difficult to detect signal at the DNA level• RNA level day may identify clusters of genes
2. Microarray technology – measures RNA levels (gene expression)
• But this data is noisy!• Focusing on single genes can lead to spurious results
due to outliers or array artifacts
Network analysis of RNA data: “Gene Co-expression Network Analysis” (GCNA)
Biostatistics
ScaleScale--free Networks: free Networks: Derek J de Derek J de SollaSolla PricePrice
• Derek J de Solla Price was a professor of applied mathematics at Raffles College which became part of the University of Singapore in 1948.
• Singapore = great location for a systems biology conference!
• In 1965 he published the first example of a scale-free network.
• The network of scientific journal articles has connections (citations) that follow a power-law distribution.
Timeline for ScaleTimeline for Scale--Free Gene CoFree Gene Co--Expression NetworksExpression Networks
1965: Concept first conceived by Derek J. de Solla Price
2000: The concept of modeling gene expression data as a network was introduced by Butte and Kohane.
2002: Featherstone and Broadie showed that these networks exhibited scale-free topology.
1999: Resurrected by Barabasi and Albert by discovering its applicability for modeling the internet and biological networks.
Biostatistics
Gene CoGene Co--Expression Network Analysis Expression Network Analysis (GCNA) = Systems Genetics Approach(GCNA) = Systems Genetics Approach
• Goal is to understand the “system” instead of reporting a list of individual parts• Focus on gene clusters: “modules” rather than
individual genes• Easily integrated with other types of data: genetic
marker and protein data, clinical traits
• Network structure translates to biological pathways (can be confirmed and annotated using gene ontology software)
Biostatistics
GCNA addresses issues in microarray data GCNA addresses issues in microarray data & complex disease genetics& complex disease genetics
• Individual gene expressions may be poorly measured, so it is safer to study this data at the module level.
• Modules are likely to represent pathways –genes which are co- regulated and/or interact.
• The signal from these pathways tends to be stronger than the signal from a single gene.
• Alleviates multiple testing problem in traditional association/differential expression analyses.
Biostatistics
Network TerminologyNetwork Terminology
Barabási AL, Oltvai ZN (2004). Network biology: Understanding the cell's functional organization. Nature reviews genetics, 5, 101-113.
(A) Random Network: each node has approximately the same number of links, for example 2.(B) Scale-Free Network: a few nodes are very highly connected.
!)Pr(
kk
ke kk )Pr(
Definitions:Node = objects (ex. Genes)Connection = link between 2 nodesk = Degree(Nodei) = # of links to Nodei
Pr(k) = probability Nodei has k links.
Biostatistics
How to construct a gene How to construct a gene coco--expression network?expression network?A) Microarray gene expression dataB) Use Pearson correlation to
determine concordance of gene expressions xi and xj r(xi,xj)
C) The Pearson correlation matrix is transformed via an adjacency function: • Step function: aij = I r(xi, xj)> τ
Unweighted network• Power function: aij = r(xi, xj)β
Weighted network
Biostatistics
Weighted
All genes are connectedWidth of line = strength of k
Unweighted
Some genes are connectedAll connections are equal
Two perspectives on scaleTwo perspectives on scale--free free networks: networks: unweightedunweighted and and weightedweighted
Hard thresholding ignores connection strength information.
aij = I r(xi, xj)> τ aij = r(xi, xj)β
Biostatistics
Gene (xGene (xii) ) –– to to –– Gene (Gene (xxjj) ) relationships in a networkrelationships in a network
• Adjacency matrix A = network, where each aij entry gives the connection strength between xi and xj
• Connectivity of gene xi = row sum of a gene xi’ s connection strengths
• Topological overlap between xi and xj = measure of clustering or shared neighbors. Ravasz et al (2002)
i ijjk a
min( , ) 1
iu uj iju
iji j ij
a a aTOM
k k a
Where is the number of genes
connected to both xi and xj (Note: this TOM definition is for an unweightednetwork.)
iu uju
a a
Biostatistics
Average Linkage Hierarchical ClusteringAverage Linkage Hierarchical ClusteringFigure I. Figure II.
• Agglomerative partitioning (Figure I) to define clusters. Start with n groups: 1 gene/group, combine until 1 size n group.
• Clusters defined using “average linkage” (Figure II) = cluster with smallest average distance (1 – TOM) is combined.
(source: http://www.resample.com/xlminer/help/HClst/HClst_intro.htm)
Biostatistics
Defining Network ModulesDefining Network Modules
2. Trim the tree at a level that gives a manageable number of genesand gene clusters (~1,000 genes, 3-10 clusters)
• Gene clusters are called modules• Grey colors indicate genes outside of the modules
1. Hierarchical clustering of overlap measures results in a cluster tree (dendrogram)
Biostatistics
Network Module AnalysisNetwork Module Analysis
• Identify relevant modules according to one or more of the following strategies:
• Associate module with trait, SNPsand/or connectivity data.
• Annotate module members and primary functions using gene ontology software.
Biostatistics
Types of Network ConnectivityTypes of Network ConnectivityRecall: connectivity of a gene i:
Intra-modular connectivityis the sum of the connection strengths of gene i within its module.
Intra-modular connectivity is more biologically meaningful than whole network connectivity.
i ijjk a
Whole network connectivityis the sum of connection strengths (aij) across all network genes.
Applications of WGCNA Part I: Applications of WGCNA Part I: interinter--species comparisonspecies comparison
1. Application to human and chimp brain tissue expression (2006)
• Modules that correspond to brain regions.
• Most and least conserved regions.
• Results agreed with known evolutionary hierarchy.
• Identified groups of genes that could be evolutionary drivers.
2. Application to two mouse strains (2007)
• Differential network analysis between BxH and BxD
• Identified pathways and genes related to weight.
Applications of WGCNA Part II: finding Applications of WGCNA Part II: finding traittrait--related pathways and genes related pathways and genes
1. Analysis of endothelial cell (EC) responses to oxidized lipids (2006)
• Identified 15 pathways characterizing response
• Identified potential gene targets for atherosclerosis
2. Integrated analysis of chronic fatigue syndrome data: microarray, SNP, traits (2008)
• Tutorial on integrated WGCNA, compared with standard microarray analysis
• Systems genetics screening criteria yields genes that are causal for parent module
WGCNA Software: WGCNA Software: stand alone and R packagestand alone and R package
Biostatistics
Part II: Network Edge Part II: Network Edge Orienting (NEO)Orienting (NEO)
UndirectedWeighted Network
DirectedWeighted Network
Jason Aten1,2 and Steve Horvath31Biomathematics, 2Human Genetics and 3Biostatistics
Biostatistics
Motivation for Cause and Effect Motivation for Cause and Effect Analysis in GeneticsAnalysis in Genetics
• Large-scale genetic marker and gene expression data sets can result in numerous genetic candidates for follow-up studies.
• Many are due to chance rather than a true clinical relationship.
• Cause and effect analysis can be performed on a weighted gene co-expression network when genetic marker data is available, based on the ‘Mendelianrandomization’ concept.
• Such an analysis may: • Help prioritize among these gene candidates for follow up
analysis.• Reduce spurious findings.
Biostatistics
Historical Rationale for Causal Historical Rationale for Causal Inference in Genetics (Inference in Genetics (KatanKatan 1986)1986)
1. DNA variation as measured by genetic markers can only be causal or have no effect on gene expression and trait data, it is never reactive
2. Mendel’s law of independent assortment: genetic traits are inherited randomly ‘Mendelian Randomization’
3. People with a particular DNA variation (X) that conferred only a small physiological effect are otherwise comparable to people who have the normal allele (Y)
• The X subjects likely do not know of their particular genetic difference from the Y subjects, and lead comparable lives
• A study of this trait in X and Y adults would be equivalent to aprospective study that began with X and Y newborns and followed them through adulthood to see which developed the trait
How to infer causal relationships?How to infer causal relationships?• Katan (1986): described how causal analysis in observational studies
on APOE gene (M) could determine whether there is a link betweencholesterol (A) and cancer (B)
• Based on research findings • APOE alleles influenced cholesterol levels• Observational studies that low cholesterol was associated with cancer
• Three possible relationships:
• Correlation information can distinguish relationship 1 from 2 and 3.
2. M A BConfounder
1. M A B, |r(M,B)| > 0
2 = 3. M A B, |r(M,B)| = 0
Biostatistics
But, in practice true causality is But, in practice true causality is difficult to establish.difficult to establish.
• r(M,B) = 0 is unlikely particularly in large data sets or if B is a quantitative trait
• M A : may be verified if SNP and gene expression correspond to the same gene• Often not possible: it is expensive to have high coverage of
genes with both SNP markers and gene expression profiles.• Confounded by other markers in linkage disequilibrium with study
marker(s)
• Relationships could be confounded by• Gene or environment interactions• Population stratification
• Causality inferred by genetic associations is best considered probable causality
Biostatistics
Network Edge Orienting Software (NEO)Network Edge Orienting Software (NEO)• Developed by Jason Aten and Steve Horvath (2008) for
estimating edge orientations in a gene co-expression network
• Methods based on structural equation modeling (SEM)
• First conceived of by geneticst Sewall Wright (1921) • Allows study of causal graphs in the context of statistical
distributions• Each variable in a graph is modeled by combinations of 1 or
more other variables using linear regression
• NEO calculates Local Edge Orienting (LEO) scores
• Based on the relative probabilities of local structural equationmodels – models including only 3 nodes
• Higher scores indicate stronger evidence for a causal relationship
Biostatistics
NEO software: Input NEO software: Input
1. A set of quantitative variables (traits)
• Physiological traits• Gene expression data• Typically input both
2. SNP marker data (or other genetic marker data)
Biostatistics
HDL
E4
E2 E3
Chr1 Chr2 … ChrX
UnorientedUnoriented Network ExampleNetwork Example
Key:
= marker
E1, …,E4 = gene expressions
HDL, Insulin = clinical traits
2. Edges between traits and gene expressions are not yet oriented
1. Note that if the transcript corresponding to a SNP is known, the orientation of the edge is known
Insulin
E1
Biostatistics
HDL
E4
E2E1 E3
Chr1 Chr2 ... Chr22 ChrX
Edges are directed. A score, which measures the strength of evidence for this direction, is assigned to each directed edgeInsulin
LEO=1.5
LEO=3.5
LEO=0.5
LEO=0.8
Network Edges OrientedNetwork Edges Oriented
NEO software: OutputNEO software: Output1. Diagram of the directed network2. Spreadsheet that summarizes LEO scores and provides hyperlinks
to model fits (html files)
There are 5 models There are 5 models for a marker M and for a marker M and traits A and Btraits A and B
Relationship r(A, B) r(M, A) r(M, B) r(M, A | B) r(M, B | A) r(A, B | M)1. M → A → B 1 1 1 1 0 12. M → B → A 1 1 1 0 1 13. A ← M → B 1 1 1 1 1 04. M → A ← B* 1 1 0 1 1 15. M → B ← A* 1 0 1 1 1 1*Note that models 4 and 5 are equivalent to the confounded model: M → X ← Counfounder → Y.
In the table below “r” refers to correlation, the value “1”indicates r > 0, while the 0 indicates r = 0.
Biostatistics
Scores from NEO SoftwareScores from NEO Software1. Scores for model selection:
• Model p-values• Local edge orienting score (LEO.NB.SingleMarker)
2. Traditional SEM measures for assessing model fit:
• Root Mean Square Error of Approximation (RMSEA)
• Comparative Fit Index (CFI)• Standardized Root Mean Square Residual
(SRMSR)
Biostatistics
Model PModel P--valuesvalues• H0: correlation = 0, H1: |correlation| > 0
• Correlations close to zero = H0 cannot be rejected, it’s possible that the data fits the null distribution.
• Larger p-values = better model fit. P-value > 0.05 is considered to indicate good fit.
• Steps for calculating a model p-value:
1. Correlation between a pair of nodes (r) is transformed to a Z-score using Fisher’s Z transformation:
2. The corresponding p-value for this score can be obtained from a standard normal distribution table.
Biostatistics
LEO Score = Relative Model FitLEO Score = Relative Model Fit
• Compares p-value of model A B with next best p-value.
• LEO score > 1 indicates possible causal model.
• Implies model p-value of causal model is 101 = 10 fold higher than best competing model.
Biostatistics
SEM Measures for Assessing SEM Measures for Assessing Model FitModel Fit
• Compare observed Sm and expected Σcovariance matrices.
• Σ consists of path coefficients among traits and genetic markers.
• Recommended thresholds for assessing likely causality:• RMSEA ≤ 0.05• CFI ≥ 0.90• SRMSR ≤ 0.10
Biostatistics
•• NEO analysis has been NEO analysis has been generalized to multiple generalized to multiple markers. markers.
•• Two LEO scores per model Two LEO scores per model rather than one.rather than one.
•• Common Common pleiotropicpleiotropic anchor anchor (CPA) > 0.8(CPA) > 0.8
•• Orthogonal candidate anchor Orthogonal candidate anchor (OCA) > 0.3(OCA) > 0.3
MultiMulti--marker Modelsmarker Models
Biostatistics
4 Multi4 Multi--Marker ModelsMarker Models
Biostatistics
•• MethodsMethods•• Selecting markers with the Selecting markers with the
best correlationbest correlation•• ForwardForward--stepwise multivariate stepwise multivariate
regression approachregression approach•• Combination Combination
•• The OCA and CPA scores are The OCA and CPA scores are computed at each SNP computed at each SNP selection step and should be selection step and should be robust to the number of robust to the number of SNPsSNPsselected selected
MultiMulti--Marker NEO can Perform Marker NEO can Perform Marker SelectionMarker Selection
Biostatistics
E1 → E2E1 → E3
E3 ← HiddenConfounder → E4E4 → TraitTrait → E5
MultiMulti--Marker Simulation TestMarker Simulation Test• Simulated a causal network consisting of the
following nodes:
• 5 gene expressions (E1-E5)• Each gene expression controlled by 3 SNPs (18 correct
SNPs)• 82 Noise SNPs• Trait• Confounder
• Can NEO retrieve the correct SNPs and the correct edge orientations?
Simulation ResultsSimulation Results• A red or orange
square in position (i,j) indicates that a trait in row i causally affects the corresponding trait from column j.
• NEO successfully reproduced the simulated orientations.
• All 18 SNPs were identified.
Biostatistics
NEO and WGCNA Software NEO and WGCNA Software Available OnlineAvailable Online
• R software, tutorials, and simulated and real data sets for NEO and WGCNA can be found online:
• www.genetics.ucla.edu/labs/horvath/aten/NEO/
• http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
• Google search • weighted co-expression network• “WGCNA”• “co-expression network”
Biostatistics
Summary: WGCNA & NEOSummary: WGCNA & NEO• WGCNA is a systems genetics approach that is useful for
complex disease analysis• Genetic signal is weak for individual genes, problematic for
traditional DNA-level analyses• RNA level data analysis may identify clusters of genes
corresponding to trait-related pathways• Helps alleviate multiple testing problem• Focusing on clusters of genes rather than individual genes
improves information quality from microarray data
• WGCNA is also useful for inter-species comparison of gene expression levels
• NEO can estimate edge orientation in a weighted gene co-expression network if relevant genetic marker data is available
• NEO can also perform marker selection
Key References:Key References:
Biostatistics
AcknowledgementsAcknowledgements
• WGCNA developed by Bin Zhang and Steve Horvath
• NEO developed by Jason Aten and Steve Horvath
• Lab members: Peter Langfelder, Jun Dong, Tova Fuller, Ai Li, Wen Lin, Wei Zhao
• Collaborators: Jake Lusis, Tom Drake, AnatoleGhazalpour