Tutorial 7
Gene expression analysis
1
Gene expression analysis• How to interpret an expression matrix
• Expression data DBs - GEO
• General clustering methods Unsupervised Clustering
• Hierarchical clustering• K-means clustering
• Tools for clustering - EPCLUST
• Functional analysis - Go annotation2
Gene expression data sources
3
Microarrays RNA-seq experiments
How to interpret an expression data matrix
• Each column represents all the gene expression levels from:– In two-color array: from a single experiment.– In one-color array: from a single sample.
• Each row represents the expression of a gene across all experiments.
Exp1 /Sample 1
Exp2 /Sample 2
Exp3 /Sample 3
Exp4 /Sample 4
Exp5 /Sample 5
Exp6 /Sample 6
Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9
Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7
Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1
Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3
Gene 5 0.1 2.6 2.2 2.7 -2.1
Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9
4
How to interpret an expression data matrix
Each element is a log ratio: • In two-color array: log2 (T/R).
T - the gene expression level in the testing sample R - the gene expression level in the reference sample • In one-color array: log2(X) X - the gene expression level in the current sample
Exp1 /Sample 1
Exp2 /Sample 2
Exp3 /Sample 3
Exp4 /Sample 4
Exp5 /Sample 5
Exp6 /Sample 6
Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9
Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7
Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1
Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3
Gene 5 0.1 2.6 2.2 2.7 -2.1
Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9
5
How to interpret an expression data matrix
6
In two-color array: Scale
Red indicates a positive log ratio: T>R
Black indicates a log ratio of zero: T=~R
Green indicates a positive log ratio: T>R
Samp 1 Samp 2 Samp 3 Samp 4 Samp 5 Samp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
ScaleIn one-color array:
Bright green indicates a high expression value
Black indicates no expression
Expr.1 Expr.2 Expr.3 Expr.4 Expr.5 Expr.6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Exp
Log
ratio
Exp
Log
ratio
Microarray Data:Different representations
T<R
T>R
7
8
How to analyze gene expression data
9
Expression profiles DBs
• GEO (Gene Expression Omnibus)http://www.ncbi.nlm.nih.gov/geo/
• Human genome browserhttp://genome.ucsc.edu/
• ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/
10
The current rate of submission and processing is over 10,000 Samples per month.
In 2002 Nature journals announce requirement for microarray data deposit to public databases.
11
Searching for expression profiles in the GEOhttp://www.ncbi.nlm.nih.gov/geo/
GEO accession IDs
GPL**** - platform IDGSM**** - sample IDGSE**** - series IDGDS**** - dataset ID
•A Series record denes a set of related Samples considered to be part of a group.•A GDS record represents a collection of biologically and statistically comparable GEO samples. Not every experiment has a GDS.
12
Download dataset
Clustering
Statistic analysis 13
Clustering analysis
14
Clustering analysis – zoom in
15
16
Clustering analysis – zoom in
17
Viewing the expression levels
18
19
Viewing the expression levels
20
ClusteringGrouping together “similar” genes
21
Clustering• Unsupervised learning: The classes are
unknown a priori and need to be “discovered” from the data.
• Supervised learning: The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects. This information is then used to classify future observations.
22http://www.bioconductor.org/help/course-materials/2002/Seattle02/Cluster/cluster.pdf
Unsupervised Clustering
• Hierarchical methods - These methods provide a hierarchy of clusters, from the smallest, where all objects are in one cluster, through to the largest set, where each observation is in its own cluster.
• Partitioning methods - These usually require the specification of the number of clusters. Then a mechanism for apportioning objects to clusters must be determined.
23http://www.bioconductor.org/help/course-materials/2002/Seattle02/Cluster/cluster.pdf
This clustering method is based on distances between expression profiles of different genes. Genes with similar expression patterns are grouped together.
24
Hierarchical Clustering
25
• In both phylogenetic trees and in clustering we create a tree based on distances matrix.
• When computing phylogenetic trees:We compute distances between sequences.• When computing clustering dendograms we
compute distances between expression values.
ATCTGTCCGCTCGATGTGTGCGCTTG
Expr.1 Expr.2 Expr.3 Expr.4 Expr.5 Expr.6
Gene 1
Gene 2
Rings a bell?...
Score Score
How to determine the similarity between two genes?
Patrik D'haeseleer, How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) , http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html
26
27
Hierarchical clustering methods produce a tree or a dendrogram.They avoid specifying how many clusters are appropriate by providing a partition for each K. The partitions are obtained from cutting the tree at different levels.
2 clusters
4 clusters6 clusters
28
The more clusters you want the higher the similarity is within each cluster.
http://discoveryexhibition.org/pmwiki.php/Entries/Seo2009
Hierarchical clustering results
29http://www.spandidos-publications.com/10.3892/ijo.2012.1644
An algorithm to classify the data into K number of groups.
30
K=4
Unsupervised Clustering – K-means clustering
How does it work?
31
The algorithm iteratively divides the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.
1 2 3 4
k initial "means" (in this casek=3) are randomly selected from the data set (shown in color).
k clusters are created by associating every observation with the nearest mean
The centroid of each of the k clusters becomes the new means.
Steps 2 and 3 are repeated until convergence has been reached.
32
How should we determine K?
• Trial and error• Take K as square root of gene number
33
http://www.bioinf.ebc.ee/EP/EP/EPCLUST/
Tool for clustering - EPclust
34
35
Choose distance metricChoose algorithm
36
Hierarchical clustering
37
Zoom in by clicking on the nodes
38
39
K-means clustering
K-means clustering
Graphical representation of the
cluster
Graphical representation of the
cluster
Samples found in cluster
40
10 clusters, as requested
41
Now that we have clusters – we want to know what is the function of each group.
There is a need for some kind of generalization for gene functions.
42
Now what?
Gene Ontology (GO)http://www.geneontology.org/
The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains:
44
Cellular Component (CC) - the parts of a cell or its extracellular environment.
Molecular Function (MF) - the elemental activities of a gene product at the molecular level, such as binding or catalysis.
Biological Process (BP) - operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.
Gene Ontology (GO)
The GO tree
GO sources
ISS Inferred from Sequence/Structural SimilarityIDA Inferred from Direct AssayIPI Inferred from Physical InteractionTAS Traceable Author StatementNAS Non-traceable Author StatementIMP Inferred from Mutant PhenotypeIGI Inferred from Genetic InteractionIEP Inferred from Expression PatternIC Inferred by CuratorND No Data availableIEA Inferred from electronic annotation
DAVID
Functional Annotation Bioinformatics Microarray Analysis
• Identify enriched biological themes, particularly GO terms• Discover enriched functional-related gene/protein groups• Cluster redundant annotation terms• Explore gene names in batch
http://david.abcc.ncifcrf.gov/
ID conversion
annotation
classification
Functional annotationUpload
Genes from your list
involved in this category
Charts for each
category
Charts for each
category
Charts for each
category
Minimum number of genes for
corresponding term
Maximum EASE score/ E-value
Genes from your list
involved in this category
Genes from your list
involved in this category
E-ValueEnriched terms associated with
your genesSource of term
52
A group of terms having similar biological meaning due to sharing similar gene members
Gene expression analysis• How to interpret an expression matrix
• Expression data DBs - GEO
• General clustering methods Unsupervised Clustering
• Hierarchical clustering• K-means clustering
• Tools for clustering - EPCLUST
• Functional analysis - Go annotation53