Post on 21-Dec-2015
transcript
2
Microarrays
Measure the expression of genes in the cell “Count” the number of mRNA molecules that attach
to biological probes
Expression data is gathered for many (thousands) of genes at once
Data is gathered for several experiments Either in several time stamps or different conditions
3
NM1N
ij
31
2M21
1M131211
ee
e
e
ee
eeee
Genes N
sExperiment M
Relative expression of gene i in experiment j
4
Detecting Patterns in Expression Data
Genes may have similar expression patterns because They are part of the same complex (protein-protein
interactions) They are part of the same pathway They have similar regulatory elements They have similar functions (part of a fail-safe
mechanism)
A popular solution: clustering (we saw already) Hierarchical clustering, K-means, agglomerative,...
Today: dimensionality reduction PCA SVD
5
Why Dimensionality Reduction
Using irrelevant data may harm accuracy
Clustering algorithms do not perform well in high dimensional data
Visualizing high dimensional data
),( ii yx '
ix
6
Principle Components Analysis (PCA)
PCA seeks for a linear projection that best describes the data in a least mean squares sense
Finds a set of principle components (PCs) A PC defines a projection that encapsulates the
maximum amount of variation in a dataset Each PC is orthogonal to all other PCs
Reduce dimensionality by picking the most informative PCs Namely, for reducing from dimension d to dimension
d’, pick the d’ most informative PCs
7
PCA - Steps
Input: a dataset
Subtract the mean from each dimension
Compute the covariance matrix for the d dimensions The covariance of two variables X and Y:
The covariance matrix:
n
i
ii
n
YYXXYX
1 )1(
)()(),cov(
),cov(),(),( YXXYYX
id
iin sssssS ,..., },,...,{ 11
8
PCA – Steps (cont.)
Compute the eigenvectors and eigenvalues of the covariance matrix
Choose the most informative PCs, construct a feature vector Eigenvectors with highest eigenvalues carry the
most information Feature vector is simply the combination of all
eigenvectors chosenFeatureVector = (eig1, eig2, …, eigd’)
Transform dataset to the new axis system For sS:
dd
T
s
s
s
eig
eig
eig
storFeatureVecs2
1
'
2
1
'
9
When Things Get Messy…
PCA is fine when initial dimension is not too big Space and time complexity are of O(d2) - size of
covariance matrix
Otherwise – we have a problem… E.g. when d=104 time/space complexity is O(108)…
Luckily an alternative exists: SVD
10
Eigengenes, Eigenarrays and SVD
The idea: Use the singular value decomposition (SVD)
theorem for transforming the dataset from the gene/array space to the eigengene/eigenarray space
Eigengenes, eigenarrays and eigenvalues: Each dimension is represented by an
eigengene/eigenarray/eigenvalue triplet Eigenvalues are used for ranking dimensions
Paper: Alter et. Al., 2000
11
Singular Value Decomposition (SVD)
Theorem: if E is a real M by N matrix, then there exist orthogonal matrices
s.t.
Where
and
and ],...,[ MxM1 MuuU NxN1 ],...,[ NvvV
TVWUE
),...,( 1 pdiagW
),min( ,0...21 nmpp
12
SVD
i is the ith singular value of E. ui and vi are the ith left singular vector and right singular vector of E, respectively.
It holds that
Efficient algorithms for calculating the SVD exist
),min(:1 NMivuE
uvEi
iiT
ii
i
13
Orthogonality of Decomposition
MMM
M
uu
u
uuu
1
12
121
11
MNM
N
ee
e
eee
1
21
11211
00
0
00
22
11
NN
N
N
vv
v
vvv
1
21
112
11
111
1 ,..., ],,...,[ MiM uuuuuU
iN
iiN vvvvvV ,..., ],,...,[ 11
TVWUE
),...,( 1 pdiagW
14
0
0
111
2222
1211
12122
1111
MMMMM
MMM
uu
uu
uuu
WU
NN
N
N
T
vv
v
vvv
V
1
21
112
11
Orthogonality of Decomposition
MNM
ij
N
T
ee
e
e
eee
VWU
1
21
11211
p
k
kj
kikij vue
1
p
k
kkk
T
vuE1
15
SVD and Microarray analysis
Reduction from the N genes x M arrays to p eigengenes x p eigenarrays space W is the eigenexpression matrix U represents the expression of genes over
eigenarrays V represents the expression of eigengenes over
arrays
The “fraction of eigenexpression”:
“Shannon entropy” of the dataset:
p
kkiip
1
2
p
kkk pp
pd
1
1)log()log(
10
16
Example: Cell cycle of Saccharomyces Cerevisiae
Data is available for 5981 genes over 14 time steps (with ½ hour intervals)
784 genes were classified as cell-cycle regulated (with no missing values)
17
Data Sorting
For eigengenes 1 and 2, plot the correlation of each gene g1 with both on a 2-D plot X-axis represents the correlation with 2, Y-axis
relates to 1. Sort by angular distance