Date post: | 23-May-2018 |
Category: |
Documents |
Upload: | truongnhan |
View: | 218 times |
Download: | 1 times |
Fourier Harmonic Approach for Visualizing Temporal Patterns of GeneExpression Data
Li Zhang and Aidong ZhangDepartment of Computer Science and Engineering
State University of New York at BuffaloBuffalo, NY 14260
lizhang,[email protected]
Murali RamanathanDepartment of Pharmaceutical SciencesState University of New York at Buffalo
Buffalo, NY [email protected]
Abstract
DNA microarray technology provides a broad snapshotof the state of the cell by measuring the expression levelsof thousands of genes simultaneously. Visualization tech-niques can enable the exploration and detection of patternsand relationships in a complex dataset by presenting thedata in a graphical format in which the key characteris-tics become more apparent. The purpose of this study is topresent an interactive visualization technique conveying thetemporal patterns of gene expression data in a form intuitivefor non-specialized end-users.
The first Fourier harmonic projection (FFHP) was in-troduced to translate the multi-dimensional time series datainto a two dimensional scatter plot. The spatial relationshipof the points reflect the structure of the original dataset andrelationships among clusters become two dimensional. Theproposed method was tested using two published, array-derived gene expression datasets. Our results demonstratethe effectiveness of the approach.
Keywords: visualization, gene expression, time series,Fourier harmonic projection
1 INTRODUCTION
Knowledge of the spectrum of genes expressed at a cer-tain time or under given conditions proves instrumental tounderstand the working of a living cell. DNA microarraytechnology allows measurements of expression levels forthousands of genes simultaneously [9]. Extensive researchhas been conducted on the study of temporal patterns ofgene expressions [10, 16]. Clustering methods which groupgenes or samples with similar patterns have become main-stream analysis tool [14]. Visualization can facilitate thediscovery of structures, features, patterns, and relationshipsin data and may provide more insightful information than
traditional numerical methods. By visualization, we hopeto gain some intuition regarding the data, but more impor-tantly, we would like to understand the relationships be-tween data points and detect the intrinsic structure, or possi-ble cluster tendencies. Visualization is especially importantin the early stages of data analysis in which qualitative anal-ysis is primary to quantitative. Early success will enhancethe users’ performance in the remaining stages of analy-sis. Array-derived gene expression datasets present analysisand visualization challenges because of their dimensional-ity, noisy environment, and pattern varieties.
The parallel coordinates approach [18] is perhaps thesimplest method to display patterns of gene expression pro-files. Here the data in each dimension are plotted along aseparate axis. Holter et al. [8] have used parallel coordi-nates to visualize the temporal progression in the yeast cellcycle data. Self-organizing maps (SOM) [12] is another ap-proach. More recently, Hautaniemi et al. [6] have presenteda heat map-based strategy for visualizing the U-matrix fromSOM. The most prominent visualization-enhanced analy-sis tool for gene expression data is TreeView [5], whichprovides a user-friendly computational and graphical en-vironment for assessing the results from hierarchical clus-tering. The graphical presentations from TreeView includea dendrogram to reflect the distance relationships betweenclusters and a heat plot to visually convey gene expressionchanges between samples. The heat plot can be viewed asvariation of the parallel coordinates plot in which color isused to convey dimension values.
Here, we present an alternative mapping for multi-dimensional data that is based on the first harmonic of thediscrete Fourier transform. The mapping has interestingproperties and preserves certain key characteristics of a va-riety of data sets, especially time series data. Unlike par-allel coordinates and heat plot which display all individualdimensional information, our approach uses a two dimen-sional point to represent each gene over the time at the com-
putational cost ofO(N log N). It focuses on one very im-portant aspect of the visualization: revealing the structure ofthe entire dataset. Tested using two published, array-derivedgene expression time series datasets, our results indicatedthat temporal patterns were well reflected in the visualiza-tion: cluster relationship became two dimensional, clusterswere apparent, and outliers were clear. An interactive visu-alization tool, VizStruct, was implemented to perform thevisualization.
The remainder of this paper is organized as follows. Sec-tion 2 presents the model of visualization. In section 3, weshow our analysis results. The final section discusses otherissues in this approach. Proofs for all mapping propositionsare included as appendix.
2 METHODS AND SYSTEM
The Mapping
Mapping converts multi-dimensional data to two-dimensions for visualization. An array-derived profilefor M genes withN measurements results inM N -dimensional data point containing real valued numericaldata. Time series data in its simplest form is merely a setof datayt, t = 0, . . . , N − 1 where the subscriptt indi-cates the time at which the datumyt was observed [4]. Onthe other hand, a discrete-time real signal onN evenly dis-tributed time points is represented as an indexed sequenceof N real numbers0, . . . , N − 1 denoted byx[n] [1]. Eachterm ofx[n] is denoted byx[n]. The denotation similaritybetween time series and digital signal suggests that we canview each data point in a time series as a discrete-time realsignal (It is not necessary for the signal’s time index to com-ply with the actual time points). In this scenario, the prob-lem of a two dimensional visualization of the time series istransformed into the problem of finding a two-dimensionalpoint estimation for signals (data points).
The frequency domain representation of discrete-timesignals is through discrete-time Fourier transform, or DFT[13]. The DFT of aN -point signalx[n] is a frequency se-quence withN complex values:F(x[n]) = [Fk(x[n])],where each
Fk(x[n]) =N−1∑n=0
x[n]WnkN , k = 0, . . . , N−1. (1)
WN = e−i2π/N is called twiddle factor.Each harmonicFk in the DFT is a measure of thekth
sinusoidal frequency component in the signal. For exam-ple, the zero harmonic,F0, is the mean value, the first har-monicF1 measures the base frequency component, the sec-ond harmonicF2 measures the component in the signal thatis twice the base frequency, and so forth. Because Fourier
harmonics, in general, are complex numbers, they providethe two-dimensional point estimate for mapping a multi-dimensional signal. For this reason, we refer to the mappingas the Fourier harmonic projections. In particular, we areinterested in the first Fourier hamonic projection (FFHP):
F1(x[n]) =N−1∑n=0
x[n]WnN =
N−1∑n=0
x[n]e−i2πn/N . (2)
The relationship between the DFT and the mapping al-lows the fast Fourier transform algorithm (FFT), originallydiscovered by Cooley and Tukey [13], to be used for com-putation. The FFT is a computationally efficient algorithmand has a complexity ofO(N log N).
The complex number ofF1(x[n]) in Equation (2) can beexpressed in terms of magnituder and phaseθ to provide auseful geometric interpretation of the mapping. The data setwas normalized so that the range of values of each dimen-sion across the dataset was0 to 1. For a data point withNdimensions, the complex exponential divides a unit circlecentered at the origin of the complex plane intoN equallyspaced angles. The value of the first dimension is projectedon the radial line corresponding toθ = 0 and, similarly, thevalue of thekth dimension is projected on to the radial linecorresponding to theθ = 2π(1−k)/N radians. The overalltwo-dimensional FFHP mapping is the complex sum of allN projections from a data point. Figure 1 illustrates the ge-ometric interpretation for a point containing6 dimensions.
unit circle
x[n]
F1(x[n])W6
W6W6
W6
W6 W6
0
12
3
4 5
Figure 1. A geometric interpretation of thefirst Fourier harmonic projection. A normal-ized 6-dimensional data point is shown on theright by the stem plot. The powers of twid-dle factor W6 divide the unit circle centeredat the origin into 6 equal angles and each di-mension of the data point is projected onto adifferent radial angle (open circle). The pro-jections are taken complex number sum togive a 2-dimensional image (indicated by afilled circle).
amplitude shifting amplitude scaling
+ a x a
origin
time shifting time shifting
origin
(A) (B)
Figure 2. Illustration of the effect of (A) amplitude shifting and multiplying, and (B) time shifting ofthe first Fourier harmonic projection.
Properties of FFHP
The first Fourier harmonic projection has useful prop-erties that preserve the correlation between dimensions inthe multi-dimensional data point. We summarize them aspropositions below. Detailed proofs for propositions areprovided as the appendix.
1. Data points with equal values for all the dimensionsare mapped to the origin. Ifx[n] = [a, . . . , a], thenF1(x[n]) = 0.
2. Two data points whose dimension values differ due to theamplitude shifting of a constant are mapped to the samepoint. If y[n] = x[n] + a, thenF1(y[n]) = F1(x[n]).
3. Two data points whose dimension values differ due to theamplitude multiplying a constant are mapped to the twopoints on a line through the origin. Ify[n] = ax[n], thenF1(y[n]) = aF1(x[n]). See Figure 2A.
4. Two data points whose dimension values are transposingeach other, i.e. symmetric regarding the middle time point,are mapped to the points symmetric to the real axis. Ify[n] = x[N − n− 1], thenF1(y[n]) = F1(x[n]).
5. Data points that differ only because they are “time-shifted”by d dimensions relative to each other are mapped to thecircumference of the circle that is concentric with the unitcircle and the angle between the points in the visualizationis φ = 2πd/N . If y[n] = x[n − d], thenF1(y[n]) =F1(x[n])W d
N . This property is illustrated in Figure 2B.
6. Letw[n] = x[n] − y[n] be the difference between the twoN -dimensional points,x[n] and y[n]. The distance be-tween these two points in the visualization is:
‖F1(w[n])‖2 = g0N
(1 + 2
N−1∑
k=1
rk cos(2πk/N)
)(3)
Theory in Practice
We will demonstrate in the next section that the relativelocations of temporal profiles’ mapping images can be pre-dicted by Propositions 1–5. For example, genes with rela-tively low levels of expression throughout the time line aremapped close to the origin. Genes that steadily increaseover the time and genes that steadily decrease over the timeline are likely mapped symmetric to the real axis.
In Equation (3), theg0 is the variance ofw[n], rk isthe kth-sample autocorrelation coefficient ofw[n]. It canbe shown [4] that−1 ≤ rk ≤ 1 and for mutually inde-pendent random sequences or white noise,rk = 0. Equa-tion (3) provides insight into the cluster delineation capa-bilities of the first Fourier harmonic projection. The vari-ance andkth-sample autocorrelation coefficients of the dif-ference between two points within a given cluster are likelyto be small because they will share similarities across manydimensions. Thus, points within cluster are likely to mapclose to each other in the visualization.
The first Fourier harmonic projection is well suited forvisualizing temporal patterns. FFHP has a better class sepa-ration capability when the first harmonic is more dominantamong all harmonics. Since the first harmonic is a measureof the base sinusoidal frequency component in the signal, alower frequency signal tends to have a larger first harmonic.This is the case for a large number of time series data wheremost trend patterns show low frequency variations (not hav-ing multiple cycles). The first Fourier harmonic projectionis sensitive to dimension order. For time series data, thetime points (dimensions) are naturally ordered.
VizStruct System
The first Fourier harmonic projection approach is im-plemented as a visualization tool called VizStruct, whichis written in Java and is available on request from the firstauthor ([email protected]). The name VizStruct em-
(A) (B) (C)
Figure 3. Three snapshots from a dimension tour through a synthetic three-dimensional data set con-taining 25, 000 points. The parameter settings for (A)–(C) were 〈0.5, 0.5, 0.5〉, 〈0.3, 0.5, 1〉, and 〈0.05, 1, 1〉,respectively.
phasizes its capability of visualizing the structure of thedataset.
Dimension Tour
The dimension tour is a feature of VizStruct that allowsthe user to interact with the data via dynamic animations.It is analogous to the grand tour [2] and the user interactswith the visualization by changing the dimension parame-ter associated with each dimension. The default value foreach dimension parameter is0.5 and the individual param-eters can be changed over the range−1 to +1 either manu-ally or systematically using the program. Because no two-dimensional mapping can capture all the interesting prop-erties of the original multi-dimensional space, two pointsthat are close in the visualization can theoretically be farapart in the multi-dimensional space. The dimension tour,which creates animations that explore dimension parameterspace, can reveal structures in the multi-dimensional inputthat were hidden due to overlap with other points in the vi-sualization.
Figure 3 illustrates the capabilities of the dimension tourusing a synthetic dataset containing25, 000 points in threedimensions. At the default settings of dimensional param-eters (Figure 3A),5 clusters are apparent. However, dur-ing the course of the animations (Figures 3B and 3C), themulti-layered structures of the original5 clusters becomeincreasingly apparent.
3 RESULTS
Data Sets for Visualization
Our approach was tested using two published array-derived data sets. Therat kidneyarray dataset of Stuart etal. [16] containes measurements of gene expressions duringrat kidney organogenesis. The data were downloaded fromhttp://organogenesis.ucsd.edu/data.html. It consists of873genes which vary significantly during kidney developmentat 7 different time points: gestational day13, 15, 17, 19;newborn (N);1 week (W); and nonpregnant adult (A).
Thefibroblastsdataset of Iyer et al. [10] is the result of astudy of the response of human fibroblasts to serum. It con-sists of gene expressions measuring the temporal changesin mRNA levels of517 human genes at13 time points,ranging from15 minutes to24 hours after serum stim-ulation. The data were downloaded from http://genome-www.stanford.edu/serum.
Rat Kidney Dataset
In the rat kidneydataset, there are5 discrete patterns orgroups of gene expression during nephrogenesis. Figure 4illustrates these temporal profiles characterized by the ide-alized gene expressions.
Figure 5A shows how genes are classified by a hierar-chical clustering algorithm. It copies the Figure3 in [16].Figure 5B shows the parallel coordinates of the dataset. Pat-terns of genes in each group comply to the profiles depictedin Figure 4 (with some noise).
13151719 N W A0
0.51
1.52
2.53 1
13151719 N W A
2
13151719 N W A
3
13151719 N W A
4
13151719 N W A
5
Figure 4. Idealized temporal gene expression profiles during kidney development. The groups werenamed 1 through 5 based on the timing of their peak expression during development. Seven timepoints were 13, 15, 17, 19 embryonic days; N , newborn; W , 1 week old; A, adult.
0
1
2
3
All Data0
1
2
3
Group 10
1
2
3
Group 2
0
1
2
3
Group 30
1
2
3
Group 40
1
2
3
Group 5
(A) (B)
Figure 5. Visualization of the rat kidneydataset in heat plot and parallel coordinates. (A) Dendrogramand a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datsetand each of the gene groups.
Figures 6A-B show the visualization of therat kidneydataset in VizStruct under the first Fourier harmonic pro-jection for two dimension parameter settings. There are5sets of colored symbols for each of the5 gene groups. Eachsymbol represents one gene across7 time points. In Figure6A, two big clusters are clearly apparent from the visualiza-tion. The top cluster consists of genes from groups3, 4, and5. The bottom cluster is comprised of genes from groups1and2. Furthermore, genes from each group are aggregated.
The formation of two large clusters can be interpreted bythe temporal profiles. Groups1 and 2 with genes whichhave very high relative levels of expression in early de-velopment are quite different from groups3, 4, and5 forgenes that have a relatively steady increase in expressionthroughout development. The visualization also indicatesthat points in the upper cluster are symmetric to the pointsin the lower cluster. Properties of FFHP may suggest thereason. Temporal profiles of groups1 and4 suggest that
they are somewhat symmetric to the middle time point (ges-tational day19). By Proposition4, they would be mappedto points symmetric to the real axis. On the other hand,groups4 and group5 are mapped closely since they havesimilar profiles except for the significantly up-regulated inthe last time point. The same arguments can be applied inthe case of group1 vs. group2, or group3 vs. group4.
Due to the noise, most boundaries between groups arenot very clear. However, the separation between group4and group5 improves in Figure 6B compared to Figure 6A.
Fibroblast Dataset
Temporal patterns are slightly complicated in thefibrob-last dataset. Data has been classified into10 groups usingthe hierarchical clustering algorithm by the original author[10]. Figure 7A shows the result of the hierarchical cluster-ing. It is a copy of Figure3 from [10]. Figure 7B gives the
−4 −3 −2 −1 0 1 2 3 4 5−6
−4
−2
0
2
4
6
Re[F1]
Im[F
1]
G1
G2
G3
G4
G5
−6 −4 −2 0 2 4 6 8 10
−6
−4
−2
0
2
4
6
8
10
12
Re[F1]
Im[F
1]
G1
G2
G3
G4
G5
(A) (B)
Figure 6. Visualization of the rat kidney dataset in VizStruct. Five gene groups were representedby blue plus symbols, red circles, green triangles, magenta stars, and black cross symbols. Thedimension parameters for (A) and (B) were 〈0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5〉 and 〈1, 0.5,−1,−0.5, 0.5, 1,−1〉,respectively.
0
5
10
15
All Group A Group B Group C Group D Group E
0
5
10
15
Group F Group G Group H Group I Group J Group U
(A) (B)
Figure 7. Visualization of the fibroblastdataset in heat plot and parallel coordinates. (A) Dendrogramand a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datset andeach of the gene groups. The gene group labelled U consists genes without label after hierarchicalclustering.
−120 −100 −80 −60 −40 −20 0 20
−20
−10
0
10
20
30
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
222
2222222222222222222222222222222222
22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222211111111111111111111111111111111
3
3
33333333333333333333333333
3333
33
11111111
44
4444444
4
4
44444444
4444
44
4
4
4
4
44
44
44
44
44
444
4
55 55
5
5
5
1111111111
6666666
6
66666
6 6
6666
6 666
6
66666666
66111111
11
77
7
7
77
77
7777 77
7
1111
8
8
8888
88
8888
88
8
888
888888
888
888
8
8
88
8
88
88
888888
88
888888
8
8
888
8
8
8
8
8
9999999
9
99999 999 9
9
9
10
10101010
10
1010
10
10
10
10
10
10
1010
10
101010
1010
10
10
10
Re[F1]
Im[F
1]
−15 −10 −5 0 5 10
−8
−6
−4
−2
0
2
4
6
8
10
1111 11
111 11
111
1111
111111111111111111
1 11
11 1
11111
1111
111
11111111111 1
11111 11111
1
1111
1 111
11
1111 1
11 11111
11
1
2
2
2
2
22
22
22222
222
222 22222
2
2
22
22
2222222 2
22
22 2
222222222222
2222
222222222222
222 2
2222222
222
222 22 222 2
2
222
2
2222
2222 2 2
2
2222222 22
2222222
22
222 2
222222
2
2222
22
111111
111111
11111111
11
1111
1111
11
3
3333
3333
3333
3
3333 3333
33
33
3
3
3 3
3
33
11
1111
11
4
4
44
44
4
4
4
4
4
44
4
444
44
444
4
4
4
4
4 4
4 4
4
44
4
4
4
4
4
4
55 5
5
5
5
5
11 1111
11 11
66
6
6
66
6
6
6
6
666
66
6
6
66
6 666
6
6
66
6
6
66
6
66 11
1111
11
7
77
77
7
7
77
7
7 7
7
7
1111
8
8
8888
8
8
8
8
88
88
8
8
8
8
88 8
888
8
8
8
88
8
8
8
8
8
8
8
8
8
888 8
8
8
8
8
888
8
8
8
8
8
8 8
8
8
8
8
8
99
999
99
9
9
99
999
99
9
9
9
10
10
10
10 10
10
10
10
10
10
10
10
10
10
10 10
10
10 10
10
10
10
10
10
Re[F1]
Im[F
1]
(A) (B)
Figure 8. Visualization of the fibroblastdataset in VizStruct. 11 colored numbers were used for eachof the gene groups. (A) The mapping of the entire dataset. (B) Enlarged portion of the visualizationin (A).
−120 −100 −80 −60 −40 −20 0 20
−20
−10
0
10
20
30
33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
333
3333333333333333333333333333333333
3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
6
6
33333333333333333333333333
3333
33
3383
99
3333999
3
9
99333333
9999
1111
11
11
11
11
99
1111
1111
99
1111
9911
11
66 63
6
6
6
10310108
10101010101010
1
1010101010
10 10
881010
10 8810
2
8888108810
1010 888
8
99
9
9
99
99
9989 99
11
88
8
2
8888
38
8888
88
2
822
888888
228
222
2
7
88
9
29
89
888888
92
288888
8
2
888
2
2
7
7
7
8888888
8
2810810 8101010
4
2
9
6883
6
33
6
3
6
6
8
6
1010
5
668
66
6
6
6
Re[F1]
Im[F
1]
−15 −10 −5 0 5 10
−8
−6
−4
−2
0
2
4
6
8
10
333 33
333 33
333
3333
333333333333333333
3 33
33 3
33333
3333
333
33333333333 3
33333 33333
3
3333
3 333
33
3333 3
33 33333
33
3
3
3
3
3
33
33
33333
333
33333333
3
3
33
33
3333333 3
33
33 3
333333333333
3333
333333333333
3333
3333333
333
333 33 333 3
3
333
3
3333
33333 3
3
3333333 33
3333333
33
333 3
333333
3
3333
33
333
333
3333
3
33
33
3
6
3333
3333
3333
3
3333 3333
33
33
3
3
3 3
3
33
3
38
3
9
9
33
33
9
9
9
3
9
99
3
333
33
999
9
11
11
9 9
1111
99
11
11
9
9
11
11
66 6
3
6
10 310
10 8
1010
10
10
1010
1010
10
101010
1010
8
8
1010
10 8810
2
8
88
8
10
88
10
1010 8
88
8
9
99
99
9
9
99
8
9 9
9
11
88
8
2
8888
3
8
8
8
88
88
2
8
2
2
88 8
888
2
8
22
2
8
8
9
8
9
8
888 8
8
9
2
2
888
8
8
8
2
8
8 8
2
88
888
88
8
2
810
8108
1010
10
2
9
6
8
8 3
6
3
3
3
6
8
6
10 106 6
8
6
6
6
6
Re[F1]
Im[F
1]
(A) (B)
Figure 9. Visualization of the fibroblast dataset in VizStruct. Genes were grouped by the k-meansclustering method. 11 colored numbers were used for each gene groups. (A) The mapping of theentire dataset. (B) Enlarged portion of the visualization in (A).
parallel coordinates the entire dataset and each gene group.
Figure 8A shows the visualization of thefibroblastsdataset in VizStruct under the first Fourier harmonic pro-jection. Colored numbers1 through11 were used for eachgene group. The visualization reveals several outliers andno distinct clusters. The majority of data are too dense to beseen. By zooming technology, the enlarged detail is shownin Figure 8B. More numbers spread out, but most blue num-bers (1) were still covered by numbers2 and3.
Two observations can be made: (1) a large number ofgenes are close to the center. (2) most clusters have a radialshape emitting from the center. They can be interpreted bythe FFHP properties. (1) Temporal patterns of groupA andgroupB have very flat shape and relative smaller values. ByPropositions1 and2, they tend to be mapped closed to theorigin. (2) In hierarchical clustering, Pearson’s correlationcoefficient was used to measure the similarity. Genes whosetime values differ due to the amplitude shifting or multiply-
−3 −2 −1 0 1 2 3−4
−3
−2
−1
0
1
2
3
4
x
y
−4 −3 −2 −1 0 1 2 3 4 5−6
−4
−2
0
2
4
6
1
23
4
5
6
7
8
9
10
1112
13
14151617
18
19
20
2122
23
24 25
262728
29
30
3132
3334
35 36
37
38394041
42
4344
4546
47
48
49
50
51
5253
5455
56
5758
59
6061
62
63
64
65
66
67
68
69
7071
727374
75
76
7778
79
80
81
82
8384
8586
87
88
8990
9192
93
94
9596
97
98
99
100
101
102
103
104 105
106107
108
109
110111
112113
114
115
116117
118
119120
121 122123124
125126127
128
129
130
131
132133 134
135
136
137
138
139
140
141
142
143144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166167
168
169
170
171
172
173
174
175
176177
178179
180181
182
183
184
185
186
187
188
189
190191
192
193
194
195
196
197
198
199
200
201
202203
204
205206
207
208
209
210
211
212
213
214
215
216217
218 219
220221
222223
224
225226
227
228
229
230
231
232233
234
235
236
237
238
239240
241
242243
244
245
246
247
248
249
250
251
252
253254
255256
257258
259260
261
262263
264265
266
267
268269270
271
272273
274
275
276
277
278279
280
281
282
283284285
286
287288
289
290
291
292293
294295296
297
298
299
300
301
302
303
304
305
306
307 308
309
310
311
312
313
314 315
316
317318
319
320321322
323
324
325
326327
328329
330
331332
333334
335336
337338 339
340
341
342
343344 345
346
347
348349
350351
352
353354
355
356
357
358
359
360
361362
363
364
365
366
367 368
369
370
371
372373
374
375
376
377378
379
380
381382383
384
385386
387
388
389390
391
392393
394
395 396 397
398
399
400
401
402
403404405
406
407408
409410411
412
413
414415416
417
418419
420
421
422
423
424
425
426
427428429430
431
432
433
434
435
436
437438439
440
441
442
443
444
445
446
447448
449450
451
452
453454455
456
457458
459
460
461
462
463
464
465
466
467468469 470
471
472473
474475476
477
478
479
480
481
482483484
485
486
487
488
489
490491
492
493
494
495496497
498499
500501502
503
504
505
506
507
508
509510
511
512
513
514515
516
517
518
519520
521
522523524525
526
527
528
529
530
531
532
533
534535
536
537
538539
540
541
542
543 544
545
546
547
548
549550
551
552553
554
555556
557
558
559
560
561
562563564
565
566
567568
569
570
571
572
573
574
575576577
578579
580581
582
583584
585586
587
588589590
591
592
593
594595
596 597598
599
600 601
602 603
604
605
606
607
608
609 610
611
612613
614
615616617
618
619620
621622623
624625
626 627
628
629
630
631632633
634635
636
637638
639
640641
642643644
645646
647648
649
650651
652653
654
655656657
658659
660661
662
663
664
665666
667
668
669
670
671
672
673674 675
676
677
678679
680
681682
683
684
685
686
687
688
689
690
691692693694695 696697
698699
700701
702
703
704705706 707
708709
710
711
712
713
714
715716 717
718
719
720721
722723724
725726
727
728
729730
731732
733
734
735736737738
739
740
741742
743744745746
747
748
749750751
752
753754
755
756
757
758
759760761762763
764
765766
767
768
769
770
771
772
773
774
775
776777778
779
780781
782783784
785
786
787788789
790791
792
793
794
795
796797
798
799
800
801
802
803804
805
806
807808809
810811
812813
814
815816817
818819820821
822823
824
825
826827
828829
830
831
832
833834835836
837
838
839840
841
842
843
844845846
847
848849
850851
852
853
854
855
856
857
858
859860
861
862863
864865
866
867868
Re[F1]
Im[F
1]
−3 −2 −1 0 1 2 3−4
−3
−2
−1
0
1
2
3
4
1
2
3
4
5
6
78
9
10
11
12
13
14
15
16
17
1819
20
212223 24
2526
27
28
29
30
31
3233
34
35
36
37
3839
4041
4243
44
45
46
47
48
49
50
5152
53
54
55
565758
596061
62
63
64
65
66
67
68
69
70 71
7273
74
75 76
77
78
79
80
81
82
8384
85
86
87
88
8990
9192 9394
95
96
97
98
99100
101
102
103
104
105
106
107
108109
110111
112
113
114115
116117118 119
120
121
122
123124
125
126127128
129
130 131
132
133
134
135
136137
138
139
140
141
142
143
144145
146
147
148149
150
151
152153
154
155
156157
158
159
160161
162
163
164
165
166167
168 169170
171
172
173
174
175176
177
178179 180
181182
183
184
185
186
187
188
189
190
191
192
193194
195
196
197
198
199
200
201202
203
204205
206
207
208
209
210211212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232233 234
235
236
237238
239
240
241
242
243
244
245
246
247
248
249
250
251
252 253
254
255
256
257 258259
260
261
262
263264
265266 267268269
270
271
272273274 275
276
277
278279
280
281 282
283
284
285
286
287288
289 290 291
292
293
294
295296
297
298
299300
301
302
303
304
305
306
307
308
309
310
311 312
313314
315
316317318
319
320321
322 323
324 325
326
327
328
329
330
331
332
333334
335
336
337338 339
340
341 342
343
344
345
346
347348
349
350
351
352 353
354
355356
357
358
359
360361
362
363
364
365
366367
368
369
370371
372373
374
375
376
377378
379
380381
382383384
385
386
387 388
389
390
391
392393
394
395 396
397
398399
400
401
402
403404405
406
407408
409
410411 412
413
414
415416
417
418
419420
421422
423
424
425
426
427
428
429 430
431432
433
434
435
436
437
438439
440
441442
443
444
445
446
447
448
449
450
451452
453
454
455456
457458 459
460
461
462
463
464
465
466467
468469
470
471
472
473
474
475
476
477
478
479
480
481
482483484485
486 487
488
489
490
491
492493
494495
496
497
498
499
500501
502503
504
505
506
507508
509
510
511
512513
514515
516
517
518
519
520
521
522
523
524
525526
527
528
529
530
531
532
533
534
535
536
537538
539
540541
542
543
544545546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562563564
565566
567
568
569
570
571
572
573574
575576577
578
579580
581
582
583584
585
586
587
588
589590
591
592
593
594
595596
597
598599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616617
618619620
621
622623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641642643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658659
660661
662
663
664
665
666667668
669
670
671672
673
674
675
676
677
678679680
681
682
683
684 685
686
687
688689
690
691
692
693
694
695 696697698
699
700701702
703
704705
706
707
708
709
710
711
712713
714
715716
717
718 719720721
722
723
724
725
726
727
728
729
730
731732
733
734 735
736
737
738 739
740
741742
743
744745746
747
748
749
750
751
752753
754
755
756
757
758
759
760761
762
763764
765
766
767
768
769
770771
772773
774
775
776
777778
779
780
781782
783784785
786
787788
789790
791
792
793
794
795796
797
798799800
801802
803
804
805
806
807
808
809
810811
812813
814
815
816
817
818819
820
821822823
824825 826
827
828
829830831
832
833834
835
836
837
838
839840
841
842
843
844
845846
847
848
849
850
851852
853
854855
856
857
858
859860
861
862863
864
865
866867
868
x
y
(A) (B) (C)
Figure 10. Comparison of Sammon’s mapping and the first Fourier harmonic projection. (A) Theresult from Sammon’s mapping of the rat kidneydataset. Five gene groups were represented by 5colored symbols the same as were used in Figure 6. (B)–(C) Comparison of relative gene locationsbetween (B) VizStruct and (C) Sammon’s mapping. Five colors were used for each of the genegroups. The color schema is the same as in (A). Notice that the visualization layout of (A) and (C) isidentical. The purpose of panel (C) is to compare the gene by gene location to the panel (B).
ing a constant have very high coefficient values. By Propo-sitions2 and3, they tend to be mapped closely along a linethough the origin.
Some relative gene group locations can be predicted. Forexample, by Proposition4, groups4 and5 are mapped sym-metric to the real axis. Close inspection suggests that tem-poral pattern of group8 is somewhat a2 or3 right time-shiftfrom the pattern of group6. By Proposition 5, each timepoint shift correspondents to roughly2/13π ≈ 30 rota-tion. In Figure 8B, genes in group8 are mapped about60
clockwise to genes in group6.Lacking clear clusters in the visualization indicates that
different clustering methods may yield different results.Figure 9 shows the grouping aspect of k-means cluster-ing. Euclidean distance was used in k-means as the distancemeasure. The visualization reveals that genes closed to theorigin are grouped as one.
Comparison of FFHP to Other Visualizations
Heat plot and parallel coordinates display all individualdimensional information. In parallel coordinates, a polylinerepresents a gene over time. This format is a concise andintuitive for displaying temporal profiles. However, the par-allel coordinate has obvious drawbacks: when the data sizebecomes larger it becomes increasingly unreadable as indi-cated in the first panel of Figure 5 and Figure 7. Heat plotuses a color mosaic instead of a polyline to overcome over-
lapping. Combined with the dendrogram from hierarchicalclustering, it gives a global view as well as individual clus-ters of the dataset by grouping genes with similar patternstogether. Cluster relationships in heat plot is one dimen-sional: clusters of genes are listed one by one as shown inFigure 5A and Figure 7A.
The first Fourier harmonic projection takes a differentapproach. It uses a two dimensional point to represent eachgene over time. By doing so, it takes advantage of the spa-tial relationship of the points to reflect the structure of theoriginal dataset. Thus the cluster relationships become twodimensional and FFHP has a better capability of displayingoutliers than heat plot. Unlike heat plot, which requires al-gorithms to group similar genes to make a meaningful visu-alization, FFHP directly mapped the multi-dimensional dataonto a two dimensional space without any prior knowledgeof the dataset.
Multidimensional scaling (MDS) [3] is a competing ap-proach for visualizing multi-dimensional data. We com-pared FFHP to Sammon’s mapping [15], a variant of MDSthat optimizes the following stress functionE :
E =1∑
i
∑j<i
d∗ij
∑i
∑j<i
(d∗ij − dij)2
d∗ij, (4)
whered∗ij is the distance between pointsi andj in theN -dimensional space anddij is the distance betweeni andjin the visualization.
Figure 10 shows Sammon’s mapping of therat kidneydataset. A comparison of Figure 10A to Figure 6A re-veals the extensive similarities between FFHP and Sam-mon’s mapping. The relative locations of individual sam-ples are also remarkably similar. This is indicated in Figure10B and Figure 10C.
Sammon’s mapping has some drawbacks: (1) it providesa single final result and the user cannot intervene interac-tively during visualization, (2) the incremental addition ofeven a single point requires a complete repetition of the op-timization procedure and possible extensive reorganizationof all the previously mapped points to new locations, and(3) it requires time-consuming optimization procedures oftime complexityO(N2) or greater.
Our results illustrate some of the strengths and weak-nesses of FFHP. The visualization reflects the structure ofthe dataset: outliers, clusters and their relationships. Com-prehending the structure of the dataset can facilitate thechoice and understanding the results of different clusteringmethods. Although the FFHP uses an approach to mappingmulti-dimensional data that is distinctively different fromSammon’s mapping, it yields results that are consistentlycomparable. However, when the dataset contains a largenumber of patterns, it becomes difficult to separate differentpatterns and the visualization may be difficult to interpret.
4 DISCUSSION
Visualization of microarray data is a challenge becauseof its high dimensionality. In this paper, we have exploreduse of the first Fourier harmonic projection for visualizingmulti-dimensional time course array datasets. Unlike par-allel coordinate or heat plot approach which display all di-mensional information, FFHP uses a two dimensional pointto represent each data point (gene in this case). Our re-sults indicated that temporal patterns were well reflected byspatial relationships: genes with similar pattern were aggre-gated and relative locations of gene groups can be predicted.Moreover, the first Fourier harmonic projection was shownto yield results that were similar to those from Sammon’smapping.
Achieving two dimensional mapping requires a trade-off. The mapping is lossy for detailed dimensional infor-mation. Our approach attempts to preserve to the maximumsemantics of the data points via Fourier harmonic aspect.In particular, characterizing the data using two descriptivemeasurements: the real and imaginary portions of the firstdiscrete Fourier harmonic. A similar approach uses prin-cipal component analysis (PCA) [11]. This visualizationdeploys another two descriptive measurement: the first andsecond principal component.
In addition to the first Fourier harmonic projection,higher Fourier harmonics can also be used as mappings.
It can be shown that for any harmonic(> 1), there exitsan equivalent first harmonic of the original discrete signalwhose time indices are systematically rearranged. At cer-tain conditions (such as temporal patterns with high fre-quency variations, i.e. multiple cycles), higher harmonicprojections enhance substructure separation in the visual-ization. Detailed discussion is beyond the scope of this pa-per due to a length constraint.
The FFHP mapping results in two-dimensional visual-izations that are identical to those of radial coordinate visu-alization techniques, e.g., RadViz [7]. However, rather thanvector notation and thespring paradigmof RadViz, we haveused a complex number notation. This substantive reformu-lation of the mapping provides valuable theoretical insightsand allows important properties of mapping, including itsrelationship to the DFT, to be easily derived. It can alsocreate possible extensions such as higher Fourier harmonicprojections.
The first Fourier harmonic projection requires data with-out missing values. This requirement can be easily metbecause filling in missing values is a mature research field[17].
Gene expression data can be studied in either samplespace or gene space. Here, we have reported only the visual-ization on gene space. In a separate report [19], we appliedthe first Fourier harmonic projection on the sample spaceand performed visualization-driven classifications. Our ex-periments demonstrated that FFHP offered an alterative for-mat of visualization. We believe that using FFHP alone orin combination with heat plot or parallel coordinates wouldgive a biologist additional powerful tools for analyzing andvisualizing microarray data sets.
ACKNOWLEDGEMENTS
This work was supported by grants from the NationalScience Foundation. We also thank the anonymous review-ers for their constructive comments on the manuscript.
References
[1] Cadzow, J. A., Landingham, H. F.Signals, Systems, andTransforms. Prentice-Hall, Inc., Englewood Cliffs, NJ,1985.
[2] Cook, C., Buja, A., Cabrera, J., and Hurley, C. GrandTour and Projection Pursuit.Journal of Computational andGraphical Statistics, 2(3):225–250, 1995.
[3] Davision, M. L.Multidimensional Scaling. Krieger Publish-ing, Inc., Malabar, FL, 1992.
[4] Diggle, P. J.Time Series: A Biostatistical Introduction. Ox-ford University Press, Oxford OX2 6DP, 1990.
[5] Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D.Cluster Analysis and Display of Genome-wide ExpressionPatterns.Proc. Natl. Acad. Sci. USA, Vol. 95:14863–14868,December 1998.
[6] Hautaniemi, S., Yli-Harja, O., Astola, J., Kauraniemi, P.,Kallioniemi, A., Wolf, M., Ruiz, J., Mousses, S., andKallioniemi, O. Analysis and Visualization of Gene Ex-pression Microarray Data in Human Cancer Using Self-Organizing Maps, 2003. To appear in Machine Learning:Special Issue on Methods in Functional Genomics.
[7] Hoffman, P. E., Grinstein, G. G., Marx, K., Grosse, I., andStanley, E. DNA Visual and Analytic Data Mining. InIEEEVisualization ’97, pages 437–441, Phoenix, AZ, 1997.
[8] Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar,J. R., and Fedoroff, N. V. Fundamental Patterns Underly-ing Gene Expression Profiles: Simplicity from Complex-ity. Proc. Natl. Acad. Sci. USA, Vol. 97(15):8409–8414, July2000.
[9] Ideker, T., Galitski, T., and Hood, L. A New Approachto Decoding Life: Systems Biology.Annu. Rev. GenomicsHum. Genet., 2:343–372, July 2001.
[10] Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore,T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J. Jr.,Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., andBrown, P. O. The Transcriptional Program in the Responseof Human Fibroblasts to Serum.Science, Vol. 283(1):83–87,January 1999.
[11] K. Y. Yeung and W. L. Ruzzo. Principal Component Anal-ysis for Clustering Gene Expression Data.Bioinformatics,Vol. 17(9):763–774, 2001.
[12] Kohonen, T. Self-Organizing Maps, Springer Series in In-formation Sciences, volume Vol. 30. Springer, Berlin, Hei-delberg, New York, 1995.
[13] Morrison, N., editor.Introduction to Fourier Analysis. JohnWiley & Sons, Inc., New York, NY, 1994.
[14] Parmigiani, G., Garrett, E. S., Irizarry, R. A., and Zeger, S.L., editor. The Analysis of Gene Expression Data: Methodsand Software. Springer-Verlag New York, Inc, New York,NY, 2003.
[15] Sammon, J. W. A nonlinear mapping for data structure anal-ysis. IEEE Transactions on Computers, C-18(5):401–409,1969.
[16] Stuart, R. O., Bush, K. T., and Nigam, S. K. Changes inGlobal Gene Expression Patterns During Development andMaturation of the Rat Kidney.Proc. Natl. Acad. Sci. USA,Vol. 98(10):5649–5654, May 2001.
[17] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P.,Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. Miss-ing Value Estimation Methods for DNA Microarrays.Bioin-formatics, Vol.17(6):520–525, 2001.
[18] Ward, M. O. XmdvTool: Integrating Multiple Methods forVisualizing Multivariate Data. InIEEE Visualization 1994,pages 326–336, 1994.
[19] Zhang, L., Zhang, A., Ramanathan, M. et al. VizCluster andIts Application on Clustering Gene Expression Data.In-ternational Journal of Distributed and Parallel Databases,13(1):79–97, January 2003.
Appendix: Proof for Propositions
Lemma 1
(N−1∑n=0
an
)2
=
N−1∑n=0
a2n + 2
N−1∑
k=1
N−k−1∑t=0
at at+k.
Lemma 2 For any two complex numbersz and w, (1) z + w =z + w, (2) z w = z w, (3) z = z.
Lemma 3 Let j ∈ N, thenN−1∑n=0
e−i2πjn/N
=
N−1∑n=0
cos(2πjn/N) =
N−1∑n=0
sin(2πjn/N) = 0.
Lemma 4 FFHP is homomorphic: F1(ax[n] + by[n]) =aF1(x[n]) + bF1(y[n]).
Proposition 1 (Cancellation) If x[n] = [a, . . . , a], thenF1(x[n]) = 0.
Proof: From the formula in Eq. (2), we get
F1(x[n]) =
N−1∑n=0
a e−i2πn/N = a
N−1∑n=0
e−i2πn/N .
By Lemma 3, letj = 1. F1(x[n]) = 0. ¤
Proposition 2 (Amplitude Shifting) If y[n] = x[n] + a, thenF1(y[n]) = F1(x[n])
From the formula in Eq. (2), we get
F1(y[n]) =
N−1∑n=0
(x[n] + a)e−i2πn/N =
N−1∑n=0
x[n]e−i2πn/N
+
N−1∑n=0
a e−i2πn/N = F1(x[n]) + 0
The second summation is0 by Proposition 1. ¤
Proposition 3 (Amplitude Multiplying) If y[n] = ax[n], thenF1(y[n]) = aF1(x[n]).
From the formula in Eq. (2), we get
F1(y[n]) =
N−1∑n=0
a x[n]e−i2πn/N
= a
N−1∑n=0
x[n]e−i2πn/N = a F1(x[n])
¤
Proposition 4 (Transposing) y[n] = x[N − n − 1], thenF1(y[n]) = F1(x[n]).
By Lemma 2, we have
F1(x[n]) =
N−1∑n=0
x[n]e−i2πn/N =
N−1∑n=0
x[n]ei2πn/N
=
N−1∑n=0
x[n]ei2πn/N = F1(x[−n])
However, whenx[n] is a real signal,x[n] = x[n]. Then we have
F1(x[n]) = F1(x[−n]). i.e., F1(x[n]) = F1(x[−n]).
Sincex[N − n− 1] = x[−n] thenF1(y[n]) = F1(x[n]) ¤
Proposition 5 (Time Shifting) If y[n] = x[n − d], thenF1(y[n]) = Wd
N F1(x[n]).
Proof: Assume0 ≤ n < N , let l = n− d, thenn = l + d. Whenn = 0, l = −d and whenn = N − 1, l = N − 1− d. From theformula in Eq. (2), we get
F1(y[n]) = F1(x[n− d]) =
N−1−d∑
l=−d
x[l]e−i2π(l+d)/N
=
N−1−d∑
l=−d
x[l]e−i2πl/Ne−i2πd/N = WdN
N−1−d∑
l=−d
x[l]e−i2πl/N
However,ei2πn/N = ei2π(n+N)/N andx[n] = x[n + N ],
N−1−d∑
l=−d
x[l]e−i2πl/N =
−1∑
l=−d
x[l + N ] e−i2π(l+N)/N
+
N−1−d∑
l=0
x[l] e−i2πl/N
Let t = l + N for the first summation andt = l for the secondsummation, we get
N−1−d∑
l=−d
x[l]e−i2πl/N =
N−1∑
t=N−d
x[t]e−i2πt/N
+
N−1−d∑t=0
x[t]e−i2πt/N
=
N−1∑t=0
x[t]e−i2πt/N = F1(x[n])
Therefore,F1(y[n]) = F1(x[n])WdN . ¤
Definition 1 The mean of a signalx[n] is defined asx =∑N−1n=0 x[n]/N . Thek-th sample autocovariance coefficient of a
signalx[n] is defined asgk =∑N−1−k
n=0 (x[n]− x)(x[n + k] −x)/N . g0 is called the variance ofx[n]. Thek-th sample autocor-relation coefficient is defined asrk = gk/g0.
Proposition 6 (General Distance)Letw[n] = x[n]−y[n] be thedifference betweenx[n] andy[n]. The distance betweenF1(x[n])andF1(y[n]) is
‖F1(w[n])‖2 = g0N
(1 + 2
N−1∑
k=1
rk cos(2πk/N)
).
Proof: By Lemma 4, the distance betweenF1(y[n]) andF1(x[n]) is ‖F1(w[n])‖. From Eq. (2), we get
‖F1(w[n])‖ = ‖N−1∑n=0
w[n]e−i2πn/N‖
= ‖N−1∑n=0
w[n] cos(2πn/N)− iw[n] sin(2πn/N)‖
Let ω = 2π/N , by Lemma 3, we haveN−1∑n=0
cos(nω) =
N−1∑n=0
sin(nω) = 0. Now add a termw, the mean ofw[n],
‖F1(w[n])‖2 =
(N−1∑n=0
w[n] cos(nω)
)2
+
(N−1∑n=0
w[n] sin(nω)
)2
=
(N−1∑n=0
(w[n]− w) cos(nω)
)2
+
(N−1∑n=0
(w[n]− w) sin(nω)
)2
Expending each squaring term by Lemma 1, we get
N−1∑n=0
(w[n]− w)2(cos2(nω) + sin2(nω))
+ 2
N−1∑
k=1
N−1−k∑t=0
[(w[t]− w)(w[t + k]− w)Ω]
whereΩ = cos(tω) cos((t + k)ω) + sin(tω) sin((t + k)ω).By trigonometry identitycos θ cos φ + sin θ sin φ = cos(φ −
θ), we haveΩ = cos(kω). Now
‖F1(w[n])‖2 =
N−1∑n=0
(w[n]− w)2
+ 2
N−1∑
k=1
N−1−k∑t=0
[(w[t]− w)(w[t + k]− w) cos(kω)]
= N(g0 + 2
N−1∑
k=1
gk cos(kω))
= g0N
(1 + 2
N−1∑
k=1
rkcos(2πk/N)
). ¤