i
CONTENTS Page
Table of Contents i
Abstract ii
1. INTRODUCTION
1.1 Multivariate Biomedical Data
1.2 Biomedical Cancer Genomic Data
1.3 Microarray and Gene Expression Levels
1.4 Data Under Study
1.4.1 Leukemia Cancer Gene Expression Data Set
1.5 Objectives of the Project
1.6 Summary Statistics for Multivariate Data Set
1.7 Relative Variance Covariance Matrix.
1
1
3
4
6
8
10
12
13
2. EXPLORATORY DATA ANALYSIS
2.1 Histograms
2.2 Box and Whiskers Plots
2.3 Transformations
2.3.1 Box-Cox Transformation
2.4 Exploratory Data Analysis: A Graphical View
2.5 Probability Plots
2.6 Fitting a Probability Distribution
2.7 Extreme Value Distributions
2.7.1 Extreme Value Distribution Argument
2.7.2 Generalized Extreme Value Distributions
2.7.3 Gumbel Distribution
2.7.4 Fréchet Distribution
2.7.5 Weibull Distribution
2.8 Goodness of Fit Test
2.9 Probability Difference Plots
2.10 Correlation Structure of the Data Matrix
2.10.1 Interpreting Coefficient of Correlation
17
17
18
19
20
21
32
35
36
37
38
40
42
43
45
56
60
61
3. PRINCIPAL COMPONENT ANALYSIS AND ITS USE IN CLUSTERING
TISSUE SAMPLES
3.1 Clustering Gene Expression Data
3.2 Cluster Analysis: A Comparison
3.3 Principal Component Analysis and Clustering
3.4 Principal Component Analysis
3.5 Principal Components
3.5.1 Principal Components Using Variance Covariance Matrix
3.5.2 Principal Components Using Correlation Matrix
3.5.3 Principal Components Using Relative Variance Matrix
3.6 Principal Components : Relative Variance Matrix versus Correlation Matrix
3.7 Principal Component Loadings
68
69
71
74
76
77
77
79
81
82
83
ii
3.8 PCA Clustering: A literature review
3.9 Kaiser’s Criterion for Retaining Principal Components
3.10 PCA in Graphical Representation
3.10.1 The PC Plots
84
86
88
88
4. SCREENING AND CLUSTERING OF GENES
4.1 Gene Clustering
4.2 Screening of Genes
4.3 Role of Minimum Threshold Value ‘20’
4.4 Garcia’s Criterion of Relative Variance
4.5 The High Variant Cluster of Genes
4.6 Discriminant Analysis
4.7 Discriminating the High Variant Gene Group
103
104
106
110
118
122
141
145
5. DISCUSSIONS, CONCLUSIONS WITH FUTURE RECOMMENDATIONS
5.1 Main Issues in Genomic data set
5.2 Addressing the Issues
5.3 Recommendations
150
150
151
155
APPENDIX
157
REFERENCES 164