Date post: | 07-Aug-2018 |
Category: |
Documents |
Upload: | satpalwadhwa |
View: | 221 times |
Download: | 0 times |
8/21/2019 Data Mining Text Book
1/658
Data Mining and Analysis:
Fundamental Concepts and Algorithms
Mohammed J. Zaki
Wagner Meira Jr.
8/21/2019 Data Mining Text Book
2/658
CONTENTS i
Contents
Preface 1
1 Data Mining and Analysis 41.1 Data Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Data: Algebraic and Geometric View . . . . . . . . . . . . . . . . . . 7
1.3.1 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . 91.3.2 Mean and Total Variance . . . . . . . . . . . . . . . . . . . . 131.3.3 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . 141.3.4 Linear Independence and Dimensionality . . . . . . . . . . . . 15
1.4 Data: Probabilistic View . . . . . . . . . . . . . . . . . . . . . . . . . 171.4.1 Bivariate Random Variables . . . . . . . . . . . . . . . . . . . 241.4.2 Multivariate Random Variable . . . . . . . . . . . . . . . . . 281.4.3 Random Sample and Statistics . . . . . . . . . . . . . . . . . 29
1.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . 311.5.2 Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . 331.5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.5.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
I Data Analysis Foundations 37
2 Numeric Attributes 382.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . 392.1.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.2.1 Measures of Location and Dispersion . . . . . . . . . . . . . . 492.2.2 Measures of Association . . . . . . . . . . . . . . . . . . . . . 50
8/21/2019 Data Mining Text Book
3/658
CONTENTS ii
2.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.4 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.5 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.5.1 Univariate Normal Distribution . . . . . . . . . . . . . . . . . 612.5.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . 63
2.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3 Categorical Attributes 713.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.1 Bernoulli Variable . . . . . . . . . . . . . . . . . . . . . . . . 713.1.2 Multivariate Bernoulli Variable . . . . . . . . . . . . . . . . . 74
3.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.2.1 Attribute Dependence: Contingency Analysis . . . . . . . . . 88
3.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.3.1 Multi-way Contingency Analysis . . . . . . . . . . . . . . . . 95
3.4 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983.5 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4 Graph Data 1054.1 Graph Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.2 Topological Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.3 Centrality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3.1 Basic Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3.2 Web Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 1174.4 Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4.1 Erds-Rnyi Random Graph Model . . . . . . . . . . . . . . . 1294.4.2 Watts-Strogatz Small-world Graph Model . . . . . . . . . . . 1334.4.3 Barabsi-Albert Scale-free Model . . . . . . . . . . . . . . . . 139
4.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5 Kernel Methods 1505.1 Kernel Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.1.1 Reproducing Kernel Map . . . . . . . . . . . . . . . . . . . . 156
5.1.2 Mercer Kernel Map . . . . . . . . . . . . . . . . . . . . . . . . 1585.2 Vector Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1615.3 Basic Kernel Operations in Feature Space . . . . . . . . . . . . . . . 1665.4 Kernels for Complex Objects . . . . . . . . . . . . . . . . . . . . . . 173
5.4.1 Spectrum Kernel for Strings . . . . . . . . . . . . . . . . . . . 1735.4.2 Diffusion Kernels on Graph Nodes . . . . . . . . . . . . . . . 175
8/21/2019 Data Mining Text Book
4/658
CONTENTS iii
5.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6 High-Dimensional Data 1826.1 High-Dimensional Objects . . . . . . . . . . . . . . . . . . . . . . . . 1826.2 High-Dimensional Volumes . . . . . . . . . . . . . . . . . . . . . . . . 1846.3 Hypersphere Inscribed within Hypercube . . . . . . . . . . . . . . . . 1876.4 Volume of Thin Hypersphere Shell . . . . . . . . . . . . . . . . . . . 1896.5 Diagonals in Hyperspace . . . . . . . . . . . . . . . . . . . . . . . . . 1906.6 Density of the Multivariate Normal . . . . . . . . . . . . . . . . . . . 1916.7 Appendix: Derivation of Hypersphere Volume . . . . . . . . . . . . . 1956.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2006.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7 Dimensionality Reduction 2047.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2047.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 209
7.2.1 Best Line Approximation . . . . . . . . . . . . . . . . . . . . 2097.2.2 Best Two-dimensional Approximation . . . . . . . . . . . . . 2137.2.3 Bestr-dimensional Approximation . . . . . . . . . . . . . . . 2177.2.4 Geometry of PCA . . . . . . . . . . . . . . . . . . . . . . . . 222
7.3 Kernel Principal Component Analysis (Kernel PCA) . . . . . . . . . 2257.4 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 233
7.4.1 Geometry of SVD . . . . . . . . . . . . . . . . . . . . . . . . 2347.4.2 Connection between SVD and PCA . . . . . . . . . . . . . . . 235
7.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
II Frequent Pattern Mining 240
8 Itemset Mining 2418.1 Frequent Itemsets and Association Rules . . . . . . . . . . . . . . . . 2418.2 Itemset Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . 245
8.2.1 Level-Wise Approach: Apriori Algorithm . . . . . . . . . . . 2478.2.2 Tidset Intersection Approach: Eclat Algorithm . . . . . . . . 2508.2.3 Frequent Pattern Tree Approach: FPGrowth Algorithm . . . 256
8.3 Generating Association Rules . . . . . . . . . . . . . . . . . . . . . . 2608.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2638.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
8/21/2019 Data Mining Text Book
5/658
CONTENTS iv
9 Summarizing Itemsets 2699.1 Maximal and Closed Frequent Itemsets . . . . . . . . . . . . . . . . . 2699.2 Mining Maximal Frequent Itemsets: GenMax Algorithm . . . . . . . 273
9.3 Mining Closed Frequent Itemsets: Charm algorithm . . . . . . . . . 2759.4 Non-Derivable Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . 2789.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2849.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
10 Sequence Mining 28910.1 Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28910.2 Mining Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . 290
10.2.1 Level-Wise Mining: GSP . . . . . . . . . . . . . . . . . . . . . 29210.2.2 Vertical Sequence Mining: SPADE . . . . . . . . . . . . . . . 29310.2.3 Projection-Based Sequence Mining: PrefixSpan . . . . . . . . 296
10.3 Substring Mining via Suffix Trees . . . . . . . . . . . . . . . . . . . . 29810.3.1 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29810.3.2 Ukkonens Linear Time Algorithm . . . . . . . . . . . . . . . 301
10.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30910.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
11 Graph Pattern Mining 31411.1 Isomorphism and Support . . . . . . . . . . . . . . . . . . . . . . . . 31411.2 Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 318
11.2.1 Canonical Code . . . . . . . . . . . . . . . . . . . . . . . . . . 32011.3 The gSpan Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 323
11.3.1 Extension and Support Computation . . . . . . . . . . . . . . 326
11.3.2 Canonicality Checking . . . . . . . . . . . . . . . . . . . . . . 33011.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33111.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12 Pattern and Rule Assessment 33712.1 Rule and Pattern Assessment Measures . . . . . . . . . . . . . . . . 337
12.1.1 Rule Assessment Measures . . . . . . . . . . . . . . . . . . . . 33812.1.2 Pattern Assessment Measures . . . . . . . . . . . . . . . . . . 34612.1.3 Comparing Multiple Rules and Patterns . . . . . . . . . . . . 349
12.2 Significance Testing and Confidence Intervals . . . . . . . . . . . . . 35412.2.1 Fisher Exact Test for Productive Rules . . . . . . . . . . . . . 354
12.2.2 Permutation Test for Significance . . . . . . . . . . . . . . . . 35912.2.3 Bootstrap Sampling for Confidence Interval . . . . . . . . . . 364
12.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36712.4 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
8/21/2019 Data Mining Text Book
6/658
CONTENTS v
III Clustering 370
13 Representative-based Clustering 371
13.1 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37213.2 Kernel K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37513.3 Expectation Maximization (EM) Clustering . . . . . . . . . . . . . . 381
13.3.1 EM in One Dimension . . . . . . . . . . . . . . . . . . . . . . 38313.3.2 EM ind-Dimensions . . . . . . . . . . . . . . . . . . . . . . . 38613.3.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 39313.3.4 Expectation-Maximization Approach . . . . . . . . . . . . . . 397
13.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40013.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
14 Hierarchical Clustering 404
14.1 P reliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40414.2 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . . 40714.2.1 Distance between Clusters . . . . . . . . . . . . . . . . . . . . 40714.2.2 Updating Distance Matrix . . . . . . . . . . . . . . . . . . . . 41114.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 413
14.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41314.4 Exercises and Projects . . . . . . . . . . . . . . . . . . . . . . . . . . 414
15 Density-based Clustering 41715.1 The DBSCAN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 41815.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . 421
15.2.1 Univariate Density Estimation . . . . . . . . . . . . . . . . . 422
15.2.2 Multivariate Density Estimation . . . . . . . . . . . . . . . . 42415.2.3 Nearest Neighbor Density Estimation . . . . . . . . . . . . . . 427
15.3 Density-based Clustering: DENCLUE . . . . . . . . . . . . . . . . . 42815.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43415.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
16 Spectral and Graph Clustering 43816.1 Graphs and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 43816.2 Clustering as Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . 446
16.2.1 Clustering Objective Functions: Ratio and Normalized Cut . 44816.2.2 Spectral Clustering Algorithm . . . . . . . . . . . . . . . . . . 451
16.2.3 Maximization Objectives: Average Cut and Modularity . . . . 45516.3 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46316.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47016.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
8/21/2019 Data Mining Text Book
7/658
CONTENTS vi
17 Clustering Validation 47317.1 External Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
17.1.1 Matching Based Measures . . . . . . . . . . . . . . . . . . . . 474
17.1.2 Entropy Based Measures . . . . . . . . . . . . . . . . . . . . . 47917.1.3 Pair-wise Measures . . . . . . . . . . . . . . . . . . . . . . . . 48217.1.4 Correlation Measures . . . . . . . . . . . . . . . . . . . . . . . 486
17.2 Internal Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48917.3 Relative Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
17.3.1 Cluster Stability . . . . . . . . . . . . . . . . . . . . . . . . . 50517.3.2 Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . 508
17.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51317.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
IV Classification 516
18 Probabilistic Classification 51718.1 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
18.1.1 Estimating the Prior Probability . . . . . . . . . . . . . . . . 51818.1.2 Estimating the Likelihood . . . . . . . . . . . . . . . . . . . . 518
18.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 52418.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52818.4 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
19 Decision Tree Classifier 53019.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
19.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 53519.2.1 Split-point Evaluation Measures . . . . . . . . . . . . . . . . 53619.2.2 Evaluating Split-points . . . . . . . . . . . . . . . . . . . . . . 53719.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 545
19.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54619.4 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
20 Linear Discriminant Analysis 54920.1 Optimal Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . 54920.2 Kernel Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 55620.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
20.4 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
21 Support Vector Machines 56621.1 Linear Discriminants and Margins . . . . . . . . . . . . . . . . . . . 56621.2 SVM: Linear and Separable Case . . . . . . . . . . . . . . . . . . . . 57221.3 Soft Margin SVM: Linear and Non-Separable Case . . . . . . . . . . 577
8/21/2019 Data Mining Text Book
8/658
CONTENTS vii
21.3.1 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57821.3.2 Quadratic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 582
21.4 Kernel SVM: Nonlinear Case . . . . . . . . . . . . . . . . . . . . . . 583
21.5 SVM Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 58821.5.1 Dual Solution: Stochastic Gradient Ascent . . . . . . . . . . . 58821.5.2 Primal Solution: Newton Optimization . . . . . . . . . . . . . 593
22 Classification Assessment 60222.1 Classification Performance Measures . . . . . . . . . . . . . . . . . . 602
22.1.1 Contingency Table Based Measures . . . . . . . . . . . . . . . 60422.1.2 Binary Classification: Positive and Negative Class . . . . . . 60722.1.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
22.2 Classifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61622.2.1 K-fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . 617
22.2.2 Bootstrap Resampling . . . . . . . . . . . . . . . . . . . . . . 61822.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 62022.2.4 Comparing Classifiers: Pairedt-Test . . . . . . . . . . . . . . 625
22.3 Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . 62722.3.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . 632
22.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63822.5 E xercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Index 641
8/21/2019 Data Mining Text Book
9/658
CONTENTS 1
Preface
This book is an outgrowth of data mining courses at RPI and UFMG; the RPI coursehas been offered every Fall since 1998, whereas the UFMG course has been offeredsince 2002. While there are several good books on data mining and related topics,we felt that many of them are either too high-level or too advanced. Our goal was
to write an introductory text which focuses on the fundamental algorithms in datamining and analysis. It lays the mathematical foundations for the core data miningmethods, with key concepts explained when first encountered; the book also tries tobuild the intuition behind the formulas to aid understanding.
The main parts of the book include exploratory data analysis, frequent patternmining, clustering and classification. The book lays the basic foundations of thesetasks, and it also covers cutting edge topics like kernel methods, high dimensionaldata analysis, and complex graphs and networks. It integrates concepts from relateddisciplines like machine learning and statistics, and is also ideal for a course on dataanalysis. Most of the prerequisite material is covered in the text, especially on linearalgebra, and probability and statistics.
The book includes many examples to illustrate the main technical concepts. Italso has end of chapter exercises, which have been used in class. All of the algorithmsin the book have been implemented by the authors. We suggest that the reader usetheir favorite data analysis and mining software to work through our examples, andto implement the algorithms we describe in text; we recommend the R software,or the Python language with its NumPy package. The datasets used and othersupplementary material like project ideas, slides, and so on, are available online atthe books companion site and its mirrors at RPI and UFMG
http://dataminingbook.info http://www.cs.rpi.edu/~zaki/dataminingbook
http://www.dcc.ufmg.br/dataminingbookHaving understood the basic principles and algorithms in data mining and data
analysis, the readers will be well equipped to develop their own methods or use moreadvanced techniques.
8/21/2019 Data Mining Text Book
10/658
CONTENTS 2
Suggested Roadmaps
The chapter dependency graph is shown in Figure 1. We suggest some typical
roadmaps for courses and readings based on this book. For an undergraduate levelcourse, we suggest the following chapters: 1-3, 8, 10, 12-15, 17-19, and 21-22. For anundergraduate course without exploratory data analysis, we recommend Chapters1, 8-15, 17-19, and 21-22. For a graduate course, one possibility is to quickly goover the material in Part I, or to assume it as background reading and to directlycover Chapters 9-23; the other parts of the book, namely frequent pattern mining(Part II), clustering (Part III), and classification (Part IV) can be covered in anyorder. For a course on data analysis the chapters must include 1-7, 13-14, 15 (Section2), and 20. Finally, for a course with an emphasis on graphs and kernels we suggestChapters 4, 5, 7 (Sections 1-3), 11-12, 13 (Sections 1-2), 16-17, 20-22.
1
2
14 6 7 15 5
13
17
16 20
22
21
4 19
3
18 8
11
12
9 10
Figure 1: Chapter Dependencies
Acknowledgments
Initial drafts of this book have been used in many data mining courses. We receivedmany valuable comments and corrections from both the faculty and students. Ourthanks go to
Muhammad Abulaish, Jamia Millia Islamia, India
8/21/2019 Data Mining Text Book
11/658
CONTENTS 3
Mohammad Al Hasan, Indiana University Purdue University at Indianapolis Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil
Loc Cerf, Universidade Federal de Minas Gerais, Brazil Ayhan Demiriz, Sakarya University, Turkey Murat Dundar, Indiana University Purdue University at Indianapolis Jun Luke Huan, University of Kansas Ruoming Jin, Kent State University Latifur Khan, University of Texas, Dallas Pauli Miettinen, Max-Planck-Institut fr Informatik, Germany Suat Ozdemir, Gazi University, Turkey Naren Ramakrishnan, Virginia Polytechnic and State University Leonardo Chaves Dutra da Rocha, Universidade Federal de So Joo del-Rei,
Brazil
Saeed Salem, North Dakota State University Ankur Teredesai, University of Washington, Tacoma Hannu Toivonen, University of Helsinki, Finland Adriano Alonso Veloso, Universidade Federal de Minas Gerais, Brazil Jason T.L. Wang, New Jersey Institute of Technology Jianyong Wang, Tsinghua University, China Jiong Yang, Case Western Reserve University
Jieping Ye, Arizona State UniversityWe would like to thank all the students enrolled in our data mining courses at RPIand UFMG, and also the anonymous reviewers who provided technical commentson various chapters. In addition, we thank CNPq, CAPES, FAPEMIG, Inweb the National Institute of Science and Technology for the Web, and Brazils Sciencewithout Borders program for their support. We thank Lauren Cowles, our editor atCambridge University Press, for her guidance and patience in realizing this book.
Finally, on a more personal front, MJZ would like to dedicate the book to Amina,Abrar, Afsah, and his parents, and WMJ would like to dedicate the book to Patricia,Gabriel, Marina and his parents, Wagner and Marlene. This book would not havebeen possible without their patience and support.
Troy Mohammed J. ZakiBelo Horizonte Wagner Meira, Jr.Summer 2013
8/21/2019 Data Mining Text Book
12/658
CHAPTER 1. DATA MINING AND ANALYSIS 4
Chapter 1
Data Mining and Analysis
Data mining is the process of discovering insightful, interesting, and novel patterns,
as well as descriptive, understandable and predictive models from large-scale data.We begin this chapter by looking at basic properties of data modeled as a data ma-trix. We emphasize the geometric and algebraic views, as well as the probabilisticinterpretation of data. We then discuss the main data mining tasks, which span ex-ploratory data analysis, frequent pattern mining, clustering and classification, layingout the road-map for the book.
1.1 Data Matrix
Data can often be represented or abstracted as annddata matrix, withn rows andd columns, where rows correspond to entities in the dataset, and columns represent
attributes or properties of interest. Each row in the data matrix records the observedattribute values for a given entity. Then d data matrix is given as
D=
X1 X2 Xd
x1 x11 x12 x1dx2 x21 x22 x2d...
... ...
. . . ...
xn xn1 xn2 xnd
wherexi denotes the i-th row, which is a d-tuple given as
xi= (xi1, xi2,
, xid)
and where Xj denotes the j-th column, which is an n-tuple given as
Xj = (x1j , x2j , , xnj )Depending on the application domain, rows may also be referred to as entities,
instances,examples, records, transactions, objects, points, feature-vectors, tuplesand
8/21/2019 Data Mining Text Book
13/658
CHAPTER 1. DATA MINING AND ANALYSIS 5
so on. Likewise, columns may also be called attributes, properties, features, dimen-sions, variables, fields, and so on. The number of instances n is referred to as thesizeof the data, whereas the number of attributes d is called the dimensionalityof
the data. The analysis of a single attribute is referred to as univariate analysis,whereas the simultaneous analysis of two attributes is called bivariate analysisandthe simultaneous analysis of more than two attributes is called multivariate analysis.
sepal sepal petal petalclass
length width length widthX1 X2 X3 X4 X5
x1 5.9 3.0 4.2 1.5 Iris-versicolorx2 6.9 3.1 4.9 1.5 Iris-versicolorx3 6.6 2.9 4.6 1.3 Iris-versicolorx
4 4.6 3.2 1.4 0.2 Iris-setosax5 6.0 2.2 4.0 1.0 Iris-versicolorx6 4.7 3.2 1.3 0.2 Iris-setosax7 6.5 3.0 5.8 2.2 Iris-virginicax8 5.8 2.7 5.1 1.9 Iris-virginica...
... ...
... ...
...x149 7.7 3.8 6.7 2.2 Iris-virginicax150 5.1 3.4 1.5 0.2 Iris-setosa
Table 1.1: Extract from the Iris Dataset
Example 1.1: Table 1.1 shows an extract of the Iris dataset; the complete dataforms a1505data matrix. Each entity is an Iris flower, and the attributes includesepal length, sepal width, petal lengthand petal widthin centimeters, andthe type or classof the Iris flower. The first row is given as the 5-tuple
x1= (5.9, 3.0, 4.2, 1.5, Iris-versicolor)
Not all datasets are in the form of a data matrix. For instance, more complexdatasets can be in the form of sequences (e.g., DNA, Proteins), text, time-series,
images, audio, video, and so on, which may need special techniques for analysis.However, in many cases even if the raw data is not a data matrix it can usually betransformed into that form via feature extraction. For example, given a database ofimages, we can create a data matrix where rows represent images and columns corre-spond to image features like color, texture, and so on. Sometimes, certain attributesmay have special semantics associated with them requiring special treatment. For
8/21/2019 Data Mining Text Book
14/658
CHAPTER 1. DATA MINING AND ANALYSIS 6
instance, temporal or spatial attributes are often treated differently. It is also worthnoting that traditional data analysis assumes that each entity or instance is inde-pendent. However, given the interconnected nature of the world we live in, this
assumption may not always hold. Instances may be connected to other instances viavarious kinds of relationships, giving rise to a data graph, where a node representsan entity and an edge represents the relationship between two entities.
1.2 Attributes
Attributes may be classified into two main types depending on their domain, i.e.,depending on the types of values they take on.
Numeric Attributes Anumericattribute is one that has a real-valued or integer-valued domain. For example, Age with domain(Age) = N, where N denotes the setof natural numbers (non-negative integers), is numeric, and so is petal length inTable 1.1, with domain(petal length) =R+ (the set of all positive real numbers).Numeric attributes that take on a finite or countably infinite set of values are calleddiscrete, whereas those that can take on any real value are called continuous. As aspecial case of discrete, if an attribute has as its domain the set{0, 1}, it is called abinaryattribute. Numeric attributes can be further classified into two types:
Interval-scaled: For these kinds of attributes only differences (addition or sub-traction) make sense. For example, attribute temperature measured inC orF is interval-scaled. If it is 20C on one day and 10 C on the followingday, it is meaningful to talk about a temperature drop of 10 C, but it is not
meaningful to say that it is twice as cold as the previous day.
Ratio-scaled: Here one can compute both differences as well as ratios betweenvalues. For example, for attribute Age, we can say that someone who is 20years old is twice as old as someone who is 10 years old.
Categorical Attributes A categoricalattribute is one that has a set-valued do-main composed of a set of symbols. For example, Sex and Education could becategorical attributes with their domains given as
domain(Sex) ={M, F}domain(Education) =
{HighSchool, BS, MS, PhD
}Categorical attributes may be of two types:
Nominal: The attribute values in the domain are unordered, and thus onlyequality comparisons are meaningful. That is, we can check only whether thevalue of the attribute for two given instances is the same or not. For example,
8/21/2019 Data Mining Text Book
15/658
CHAPTER 1. DATA MINING AND ANALYSIS 7
Sexis a nominal attribute. Alsoclassin Table 1.1 is a nominal attribute withdomain(class) ={iris-setosa, iris-versicolor, iris-virginica}.
Ordinal: The attribute values are ordered, and thus both equality comparisons(is one value equal to another) and inequality comparisons (is one value lessthan or greater than another) are allowed, though it may not be possible toquantify the difference between values. For example, Education is an ordi-nal attribute, since its domain values are ordered by increasing educationalqualification.
1.3 Data: Algebraic and Geometric View
If the d attributes or dimensions in the data matrix D are all numeric, then eachrow can be considered as a d-dimensional point
xi= (xi1, xi2, , xid) Rd
or equivalently, each row may be considered as a d-dimensional column vector (allvectors are assumed to be column vectors by default)
xi =
xi1xi2
...xid
= xi1 xi2 xidT Rd
whereT is the matrix transposeoperator.
Thed-dimensional Cartesian coordinate space is specified via the d unit vectors,called the standard basis vectors, along each of the axes. The j-th standard basisvectorej is thed-dimensional unit vector whose j -th component is 1 and the rest ofthe components are 0
ej = (0, . . . , 1j , . . . , 0)T
Any other vector in Rd can be written as linear combinationof the standard basisvectors. For example, each of the pointsxi can be written as the linear combination
xi= xi1e1+ xi2e2+
+ xided=
d
j=1 xij ejwhere the scalar value xij is the coordinate value along the j-th axis or attribute.
8/21/2019 Data Mining Text Book
16/658
CHAPTER 1. DATA MINING AND ANALYSIS 8
0
1
2
3
4
0 1 2 3 4 5 6X1
X2
x1 = (5.9, 3.0)
(a)
X1
X2
X3
12
3
45
6
1 2 3
1
2
3
4
x1= (5.9, 3.0, 4.2)
(b)
Figure 1.1: Row x1 as a point and vector in (a) R2 and (b) R3
2
2.5
3.0
3.5
4.0
4.5
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
X1: sepal length
X2:sepalwid
th
Figure 1.2: Scatter Plot: sepal lengthversus sepal width. Solid circle shows themean point.
8/21/2019 Data Mining Text Book
17/658
CHAPTER 1. DATA MINING AND ANALYSIS 9
Example 1.2: Consider the Iris data in Table 1.1. If we project the entiredata onto the first two attributes, then each row can be considered as a point
or a vector in 2-dimensional space. For example, the projection of the 5-tuplex1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor) on the first two attributes is shown inFigure 1.1a. Figure 1.2 shows the scatter plot of all the n = 150 points in the2-dimensional space spanned by the first two attributes. Likewise, Figure 1.1bshowsx1 as a point and vector in 3-dimensional space, by projecting the data ontothe first three attributes. The point (5.9, 3.0, 4.2) can be seen as specifying thecoefficients in the linear combination of the standard basis vectors in R3
x1= 5.9e1+ 3.0e2+ 4.2e3= 5.9
100
+ 3.001
0
+ 4.200
1
=5.93.0
4.2
Each numeric column or attribute can also be treated as a vector in an n-dimensional space Rn
Xj =
x1jx2j
...xnj
If all attributes are numeric, then the data matrix D is in fact an n d matrix,
also written as D Rnd, given as
D=x11 x12 x1dx21 x22 x2d... ... . . . ...xn1 xn2 xnd
= xT1
xT
2 ...xTn
= | | |X1 X2 Xd| | | As we can see, we can consider the entire dataset as an n dmatrix, or equivalentlyas a set ofnrow vectors xTi Rd or as a set ofd column vectors Xj Rn.
1.3.1 Distance and Angle
Treating data instances and attributes as vectors, and the entire dataset as a matrix,enables one to apply both geometric and algebraic methods to aid in the data miningand analysis tasks.
Let a, b Rm be two m-dimensional vectors given as
a=
a1a2...
am
b=
b1b2...
bm
8/21/2019 Data Mining Text Book
18/658
CHAPTER 1. DATA MINING AND ANALYSIS 10
Dot Product Thedot productbetween aand b is defined as the scalar value
aTb= a1 a2 amb1
b2...bm
=a1b1+ a2b2+ + ambm
=m
i=1
aibi
Length TheEuclidean normor length of a vector a Rm is defined as
a =
a
T
a= a21+ a22+ + a2m= m
i=1 a2iTheunit vectorin the direction ofa is given as
u= a
a =
1
a
a
By definition u has lengthu= 1, and it is also called a normalizedvector, whichcan be used in lieu ofa in some analysis tasks.
The Euclidean norm is a special case of a general class of norms, known as Lp-norm, defined as
ap= |a1|p + |a2|p + + |am|p 1p = mi=1
|ai|p1
p
for any p= 0. Thus, the Euclidean norm corresponds to the case when p= 2.
Distance From the Euclidean norm we can define theEuclidean distancebetweena and b, as follows
(a, b) =a b =
(a b)T(a b) = m
i=1
(ai bi)2 (1.1)
Thus, the length of a vector is simply its distance from the zero vector 0, all of whoseelements are 0, i.e.,a=a 0= (a, 0).
From the generalLp-norm we can define the corresponding Lp-distance function,given as follows
p(a, b) =a bp (1.2)
8/21/2019 Data Mining Text Book
19/658
CHAPTER 1. DATA MINING AND ANALYSIS 11
Angle The cosine of the smallest angle between vectors a and b, also called thecosine similarity, is given as
cos = aTba b = aaT bb (1.3)Thus, the cosine of the angle between a and b is given as the dot product of the unitvectors aa and
bb .
TheCauchy-Schwartz inequality states that for any vectors aandb in Rm
|aTb| a bIt follows immediately from the Cauchy-Schwartz inequality that
1 cos 1
Since the smallest angle [0, 180]and since cos [1, 1], the cosine similarityvalue ranges from +1 corresponding to an angle of0, to1 corresponding to anangle of180 (or radians).
Orthogonality Two vectors a and b are said to be orthogonalif and only ifaTb=0, which in turn implies that cos = 0, that is, the angle between them is 90 or 2radians. In this case, we say that they have no similarity.
0
1
2
3
4
0 1 2 3 4 5X1
X2
(5, 3)
(1, 4)
ab
ab
Figure 1.3: Distance and Angle. Unit vectors are shown in gray.
Example 1.3 (Distance and Angle): Figure 1.3 shows the two vectors
a=
53
and b=
14
8/21/2019 Data Mining Text Book
20/658
CHAPTER 1. DATA MINING AND ANALYSIS 12
Using (1.1), the Euclidean distance between them is given as
(a, b) = (5 1)2 + (3 4)2 =
16 + 1 =
17 = 4.12
The distance can also be computed as the magnitude of the vector
a b=
53
14
=
41
sincea b= 42 + (1)2 = 17 = 4.12.The unit vector in the direction ofais given as
ua= a
a = 152 + 32
53
=
134
53
=
0.860.51
The unit vector in the direction ofb can be computed similarly
ub =
0.240.97
These unit vectors are also shown in gray in Figure 1.3.
By (1.3) the cosine of the angle between a and bis given as
cos =
53
T14
52 + 32
12 + 42 =
1734 17=
12
We can get the angle by computing the inverse of the cosine
= cos1
1/
2
= 45
Let us consider the Lp-norm for awith p= 3; we get
a3=
53 + 331/3
= (153)1/3 = 5.34
The distance between aandb using (1.2) for the Lp-norm withp = 3is given as
a b3=
(4, 1)T
3
=
43 + (1)3
1/3
= (63)1/3 = 3.98
8/21/2019 Data Mining Text Book
21/658
CHAPTER 1. DATA MINING AND ANALYSIS 13
1.3.2 Mean and Total Variance
Mean Themeanof the data matrix Dis the vector obtained as the average of all
the row-vectors
mean(D) = = 1
n
ni=1
xi
Total Variance The total varianceof the data matrix D is the average squareddistance of each point from the mean
var(D) = 1
n
ni=1
(xi,)2 =
1
n
ni=1
xi 2 (1.4)
Simplifying (1.4) we obtain
var(D) = 1
n
ni=1
xi2 2xTi + 2=
1
n
ni=1
xi2 2nT
1
n
ni=1
xi
+ n 2
= 1
n
ni=1
xi2 2nT + n 2
= 1
n
ni=1
xi2
2
The total variance is thus the difference between the average of the squared mag-nitude of the data points and the squared magnitude of the mean (average of thepoints).
Centered Data Matrix Often we need to center the data matrix by making themean coincide with the origin of the data space. Thecentered data matrixis obtainedby subtracting the mean from all the points
Z= D 1 T =
xT1
xT2..
.xTn
T
T
..
.T
=
xT1 TxT2 T
..
.xTn T
=
zT1
zT2..
.zTn
(1.5)
where zi = xi represents the centered point corresponding to xi, and 1 Rnis the n-dimensional vector all of whose elements have value 1. The mean of thecentered data matrix Z is 0Rd, since we have subtracted the mean from all thepoints xi.
8/21/2019 Data Mining Text Book
22/658
CHAPTER 1. DATA MINING AND ANALYSIS 14
0
1
2
3
4
0 1 2 3 4 5X1
X2
a
b
r=b
p=b
Figure 1.4: Orthogonal Projection
1.3.3 Orthogonal Projection
Often in data mining we need to project a point or vector onto another vector, forexample to obtain a new point after a change of the basis vectors. Let a, b Rmbe two m-dimensional vectors. An orthogonal decompositionof the vector b in thedirection of another vector a, illustrated in Figure 1.4, is given as
b= b+ b = p + r (1.6)
wherep= b is parallel to a, and r= b is perpendicular or orthogonal to a. Thevector p is called the orthogonal projectionor simply projection ofb on the vector
a. Note that the point pRm is the point closest to bon the line passing througha. Thus, the magnitude of the vector r = b p gives the perpendicular distancebetween b and a, which is often interpreted as the residual or error vector betweenthe points b andp.
We can derive an expression for p by noting that p = cafor some scalar c, sincepis parallel to a. Thus, r = b p= b ca. Since p and r are orthogonal, we have
pTr= (ca)T(b ca) =caTb c2aTa= 0
which implies that c= aTb
aTa
Therefore, the projection ofb on ais given as
p= b = ca=
aTb
aTa
a (1.7)
8/21/2019 Data Mining Text Book
23/658
CHAPTER 1. DATA MINING AND ANALYSIS 15
X1
X2
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0
-0.5
0.0
0.5
1.0
1.5
Figure 1.5: Projecting the Centered Data onto the Line
Example 1.4: Restricting the Iris dataset to the first two dimensions, sepallengthand sepal width, the mean point is given as
mean(D) =
5.8433.054
which is shown as the black circle in Figure 1.2. The corresponding centered datais shown in Figure 1.5, and the total variance is var(D) = 0.868 (centering doesnot change this value).
Figure 1.5 shows the projection of each point onto the line , which is the linethat maximizes the separation between the class iris-setosa (squares) from theother two class (circles and triangles). The line is given as the set of all the points
(x1, x2)T satisfying the constraint
x1x2
= c
2.152.75
for all scalars c R.
1.3.4 Linear Independence and Dimensionality
Given the data matrix
D=
x1 x2 xnT
=
X1 X2 Xd
8/21/2019 Data Mining Text Book
24/658
CHAPTER 1. DATA MINING AND ANALYSIS 16
we are often interested in the linear combinations of the rows (points) or the columns(attributes). For instance, different linear combinations of the original d attributesyield new derived attributes, which play a key role in feature extraction and dimen-
sionality reduction.Given any set of vectors v1, v2, , vk in an m-dimensional vector space Rm,
theirlinear combinationis given as
c1v1+ c2v2+ + ckvkwhere ci R are scalar values. The set of all possible linear combinations of the kvectors is called the span, denoted asspan(v1, , vk), which is itself a vector spacebeing asubspaceofRm. Ifspan(v1, , vk) = Rm, then we say that v1, , vk is aspanning setfor Rm.
Row and Column Space There are several interesting vector spaces associatedwith the data matrix D, two of which are the column space and row space of D.The column spaceofD, denoted col(D), is the set of all linear combinations of thed column vectors or attributes Xj Rn, i.e.,
col(D) =span(X1, X2, , Xd)
By definition col(D) is a subspace ofRn. The row spaceofD, denoted row(D), isthe set of all linear combinations of the nrow vectors or points xi Rd, i.e.,
row(D) =span(x1, x2, , xn)
By definition row(D) is a subspace ofRd
. Note also that the row space ofDis thecolumn space ofDT
row(D) =col(DT)
Linear Independence We say that the vectors v1, , vkare linearly dependentifat least one vector can be written as a linear combination of the others. Alternatively,thek vectors are linearly dependent if there are scalars c1, c2, , ck, at least one ofwhich is not zero, such that
c1v1+ c2v2+ + ckvk =0
On the other hand, v1,
, vk are linearly independentif and only if
c1v1+ c2v2+ + ckvk =0 implies c1 = c2= = ck = 0
Simply put, a set of vectors is linearly independent if none of them can be writtenas a linear combination of the other vectors in the set.
8/21/2019 Data Mining Text Book
25/658
CHAPTER 1. DATA MINING AND ANALYSIS 17
Dimension and Rank Let S be a subspace of Rm. A basis for S is a set ofvectors in S, say v1, , vk, that are linearly independent and they span S, i.e.,span(v1,
, vk) = S. In fact, a basis is a minimal spanning set. If the vectors in
the basis are pair-wise orthogonal, they are said to form an orthogonal basis for S.If, in addition, they are also normalized to be unit vectors, then they make up anorthonormal basisfor S. For instance, the standard basisfor Rm is an orthonormalbasis consisting of the vectors
e1=
10...0
e2=
01...0
em =
00...1
Any two bases for Smust have the same number of vectors, and the number of
vectors in a basis for S is called the dimensionofS, denoted as dim(S). Since S isa subspace ofRm, we must have dim(S)m.
It is a remarkable fact that, for any matrix, the dimension of its row and columnspace is the same, and this dimension is also called the rank of the matrix. Forthe data matrix D Rnd, we have rank(D) min(n, d), which follows fromthe fact that the column space can have dimension at most d, and the row spacecan have dimension at most n. Thus, even though the data points are ostensiblyin a d dimensional attribute space (the extrinsic dimensionality), ifrank(D) < d,then the data points reside in a lower dimensional subspace ofRd, and in this caserank(D)gives an indication about the intrinsicdimensionality of the data. In fact,with dimensionality reduction methods it is often possible to approximate D Rnd
with a derived data matrix D
Rn
k
, which has much lower dimensionality, i.e.,kd. In this case k may reflect the true intrinsic dimensionality of the data.
Example 1.5: The line in Figure 1.5 is given as = span2.15 2.75T,
with dim() = 1. After normalization, we obtain the orthonormal basis for asthe unit vector
112.19
2.152.75
=
0.6150.788
1.4 Data: Probabilistic View
The probabilistic view of the data assumes that each numeric attribute Xis arandomvariable, defined as a function that assigns a real number to each outcome of anexperiment (i.e., some process of observation or measurement). Formally, X is a
8/21/2019 Data Mining Text Book
26/658
CHAPTER 1. DATA MINING AND ANALYSIS 18
functionX:O R, whereO, the domain ofX, is the set of all possible outcomesof the experiment, also called the sample space, and R, the rangeofX, is the set ofreal numbers. If the outcomes are numeric, and represent the observed values of the
random variable, then X:O O is simply the identity function: X(v) =v for allv O. The distinction between the outcomes and the value of the random variableis important, since we may want to treat the observed values differently dependingon the context, as seen in Example 1.6.
A random variable X is called a discrete random variable if it takes on onlya finite or countably infinite number of values in its range, whereas X is called acontinuous random variableif it can take on any value in its range.
5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.95.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.85.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.6
4.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.66.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.64.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.34.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.74.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.35.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.95.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1
Table 1.2: Iris Dataset: sepal length(in centimeters)
Example 1.6: Consider the sepal length attribute (X1) for the Iris dataset inTable 1.1. All n = 150 values of this attribute are shown in Table 1.2, which liein the range [4.3, 7.9] with centimeters as the unit of measurement. Let us assumethat these constitute the set of all possible outcomesO.
By default, we can consider the attribute X1 to be a continuous random vari-able, given as the identity function X1(v) = v, since the outcomes (sepal lengthvalues) are all numeric.
On the other hand, if we want to distinguish between Iris flowers with shortand long sepal lengths, with long being, say, a length of7cm or more, we can definea discrete random variable A as follows
A(v) = 0 Ifv
8/21/2019 Data Mining Text Book
27/658
CHAPTER 1. DATA MINING AND ANALYSIS 19
Probability Mass Function IfXis discrete, the probability mass functionofXis defined as
f(x) =P(X=x) for allx RIn other words, the function fgives the probability P(X = x) that the randomvariable Xhas the exact value x. The name probability mass function intuitivelyconveys the fact that the probability is concentrated or massed at only discrete valuesin the range ofX, and is zero for all other values. fmust also obey the basic rulesof probability. That is,fmust be non-negative
f(x) 0and the sum of all probabilities should add to 1
x
f(x) = 1
Example 1.7 (Bernoulli and Binomial Distribution): In Example 1.6, Awas defined as discrete random variable representing long sepal length. From thesepal length data in Table 1.2 we find that only 13 Irises have sepal length of atleast 7cm. We can thus estimate the probability mass function ofA as follows
f(1) =P(A= 1) = 13
150= 0.087 =p
and f(0) =P(A= 0) = 137150
= 0.913 = 1 pIn this case we say that A has a Bernoulli distributionwith parameter p [0, 1],which denotes the probability of a success, i.e., the probability of picking an Iriswith a long sepal length at random from the set of all points. On the other hand,1pis the probability of afailure, i.e., of not picking an Iris with long sepal length.
Let us consider another discrete random variable B, denoting the number ofIrises with long sepal length in m independent Bernoulli trials with probability ofsuccessp. In this case, B takes on the discrete values [0, m], and its probabilitymass function is given by the Binomial distribution
f(k) =P(B= k) = mkpk(1 p)mk
The formula can be understood as follows. There arem
k
ways of picking k long
sepal length Irises out of the m trials. For each selection ofk long sepal lengthIrises, the total probability of theksuccesses ispk, and the total probability ofmk
8/21/2019 Data Mining Text Book
28/658
CHAPTER 1. DATA MINING AND ANALYSIS 20
failures is (1 p)mk. For example, since p = 0.087 from above, the probability ofobserving exactly k= 2 Irises with long sepal length in m = 10 trials is given as
f(2) =P(B= 2) =10
2
(0.087)2(0.913)8 = 0.164
Figure 1.6 shows the full probability mass function for different values of k form= 10. Sincep is quite small, the probability ofk successes in so few a trials fallsoff rapidly as k increases, becoming practically zero for values ofk6.
0.1
0.2
0.3
0.4
0 1 2 3 4 5 6 7 8 9 10k
P(B=k)
Figure 1.6: Binomial Distribution: Probability Mass Function (m= 10, p= 0.087)
Probability Density Function If X is continuous, its range is the entire setof real numbers R. The probability of any specific value x is only one out of theinfinitely many possible values in the range ofX, which means that P(X=x) = 0for all x R. However, this does not mean that the value x is impossible, sincein that case we would conclude that all values are impossible! What it means isthat the probability mass is spread so thinly over the range of values, that it can bemeasured only over intervals [a, b] R, rather than at specific points. Thus, insteadof the probability mass function, we define the probability density function, which
8/21/2019 Data Mining Text Book
29/658
CHAPTER 1. DATA MINING AND ANALYSIS 21
specifies the probability that the variable Xtakes on values in any interval[a, b] R
PX[a, b]=b
a
f(x)dx
As before, the density function fmust satisfy the basic laws of probability
f(x)0, for allx R
and
f(x)dx= 1
We can get an intuitive understanding of the density function f by considering
the probability density over a small interval of width 2 >0, centered at x, namely[x , x+ ]
P
X[x , x + ]= x+x
f(x)dx 2 f(x)
f(x) P
X[x , x + ]2
(1.8)
f(x) thus gives the probability density at x, given as the ratio of the probabilitymass to the width of the interval, i.e., the probability mass per unit distance. Thus,
it is important to note that P(X=x)=f(x).Even though the probability density functionf(x)does not specify the probabil-ityP(X=x), it can be used to obtain the relative probability of one value x1 overanotherx2, since for a given >0, by (1.8), we have
P(X[x1 , x1+ ])P(X[x2 , x2+ ])
2 f(x1)2 f(x2) =
f(x1)
f(x2) (1.9)
Thus, iff(x1) is larger than f(x2), then values ofXclose to x1 are more probablethan values close to x2, and vice versa.
Example 1.8 (Normal Distribution): Consider again the sepal length val-ues from the Iris dataset, as shown in Table 1.2. Let us assume that these valuesfollow a Gaussianor normaldensity function, given as
f(x) = 1
22exp
(x )222
8/21/2019 Data Mining Text Book
30/658
CHAPTER 1. DATA MINING AND ANALYSIS 22
0
0.1
0.2
0.3
0.4
0.5
2 3 4 5 6 7 8 9 x
f(x)
Figure 1.7: Normal Distribution: Probability Density Function ( = 5.84, 2 =0.681)
There are two parameters of the normal density distribution, namely, , whichrepresents the mean value, and 2, which represents the variance of the values(these parameters will be discussed in Chapter 2). Figure 1.7 shows the character-istic bell shape plot of the normal distribution. The parameters, = 5.84 and2 = 0.681, were estimated directly from the data for sepal length in Table 1.2.
Whereasf(x= ) =f(5.84) = 12 0.681 exp{0}= 0.483, we emphasize that
the probability of observing X = is zero, i.e., P(X= ) = 0. Thus,P(X=x)is not given byf(x), rather, P(X=x)is given as the area under the curve for aninfinitesimally small interval [x , x+ ] centered at x, with > 0. Figure 1.7illustrates this with the shaded region centered at = 5.84. From (1.8), we have
P(X=) 2 f() = 2 0.483 = 0.967
As 0, we get P(X=)0. However, based on (1.9) we can claim that theprobability of observing values close to the mean value = 5.84 is 2.67 times theprobability of observing values close to x= 7, since
f(5.84)
f(7) =
0.483
0.18 = 2.69
8/21/2019 Data Mining Text Book
31/658
CHAPTER 1. DATA MINING AND ANALYSIS 23
Cumulative Distribution Function For any random variable X, whether dis-crete or continuous, we can define the cumulative distribution function (CDF) F :R
[0, 1], that gives the probability of observing a value at most some given value
x
F(x) =P(Xx) for all < x
8/21/2019 Data Mining Text Book
32/658
CHAPTER 1. DATA MINING AND ANALYSIS 24
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.80.9
1.0
0 1 2 3 4 5 6 7 8 9 10x
F(x)
(, F()) = (5.84, 0.5)
Figure 1.9: Cumulative Distribution Function for the Normal Distribution
Figure 1.9 shows the cumulative distribution function for the normal densityfunction shown in Figure 1.6. As expected, for a continuous random variable,the CDF is also continuous, and non-decreasing. Since the normal distribution issymmetric about the mean, we have F() =P(X) = 0.5.
1.4.1 Bivariate Random Variables
Instead of considering each attribute as a random variable, we can also perform pair-wise analysis by considering a pair of attributes, X1 andX2, as a bivariate randomvariable
X=
X1X2
X:O R2 is a function that assigns to each outcome in the sample space, a pairof real numbers, i.e., a 2-dimensional vector
x1x2
R2. As in the univariate case,
if the outcomes are numeric, then the default is to assume X to be the identityfunction.
Joint Probability Mass Function IfX1 andX2 are both discrete random vari-ables then X has ajoint probability mass functiongiven as follows
f(x) =f(x1, x2) =P(X1 = x1, X2= x2) =P(X= x)
8/21/2019 Data Mining Text Book
33/658
CHAPTER 1. DATA MINING AND ANALYSIS 25
fmust satisfy the following two conditions
f(x) =f(x1, x2)
0 for all
< x1, x2 x
is a binary indicator variablethat indicates whether the given condition is satisfiedor not. Intuitively, to obtain the empirical CDF we compute for each value xR,how many points in the sample are less than or equal to x. The empirical CDF puts
a probability mass of
1
n at each point xi. Note that we use the notation Fto denotethe fact that the empirical CDF is an estimate for the unknown population CDF F.
Inverse Cumulative Distribution Function Define theinverse cumulative dis-tribution functionorquantile functionfor a random variable Xas follows
F1(q) = min{x| F(x)q} forq[0, 1] (2.2)
That is, the inverse CDF gives the least value of X, for which q fraction of thevalues are higher, and 1 q fraction of the values are lower. Theempirical inversecumulative distribution functionF1 can be obtained from (2.1).
Empirical Probability Mass Function (PMF) Theempirical probability massfunctionofX is given as
f(x) =P(X=x) = 1
n
ni=1
I(xi = x) (2.3)
where
I(xi= x) =
1 ifxi= x
0 ifxi=xThe empirical PMF also puts a probability mass of 1n at each point xi.
2.1.1 Measures of Central Tendency
These measures given an indication about the concentration of the probability mass,the middle values, and so on.
8/21/2019 Data Mining Text Book
48/658
CHAPTER 2. NUMERIC ATTRIBUTES 40
Mean
Themean, also called the expected value, of a random variable Xis the arithmetic
average of the values of X. It provides a one-number summary of the location orcentral tendencyfor the distribution ofX.
The mean or expected value of a discrete random variable Xis defined as
= E[X] =
x
x f(x) (2.4)
wheref(x)is the probability mass function ofX.The expected value of a continuous random variable Xis defined as
= E[X] =
xf(x)dx
wheref(x) is the probability density function ofX.
Sample Mean Thesample meanis a statistic, i.e., a function :{x1, x2, , xn} R, defined as the average value ofxis
= 1
n
ni=1
xi (2.5)
It serves as an estimator for the unknown mean value ofX. It can be derived byplugging in the empirical PMF f(x) in (2.4)
=
x
xf(x) =
x
x
1
n
ni=1
I(xi=x)
=
1
n
ni=1
xi
Sample Mean is Unbiased An estimator is called an unbiased estimator forparameter if E[] = for every possible value of . The sample mean is anunbiased estimator for the population mean , since
E[] =E
1
n
ni=1
xi
=
1
n
ni=1
E[xi] = 1
n
ni=1
= (2.6)
where we use the fact that the random variables xi are IID according to X, whichimplies that they have the same mean as X, i.e.,E[xi] = for allxi. We also usedthe fact that the expectation function Eis alinear operator, i.e., for any two randomvariablesXand Y, and real numbersaandb, we haveE[aX+ bY] =aE[X]+bE[Y].
8/21/2019 Data Mining Text Book
49/658
CHAPTER 2. NUMERIC ATTRIBUTES 41
Robustness We say that a statistic isrobustif it is not affected by extreme values(such as outliers) in the data. The sample mean is unfortunately not robust, sincea single large value (an outlier) can skew the average. A more robust measure is
the trimmed meanobtained after discarding a small fraction of extreme values onone or both ends. Furthermore, the mean can be somewhat misleading in thatit is typically not a value that occurs in the sample, and it may not even be avalue that the random variable can actually assume (for a discrete random variable).For example, the number of cars per capita is an integer valued random variable,but according to the US Bureau of Transportation Studies, the average number ofpassenger cars in the US was 0.45 in 2008 (137.1 million cars, with a populationsize of 304.4 million). Obviously, one cannot own 0.45 cars; it can be interpreted assaying that on average there are 45 cars per 100 people.
Median
Themedianof a random variable is defined as the value m such that
P(Xm)12
andP(Xm) 12
In other words, the median m is the middle-most value; half of the values of Xare less and half of the values of X are more than m. In terms of the (inverse)cumulative distribution function, the median is therefore the value m for which
F(m) = 0.5 orm = F1(0.5)
Thesample mediancan be obtained from the empirical CDF (2.1) or the empiricalinverse CDF (2.2) by computing
F(m) = 0.5 orm = F1(0.5)
A simpler approach to compute the sample median is to first sort all the values xi(i[1, n]) in increasing order. Ifn is odd, the median is the value at position n+12 .Ifn is even, the values at positions n2 and
n2 + 1 are both medians.
Unlike the mean, median is robust, since it is not affected very much by extremevalues. Also, it is a value that occurs in the sample and a value the random variablecan actually assume.
Mode
Themodeof a random variable Xis the value at which the probability mass function
or the probability density function attains its maximum value, depending on whetherX is discrete or continuous, respectively.
The sample mode is a value for which the empirical probability function (2.3)attains its maximum, given as
mode(X) = arg maxx
f(x)
8/21/2019 Data Mining Text Book
50/658
CHAPTER 2. NUMERIC ATTRIBUTES 42
The mode may not be a very useful measure of central tendency for a sample,since by chance an unrepresentative element may be the most frequent element.Furthermore, if all values in the sample are distinct, each of them will be the mode.
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0X1
Frequency
= 5.843
Figure 2.1: Sample Mean for sepal length. Multiple occurrences of the same valueare shown stacked.
Example 2.1 (Sample Mean, Median, and Mode): Consider the attributesepal length (X1) in the Iris dataset, whose values are shown in Table 1.2. Thesample mean is given as follows
= 1
150(5.9 + 6.9 + + 7.7 + 5.1) =876.5
150 = 5.843
Figure 2.1 shows all 150 values of sepal length, and the sample mean. Figure 2.2a
shows the empirical CDF and Figure 2.2b shows the empirical inverse CDF forsepal length.
Sincen= 150 is even, the sample median is the value at positions n2 = 75 andn2 + 1 = 76 in sorted order. For sepal length both these values are 5.8, thus thesample median is 5.8. From the inverse CDF in Figure 2.2b, we can see that
F(5.8) = 0.5 or5.8 = F1(0.5)
The sample mode for sepal length is 5, which can be observed from thefrequency of 5 in Figure 2.1. The empirical probability mass at x= 5is
f(5) = 10
150= 0.067
8/21/2019 Data Mining Text Book
51/658
CHAPTER 2. NUMERIC ATTRIBUTES 43
0
0.25
0.50
0.75
1.00
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
x
F(x)
(a) Empirical CDF
4
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
0 0.25 0.50 0.75 1.00q
F1(q)
(b) Empirical Inverse CDF
Figure 2.2: Empirical CF and Inverse CDF: sepal length
2.1.2 Measures of Dispersion
The measures of dispersion give an indication about the spread or variation in thevalues of a random variable.
Range
The value rangeor simply rangeof a random variable X is the difference betweenthe maximum and minimum values ofX, given as
r= max{X} min{X}
8/21/2019 Data Mining Text Book
52/658
CHAPTER 2. NUMERIC ATTRIBUTES 44
The (value) range ofXis a population parameter, not to be confused with the rangeof the function X, which is the set of all the values Xcan assume. Which range isbeing used should be clear from the context.
Thesample rangeis a statistic, given as
r= nmax
i=1{xi}
nmini=1
{xi}
By definition, range is sensitive to extreme values, and thus is not robust.
Inter-Quartile Range
Quartilesare special values of the quantile function (2.2), that divide the data into4 equal parts. That is, quartiles correspond to the quantile values of0.25, 0.5,0.75,and 1.0. The first quartile is the value q1 =F
1(0.25), to the left of which 25% ofthe points lie, the second quartileis the same as the median value q
2= F
1(0.5), to
the left of which 50% of the points lie, the third quartile q3= F1(0.75)is the value
to the left of which 75% of the points lie, and the fourth quartile is the maximumvalue ofX, to the left of which 100% of the points lie.
A more robust measure of the dispersion ofX is the inter-quartile range (IQR),defined as
IQR = q3 q1= F1(0.75) F1(0.25) (2.7)IQR can also be thought of as a trimmed range, where we discard 25% of the lowand high values ofX. Or put differently, it is the range for the middle 50% of thevalues ofX. IQR is robust by definition.
The sample IQR can be obtained by plugging in the empirical inverse CDF in(2.7)
IQR = q3 q1= F1(0.75) F1(0.25)
Variance and Standard Deviation
The varianceof a random variable Xprovides a measure of how much the valuesofXdeviate from the mean or expected value ofX. More formally, variance is theexpected value of the squared deviation from the mean, defined as
2 =var(X) =E
(X )2= x
(x )2 f(x) ifX is discrete
(x )2 f(x)dx ifX is continuous(2.8)
Thestandard deviation, , is defined as the positive square root of the variance, 2.
8/21/2019 Data Mining Text Book
53/658
CHAPTER 2. NUMERIC ATTRIBUTES 45
We can also write the variance as the difference between the expectation ofX2
and the square of the expectation ofX
2
=var(X) =E[(X )2
] =E[X2
2X+ 2
]=E[X2] 2E[X] + 2 =E[X2] 22 + 2=E[X2] (E[X])2 (2.9)
It is worth noting that variance is in fact the second moment about the mean,corresponding to r = 2, which is a special case of the r-th moment about the meanfor a random variable X, defined as E[(x )r].
Sample Variance Thesample varianceis defined as
2 = 1
n
n
i=1(xi )2 (2.10)
It is the average squared deviation of the data values xi from the sample mean ,and can be derived by plugging in the empirical probability function f from (2.3)into (2.8), since
2 =
x
(x )2f(x) =
x
(x )2
1
n
ni=1
I(xi = x)
=
1
n
ni=1
(xi )2
Thesample standard deviationis given as the positive square root of the samplevariance
= 1nn
i=1(xi )2The standard score, also called the z-score , of a sample value xi is the number
of standard deviations away the value is from the mean
zi =xi
Put differently, the z-score ofxi measures the deviation ofxi from the mean value, in units of.
Geometric Interpretation of Sample Variance We can treat the data samplefor attributeXas a vector in n-dimensional space, wheren is the sample size. That
is, we write X= (x1, x2, , xn)T Rn. Further, let
Z=X 1 =
x1 x2
...xn
8/21/2019 Data Mining Text Book
54/658
CHAPTER 2. NUMERIC ATTRIBUTES 46
denote the mean subtracted attribute vector, where 1 Rn is the n-dimensionalvector all of whose elements have value 1. We can rewrite (2.10) in terms of themagnitude ofZ, i.e., the dot product ofZ with itself
2 = 1
nZ2 = 1
nZTZ=
1
n
ni=1
(xi )2 (2.11)
The sample variance can thus be interpreted as the squared magnitude of the centeredattribute vector, or the dot product of the centered attribute vector with itself,normalized by the sample size.
Example 2.2: Consider the data sample for sepal lengthshown in Figure 2.1.We can see that the sample range is given as
maxi{xi} mini{xi}= 7.9 4.3 = 3.6
From the inverse CDF for sepal lengthin Figure 2.2b, we can find the sampleIQR as follows
q1= F1(0.25) = 5.1
q3= F1(0.75) = 6.4
IQR = q3 q1= 6.4 5.1 = 1.3
The sample variance can be computed from the centered data vector via theexpression (2.11)
2 = 1
n(X 1 )T(X 1 ) = 102.168/150 = 0.681
The sample standard deviation is then
=
0.681 = 0.825
Variance of the Sample Mean Since the sample mean is itself a statistic, wecan compute its mean value and variance. The expected value of the sample mean is
simply, as we saw in (2.6). To derive an expression for the variance of the samplemean, we utilize the fact that the random variables xi are all independent, and thus
var
ni=1
xi
=
ni=1
var(xi)
8/21/2019 Data Mining Text Book
55/658
CHAPTER 2. NUMERIC ATTRIBUTES 47
Further since all the xis are identically distributed asX, they have the same varianceas X, i.e.,
var(xi) =2 for alli
Combining the above two facts, we get
var
ni=1
xi
=
ni=1
var(xi) =n
i=1
2 =n2 (2.12)
Further, note that
E
ni=1
xi
= n (2.13)
Using (2.9), (2.12), and (2.13), the variance of the sample mean can be com-puted as
var() =E[( )2] =E[2] 2 =E 1
n
ni=1
xi
2 1n2
E
ni=1
xi
2
= 1
n2
E n
i=1
xi
2 E ni=1
xi
2= 1n2
var
ni=1
xi
=2
n (2.14)
In other words, the sample mean varies or deviates from the mean in proportionto the population variance 2. However, the deviation can be made smaller byconsidering larger sample size n.
Sample Variance is Biased, but is Asymptotically Unbiased The samplevariance in (2.10) is a biased estimator for the true population variance, 2, i.e.,E[2]=2. To show this we make use of the identity
n
i=1(xi )2 =n( )2 +
n
i=1(xi )2 (2.15)
Computing the expectation of2 by using (2.15) in the first step, we get
E[2] =E
1
n
ni=1
(xi )2
= E
1
n
ni=1
(xi )2
E[( )2] (2.16)
8/21/2019 Data Mining Text Book
56/658
CHAPTER 2. NUMERIC ATTRIBUTES 48
Recall that the random variables xi are IID according to X, which means that theyhave the same mean and variance 2 asX. This means that
E[(xi )2
] =
2
Further, from (2.14) the sample mean has variance E[( )2] = 2n . Pluggingthese into the (2.16) we get
E[2] = 1
nn2
2
n
=
n 1
n
2
The sample variance 2 is a biased estimator of2, since its expected value differsfrom the population variance by a factor of n1n . However, it is asymptoticallyunbiased, that is, the bias vanishes as n , since
limn n 1n = limn 1 1n = 1
Put differently, as the sample size increases, we have
E[2]2 as n
2.2 Bivariate Analysis
In bivariate analysis, we consider two attributes at the same time. We are specificallyinterested in understanding the association or dependence between them, if any. Wethus restrict our attention to the two numeric attributes of interest,X1and X2, withthe data Drepresented as an n
2matrix
D=
X1 X2x11 x12x21 x22
... ...
xn1 xn2
Geometrically, we can think ofD in two ways. It can be viewed asnpoints or vectorsin two dimensional space over the attributes X1 andX2, i.e., xi = (xi1, xi2)
T R2.Alternatively, it can be viewed as two points or vectors in an n-dimensional spacecomprising the points, i.e., each column is a vector in Rn, as follows
X1= (x11, x21,
, xn1)T
X2= (x12, x22, , xn2)TIn the probabilistic view, the column vector X = (X1, X2)
T is considered abivariate vector random variable, and the points xi (1 i n) are treated as arandom sample drawn from X, i.e., xis are considered independent and identicallydistributed as X.
8/21/2019 Data Mining Text Book
57/658
CHAPTER 2. NUMERIC ATTRIBUTES 49
Empirical Joint Probability Mass Function The empirical joint probabilitymass function for Xis given as
f(x) =P(X= x) = 1n
ni=1
I(xi=x) (2.17)
f(x1, x2) =P(X1=x1, X2 = x2) = 1
n
ni=1
I(xi1= x1, xi2= x2)
whereI is a indicator variable which takes on the value one only when its argumentis true
I(xi=x) =
1 ifxi1 = x1 andxi2= x2
0 otherwise
As in the univariate case, the probability function puts a probability mass of 1n at
each point in the data sample.
2.2.1 Measures of Location and Dispersion
Mean The bivariate mean is defined as the expected value of the vector randomvariable X, defined as follows
= E[X] =E
X1X2
=
E[X1]
E[X2]
=
12
(2.18)
In other words, the bivariate mean vector is simply the vector of expected valuesalong each attribute.
The sample mean vector can be obtained from fX1 and fX2, the empirical proba-bility mass functions ofX1andX2, respectively, using (2.5). It can also be computedfrom the joint empirical PMF in (2.17)
=x
xf(x) =x
x
1
n
ni=1
I(xi= x)
=
1
n
ni=1
xi (2.19)
Variance We can compute the variance along each attribute, namely 21 for X1and22 forX2 using (2.8). The total variance(1.4) is given as
var(D) =2
1+ 2
2
The sample variances 21 and 22 can be estimated using (2.10), and the sample
total varianceis simply 21+ 22 .
8/21/2019 Data Mining Text Book
58/658
CHAPTER 2. NUMERIC ATTRIBUTES 50
2.2.2 Measures of Association
Covariance Thecovariancebetween two attributesX1and X2provides a measure
of the association or linear dependence between them, and is defined as12=E[(X1 1)(X2 2)] (2.20)
By linearity of expectation, we have
12= E[(X1 1)(X2 2)]=E[X1X2 X12 X21+ 12]=E[X1X2] 2E[X1] 1E[X2] + 12=E[X1X2] 12=E[X1X2] E[X1]E[X2] (2.21)
The expression above can be seen as a generalization of the univariate variance (2.9)to the bivariate case.
If X1 and X2 are independent random variables, then we conclude that theircovariance is zero. This is because ifX1 andX2 are independent, then we have
E[X1X2] =E[X1] E[X2]which in turn implies that
12 = 0
However, the converse is not true. That is, if12 = 0, one cannot claim that X1 andX2 are independent. All we can say is that there is no linear dependence betweenthem, but we cannot rule out that there might be a higher order relationship or
dependence between the two attributes.The sample covariancebetween X1 andX2 is given as
12= 1
n
ni=1
(xi1 1)(xi2 2) (2.22)
It can be derived by substituting the empirical joint probability mass functionf(x1, x2)from (2.17) into (2.20), as follows
12= E[(X11)(X22)]=
x=(x1,x2)T(x1 1)(x2 2)f(x1, x2)
= 1
n
x=(x1,x2)T
ni=1
(x1 1) (x2 2) I(xi1 = x1, xi2= x2)
= 1
n
ni=1
(xi1 1)(xi22)
8/21/2019 Data Mining Text Book
59/658
CHAPTER 2. NUMERIC ATTRIBUTES 51
Notice that sample covariance is a generalization of the sample variance (2.10),since
11= 1n
ni=1
(xi 1)(xi 1) = 1nn
i=1
(xi 1)2 = 21
and similarly, 22= 22 .
Correlation Thecorrelationbetween variables X1 andX2 is the standardized co-variance, obtained by normalizing the covariance with the standard deviation of eachvariable, given as
12= 1212
= 12
2122
(2.23)
The sample correlationfor attributes X1 andX2 is given as
12= 1212
=ni=1(xi1 1)(xi22)ni=1(xi1 1)2
ni=1(xi2 2)2
(2.24)
xn
x2
x1
Z2
Z1
Figure 2.3: Geometric Interpretation of Covariance and Correlation. The twocentered attribute vectors are shown in the (conceptual) n-dimensional space Rn
spanned by the n points.
Geometric Interpretation of Sample Covariance and Correlation Let Z1
andZ2 denote the centered attribute vectors in Rn, given as follows
Z1= X1 1 1=
x11 1x21 1
...xn1 1
Z2=X2 1 2=
x12 2x22 2
...xn2 2
8/21/2019 Data Mining Text Book
60/658
CHAPTER 2. NUMERIC ATTRIBUTES 52
The sample covariance (2.22) can then be written as
12
=ZT1Z2
n
In other words, the covariance between the two attributes is simply the dot productbetween the two centered attribute vectors, normalized by the sample size. Theabove can be seen as a generalization of the univariate sample variance given in(2.11).
The sample correlation (2.24) can be written as
12= ZT1Z2ZT1Z1
ZT2Z2
= ZT1Z2Z1 Z2 =
Z1Z1
T Z2Z2
= cos (2.25)
Thus, the correlation coefficient is simply the cosine of the angle (1.3) between the
two centered attribute vectors, as illustrated in Figure 2.3.
Covariance Matrix The variance-covariance information for the two attributesX1 andX2 can be summarized in the square 2 2 covariance matrix, given as
= E[(X )(X )T]
=E
X1 1X2 2
X1 1 X2 2
=
E[(X1 1)(X1 1)] E[(X1 1)(X2 2)]E[(X2 2)(X1 1)] E[(X2 2)(X2 2)]
= 21 1221
22
(2.26)Since 12 = 21, is a symmetric matrix. The covariance matrix records theattribute specific variances on the main diagonal, and the covariance information onthe off-diagonal elements.
The total variance of the two attributes is given as the sum of the diagonalelements of, which is also called the traceof, given as
var(D) =tr() =21+ 22
We immediately have tr()
0.
The generalized varianceof the two attributes also considers the covariance, inaddition to the attribute variances, and is given as the determinantdet() of thecovariance matrix ; it is also denoted as||. The generalized covariance is non-negative, since
|| = det() =2122 212= 2122 2122122 = (1 212)2122
8/21/2019 Data Mining Text Book
61/658
CHAPTER 2. NUMERIC ATTRIBUTES 53
where we used (2.23), i.e., 12 =1212. Note that|12| 1 implies that 2121,which in turn implies that det() 0, i.e., the determinant is non-negative.
Thesample covariance matrixis given as
= 21 1212
22
The sample covariance matrix shares the same properties as, i.e., it is symmetricand|| 0, and it can be used to easily obtain the sample total and generalizedvariance.
2
2.5
3.0
3.5
4.0
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0X1: sepal length
X2:sepalwidth
Figure 2.4: Correlation between sepal lengthand sepal width
Example 2.3 (Sample Mean and Covariance): Consider the sepal lengthand sepal width attributes for the Iris dataset, plotted in Figure 2.4. Thereare n = 150 points in the d = 2 dimensional attribute space. The sample meanvector is given as
= 5.843
3.054The sample covariance matrix is given as
= 0.681 0.0390.039 0.187
8/21/2019 Data Mining Text Book
62/658
CHAPTER 2. NUMERIC ATTRIBUTES 54
The variance for sepal length is 21 = 0.681, and that for sepal width is22 = 0.187. The covariance between the two attributes is 12 =0.039, andthe correlation between them is
12= 0.039
0.681 0.187 =0.109
Thus, there is a very weak negative correlation between these two attributes, asevidenced by the best linear fit line in Figure 2.4. Alternatively, we can consider theattributes sepal lengthand sepal widthas two points in Rn. The correlation isthen the cosine of the angle between them; we have
12= cos =0.109, which implies that = cos1(0.109) = 96.26
The angle is close to 90, i.e., the two attribute vectors are almost orthogonal,indicating weak correlation. Further, the angle being greater than 90 indicatesnegative correlation.
The sample total variance is given as
tr() = 0.681 + 0.187 = 0.868and the sample generalized variance is given as
|| =det() = 0.681 0.187 (0.039)2 = 0.126
2.3 Multivariate Analysis
In multivariate analysis, we consider all the d numeric attributes X1, X2, , Xd.The full data is an n d matrix, given as
D=
X1 X2 Xdx11 x12 x1dx21 x22 x2d
... ...
. . . ...
xn1 xn2 xnd
In the row view, the data can be considered as a set ofn points or vectors in thed-dimensional attribute space
xi= (xi1, xi2, , xid)T Rd
8/21/2019 Data Mining Text Book
63/658
CHAPTER 2. NUMERIC ATTRIBUTES 55
In the column view, the data can be considered as a set ofd points or vectors in then-dimensional space spanned by the data points
Xj = (x1j , x2j , , xnj)T
Rn
In the probabilistic view, the d attributes are modeled as a vector random vari-able, X = (X1, X2, , Xd)T, and the points xi are considered to be a randomsample drawn from X, i.e., they are independent and identically distributed as X.
Mean Generalizing (2.18), themultivariate mean vectoris obtained by taking themean of each attribute, given as
= E[X] =
E[X1]E[X2]
...
E[Xd]
=
12...
d
Generalizing (2.19), the sample meanis given as
= 1
n
ni=1
xi
Covariance Matrix Generalizing (2.26) tod-dimensions, the multivariate covari-ance information is captured by the d d(square) symmetriccovariance matrixthatgives the covariance for each pair of attributes
= E[(X )(X )T
] = 21 12 1d21
22
2d
d1 d2 2d
The diagonal element 2i specifies the attribute variance for Xi, whereas the off-diagonal elements ij =ji represent the covariance between attribute pairs Xi andXj .
Covariance Matrix is Positive Semi-definite It is worth noting that is apositive semi-definitematrix, i.e.,
aTa0 for any d-dimensional vector aTo see this, observe that
aTa= aTE
(X )(X )Ta=E
aT(X )(X )Ta
=E
Y2
0
8/21/2019 Data Mining Text Book
64/658
CHAPTER 2. NUMERIC ATTRIBUTES 56
whereY is the random variable Y =aT(X ) =di=1 ai(Xi i), and we use thefact that the expectation of a squared random variable is non-negative.
Since is also symmetric, this implies that all the eigenvalues of are real
and non-negative. In other words the d eigenvalues of can be arranged from thelargest to the smallest as follows: 1 2 d 0. A consequence is thatthe determinant ofis non-negative
det() =d
i=1
i0 (2.27)
Total and Generalized Variance The total variance is given as the trace of thecovariance matrix
var(D) =tr() =21+ 22+
+ 2d (2.28)
Being a sum of squares, t