+ All Categories
Home > Documents > Benchmarking of Classification Algorithms Using a Challenging Dataset

Benchmarking of Classification Algorithms Using a Challenging Dataset

Date post: 30-May-2018
Category:
Upload: eric-harris
View: 224 times
Download: 0 times
Share this document with a friend
5
Benchmarking of Classication Algorithms using a Challenging Dataset Chao Ji December 14, 2009 Introduction Multivariate statistics is concerned with the situation where each data object is described by more than one statistica l variable , typ ically more than 100 in real applic ations. On type of problem studied in multivariate statistics and machine learning community is supervised learning in which one wants to predict the values of outputs based on observed input variables, by learning input-output relationship from trainin g data. The prediction task is also known as classicat ion if the outputs are categoria l, discrete or qualit ative. Mult iple classicat ion algorithms have been dev eloped and widely used. The purpose of this class project is to benchmark the performance of Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest-Neighbor (KNN) and Support Vector Machine (SVM) on a single chal lengin g dataset and assess the impact of va rious factor s (e.g. num ber of PCs used, the number of nearest neighbors in KNN) on the performance of these algorithms. The dataset is obtained from UCI Machine Learning Repository. It is an articial dataset containing data points grouped in 32 clusters placed on the vertices of a ve dimensional hypercube and randomly labeled +1 or -1. The ve dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (co rre sponding to the +-1 labe ls) . A number of distractor features are also added that have no predictive power. The order of the features and patterns were randomized. Specically, the dataset contains 1000 positive and 1000 negative instances, each of which is described by 500 features. Results Data Visualization Due to the high-dimension nature of this dataset, it is sensible to perform PCA before data visual- ization. Figure ?? shows the scatterplots of data points projected in the subspace coordinatized by the rst two and three PCs . Red and blue dots repres ent positive (+1) and negative ( 1) instances respectively. It’s intuitive that there is no clearly discernible pattern or structure. In addition, as can be seen in scree plot (Figure 2), only 30% variance is captured even if we use the rst 10 principle components and it’s not clear which principal components are the 5 ”informative” ones . Actua lly the rst 225 principal components would be needed to explained 90% of the total variance. Figure ?? shows the h -plots superimposed on the 2-D and 3-D representation of data in which most of the vec- tors have negligib ly small lengt hs and those vectors with substant ial lengths seem to point to random directions. Put together these result s indicate that this dataset is highly non-line arly separable more coordinates must be prese rved to reveal its structure. 1
Transcript
Page 1: Benchmarking of Classification Algorithms Using a Challenging Dataset

8/14/2019 Benchmarking of Classification Algorithms Using a Challenging Dataset

http://slidepdf.com/reader/full/benchmarking-of-classication-algorithms-using-a-challenging-dataset 1/5

Page 2: Benchmarking of Classification Algorithms Using a Challenging Dataset

8/14/2019 Benchmarking of Classification Algorithms Using a Challenging Dataset

http://slidepdf.com/reader/full/benchmarking-of-classication-algorithms-using-a-challenging-dataset 2/5

Page 3: Benchmarking of Classification Algorithms Using a Challenging Dataset

8/14/2019 Benchmarking of Classification Algorithms Using a Challenging Dataset

http://slidepdf.com/reader/full/benchmarking-of-classication-algorithms-using-a-challenging-dataset 3/5

Page 4: Benchmarking of Classification Algorithms Using a Challenging Dataset

8/14/2019 Benchmarking of Classification Algorithms Using a Challenging Dataset

http://slidepdf.com/reader/full/benchmarking-of-classication-algorithms-using-a-challenging-dataset 4/5

Page 5: Benchmarking of Classification Algorithms Using a Challenging Dataset

8/14/2019 Benchmarking of Classification Algorithms Using a Challenging Dataset

http://slidepdf.com/reader/full/benchmarking-of-classication-algorithms-using-a-challenging-dataset 5/5


Recommended