NITP Summer 2015 PK Douglas
Machine Learning Practical NITP Summer Course 2013
Pamela K. Douglas UCLA Semel Institute
Email: [email protected]
NITP Summer 2015 PK Douglas
Topics Covered Part I: WEKA Basics J Part II: MONK Data Set & Feature Selection (from the Kohavi & John 1997)
• We will run part this together Part III: Applying WEKA to the Haxby data set
NITP Summer 2015 PK Douglas
Part 1. Weka Basics Background What is Weka? Weka is data mining software written in Java. It contains a collection of machine learning algorithms (supervised & unsupervised), regression tools, and feature selection methods. Weka is open source and freely available at: http://www.cs.waikato.ac.nz/ml/weka/. Currently, Weka only deals with “flat” files. However, an import Nifti button will be added to the next version of Weka. Input files are called Attribute Relation File Format (.arff) files. Until their “brain button” is released, you must first convert your data into this format. Example MATLAB files to do this are available on the NITP website. Benefits of Weka:
1.) It is very easy to do cross validation & nested cross validation with the simple use of a flag. 2.) There are many classifiers available. Each of these have been vetted by the machine learning
community. 3.) The classification part is very fast. The art of using WEKA is in the feature selection step.
The Weka File Format In WEKA, features are called “attributes.” The first section of the Input file is the Header. In this section of the input file, one simply names (or initializes) all of the features. Each feature must be declared or listed as a separate line starting with @attribute, followed by the feature name, and then the variable type.
Example: @attribute HaxbyVoxel1 real
The <variable type> can be any of the options supported by Weka:
numeric integer real string date
Note, for most neuroimaging features, we will use either a real or numeric variable type. However, we may also wish to include sex or behavioral data that may be a string (e.g., “female”). The next part of the .arff file contains the data as a comma separated list followed by the class label (e.g. face). The entries on the line must correspond to the order of features/attributes listed in the header. For example, if
NITP Summer 2015 PK Douglas
there are 4 voxels being used as features, and the first example (or instance) is when subjects viewed a face, the first line may look like this: -‐0.23, 0.56, 0.78, 0.51, face Testing (a variety of) Machine Learning Classifiers Why Test Multiple Classifiers? According to the Wolpert & MacGreedy “no free lunch” theorem, there is no single learning algorithm that universally performs best across all domains. Most Supervised ML algorithms differ in the model g(x|θ) complexity that they use to describe inputs, x, using parameters θ, (the inductive bias), the loss function used, and/or the optimization procedure used to best fit the model parameters to the data. You may therefore wish to test a series of classifier model hypotheses. Let’s try a few using the WEKA gui on one of the supplied test sets. Launch the graphical user interface for weka by navigating to WEKA-‐3-‐6. Then double click on the Weka Icon.
From here, select the icon for Explorer on the main Weka menu (see below).
You should now see a screen like that shown below.
NITP Summer 2015 PK Douglas
From the Preprocess menu (at the top), select ‘Open File…’, and navigate to the iris.arff file. Once you have selected this file, the data should be loaded in, and should look like what you see below.
Select the button, “Visualize All,” to view a histogram of each attribute’s distribution by class. Are there some attributes that are more informative than others?
NITP Summer 2015 PK Douglas
Now select the “Classify” from the top menu. We will now select the classifier to use on the iris data set. We will start by using the J48 Decision Tree algorithm, which implements the C4.5 Decision Tree, as described originally by Quinlan (1993). You can find this algorithm by selecting classifiers>>trees>>J48.
There are various approaches to determine the performance of classifiers, however cross-‐validation seems to be the most popular. In cross-‐validation, a number of folds n is specified. The dataset is randomly reordered and then split into n folds of equal size. In each iteration, one fold is used for testing and the other n-‐1 folds are used for training the classifier. The test results are collected and averaged over all folds. This gives the cross-‐validation estimate of the accuracy.
NITP Summer 2015 PK Douglas
The default is to randomly assign your data into folds. However, you can create folds that have the same number of class examples per fold. This is called stratified cross validation. Leave-‐one-‐out (loo) cross-‐validation signifies that n is equal to the number of examples. Out of necessity, loo cv has to be non-‐stratified, i.e. the class distributions in the test set are not related to those in the training data. Leave one out cv can be useful in dealing with datasets with a few number of exemplars, since it utilizes the greatest amount of training data. However, it provides only one example for testing and assessing accuracy. Here, we will start by using the default, 10-‐fold (randomly assigned) cross validation. Now click Start. The Weka bird will pace back and forth while you classify. The output should look like what you see below:
You may with to examine the confusion matrix, which indicates the number of correctly classified instances on the diagonal, and mis-‐classified data on the off-‐diagonal.
Now try out a different classifier, Naïve Bayes. How does Naïve Bayes compare to the J48 Tree? How about Support Vector Machine? (hint: its located under classifier à functions, and is called SMO)
NITP Summer 2015 PK Douglas
Part 2. Monk1 Data Set – Feature Selection Background. The classic Monk-‐1 dataset (Thrun et al. 1991), available on the UCI database repository (http://archive.ics.uci.edu/ml/machine-‐learning-‐databases/monks-‐problems/monks.names). was the first one used in an international competition applying ML algoriths to the same dataset. It is sometimes still used for ML benchmarking purposes. Now, load the monk1_train.arff using the preprocess tab at the top, as you did previously for the Iris data. Next click on classify, and select ‘supplied test set.’ Navigate to the file called monk1_test.arff. Click close. Try using AdaBoost to classify this data set (located under the ‘meta’ menu tab). The dangers of ‘circular logic’ have been discussed in detail in the neuroimaging literature (see Kriegeskorte. Circular analysis in systems neuroscience: the dangers of double dipping. Nat Neurosci. 2009). In order to avoid this pitfall, one can set aside a separate set altogether called a validation set, which has not been touched at all in any of the processing. ** NOTE –Feature Selection should be run on your Training Data only!!! Using your Test Data in Feature Selection is another form of ‘peeking’. ! Feature Selection Step. Redundant (or highly correlated) features degrade classifier performance. Neuroimaging data have many spatially contiguous highly correlated voxels. In this Monk-‐1 data set, see how only a few redundant features causes problems! With the Monk-‐1 data set, Features 1 and 2 are highly correlated. Try going back to the ‘preprocess’ screen, and select the second feature for removal. Using AdaBoost, try classifying the data set again. (Note: you will need to use the validation set called monk1_test_minus2.arff). Did the classifier perform better? There are a number of approaches to feature subset selection. Forward selection begins with an empty set of features; whereas backward elimination refers to a search that begins at the full set of features. Each of these methods are generally performed iteratively. You may wish to try out a few of feature selection methods on the original monk1_train.arff data set. To do so, click on the Weka tab called “Select Attributes” at the top.
NITP Summer 2015 PK Douglas
Part 3. Running the Haxby et al. 2001 Data Set with Weka 3.1. Background & Setup. In the well-‐known Haxby et al. 2001 Science paper, fMRI data was collected while subjects passively viewed data from one of eight categories of objects. Distributed and overlapping response patterns to each stimulus category were identified – even within regions that responded maximally to only one response category. Here we will explore these data using Weka and MVPA scripts. First, let’s get started with Weka, and see how different classifiers perform on these data. Open MATLAB, and add the folder NITP_ML_2015 to your path. Make sure to “add with subfolders.” Note, this folder contains the following:
a.) Scripts for converting NIFTI data to Weka format b.) The MVPA toolbox c.) The MATLAB NIFTI toolbox
If you already have MVPA and the NIFTI toolbox installed, you may wish to add only the main folder or appropriate select folders.
Step 1: Convert the Haxby Data to Weka format. Open the script called, “create_arff_nifti_2015.m”. Notice that it is rather straight forward to convert the data to Weka format. In this first script, we apply the ventral –temporal lobe masks provided within the Haxby data set to select all voxels within that ROI. To run this script, you must simply provide two input variables. The first is the path to the data (string). We will try this out for subject 1. The second is what you wish to name your output Weka file (string). For example, at the MATLAB command line, you might type: data_dir=’/Users/NITP_Student/NITP_ML_2015 /subj1/’ Weka_file=’Haxby_Subj1_ROI’ Now, run this script. create_arff_nifti_2015(data_dir, Weka_file) Check the output. A file called Haxby_Subj1_ROI.arff should have been created in the data directory.
NITP Summer 2015 PK Douglas
Load the file into Weka, and test out the support vector machine results. On the 8 class problem, you should get ~65% accuracy. Right click on the bar at the top to see all the options available. You can try out different kernels, and change parameters like the ‘C’ penalty term. To get a detailed description of your options, click “More.”
Try out some other classifiers. You might try the NaiveBayes classifer, which applies the conditional independence assumption. As you will see it performs very poorly. On the other hand, you might try the MultiClassClassifier under “Meta.” How does this perform under the default parameters? (Better than SVM?) Note, with many of the “meta” classifiers, you have the option to choose a variety of base classifiers whose output information is typically boosted (e.g., AdaBoost) or voted upon, perhaps after a bagging procedure (e.g., Random Forest). Ok! Now you should be somewhat familiar with Weka!
NITP Summer 2015 PK Douglas
Regularization & Hyperparameter Tuning “To validate the generalization ability of a classifier with hyperparameters one has to perform a nested cross-‐validation. On each training set of the outer cross-‐validation, an inner cross-‐validation is performed for different values of the hyperparameters. The one with minimum (inner) cross-‐validation error is selected and evaluated on the test set of the outer cross-‐validation.” – Muller et a. (2004) Why perform parameter tuning? One example that illustrates why this is important is shown on the right from the Alypaydin (2004) textbook. When too many neighbors are used in K-‐NN, the algorithm begins to overfit, and won’t generalize well to incoming data sets. Instead of performing feature selection as a separate step, you may wish to use regularization. A regularization term can be added to a ML algorithm objective function that trades of complexity and accuracy. The ‘C’ term in SVM is essentially a regularization term. If time permits, you should try out different values for the C parameter. You may need to make large (order of magnitude) changes to see a difference. Tuning the ‘C’ parameter can be vital. Note – in order to do this you will need to perform a “Nested Cross-‐Valdiation.” Within MVPA, you can do this easily. There are a number of good tutorial walk-‐throughs to do this. For the purposes of this exercise, we will simply test how this parameter influences the outcome using cross-‐validation on the training data. If using Weka for this purpose, it is useful to optimize hyperparameters using command line scripts. Note on Batching Weka To batch Weka, you can create scripts (bash, Perl, etc.) that use their command line options. The bar that you right click to get the options also gives you the line that you type if you choose to run Weka at the command line. With large neuroimaging data sets, you may need to request more memory. Use the ‘Xmx’ flag followed by the memory request.
NITP Summer 2015 PK Douglas
MVPA: The Haxby et al. 2001 Data Set MVPA is perfectly suited for classifying and decoding fMRI data. It has a number of capabilities built in for feature selection, cross-‐validation, and parameter tuning. One of the nice aspects of MVPA is the ability to run searchlight feature selection within this toolbox. Furthermore, there are tools for running permutation tests (where the labels are shuffled). In doing so, you can create distributions using these shuffled labels to test your accuracy outcome against. There are a number of detailed tutorials available. W e suggest you try out some of the tutorial scripts available as part of the MVPA toolbox. They can be found here: /mvpa/core/tutorial_easy .. Further Explorations MVPA and Weka may also be run together. You can perform all your preprocessing in MVPA, and then some of the additional classifiers available in Weka from Matlab. If time permits, you might also try WEKA’s feature selection tools under the “SELECT ATTRIBUTES” menu. Try the linear forward search. Do the “informative features” found match the ones that you would expect by visual inspection of the matrix? To see what command you should use – try holding your curser over the main classifier window. WEKA should give you the command needed. This comes in handy when running a number of optimizations. ** Note – We will not do this today, however I wanted to let you know that scripts for converting functional connectivity matrices into Weka are available from the previous Weka NITP tutorial. See NITP website 2013.
NITP Summer 2015 PK Douglas
Useful References Feature Selection & Parameter Tuning Kerr WT, Douglas PK, Anderson A, Cohen MS. The utility of data-‐driven feature selection: re: Chu et al. 2012. Neuroimage. 2014 Jan 1;84:1107-‐10. Kriegeskorte N, Goebel R, Bandettini P. Information-‐based functional brain mapping. Proc Natl Acad Sci U S A. 2006 Mar 7;103(10):3863-‐8. Epub 2006 Feb 28.
Muller, K-‐R et al. Machine Learning Techniques for Brain-‐Computer Interfaces. (2014) http://doc.ml.tu-‐berlin.de/bbci/publications/MueKraDorCurBla04.pdf
Lemm, S. et al. Introduction to Machine Learning for Brain Imaging. Neuroimage. 2011 May 15;56(2):387-‐99. PMID: 21172442 Why you should avoid interpreting Feature Weights Directly Haufe et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage. 2014 Feb 15;87:96-‐110. Neuroimaging & WEKA J If using these scripts in your own work, please cite: PK Douglas, Sam Harris, Alan Yuille, Mark S. Cohen “Performance Comparison of Machine Learning Algorithms and Number of Independent Components Used in fMRI Decoding of Belief vs. Disbelief” Neuroimage 2011 May; 56(2): 544-‐53. PMID: 21073969