Home >Documents >ITB Term Paper

ITB Term Paper

Date post:26-Aug-2014
View:34 times
Download:0 times
Share this document with a friend

Data mining techniques using WEKA

Submitted by: Shashidhar Shenoy N (10BM60083) MBA, 2nd Year, Vinod Gupta School of Management, IIT Kharagpur As part of the course IT for Business Intelligence

Introduction to WekaWeka stands for Waikato Environment for Knowledge Analysis and is a free open source software developed by at the University of Waikato, New Zealand. It is a very popular set of software for machine learning, containing a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality. Although not as sophisticated as the other statistical packages, Wekas popularity lies in the fact that it is not only a freeware but also code is open source, which means that new algorithms can be implemented by making use of the existing algorithms and sufficiently modifying them. Weka can be used to do a wide variety of operations on the data. Some of the important operations which can be carried out using weka suite are: Classification of data Regression analysis and prediction Clustering of data Associating data

A quick guide on how to carry out some of these operations is described in this document.

Quick note on the data used in the guideUnless meaningfully interpreted, any data is meaningless. Most machine learning software would accept any data as long as they are in the specified format without understanding why they are used. Thus, the onus lies on the user of the software to choose proper data and feed it to the software to derive meaningful insights on it. Rather than using the pre-built examples given in Weka suite, some attempt is made to get freely available data from the internet and the best place to get .arff files would be the Machine Learning Repository located of UCI. The about page in their website says: The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science For the demonstrations, two of the data sets have been used. Regression uses the data from Auto MPG while the classification uses the data Contraceptive method choice. More details on the data and its attributes are explained in the subsequent sections.

VGSoM, IIT Kharagpur

Page 2

Regression using WekaSimple regression involving two variablesRegression involves building a model to predict the dependant variable based on one or more independent variables. A simple example of regression would be to predict the body weight of a mammal given the brain weight. Here, the body weight is the dependant variable and brain weight is the independent variable:

Figure 1: Brain weight v Body weight

The data is imported into weka in the native (Attribute-Relation File Format) arff format. Weka supports imports of the ubiquitous .csv formats too. This is done by clicking on Explorer in the Weka Gui Chooser suite and then going to Open File.. under the preprocess tab.

Figure 2: Opening a file in Weka Suite

VGSoM, IIT Kharagpur

Page 3

Once the file is loaded, a variety of pre-process operations can be done on the data. The data can be edited using the Edit option too. In the left section of the Explorer window, it outlines all of the columns in the data (Attributes) and the number of rows of data supplied (Instances). By selecting each column, the right section of the Explorer window will also give information about the data in that column of your data set. Theres a visual way of examining the data, which we can see by clicking the Visualize All button. The next step would be to perform the regression analysis. For this, we go to the Classify tab and click on the Choose button. Since we are running a simple linear regression, we need to go to the Classifiers.functions.simplelinearregression and click on it. Once this is done, we need to supply the test options for building the regression model. The following options are available: Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on. Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.

Choose one of these for a model, make sure that the dependant variable is shown in the field below as body weight (kg) and click on start. This is the output we get:

Figure 3: Output of simple regression

VGSoM, IIT Kharagpur

Page 4

It gives the model summary and the details of the regression. Thus, simple linear regression model has been built using the weka suite.

Multiple Linear regression with many variablesIn multiple regression, there is one dependant variable which depends on many independent variables. Many of the real world situations are multiple regression models where one variable depends on a lot of other variables. Here, we use a famous example data to demonstrate regression using Weka.

Data used for multiple regressionThis data set is taken from the UCIs machine learning repository and regresses automobile mileage against certain basic attributes of the model. The data can be downloaded from the URL and a corresponding ARFF file be created. This sample data file attempts to create a regression model to predict the miles per gallon (MPG) for a car based on several attributes of the car (this data is from 1970 to 1982). The model includes these possible attributes of the car: cylinders, displacement, horsepower, weight, acceleration, model year, origin, and car make. Further, this data set has 398 rows of data.Data Set Characteristics: Attribute Characteristics: Associated Tasks: Multivariate Categorical, Real Regression Number of Instances: Number of Attributes: Missing Values? 398

8 Yes 8 instances of the variable horsepower are removed because they have unknown value

This data set is loaded into the Weka suite using the Open file syntax as explained before. This is how the window looks like when the data is imported.

Figure 4: Imported data in Weka

VGSoM, IIT Kharagpur

Page 5

The first seven attributes are all independant variables, while the eighth one, ie, CLASS is the dependant variable for which we try and build a predictive model. Before doing so, we can use as many visualizations on the data as necessary to see the relevant information in each attribute.

Figure 5: Visualize the data in Weka

The next step is to perform the regression. Go to the Classify tab and on the choose button, go to classifiers -> functions -> linear regressions. Once this is done, we need to supply the test options for building the regression model, in the same manner which we did for simple linear regression. We initially give a Percentage split of 80% of the test data and see the output:

Figure 6: Run information shown by Weka

VGSoM, IIT Kharagpur

Page 6

Figure 7: The regression model ouput by Weka

Figure 8: Regression model details

This model might appear as complex for beginners but it is not. For example, the first line of the regression model, -2.2744 * cylinders=6,3,5,4 means that if the car has six cylinders, you would place a 1 in this column, and if it has eight cylinders, you would place a 0. We could use a test set and see the deviation from the expected results and calculate the error. Example data:data = 8,390,190,3850,8.5,70,1,15 class (aka MPG) = -2.2744 -4.4421 6.74 0.012 -0.0359 -0.0056 1.6184 1.8307 1.8958 1.7754 1.167 1.2522 * * * * * * * * * * * * 0 + 0 + 0 + 390 + 190 + 3850 + 0 + 0 + 0 + 0 + 0 + 0 +

VGSoM, IIT Kharagpur

Page 7

2.1363 * 0 + 37.9165 Expected Value = 15 mpg Regression Model Output = 14.2 mpg

So, we see that the regression model output is pretty near the expected value and thus we have a predictive model for beginners. We could continue to improve on this model to improve the accuracy. We can also go for visualization to plot each of the independent variable against the dependent one and see how the variation occurs. A sample plot of horsepower versus Miles per gallon is shown. The relationship can be found to be inversely proportional.

Figure 9: Visualizing the regression output

Classification using WekaIn classification, different attributes of a product are analysed to classify the product into one of the predefined classes. For example, a cricket player can be classified as batsman, bowler, wicket keeper or allrounder depending on the attributes like Can bat?, Can bowl? etc. TrainSet: The trainset is that data which is used to train the software. Here, the classification is already made based on few attributes. The machine just observes the patterns and tries to create a rule which can be used to explain how the training set data is classified. If the model built by the machine in first instance is not reliable, intelligent algorithms might be used t

Popular Tags:

Click here to load reader

Embed Size (px)