Data Mining Techniques using WEKA_Saurabh Singh_10BM60082

IT for Business Intelligence

Term paper on Weka

Submitted by:

Saurabh Singh 10BM60082

Introduction The Weka contains a collection of visualization tools and algorithms for data analysis and predictive

modeling, together with graphical user interfaces for easy access to this functionality. The original non-

Java version of Weka was a TCL/TK front-end to (mostly third-party) modeling algorithms implemented

in other programming languages, plus data preprocessing utilities in C, and a Make file-based system for

running machine learning experiments. This original version was primarily designed as a tool for

analyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3), for

which development started in 1997, is now used in many different application areas, in particular for

educational purposes and research. Advantages of Weka include:

free availability under the GNU General Public License

portability, since it is fully implemented in the Java programming language and thus runs on

almost any modern computing platform

a comprehensive collection of data preprocessing and modeling techniques

ease of use due to its graphical user interfaces

Weka primarily consists of following four screens:

http://en.wikipedia.org/wiki/C_(programming_language)

K-means clustering in WEKA

Suppose a company wants to cluster the market based on the attribute collected by its research team. This can be done very effectively and efficiently by using K- mean clustering in Weka. The attributes used are as follows: ID

AGE

SEX

RELIGION

INCOME

MARRIED

CHILDREN

CAR

SAVING A/C

CURRENT A/C

LOAN

PENSION PLAN

Weka accepts few file input format such as .csv, .arff etc. We would be using .csv file as the input file in

our example. Given data file consists of 1600 instances and 12 attributes as described above.

Steps in K-mean analysis:

Step 1:

Weak Startup screen

Step 2:

Choose explorer option from the menu. This option is more than enough for us to perform all the

required operation on the data.

Step 3:

Load the .csv file of bank accounts data.

Step 4:

Since we intend to create cluster within the data so click on cluster tab and choose Simple K-means

among the choices that appear. Following screen would appear.

Step 5:

Click on the box next to choose box and following menu would appear

Step 6:

Assign value 4 to ‘numClusters’ box.

Step 7:

Click on start to begin the clustering process. Following screen would appear for the same.

Step 8:

The result can be viewed in a separate window. Following screen would appear.

We can interpret by the above given results that

Cluster 0:

Centers around male population.

Mainly lives in town area.

Is mostly non married.

Doesn’t own a car or previous loan.

Owns a Savings a/c and current a/c.

Still is not having a pension plan.

Hence we can conclude that cluster 1 is the likely cluster to buy a pension plan. Similar interpretation

can be applied to other clusters as well according to requirements.

Step 9:

We can use visualize all to see the distribution of all the variables in the population.

Linear Regression using WEKA

Regression

Regression model can easily answer questions such as how much should be charged for a given model of

car with certain set of features. It uses the past data of car sales, price of the cars, features provided and

other attributes to determine the price of future models.

Regression in WEKA

Suppose a company wants to regress the Price of a car with various features associated with it. It can

run the regression in WEKA by appropriately determining the independent variables and then establish a

regression equation establishing the relationship between independent variables and dependent

variable. Following example illustrates this procedure -

Step 1:

Weak Startup screen

Step 2:

Choose explorer option from the menu. This option is more than enough for us to perform all the

required operation on the data.

Step 3:

Load the .csv file of car specification data.

Step4:

Click Classify tab, then click Choose button and then select Linear Regression from Functions. Following

screen would appear after this.

Step5:

After clicking on Start button, following output would be generated.

Interpretation of the output – From the above output, we can observe that the selling price is positively

correlated to the engine displacement and none of the other factors.

Step 6:

Right click on result list for options and select visualize Classifier errors for the following screen.

Step 7:

If we click at any point on the given plot summary of data point is given by Weka. E.g.

References:

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://www.cs.waikato.ac.nz/ml/weka/

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://www.cs.waikato.ac.nz/ml/weka/

Date post:	25-Dec-2014
Category:	Business
Upload:	saurabh-singh
View:	1,014 times
Download:	1 times