ITB Term Paper - 10BM60066

ITB TERM PAPER

DATA MINING TECHNIQUES (LINEAR MODELLING AND CLASSIFICATION)

RAHUL MAHAJAN (10BM60066)

2

Table of Contents

INTRODUCTION ............................................................................................................................................. 3

ABOUT WEKA ............................................................................................................................................ 3

ABOUT R .................................................................................................................................................... 3

LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE ..................................... 4

DATA ......................................................................................................................................................... 4

CASE 1 ........................................................................................................................................................... 5

THE CODE .................................................................................................................................................. 5

THE RESULT ............................................................................................................................................... 5

INTERPRETATION OF THE RESULT ............................................................................................................. 6

CASE 2 ........................................................................................................................................................... 7

THE CODE .................................................................................................................................................. 7

THE RESULT ............................................................................................................................................... 7

INTERPRETATION OF THE RESULT ............................................................................................................. 9

CLASSIFICATION .......................................................................................................................................... 10

THE DATASET .......................................................................................................................................... 10

CLASSIFICATION PROCEDURE ................................................................................................................. 10

INTERPRETING THE RESULTS .................................................................................................................. 11

3

INTRODUCTION

In this term paper I have demonstrated two data mining techniques

LINEAR MODELLING TECHNIQUE o The linear modelling technique is demonstrated using R.

CLASSIFICATION

o The classification technique is demonstrated using WEKA

ABOUT WEKA

Weka is java based collection of open source of many data mining and machine learning

algorithms, including

o Pre-processing on data

o Classification:

o Clustering

o Association rule extraction

ABOUT R

R is an open source programming language and software environment for statistical

computing and graphics. The R language is widely used among statisticians for developing

statistical software and data analysis

4

LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE

Here I will try to use GARCH model to predict future share price. GARCH Models gives us

liberty to define model using previous share prices and volatility for a defined period. There are

many versions of GARCH Models to give better estimate in different scenarios.

Case 1 - Using previous day share prices and standard deviation.

In the example explained in this term paper the expression of tomorrow’s price is dependent

on yesterday’s prices and standard deviation of last 3 days.

Case 2 – Using previous day share price and gain of previous day.

It is generally known that share prices behave in momentum basis, for a period of time share

prices go up, then comes a period when prices goes down. This model takes advantage of this

behaviour of the stock prices.

So using the statistical techniques I will try to compare the model developed using case 1 and

case 2. It is widely accepted that the model developed using case 2 fits better than model

developed using case 1

DATA

Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have

used few files from her data. In both the cases I have used February 2008 share price data of

Tata Motors. Except the traded data rest all data is available in public domain.

The file contains the following items

i) Symbol,

ii) Series,

iii) Date,

iv) Prev Close,

v) Open Price,

vi) High Price,

vii) Low Price,

viii) Last Price,

ix) Close Price,

x) Average Price,

xi) Total Traded

xii) Quantity,

xiii) Turnover in Lacs,

This text file is available at this link- http://bit.ly/TM_PVD

5

CASE 1

The program first reads the file. Then it extracts the price data. It creates few matrixes for

prices of previous 3 days i.e. A, B and C. Then using for loop it finds the standard deviation of

prices of past 3 days. Then using linear modelling it tries to fit the model to predict future

prices.

Before running the case one thing we need to keep in mind is that we change the directory

location of R to the place where we have saved our text file. The packages required to run this

code are already installed in R so there is no need of adding any additional packages.

THE CODE

TFile<-"tatamotors.txt"

Trade<-read.table(TFile)

A <- Trade[,4]

B <- A[-1]

C <- B[-1]

l<-length(B)

B <- B[-l]

l<-length(A)

A <- A[-l]

l<-length(A)

A <- A[-l]

l<-length(A)

for (i in 1:l) D[i]= sd(D <-c(A[i],B[i],C[i]),na.rm = FALSE)

summary(lm(C~A+D))

THE RESULT

The result of the above code can be found on the following page (figure 1)

6

Figure 1 the output of case 1

INTERPRETATION OF THE RESULT

P value and F Stats shows that model is not able to predict the prices well.

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.891e+03 1.589e+03 -1.190 0.445

A 3.694e+00 2.257e+00 1.637 0.349

D -9.471e-02 1.066e-01 -0.888 0.538

7

CASE 2

The program first reads the file. Then it extract the price data in matrix A.T hen using matrix B

and C it finds the gains for first n-1 days (where n is total number of days available. This data is

stored in matrix D. Now using linear model function one can find statistical significance of the

model.

We know that in this case correlation will be high so we use correlation flag of liner model as

true, this way the function gives better prediction by minimizing the auto correlation problem

from the data.

THE CODE

TFile<-"tatamotors.txt"

Trade<-read.table(TFile)

A <- Trade[,4]

l<-length(A)

B <- A[-1]

C <- A[-l]

D <- (C-B)*100/C

summary(lm(B~C+D), correlation=TRUE)

THE RESULT

The result of the above code can be found on the following page (figure 2)

8

Figure 2 The output of case 2

9

INTERPRETATION OF THE RESULT

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.335684 2.392506 -0.976 0.333

C 1.002842 0.003261 307.518 <2e-16 ***

D -7.187997 0.042478 -169.215 <2e-16 ***

Here we see that significance of the model is very high. Also the Adjusted R square is high.

However adjusted R value also indicates auto correlations, which is very evident in this case.

But again the F-statistic analysis shows that model is able to predict share prices in better way.

So here we confirm our assumption that previous day gain models (case 2) fits better than

standard deviation model using R(case 1).

10

CLASSIFICATION

Classification is also known as decision trees. It’s basically an algorithm that creates a rule to

determine the output of a new data instance.

It creates a tree where each node represents attribute of our dataset. A decision is made at

these spots based on the input. By moving on from one to another node you reach at the end

of the tree which gives a predicted output.

This is illustrated using the following example

THE DATASET

The dataset used in this example was found on net. The data can be downloaded from the link

http://maya.cs.depaul.edu/classes/ect584/weka/data/bank-data.csv.

Let’s say there is a bank ABC. It has data of 600 people who have either opted for its product

or not. It has the following information of the people: age, gender, income, marital status region

and mortgage. Now bank can use this information to create a rule to predict whether a new

potential customer would opt for its product or not based on the known attributes of the

customer.

CLASSIFICATION PROCEDURE

Load the data in weka. To load data click on open file and specify the path. The window shown

in figure 3 should appear after loading.

One will note that there are 12 attributes in the dataset as seen in the attribute tab of the

window. For this example we will be using only the following attributes

Age, sex, region, income, married, mortgage, savings and product.

Here we will try to predict the response of new customer using the 7 attributes age, sex,

region, income, marriage, mortgage and savings

To remove the remaining attributes click on the checkbox on the left side of the attributes and

click on remove .After removing the attributes one should get the window as shown in figure 4.

Now click on the classify tab on the top. Under the classifier tab click on choose trees

J48 as shown in figure 5.

J48 is an algorithm used to generate a decision tree developed by Ross Quinlan. It is an

extension of Quinlan's earlier ID3 algorithm. The decision trees generated by uses the concept

of information entropy.

11

Now we can create the model in WEKA. First ensure that training set is selected so the data

we have loaded is only used for creating the model. Click start. The output from this model will

look like as shown in figure 6.

INTERPRETING THE RESULTS

The important results to focus on are

1. Correctly Classified Instances" (75.66 percent) and the "Incorrectly Classified Instances

(24.33)” which tells us about the accuracy of the model. Our model is neither very good

nor very bad. It’s Ok. Further modification needs to be done.

2. Confusion matrix which shows number of false positives and negatives. Here in this case

117 a are incorrectly classified as b and 29 b are incorrectly classified as a.

3. The ROC area measures the discrimination ability of the forecast. Although there is

some discrimination whenever the ROC area is > 0.5, in most situations the

discrimination ability of the forecast is not really considered useful in practice unless the

ROC area is > 0.7. For our model the value of ROC is greater then .7 (.787).

4. The decision tree is the main output. It’s the rule that will help to predict the outcome

of new data instances – To view the decision tree right-click on the model and

select Visualize tree. You will get the window as shown in the figure7

12

Figure 3 Window after loading dataset

Figure 4 Window after removing unwanted attributes

13

Figure 5 Choosing the J48 tree

Figure 6 Output of the classification process

14

Figure 7The decision tree

Date post:	03-Jul-2015
Category:	Technology
Upload:	rahulsm27
View:	233 times
Download:	0 times

ITB Term Paper - 10BM60066

Technology