ISEN 613_Team3_Final Project Report

ISEN 613- Engineering Data Analysis

Naman Kapoor Vinayak Nair Rahul Garg Omkar Deshpande Adriana De La Cruz

Multi-Attribute Classification of Steel Plate Defects

Team 3

Executive Summary

Anomaly detection is vital in the industry and can be the difference between success and bankruptcy. Manufacturing processes need to be continuously monitored so that any change in the process can be quickly identified and controlled so that there is no production loss. This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into consideration the available historical data. The main objective of this project is to compare the working of different classification models and decide one final model that will have the least misclassification rate (high prediction accuracy).

Many studies have been done by researchers on the comparative performances of multiclass classification techniques. This project adds a new dimension by drawing comparisons between error rates of multiclass techniques and individual classification techniques for each class. Although modelling for individual defects gives a very high accuracy rate, the combined practical hierarchical model will not be as efficient due to the fact that its accuracy is a product of the individual accuracies of the models used in the hierarchical model.

In this project, performances of techniques such as Linear Discriminant Analysis, Logistic Regression (individual and multivariate), Random Forests (individual and multivariate), Single Decision Trees, Bagging, Support Vector Machines, and Artificial Neural Networks have been compared and analyzed. Principal Component Analysis was also used to reduce the dimensions of the given data.

The challenge was to decide whether to consider different models for all seven defects or to build a model for all defects combined. The dataset also had many attributes and hence, it was difficult to select the most significant predictors and avoid over-fitting. As there were more than two responses the coding that was used for binary classification wasn’t applicable and had to deal with multi classification and new coding techniques had to be explored. Methods like Artificial Neural Networks and C5.0 were used. These methods were completely new and efforts were required in terms of literature review and coding to implement them.

The following table gives the misclassification error rate and area under curve for different modeling techniques used to model the problem.

Modeling Technique Misclassification Error Rate (percentage)

Area Under Curve

LDA 32.4 0.790

Decision Tree 36.0 0.784

Bagging 20.8 0.824

Random Forest 22.8 0.797

SVM 27.6 0.804

Neural Network Analysis 53.9 0.605

C5.0 19.4 0.831

From the above table it is clear that C5.0 modeling technique gives the least misclassification error which is 19.4 % and also highest area under the ROC curve.

The main objective of this project was to compare the working of different classification models built using different modeling techniques and propose one final model that will have the least misclassification rate (high prediction accuracy). Thus, the results show the successful completion of the objectives. The final model proposed to be implemented to predict the faults is C5.0. This method has an 81.6% prediction accuracy.

INTRODUCTION

Importance of the problem:

Present era is the era of quality and in today’s world of cut-throat competition and large scale production, only those manufacturers survive who can provide good quality products and services that meet or exceed the expectations of the customers. There is a need to continuously monitor the ongoing manufacturing processes so that any change in the process can be identified quickly and rectified to prevent production loss. In manufacturing, operations managers can use advanced analytics to take a deep dive into historical process data, identify patterns and relationships among discrete process steps and inputs, and then optimize the factors that prove to have the greatest effect on yield. Many global manufacturers in a range of industries and geographies now have an abundance of real-time shop-floor data and the capability to conduct such sophisticated statistical assessments. They are taking previously isolated data sets, aggregating them, and analyzing them to reveal important insights. In the steel industry, specifically alloy steel, creating different defective products can impose a high cost for steel product manufacturer. One common fault out of all others in producing low carbon steel grades is Pits & Blister defect. To remove this drawback, we need to grind the surface of the steel product. Grinding cause waste of time and involved cost of the production will be increased. Incidence of defects analysis is related to numerous factors including material analysis, production processes etc. So, if we are able to correctly predict these defects based on the important parameters then in a way we know which of the parameters to be controlled with high level of accuracy to minimize the defects and hence the defects. Thus the problem at hand in this project deals with the data from a steel industry and the results obtained from this project can be used to predict the faults and implement necessary changes.

Objective: This project deals with the prediction of faults that can occur in the manufacturing of the steel plates by taking into

consideration the available historical data. The main objective of this project is to compare the working of different

classification models built using different classification techniques and propose one final model that will have the least

misclassification rate (high prediction accuracy). Various data mining techniques can be used to predict the steel

plate faults from the given data. In this project, the results of classification techniques such as Linear Discriminant

Analysis, Logistic Regression (individual and multivariate), Random Forests (individual and multivariate), Single Decision

Trees, Bagging, Support Vector Machines, and Artificial Neural Networks have been compared and the best model is

proposed. The model building also uses Principal Component Analysis to reduce the dimensions of the given data.

Scope of Work:

13th-15th 16th-19th 20th-23rd 24th-27th 28th-31st 1st-4th 5th-8th 9th-12th 12th-15th

1 Retrieving data and understanding its details

2 Literature Review & selecting suitable supervised neural network method

3 Model building using classification techniques learned in class

4 Model building using selected neural network method:

5 Predicting results and concluding the best modeling method

6 Report making & documentation

S No ActvityNovember'2015

GANT ChartDecember'2015

LITERATURE REVIEWS

Following were the papers selected:

1. Steel Plates Faults Diagnosis with Data Mining Models. Fakhr,M., Elsayad, A. M. (2012). (Reviewed by Naman Kapoor)

2. Machine Learning Techniques for Anomaly Detection: An Overview. Omar,S. Ngadi,A. and Jebur, H. H. (2013). (Reviewed by Naman Kapoor)

3. Neuralnet: Training of neural networks. Frauke Günther and Stefan Fritsch (2008). (Reviewed by Omkar Deshpande)

4. An Empirical Comparison of Supervised Learning Algorithms. Caruana, R., & Niculescu-Mizil, A. (2006). (Reviewed by Omkar Deshpande)

5. A SVM-based pipeline leakage detection and pre-warning system. Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin. (2010). (Reviewed by Rahul Garg)

6. Steel faults diagnosis under predictive analysis. Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, (2013).

(Reviewed by Rahul Garg)

7. Classification of EEG signals using neural network and logistic regression. A. Subasi and E. Erçelebi. (2005).

(Reviewed by Rahul Garg)

8. A study of decision tree ensembles and feature selection for steel plates faults detection. Halawani, M. (2014). (Reviewed by Vinayak Nair)

9. Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Tsoi, A.C., Pearson, R.A. (1991). (Reviewed by Vinayak Nair)

10. Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study. Pohar, M., Blas, M., & Turk, S. (2004). (Reviewed by Vinayak Nair)

Combined Takeaways

Advanced decision trees are extremely efficient modeling techniques for multiclass classification problems.

Artificial Neural Networks are a very powerful and complex algorithms but they have certain issues of convergence and variable selection which need to be addressed.

Supervised machine learning techniques significantly outperform the unsupervised ones when it comes to multi classification problems.

LDA is advisable in comparison to logistic regression, when the variables are normally distributed.

Reviewed by: Naman Kapoor

Steel Faults Diagnosis with Data Mining Models

-Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science, vol. 8, no. 4, pp. 506-514, 2012.

Objective:

The key problem this paper addresses is the formation of an appropriate intelligent data mining model for anomaly

detection in the manufacturing industry on a particular dataset. Addressing this problem is important due to the need to

create intelligent fault diagnostic models with the help of data mining to enhance the quality of manufacturing and to

lessen the cost of product testing. It not only helps to keep away product quality problems but also facilitates

precautionary maintenance. The key objective of this paper is to use predictive analytics to select the best classification

model for the selected steel plate faults detection dataset by comparing different models using certain statistical

measures. The authors have addressed this problem by evaluating the performances of three of the popular and

effective data mining models (using supervised learning techniques) on the selected dataset and have presented their

views and outcomes on these. From their approach the authors found that the C5.0 decision tree with boosting achieved

the best results on the dataset which implies that decision trees have a greater impact on fault diagnosis than fellow

supervised learning techniques.

Approach:

The authors approached the problem by performing three multi classification techniques namely C5.0 decision tree

(C5.0 DT) with boosting, Multi Perception Neural Network (MLPNN) with pruning and Logistic Regression (LR) with step

forward on the steel plates fault dataset obtained from the University of California at Irvine (UCI) machine learning

repository. These models were formulated to diagnose seven commonly occurring faults of steel plate namely: Pastry,

Z_Scratch, K_Scratch, Stains, Dirtiness, Bumps and other faults. A brief description of the techniques used is presented

below:

I. C5.0 decision tree

The C5.0 DT algorithm is an improved version of the C4.5 and ID3 algorithms. C5.0 uses information gain as a measure of purity, which is based on the notion of entropy. This method proved to be a major take out for us for our project.

The three methods used in the C5.0 tree construction are boosting, pruning and winnowing. While boosting and pruning

were known to us we were introduced to the concept of winnowing which preselects a subset of the attributes that are

selected to construct the tree. Winnowing ensures that the attributes which are irrelevant are excluded from the tree

building process. The authors used only 13 attributes (from 27) from the dataset to build the C5.0 tree.

II. Multilayer Perceptron Neural Network (MLPNN)

Artificial Neural Networks (ANNs) are biologically motivated and highly sophisticated analytical techniques capable of

modelling extremely complex nonlinear functions. The technique used to address this problem: MLPNN is considered a

powerful function approximate for prediction and classification problems, its structure is organized into layers of

neurons input, output and hidden layer. The MLPNN was trained using the Back Propagation (BP) training technique.

In this study the network was trained using the pruning approach which starts with a large network and removes the

weakest neurons (prunes) in the hidden and input layers as the training proceeds.

III. Logistic Regression

Logistic regression is a nonlinear regression technique for the prediction of dichotomous (binary) class attribute in terms

of the predictive ones. This algorithm does not predict the class attribute but predicts the odds of its occurrence using

the log likelihood (logit) function.

Results

The performance of each model was evaluated using three statistical measures: classification accuracy (compliment of

misclassification error rate), sensitivity and specificity. These measures are define using the values of True Positive (TP),

True Negative (TN), False Positive (FP) and False Negative (FN).

These charts depict that the performances of the C5.0 learning algorithm is the best model for training and test subsets.

Neural network model is the second best and finally the logistic regression is the worst one.

Summary

The major takeaways from this study were as follows:

Advanced Decision Trees (C5.0 DT) are a very powerful data mining tool to use in predictive analytics of

multiclass anomaly detection with very high accuracy.

Multilayer Perceptron Neural Networks with back propagation is a complex algorithm which standard and

simple to implement, it still suffers from convergence issues and requires initialization and adjustment of many

individual parameters to optimize its performance.

Logistic Regression while although a very powerful modeling tool, assumes that the class attribute (the log odds,

not the event itself) is linear in the coefficients of the predictive attributes. The right inputs must be chosen with

their functional relationship to the class attribute.

Amount, quality and the measuring process of data are key components of diagnostic accuracy.

Reviewed by: Naman Kapoor

Machine Learning Techniques for Anomaly Detection: An Overview -S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013.

Objective:

The key problem this paper addresses is the issue of anomaly detection in the industry. Through this paper the authors

try to aid anomaly detection with the aid of machine learning techniques. The key reason why addressing this problem is

important because even though after many years of research the anomaly detection community is still confronting

difficult problems, the authors aim to further research this issue. The key objective of his paper is to present an overview

of research directions for applying supervised and unsupervised methods for managing the problem of anomaly

detection. The authors address this problem by providing a general architecture of anomaly intrusion detection systems

and conduct detailed discussions on the various machine learning techniques that come under supervised and

unsupervised learning and discussing their strengths and weaknesses in handling anomaly detection.

Approach:

The authors approached the problem by comparing different techniques under supervised and unsupervised machine

learning techniques and bringing out their strengths and weaknesses on anomaly detection. An overview of the two

approaches is given below:

I. Supervised Anomaly Detection

Supervised methods (also known as classification methods) required a labelled training set containing both normal and

anomalous samples to construct the predictive model. Theoretically, supervised methods provide better detection rate

than semi-supervised and unsupervised methods, since they have access to more information. However, there exist

some technical issues, which make these methods seem not accurate as they are supposed to be.

II. Unsupervised Anomaly Detection

These techniques do not need training data. As alternative, they based on two basic assumptions. First, they presume

that most of the network connections are normal traffic and only a very small traffic percentage is abnormal. Second,

they anticipate that malicious traffic is statistically various from normal traffic. According to these two assumptions, data

groups of similar instances which appear frequently are assumed to be normal traffic, while infrequently instances which

considerably various from the majority of the instances are regarded to be malicious.

The different techniques compared are shown in the table below:

Supervised Macine Learning Unsupervised Macine Learning

K-Nearest Neighbours Self Organising Maps

Neural Networks K-means Clustering

Decision Trees Fuzzy C-means Clustering

Support Vevtor Machines Expectation-Maximization Meta

Machine Learning Techniques for Anomaly Detection

Results

The results are shown in the table below:

Summary

The major takeaways from this review were:

Machine learning techniques have received considerable attention among the anomaly detection researchers.

Anomaly detection comprises supervised techniques and unsupervised techniques.

The experiments demonstrated that the supervised learning methods significantly outperform the unsupervised

ones if the test data contains no unknown attacks.

Among the supervised methods, the best performance is achieved by the non-linear methods, such as SVM,

multi-layer perceptron and the rule-based methods.

Techniques for unsupervised such as K-Means, SOM, and one class SVM achieved better performance over the

other techniques although they differ in their capabilities of detecting all attacks classes efficiently.

An Empirical Comparison of Supervised Learning Algorithms

Reviewed by: Omkar Deshpande

Reference: Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms.

Proceedings Of The 23Rd International Conference On Machine Learning - ICML '06.

http://dx.doi.org/10.1145/1143844.1143865.

Objective:

The objective of this paper is to give an empirical comparison between supervised learning algorithms such as SVMs, neural nets,

logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted

stumps.

The main reason behind this paper is to publish comparison between these newly developed algorithms as the last comprehensive

empirical evaluation of supervised learning was the Statlog Project in the early 90’s.

The key objective of this paper is to provide the comparison between the algorithms based on variety of performance criteria such

as Precision/Recall, ROC, Lift, Accuracy, F-score, squared error, etc.

As a part of results after empirical comparison it was found that boosted trees were the best learning algorithm overall. Random

forests are close second, followed by un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that

performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. This implies that if a model is trained

using boosted trees it will give the best performance in predicting values as compared to other methods like random forest, SVM

etc.

Approach:

The model used by the authors for this paper is ADULT, COV TYPE and LETTER are from the UCI Repository (Blake & Merz, 1998).

COV TYPE has been converted to a binary problem by treating the largest class as the positive and the rest as negative. Random

5000 cases were taken as training data set and rest as testing. Also from those 5000 cases 4000 were used for training and 1000

cases to calibrate the model. Now various parameters like ROC, accuracy, lift are calculated for different algorithms and then a

column is obtained which gives mean normalized score for the eight metrics when model selection is done by cheating and looking

at the final test sets. The means in this column represent the best performance that could be achieved with each learning method if

model selection were done optimally.

Results:

From comparison it is seen that the models which perform the best can perform poorly than the average performing models. For

example, the best models on ADULT are calibrated boosted stumps, random forests and bagged trees. Boosted trees perform much

worse. Bagged trees and random forests also perform very well on MG and SLAC. On MEDIS, the best models are random forests,

neural nets and logistic regression. The only models that never exhibit excellent performance on any problem are naive bayes and

memory-based learning. that boosted trees were the best learning algorithm overall. Random forests are close second, followed by

un-calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes,

logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best

performance in predicting values as compared to other methods like random forest, SVM etc. The table below gives the results that

were used for the comparison of the techniques.

http://dx.doi.org/10.1145/1143844.1143865

Summary:

It can be seen that boosted trees were the best learning algorithm overall. Random forests are close second, followed by un-

calibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performed poorest were naive bayes,

logistic regression, decision trees, and boosted stumps. This implies that if a model is trained using boosted trees it will give the best

performance in predicting values as compared to other methods like random forest, SVM etc. But this is not always the case. We

need to select the parameters carefully and then select the technique which best works for that parameter. For example

Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks,

etc. So it can be said that for medicinal area the model which gives high performance when it comes to ROC will be the best model.

Reviewed By: Omkar Deshpande

Training of Neural Networks

Reference: Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1,

June 2010

Objective:

The objective of this paper is to discuss the algorithm used in the neuralnet package and give its application in R. Also

discuss the advantages of the neuralnet package over other generalized linear models.

The main reason behind publishing this paper is to give the details of the neuralnet package developed by the authors

and also give the working example of it by using the infert dataset in R.

Artificial neural networks can be applied to approximate any complex functional relationship between input and output

variables. Unlike generalized linear models it is not necessary to pre specify the type of relationship between covariates

and response variables as for instance as linear combination. This makes artificial neural networks a valuable statistical

tool. They are in particular direct extensions of GLMs and can be applied in a similar manner.

Approach:

In this paper the authors first discuss the algorithm used in building the neuralnet package. Then the training of the

neuralnet model in R is discussed. Infert dataset is used for this purpose. The number of hidden neurons is determined

in relation to the needed complexity. A neural network with for example two hidden neurons is trained. Then the results

of backprop nnet and neuralnet are compared. Then the paper discusses the additional features such as the compute

function, confidence.interval function that come loaded in the neuralnet package.

Results:

This being an informative paper, discusses various functions available in neuralnet package and how to implement them

in R. A few comparative results provided in the paper include the comparison with the nnet package. For comparison,

neural networks are trained with the same parameter setting as above using neuralnet with algorithm="backprop" and

the package nnet. nn.bp and nn.nnet show equal results. Both training processes last only a very few iteration steps and

the error is approximately 158. Thus in this little comparison, the model fit is less satisfying than that achieved by

resilient backpropagation.

Summary:

This paper introduced multi layer perceptron and supervised learning. It also took into consideration the use of the

package neuralnet available in R for modeling functional relationships between covariates and response variables.

neuralnet contains a very flexible function that trains multilayer perceptrons to a given data set in the context of

regression analyses. It is a very flexible package since most parameters can be easily adapted. For example, the

activation function and the error function can be arbitrarily chosen and can be defined by the usual definition of

functions in R.

Reviewed By: Rahul Garg

A SVM-BASED PIPELINE LEAKAGE DETECTION AND PRE-WARNING SYSTEM

Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system",

Measurement, vol. 43, no. 4, pp. 513-519, 2010.

Objective:

This paper addresses the detection problem of pipeline leakages which may occur due to various reasons like manual

digging and illegal construction. This paper indicates the effectiveness of SVM over traditional machine learning

techniques which is based on the assumption of availability of infinite training data. The problem of gas leakages are a

concern for industries as they lead not only to huge monetary losses but also may have very tragic outcomes like

outbreak of diseases and even deaths. The timely detection of any suspected leakages can be very beneficial to the

industries as well as to general masses. The objective of this paper is to monitor and locate the possible abnormal events

(e.g. manual digging above a pipeline and illegal constructions, etc., which might cause a pipeline leakage) along pipeline

before a leakage takes place, a new pipeline leakage detection and pre-warning system. The authors of this paper have

employed SVM as the classifier to recognize these abnormal events. Three cases gas leakage, manual digging and human

walk above the pipeline were created and a series of experimental trials were used to train the model. Next, this model

was used to detect any abnormal events for classification and it provided quite accurate results. The authors found that

SVM can prove to be a lot better and accurate technique for predicting gas leakages along pipelines as compared to the

empirical risk minimization method. This implies that although SVM is comparatively a new technique but it is quite

accurate for predictive analytics in case of multi classification problems.

Approach:

The authors have followed the multi-classification approach of predictive data analytics. Since, there was no historical

data available for, the authors collected data for training the model by conducting various trials. Basically two types of

trials were done for this. One was the abnormal events identification and other abnormal events location trial. Three

cases namely gas leakage; manual digging and human walk above the pipeline were created and number of columns or

prediction terms is eight. Twenty samples were collected from each case randomly for training and ten samples were

chosen from every case to test the trained SVM model. The misclassification rate on the test data will tell, how

accurately the model has performed and whether it can be deployed in actual practices or not. For, training process

“one-against-one” method is employed. The multi-class SVM trained classifier which the authors obtained is shown in

below figure. The two axes are the first two predictors out of eight and the circled data points show the support vectors.

Results:

The detection results from SVM recognized correct cause of leakage more than 95 % of the times and locate abnormal

events quite accurately. Below photo shows the prediction results, where 1, 2 and 3 are three different categories of

abnormality. Out of the below results only sample 12 has been recognized incorrectly.

Summary:

This paper represented the problem of pipeline leakage detection which was a problem of multi classification predictive

analysis. The major take away from this review is that SVM can work quite accurately for multi classification especially in

cases where the training data is not very large like in this case of leakage detection. This technique is far better as

compared to the traditional machine learning methods like ERM. Among several methods available for multi-class

classification ‘‘one-against-one” SVM method is more suitable for practical use than others.


STEEL FAULTS DIAGNOSIS USING PREDICTIVE ANALYSIS

Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International

Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13.

Objective:

The key problem which this paper discusses is the generation of various types of defects in manufactured steel plates

especially made of alloyed steel in steel industry. It is quite imperative to address this problem because rectifying these

defects by grinding or milling causes waste of time and augments the cost of production which could be prevented. This

paper aims at performing steel fault diagnosis using predictive analytics, so that the defect generation rate can be

minimized by finely tweaking the factors responsible for it. To address this problem the authors of this paper have used

classification modeling techniques namely Decision trees, Multilayer perceptron neural networks and Logistic regression

to develop a model which will diagnose the faults as accurately as possible. After developing different models it was

found that decision trees provide the best results as it has lowest misclassification rate. This implies that using Decision

Trees model could be a good option for steel fault diagnosis using data mining techniques.

Approach:

The data set used in this review has been taken from UCI repository and it classifies steel plate faults into seven different

types which makes this a case of multi classification predictive analysis. The authors of this paper have tried various

methods of classification and then selected the best method based on the misclassification rate. The methods used are

decision trees, multilayer perceptron neural networks and logistic regression. The C4.5 boosting algorithm with 10 trials

was used for decision trees. After all the models were built, and one out of the three was selected, the genetic algorithm

was used to find the best optimal solution, which works as per the following:

1. Initialize random population of n chromosomes 2. Evaluate the fitness value f(x) of each chromosome x in population 3. Create a new population by repeating following steps

Select two parent chromosomes from given population according to fitness. (Chromosomes having better fitness value, the bigger chance to be chosen)

Cross over the parents to form a new offspring. If no crossover then offspring is an exact copy of the parents.

Mutate new offspring at each locus.

Place new offspring in the new population.

4. Use new generated population for a further execution.

5. If the end condition is fulfilled, stop, and return best solution in the current population.

6. Go to step 2

The best optimal solution chosen in this case was the solution number seven based on the output results shown by this

genetic algorithm.

Results:

The results from this review have been shown in the below table:

S No. Method Classification Accuracy Classification Error

1. Decision Tree 94.38 % 5.62 %

2. Multilayer Perceptron 83.87 % 16.13 %

3. Logistic Regression 72.64 % 27.36 %

The above table shows that out of the three classification techniques used, decision trees gave the best results as the

misclassification rate with decision trees is the least.

Summary:

This review helped me in gaining insights on the methods we can try for a multi classification predictive analytics

problem and various ways that can be used to improve those models. C4.5 algorithm can be used to improve decision

trees and pruning algorithm can be used to improve the multilayer perceptron model. Another important take away

from this review was the fact that boosted Decision Trees with C4.5 package performed the best in classifying various

steel defects out of the three models especially when the results have to be interpreted by humans.


CLASSIFICATION OF EEG SIGNALS USING NEURAL NETWORK AND LOGISTIC REGRESSION

A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods

and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005.

Objective:

This paper is about the detection of epileptiform discharges in the EEG using logistic regression and artificial neural

network models. Epileptic seizure can occur in many different ways and EEG signals carry a lot of information and

accurate classification and evaluation of these signals may turn out be a breakthrough in medical science domain. This

paper aims to compare the traditional method of logistic regression to the more advanced neural network techniques, as

mathematical tools for developing classifiers for the detection of epileptic seizure in multi-channel EEG. The authors have

developed two different models using logistic regression and artificial neural networks. Multilayer perceptron neural

network (MLPNN) will be used with back propagation and Levenberg—Marquardt training algorithm. After this a comparison has

been done in both the methods. After comparing the results from both the papers, the authors concluded that the neural

network analysis proved to be a better model than the logistic regression. This implies that MLPNN is more accurate and

easier to build, as for developing logistic regression equations we start with no knowledge as to the best combination of the

parameters or the shape and degree of nonlinearity required to produce an optimal model.

Approach:

The EEG data used in this study was downloaded from 24-h EEG recorded from both epileptic patients and normal subjects. In order

to assess the performance of the classifier, 500 EEG segments were selected containing spike and wave complex, artifacts and

background normal EEG. Twenty absence seizures (petit mal) from five epileptic patients admitted for video-EEG monitoring were

analyzed. Next each of the signals was inspected by experienced neurologists to score epileptic and normal signals. After this

wavelet transform analysis was done as it captures transient features and localizes them in both time and frequency content

accurately. Next logistic regression and neural network classifiers were developed randomly selecting 300 examples out of 500

available as the training set and remaining 200 were kept for testing and validating the developed models. The selection of the

optimal network was based on monitoring the variation of error and some accuracy parameters as the network was expanded in the

hidden layer size and for each training cycle. The sum of square errors was used for choosing the optimal model and the optimum

number of nodes in the hidden layer was found to be 21. Finally after testing both the developed models, the best one was

chosen based on the misclassification error rate and sensitivity-specificity analysis. Below table shows the division of the

collected data as training and testing data.

Results:

Below table provides the results from comparing the two models on the basis of accuracy of classification and sensitivity and specificity analysis on test data. We can clearly see that the MLPNN has more accuracy and larger area under the ROC curve.

Summary: This paper helped in having a better understanding of the neural network analysis which is a new technique apart from the techniques learnt in class. It provided insights on the procedure of choosing optimal number of hidden layers in the model and limitations of logistic regression model. Another major take away from this is the evaluation and comparison of traditional logistic regression model used for classification with a much newer multilayer perceptron neural network analysis. At last but not the least, this paper made me aware of the wavelet transform analysis which is very effective in capturing transient features and localizing them in both time and frequency domains.

Reviewed by: Vinayak Nair

A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAULTS DETECTION

Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults detection. International Journal of Technical Research and Applications, 2(4), 127-131.

1. Objective: Detection of steel plate defects is a serious problem in the industry and often, they’re performed by human operators, which is expensive and slow. This can be tackled if the process is automated. The paper shows the application of decision tree ensembles for fault detection. Many decision tree ensembles random subspace, bagging, adaBoost.M1 and random forests are used to perform the steel plate fault detection and the best method for this problem is found out. The effect of removing insignificant features is also studied. The results suggest that random AdaBoost.M1 and random subspace are the best ensemble methods with a prediction accuracy greater than 80%.

2. Approach: Random subspaces, bagging, adaBoost.M1 and random forests classifier ensembles were performed on the UCI dataset and the prediction accuracies were calculated. Different selection of predictors were also tried out.

3. Results: The classification errors for the methods with different selection of predictors are tabulated below: The classification error with all the predictors included:

The classification error with 20 most important predictors included:

The classification error with 15 most important predictors included:

Random Subspace performed the best for the first and third method. AdaBoost.M1 came first for the second method. When the best 20 predictors were selected, results of all the methods improved except Random Subspaces. In the third method, the performance of models reduced, which indicated some important predictors were missed out.

4. Summary: The single decision tree model always gave bad results in comparison to Random Subspace, Adaboost.M1, Bagging and Random Forests, which means that we’ll have to use decision tree ensembles in the project, as well. Feature selection is also very important as selecting the most important predictors reduces the error rate.


COMPARISON OF THREE CLASSIFICATION TECHNIQUES, CART, C4.5 AND MULTI-LAYER PERCEPTRONS

Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969.

1. Objective: There are many popular algorithms such as CART (Classification and regression tree), MLP (multilayer perceptron) and C4.5. There is a need to know how these methods compare against each other. By comparing different methods on constrained data, we can make qualitative statements about the methods. Hence, addressing this problem can help individuals in making less mistakes while applying a particular method to practical problems. The key objective is to compare 3 algorithms, CART, MLP and C4.5 on classification and generalization capabilities. The algorithms are carried out on a version of the Penzias example and the results are summarized.

It was found that generally, the MLP has better classification and generalization accuracies compared with the other two algorithms.

2. Approach: For comparing the classification performance, data known as the clump example ( 8th order Penzias) was used. All 256 examples were used as both training and testing data sets. For comparing the generalization performance, the same data is used and the first 200 example are set as training set and the rest as testing. Parameters used: In the MLP, both the learning rate and the momentum are set at 0.1. The architecture used is: 8 input neurons, 5 hidden layer neurons, and 4 output neurons. In CART, the prior probability is set to be equi-probable. The pruning is performed when the probability of the leaf node is equal 0.5. In C4.5, all the default values are used.

3. Results: Following are the classification results:

Where mlp1 and mlp2 are the values related to the MLP when it has run for 10000 iterations and 100000 iterations respectively. The MLP accuracies improve with the number of iterations (till about 20000 iterations). Following are the generalization results on the same data:

The generalization accuracy of the MLP is observed to be better than CART, and is comparable to C4.5.

4. Summary: It is found that the MLP once it is converged, in general, has a better classification and generalization accuracies compared with CART, or C4.5. On the other hand it is also noted that the prediction errors made by each algorithm are different. This indicates that there may be a possibility of combining these algorithms in such a way that their prediction accuracies could be improved. This is presented as a challenge for future research.


COMPARISON OF LOGISTIC REGRESSION AND LINEAR DISCRIMINANT ANALYSIS: A SIMULATION STUDY

Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study. Metodoloski zvezki, 1(1), 143-161.

1. Objective Linear Discriminant Analysis (LDA) and Logistic Regression (LR) are two widely used statistical methods. Though both of them can be used to develop linear classification models, we need to have a set of guidelines for proper selection. While LR makes no assumptions on the distribution of the explanatory data, LDA has been developed for normally distributed explanatory variables. The appropriate method to a problem would always give better results. The objective of the paper is understand when to choose LDA and logistic regression. The two methods are compared and performance is studied using simulations. The results of LDA and LR were found to be close whenever the normality assumptions are not too badly violated, and some guidelines were set for recognizing these situations. The inappropriateness of LDA in all other cases is discussed.

2. Approach: The simplest and the most frequently used criterion for comparison between the two methods is classification error (percent of incorrectly classified objects; CE). However, classification error is a very insensitive and statistically inefficient measure (Harrell, 1997). Harrell and Lee (1985) proposed four different measures of comparing predictive accuracy of the two methods. These measures are indexes A, B, C and Q. They are better and more efficient criteria for comparisons and they tell us how well the models discriminate between the groups and/or how good the prediction is.

Where, Pk denotes an estimate of P(Yk=1|Xk) from (2.1) and I is an indicator function ,Pi is a probability of classification into group i, Yi is the actual group membership (1 or 0), and n is the sample size of both populations. Random samples of size n and m from two multivariate normal populations with different mean vectors, but equal covariance matrix Σ. The mean vector of one group is always set at (0,0). The distance to the other one is measured using Mahalanobis distance, while the direction is set as the angle (denoted by υ) to the direction of the eigenvector of the covariance matrix. Each sample is then randomly divided into two parts, a training and a test sample. The coefficients of LDA and LR are computed using the first sample and then predictions are made in the second one. The sampling experiment is replicated 50 times. Each time the indexes for both methods are computed. Finally, the average value of indexes and the proportion of simulations in which LR performs better are recorded. After sampling, the normally distributed variables can be categorized, either only one or both of them. The minimum and maximum value are computed, then the whole interval is divided into a certain number of categories of equal size.

3. Results: The sample size has the most obvious impact on the difference between methods. LDA assumes normality and the errors it makes in prediction are only due to the errors in estimation of the mean and variance on the sample. On the contrary, LR adapts itself to distribution and assumes nothing about it. Therefore, in the case of small samples, the difference between the distribution of the training sample and that of the test sample can be substantial. But, as the sample size increases, the sampling distributions become more stable which leads to better results for the LR.

Consequently, the results of the two methods are getting closer because the populations are normally distributed.

The results from Table 1 confirm this consideration. As the sample size increases, the LDA coefficient estimations become more accurate and therefore all four indexes are improving. The LR indexes are increasing even faster, thus approaching those of LDA. Decreasing difference between the two methods is best presented with the Q index, which is the most sensitive one. As the differences between index means are negligible, it is also interesting to look at the proportion of simulations where LR performs better. It can be seen that the value of rates to which we pay special attention that of B index and of Q index, is constantly increasing.

In the case of other changes, the results of the two methods were found to remain very close; in fact LDA is only a little bit better than LR. Simulations are carried out to study the effects of categorization and non-linearity, but are not presented in this literature review due to a lack of space. However, the major takeaways from the results have been summarized in the next section.

4. Summary: LDA is a more appropriate method when the explanatory variables are normally distributed. In the case of categorized variables, LDA remains preferable and fails only when the number of categories is really small (2 or 3). The results of LR,

however, are in all these cases constantly close and a little worse than those of LDA. But whenever the assumptions of LDA are not met, the usage of LDA is not justified, while LR gives good results regardless of the distribution. As the estimates for LR are obtained by the maximum likelihood method, they have a number of nice asymptotic properties as well.

Project Approach

Analysis Flow Chart:

Problem Description

Propose the best model with highest prediction accuracy that can be implemented in the steel plate manufacturing

process to detect faults during the process and thus help in reducing them by taking proper preventive measures. The

assumptions are

The data available is the exact data that is taken from the production line and has no manipulations.

The data is not biased and is randomly selected data from different production lines (if present) and collected over a

period of time.

Given Data

The data used for this project is taken from the UCI library. This dataset consists of 7 different steel plate faults and 27

attributes which contain the features of the steel plate manufactured and also the manufacturing process.

Data Set Information:

Type of dependent variables (7 Types of Steel Plates Faults):

1.Pastry

2.Z_Scratch

3.K_Scatch

4.Stains

5.Dirtiness

6.Bumps

7.Other_Faults

Attribute Information:

27 independent variables:

X_Minimum

X_Maximum

Y_Minimum

Y_Maximum

Pixels_Areas

X_Perimeter

Y_Perimeter

Sum_of_Luminosity

Minimum_of_Luminosity

Maximum_of_Luminosity

Length_of_Conveyer

TypeOfSteel_A300

TypeOfSteel_A400

Steel_Plate_Thickness

Edges_Index

Empty_Index

Square_Index

Outside_X_Index

Edges_X_Index

Edges_Y_Index

Outside_Global_Index

LogOfAreas

Log_X_Index

Log_Y_Index

Orientation_Index

Luminosity_Index

SigmoidOfAreas

Preliminary analysis of the data:

X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas 1 42 50 270900 270944 267 2 645 651 2538079 2538108 108 3 829 835 1553913 1553931 71 4 853 860 369370 369415 176 5 1289 1306 498078 498335 2409 6 430 441 100250 100337 630 X_Perimeter Y_Perimeter Sum_of_Luminosity 1 17 44 24220 2 10 30 11397 3 8 19 7972 4 13 45 18996 5 60 260 246930 6 20 87 62357 Minimum_of_Luminosity Maximum_of_Luminosity 1 76 108 2 84 123 3 99 125 4 99 126 5 37 126 6 64 127 Length_of_Conveyor TypeOfSteel_A300 TypeOfSteel_A400 1 1687 1 0 2 1687 1 0 3 1623 1 0 4 1353 0 1 5 1353 0 1 6 1387 0 1 Steel_Plate_Thickness Edges_Index Empty_Index 1 80 0.0498 0.2415 2 80 0.7647 0.3793 3 100 0.9710 0.3426 4 290 0.7287 0.4413 5 185 0.0695 0.4486 6 40 0.6200 0.3417 Square_Index Outside_X_Index Edges_X_Index Edges_Y_Index 1 0.1818 0.0047 0.4706 1.0000 2 0.2069 0.0036 0.6000 0.9667 3 0.3333 0.0037 0.7500 0.9474 4 0.1556 0.0052 0.5385 1.0000 5 0.0662 0.0126 0.2833 0.9885 6 0.1264 0.0079 0.5500 1.0000 Outside_Global_Index LogOfAreas Log_X_Index Log_Y_Index 1 1 2.4265 0.9031 1.6435 2 1 2.0334 0.7782 1.4624 3 1 1.8513 0.7782 1.2553 4 1 2.2455 0.8451 1.6532 5 1 3.3818 1.2305 2.4099 6 1 2.7993 1.0414 1.9395 Orientation_Index Luminosity_Index SigmoidOfAreas Pastry 1 0.8182 -0.2913 0.5822 1 2 0.7931 -0.1756 0.2984 1 3 0.6667 -0.1228 0.2150 1 4 0.8444 -0.1568 0.5212 1

5 0.9338 -0.1992 1.0000 1 6 0.8736 -0.2267 0.9874 1 Z_Scratch K_Scratch Stains Dirtiness Bumps Other_Faults 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0 6 0 0 0 0 0 0

The preliminary data analysis shows that

There is at least one defect associated with every row of attributes.

There are 1941 entries of defects in the whole dataset which is actually equal to the total number of rows in the

dataset.

Other_Faults account for the majority of the defects; almost 35% of the defects recorded are Other_Faults. Thus it

can be fairly predicted that the misclassification error may go high for this defect.

No two defects are associated with a single row of input. That is only one defect occurs for a particular row of

attributes in the data.

Description of new techniques used:

The new techniques chosen for this project are neural network analysis and C5.0 Decision trees.

1) Artificial Neural Networks

What is a Neural Network?

An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous

systems, such as the brain, process information.

Type of Defect Number of Occurrences

Pastry (1) 158

Z_Scratch (2) 190

K_Scratch (3) 391

Stains (4) 72

Dirtiness (5) 55

Bumps (6) 402

Other_Faults (7) 673

Components of a neuron The synapse

The figure above shows the structure of the human neural system. In the human brain, a neuron gets signals from all the

parts of the body through huge number of Dendrites. The neuron then sends signals as electrical activity through Axon.

Learning occurs by change in the energy levels of the neurons.

What is Artificial Neural Networks?

Artificial neural network is a computing system made up of a number of simple, highly interconnected processing

elements, which process information by their dynamic state response to external inputs.

The structure of a neural-network algorithm has three layers:

The input layer feeds past data values into the next (hidden) layer. The black circles represent nodes of the

neural network.

The hidden layer encapsulates several complex functions that create predictors; often those functions are

hidden from the user. A set of nodes (black circles) at the hidden layer represents mathematical functions that

modify the input data; these functions are called neurons.

The output layer collects the predictions made in the hidden layer and produces the final result: the model’s

prediction.

Neurons in a neural network can use sigmoid functions to match inputs to outputs. When used that way, a

sigmoid function is called a logistic functionand its formula looks like this:

f(input) = 1/(1+eoutput)

Artificial Neural Network Representation

2) C5.0 Decision Trees

Decision tree can be considered as a system that allows organizing a huge amount of information graphically.

A decision tree consists of internal nodes that represent the decisions corresponding to the hyper-planes or split points

(i.e., which half-space a given point lies in), and leaf nodes that represent regions or partitions of the data space, which

are labeled with the majority class. A region is characterized by the subset of data points that lie in that region.

One of the advantages of decision trees is that they produce models that are relatively easy to interpret. In particular, a

tree can be read as set of decision rules, with each rule’s antecedent comprising the decisions on the internal nodes

along a path to a leaf, and its consequent being the label of the leaf node. Further, because the regions are all disjoint

and cover the entire space, the set of rules can be interpreted as a set of alternatives or disjunctions.

An example of the decision tree is seen in the following figure.

C5.0 algorithm acts similar to ID3 but improves a few of ID3 behaviors:

The new features (versus ID3) are:

1) Accepts both continuous and discrete features

2) Handles incomplete data points

3) Pruning is already included in the package and thus the results are after pruning.

4) Ability to use attributes with different weights.

5) Scalability is enhanced by multi-threading; C5.0 can take advantage of computers with multiple CPUs and/or

cores

IMPLEMENTATION

Linear Discriminant Analysis

The modified multiclass dataset was modeled using Linear Discriminant Analysis to obtain the confusion matrix and misclassification error rate for the test dataset. K-fold cross validation was also performed on the test data to confirm our results.

>##LDA

>lda.model= lda(alldefects~., data = train2)

>lda_pred= predict(lda.model, test2)

>table(lda_pred$class, test.alldefects)

test.alldefects

A B C D E F G

A 25 0 0 0 0 1 15

B 5 50 0 0 0 4 10

C 2 0 91 0 0 0 4

D 0 0 1 26 0 0 2

E 3 0 0 0 18 0 6

F 4 2 4 0 1 67 40

G 9 5 27 1 4 39 117

>mean(lda_pred$class!= test.alldefects)

[1] 0.3241852

>##CrossValidation

> lda.cv=lda(alldefects~.,test2, CV=TRUE) >table(lda.cv$class,test.alldefects) test.alldefects A B C D E F G A 24 0 1 0 1 1 15 B 5 48 0 0 0 5 10 C 0 0 104 0 0 0 2 D 1 0 1 24 0 0 1 E 3 0 0 0 16 0 9 F 4 3 1 0 1 61 42 G 11 6 16 3 5 44 115 >mean(lda.cv$class!= test.alldefects) [1] 0.3276158

The misclassification and cross validation error were 32.42% and 32.76% respectively.

Decision Tree

A single decision tree was then modelled on the modified dataset. The tree was also pruned to reduce the number of branches and simplify the tree.

> ##tree >library(tree) > tree1=tree(train2$alldefects~.,data=train2) >plot(tree1) >text(tree1 ,pretty =0)

>tree.pred=predict(tree1,test2,type="class") >table(tree.pred ,test.alldefects) test.alldefects tree.pred A B C D E F G A 0 0 0 0 0 0 0 B 3 51 0 0 0 2 5 C 0 0 98 0 0 0 2 D 0 0 0 23 0 0 1 E 0 0 0 0 0 0 0 F 6 0 0 1 7 80 56 G 39 6 25 3 16 29 130 >mean(tree.pred!=test.alldefects) [1] 0.3447684

> ##pruning >set.seed (1) >cv.data =cv.tree(tree1 ,FUN=prune.misclass ) >plot(cv.data$size ,cv.data$dev ,type="b") >plot(cv.data$k ,cv.data$dev ,type="b") >prune.data = prune.misclass(tree1 ,best =9) >plot(prune.data) >text(prune.data,pretty =0) > tree.pred2=predict(prune.data , test2 ,type="class") >table(tree.pred2 ,test.alldefects) test.alldefects tree.pred2 A B C D E F G A 0 0 0 0 0 0 0 B 3 51 0 0 0 2 5 C 0 0 88 0 0 1 8 D 0 0 0 23 0 0 1 E 0 0 0 0 0 0 0 F 2 0 0 0 1 59 28 G 43 6 35 4 22 49 152 >mean(tree.pred2!=test.alldefects) [1] 0.3602058

The misclassification error rates obtained from the original and the pruned tree were 34.5% and 36% respectively. The error rate did not increase a lot thus justifying pruning to make the decision tree more readable.

Bagging Bagging was used on the dataset to reduce the variance obtained in a decision tree model by averaging the observations and also effectively increasing the training datasets via bootstrapping. > ## bAGGING >set.seed (1) >library(randomForest) >bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE) >yhat.bag = predict (bag.data ,test2) >plot(yhat.bag , test.alldefects) >abline (0,1) >table(yhat.bag, test.alldefects) test.alldefects yhat.bag A B C D E F G A 30 0 0 0 0 5 7 B 0 50 0 0 0 1 0 C 0 0 112 0 0 0 1 D 0 0 0 24 0 0 1 E 1 0 0 0 19 1 2 F 4 0 0 1 2 76 32 G 13 7 11 2 2 28 151 >mean(yhat.bag!=test.alldefects) [1] 0.2075472

The misclassification error rate obtained was 20.75%.

Random Forest Random Forest bootstrapping method was applied on the modified dataset to de-correlate the bagged trees thus further reducing the variance.

> #randomforest >set.seed (1) >rf =randomForest(alldefects~.,data=train2 , importance =TRUE) >yhat.rf = predict (rf ,test2) >table(yhat.rf, test.alldefects) test.alldefects yhat.rf A B C D E F G A 25 0 0 0 0 6 5 B 1 50 0 0 0 0 4 C 0 0 112 0 0 0 1 D 0 0 0 24 0 0 1 E 0 0 0 0 19 0 3 F 3 1 0 1 2 73 33 G 19 6 11 2 2 32 147 >mean(yhat.rf !=test.alldefects) [1] 0.2281304

The misclassification error rate obtained was 22.81% a bit higher than bagging but it ensures a lower variance which is generally better to implement on future data points.

C5.0 An advanced decision tree technique known as C5.0 was also used to model the modified dataset. > #C50 >crx<- data[ sample( nrow( data ) ), ] > X <- crx[,1:27] > y <- crx[,35] >trainx<- X[1:1358,] >trainy<- y[1:1358] >testx<- X[1358:1941,] >testy<- y[1358:1941] >model<- C5.0( trainx, trainy, trials=75 ) > p <- predict( model, testx, type="class" ) >table(p, testy) testy p A B C D E F G A 32 0 0 0 1 2 6 B 1 39 0 0 0 1 4 C 0 0 111 0 0 1 3 D 0 0 0 26 0 2 0 E 1 0 0 0 14 2 2 F 11 0 0 1 0 96 19 G 15 1 7 1 1 31 153 >mean(p != testy) [1] 0.1934932


Support Vector Machines

SVM was tried on the training data set for different values of ‘C’ and the best results came out with C=15.

> svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15) >summary(svm.fit) Call: svm(formula = alldefects ~ ., data = train2, type = "C", kernel = "polynomial",degree = 3, cost = 15)

Parameters: SVM-Type: C-classification SVM-Kernel: polynomial cost: 15 degree: 3 gamma: 0.03703703704 coef.0: 0 Number of Support Vectors: 819 ( 43 225 347 87 74 18 25 ) Number of Classes: 7 Levels: A B C D E F G

The plot for SVM on the training data has been shown in below figure. It is a 2-D plot with its axes as Edges_X _Index

and Edges_Y_Index. The circular symbols show the data points and the triangles show support vectors.

> predicted=predict(svm.fit,test2) >table(predicted,test.alldefects) test.alldefects predicted A B C D E F G A 28 1 0 0 0 6 7 B 0 49 0 0 0 3 6 C 0 1 113 0 0 0 4 D 0 0 0 25 0 0 0 E 0 0 0 0 17 0 3

F 6 0 3 1 5 69 53 G 14 6 7 1 1 33 121 >mean(predicted!=test.alldefects) [1] 0.2761578045

The misclassification rate for SVM on testing data is about 27.61%.

Artificial Neural Networks

For this project artificial neural network model is developed by using different methods. First a model is tried using nnet

function in the nnet library in R. The second model is created using the multi level perceptron. This code uses the mlp

function provided in the RSNNS library in R.

1st Method

In this method the model was built by using artificial neural network using the nnet function from the nnet library in R.

Many attributions were tried by changing the number of hidden layers expresses as size in the code. Also rang, decay

and matix were changed to get a lower misclassification rate. Finally the best model had 20 hidden layers and other

attributes as seen in the code.

train.nnet<-nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000) # weights: 707 initial value 2736.596055 iter 10 value 2082.470121 iter 20 value 2006.626658 iter 30 value 1963.775429 iter 40 value 1907.670254 iter 50 value 1901.104216 iter 60 value 1841.389091 iter 70 value 1815.725249 iter 80 value 1804.856698 iter 90 value 1801.382263 iter 100 value 1801.021638 iter 110 value 1797.549455 iter 120 value 1797.305184 iter 130 value 1797.183004 iter 140 value 1796.918336 iter 150 value 1795.256115 iter 160 value 1793.025804 final value 1792.714314 converged

test.nnet<-predict(train.nnet,test2,type=("class")) table(test2$alldefects,test.nnet) test.nnet 1 3 7 1 0 5 43 2 0 7 50 3 0 98 25 4 0 0 27 5 0 1 22 6 0 6 105 7 1 22 171 mean(test.nnet!=test2$alldefects) [1] 0.5385935

The misclassification rate for ANN on testing data is about 53.8%.

2nd method

In this method the artificial neural network model was built by using the mlp function available in RSNNS library in R.

This multi layer perceptron takes the predictors, responses, number of hidden layers as input.

> model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F) > test.cl(y[-samp,], predict(model, x[-samp,])) cres true 3 7 1 3 42 2 2 52 3 50 63 4 0 25 5 0 16 6 3 110 7 16 200 > test.cl(y[samp,],fitted.values(model)) cres true 3 7 1 5 108 2 6 130 3 119 159 4 0 47 5 4 35 6 4 285 7 33 424

The misclassification rate for ANN on testing data is about 60.01%.

It is seen that the misclassification rate of the model built by using artificial neural networks is very high. The least error rate achieved is by using the nnet package, which is about 52%.

Logistic regression We developed 7 different logistic models for classification of each type of defect as we noticed that each defect were reliant on a different set of predictors. We aim to develop a hierarchical model in which we detect if a defect is present or not; if present, we end the program (Dataset implies that each steel plate has only one kind of defect); and if absent, we continue and check the presence of the next type of defect. And so on.

Logistic regression model for the first type of defect: Pastry >train_pastry=train[,-c(29,30,31,32,33,34,35)]

>fix(train_pastry) >test_pastry= test[,-c(29,30,31,32,33,34,35)] >log_pastry = glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train_pastry,family = "binomial") >summary(log_pastry) Call: glm(formula = Pastry ~ LogOfAreas + TypeOfSteel_A300 + Sum_of_Luminosity + Log_X_Index + Square_Index + Orientation_Index + Log_Y_Index + Maximum_of_Luminosity + X_Maximum + X_Minimum + Length_of_Conveyor + Minimum_of_Luminosity, family = "binomial", data = train_pastry) Deviance Residuals:

Min 1Q Median 3Q Max -2.01159 -0.26886 -0.05515 0.00000 3.08897 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.178e+01 3.366e+00 -3.500 0.000465 *** LogOfAreas 4.551e+00 2.015e+00 2.258 0.023940 * TypeOfSteel_A300 -5.827e-01 3.077e-01 -1.894 0.058260 . Sum_of_Luminosity 7.952e-06 2.324e-06 3.422 0.000621 *** Log_X_Index 1.377e+01 6.238e+00 2.207 0.027308 * Square_Index -4.378e+00 1.186e+00 -3.692 0.000223 *** Orientation_Index 4.450e+00 1.267e+00 3.512 0.000445 *** Log_Y_Index -9.260e+00 2.309e+00 -4.010 6.07e-05 *** Maximum_of_Luminosity 3.396e-02 9.677e-03 3.510 0.000449 *** X_Maximum -7.516e-01 2.283e-01 -3.292 0.000995 *** X_Minimum 7.521e-01 2.283e-01 3.294 0.000987 *** Length_of_Conveyor 3.830e-03 8.421e-04 4.548 5.41e-06 *** Minimum_of_Luminosity -4.196e-02 8.145e-03 -5.151 2.59e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 763.76 on 1357 degrees of freedom Residual deviance: 427.59 on 1345 degrees of freedom AIC: 453.59

Number of Fisher Scoring iterations: 14 >log_pastry_pred = predict(log_pastry, test_pastry, type ="response") >log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment >log_pastry_pred_y[log_pastry_pred> 0.5]= 1 >table(log_pastry_pred_y, test_pastry[,28]) log_pastry_pred_y 0 1 0 528 33 1 7 15 >mean(log_pastry_pred_y != test_pastry[,28]) [1] 0.06861063

We see that the misclassification error rate is less than 7% which is acceptable for this individual model. We have also cross-validated these results using K-Fold cross validation Technique. ># cross validation >cv_pastry = glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train_pastry,family = "binomial") >cv.glm(train_pastry,cv_pastry,K=10)$delta[1]

1] 0.06969064


Similarly, individual classification models were developed for rest of the defects and the results obtained are tabulated below:

The combined accuracy of the hierarchical model would be = (1-0.068)*(1- 0.041)*(1-0.045)*(1-0.0189)*(1-0.031)*(1-0.16)*(1-0.252) = 0.51= 51%

* Since the defects are independent of each other the probabilities of the individual models being right will be multiplied.

Random Forest

Individual responses were then modeled with random forest to get the respective error rates.

> ##randomforest with Pastry only >train_pastry$Pastry=factor(train_pastry$Pastry) >test_pastry$Pastry=factor(test_pastry$Pastry) >rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE) >yhat.rf_pastry = predict (rf_pastry ,test_pastry) >table(yhat.rf_pastry, test_pastry[,28]) yhat.rf_pastry 0 1 0 529 32 1 6 16 >mean(yhat.rf_pastry!=test_pastry[,28]) [1] 0.0651801


Defect Confusion matrix Error rate CV error

log_pastry_pred_y 0 1

0 528 33

1 7 15

log_zs_pred_y 0 1

0 511 9

1 15 48

log_ks_pred_y 0 1

0 453 19

1 7 104

log_stains_pred_y 0 1

0 554 9

1 2 18

log_dirt_pred_y 0 1

0 554 12

1 6 11

log_bumps_pred_y 0 1

0 447 68

1 25 43

log_of_pred_y 0 1

0 346 104

1 43 90

0.176

0.069

0.030

0.018

0.020

0.015

0.124

All Faults

0.068

0.041

0.045

0.0189

0.031

0.16

0.252

Pastry

Z_Scratch

K_Scratch

Stains

Dirtiness

Bumps

Similarly, random forest models were developed for other individual defects with the results tabulated below:

The combined accuracy of the hierarchical model would be = (1-0.065)*(1- 0.022)*(1-0.0257)*(1-0.005)*(1-0.0189)*(1-0.111)*(1-0.17) = 0.6417= 64.17%


Principal Component Analysis

A dimensional reduction technique was conducted on the dataset to apply the 80/20 rule and extract the “Vital Few” from the “Trivial Many” prediction terms.

>#PCA on complete data set >datap=data[,-(28:35)] >fit <- princomp(datap, cor=TRUE) >summary(fit) # print variance accounted for Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.8815442 1.8493487 1.6443472 1.49596665 1.40409954 1.27423486 1.17387303 0.99985628 0.96006830

Defect Confusion matrix Error rate

yhat.rf_pastry 0 1

0 529 32

1 6 16

yhat.rf_zs 0 1

0 522 9

1 4 48

yhat.rf_ks 0 1

0 459 14

1 1 109

yhat.rf_stains 0 1

0 556 3

1 0 24

yhat.rf_dirt 0 1

0 558 9

1 2 14

yhat.rf_bumps 0 1

0 460 53

1 12 58

yhat.rf_of 0 1

0 363 73

1 26 121

Other Faults

0.065

0.022

0.0257

0.0051

0.0189

0.111

0.17

Pastry

Z_Scratch

K_Scratch

Stains

Dirtiness

Bumps

Proportion of Variance 0.3075295 0.1266700 0.1001436 0.08288579 0.07301835 0.06013609 0.05103622 0.03702639 0.03413819 Cumulative Proportion 0.3075295 0.4341995 0.5343431 0.61722894 0.69024729 0.75038338 0.80141960 0.83844599 0.87258418 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Standard deviation 0.88369299 0.84586524 0.73974691 0.62701635 0.54299087 0.489154397 0.434472983 0.316363504 Proportion of Variance 0.02892271 0.02649956 0.02026761 0.01456109 0.01091997 0.008861927 0.006991362 0.003706884 Cumulative Proportion 0.90150689 0.92800645 0.94827406 0.96283515 0.97375512 0.982617047 0.989608409 0.993315293 Comp.18 Comp.19 Comp.20 Comp.21 Comp.22 Comp.23 Comp.24 Comp.25 Standard deviation 0.243539500 0.235660669 0.211674618 0.1093589373 0.0837014355 3.693381e-02 2.217081e-02 3.543139e-03 Proportion of Variance 0.002196722 0.002056887 0.001659487 0.0004429399 0.0002594789 5.052246e-05 1.820536e-05 4.649569e-07 Cumulative Proportion 0.995512015 0.997568902 0.999228388 0.9996713283 0.9999308072 9.999813e-01 9.999995e-01 1.000000e+00 Comp.26 Comp.27 Standard deviation 5.417823e-06 1.218954e-08 Proportion of Variance 1.087141e-12 5.503141e-18 Cumulative Proportion 1.000000e+00 1.000000e+00 >plot(fit,type="lines") # scree plot >biplot(fit)

From the analysis and the scree plot first 7 principal components were selected as they seemed to explain just over 80% of the variability in the sample space.

The principal components were then extracted and were stored in two different files:

One with only the top 7 principal components

Another with the original data and the top 7 principal components combined

>axes<- predict(fit, newdata = datap) >fix(axes)

> data1=axes[,1:7] >fix(data1) >write.csv(data1,file="pcadata.csv") #data file with the top 7 PCs >data2=data.frame(data,data1) >write.csv(data2,file="comb_data.csv")#data file with the original data and the 7 PCs combined

These two data files were used for further modelling.

Model Formation with principal components

Logistic Regression

Logistic Regression was performed on the extracted principal components for individual responses.

Logistic regression model for the first type of defect: Pastry (Using the PCs)

>#######################logistic regression > #00000000000000000000_pastry >train_pastry=train[,-c(9,10,11,12,13,14,15)] >fix(train_pastry) >test_pastry= test[,-c(9,10,11,12,13,14,15)] >log_pastry = glm(Pastry~.,data=train_pastry,family = "binomial") >summary(log_pastry) Call: glm(formula = Pastry ~ ., family = "binomial", data = train_pastry) Deviance Residuals: Min 1Q Median 3Q Max -1.5477 -0.4039 -0.1326 -0.0243 3.5952 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.51776 0.38758 -11.656 < 2e-16 *** Comp.1 -0.63814 0.11686 -5.461 4.75e-08 *** Comp.2 -1.17226 0.14334 -8.178 2.88e-16 *** Comp.3 -0.37291 0.08243 -4.524 6.07e-06 *** Comp.4 -0.25626 0.08068 -3.176 0.00149 ** Comp.5 0.42288 0.07491 5.645 1.65e-08 *** Comp.6 0.17290 0.09726 1.778 0.07546 . Comp.7 0.40363 0.10328 3.908 9.31e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 763.76 on 1357 degrees of freedom Residual deviance: 541.13 on 1350 degrees of freedom AIC: 557.13 Number of Fisher Scoring iterations: 8 >log_pastry_pred = predict(log_pastry, test_pastry, type ="response") >log_pastry_pred_y = rep(0, length(test_pastry[,8])) # default assignment >log_pastry_pred_y[log_pastry_pred> 0.5]= 1 >table(log_pastry_pred_y, test_pastry[,8]) log_pastry_pred_y 0 1 0 529 44 1 6 4

>mean(log_pastry_pred_y != test_pastry[,8]) [1] 0.08576329

> # cross validation >cv_pastry = glm(Pastry~.,data=train_pastry,family = "binomial") >cv.glm(train_pastry,cv_pastry,K=10)$delta[1] [1] 0.06398761

Similarly, individual classification models were developed for rest of the defects and the results obtained are tabulated below:

The combined accuracy of the hierarchical model would be = (1-0.0857)*(1- 0.0857)*(1-0.0634)*(1-0.0172)*(1-0.0446)*(1-0.19)*(1-0.285) = 0.4258= 42.58%


Random Forest

The dataset consisting of the principal components was then used with a Random Forest model.

> #randomforest with important predictors >set.seed (1) > #randomforest with Pastry only >set.seed (1) >train_pastry$Pastry=factor(train_pastry$Pastry) >test_pastry$Pastry=factor(test_pastry$Pastry)

Defect Confusion matrix Error rate CV error

log_pastry_pred_y 0 1

0 529 44

1 6 4

log_zs_pred_y 0 1

0 506 30

1 20 27

log_ks_pred_y 0 1

0 452 29

1 8 94

log_stains_pred_y 0 1

0 553 7

1 3 20

log_dirt_pred_y 0 1

0 557 23

1 3 0

log_bumps_pred_y 0 1

0 447 86

1 25 25

log_of_pred_y 0 1

0 344 121

1 45 73

Other Faults 0.285

0.064

0.0525

0.0257

0.0157

0.0224

0.14

0.198

Stains 0.0172

Dirtiness 0.0446

Bumps 0.19

Pastry 0.0857

Z_Scratch 0.0857

K_Scratch 0.0634

>rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE) >yhat.rf_pastry = predict (rf_pastry ,test_pastry) >table(yhat.rf_pastry, test_pastry[,8]) yhat.rf_pastry 0 1 0 529 38 1 6 10 >mean(yhat.rf_pastry!=test_pastry[,8]) [1] 0.0754717


With the model implementation showing that random forest was giving better results (and rightly so) it was decided to model all the individual responses with random forest using both the datasets (with only the 7 PCs and the one with the original predictors + 7 PCs).

The results obtained are tabulated below.

S No. Type of defect using 1st seven PC's using all 27 predictorsUsing all predictors and 7

PC's

1 Pastry (A) 0.075 0.065 0.065

2 Z-scratch (B) 0.046 0.022 0.024

3 K-Scratch (C) 0.036 0.026 0.027

4 Stains (D) 0.012 0.005 0.005

5 Dirtiness (E) 0.019 0.019 0.015

6 Bumps (F) 0.042 0.111 0.127

7 Other Defects (G) 0.196 0.170 0.168

0.635 0.641 0.631

Misclassification error rates for individualRANDOM FORESTS with different prediction terms

Accuracy for the combined model

RESULTS

ROC Analysis

ROC analysis was conducted on the different multiclass models to aid us in selecting the best model.

S No.Modelling

TechniqueConfusion Matrix

Missclassification

Error rateROC curve AUC*

1 LDA 0.324 0.065 0.790

2

Decision

Tree (After

Pruning)

0.360 0.784

3 Bagging 0.208 0.824

4Random

Forest0.228 0.797

5 SVM 0.276 0.804

6

Neural

Network

Analysis

0.539 0.605

7 C 5.0 0.194 0.831

Comparision of different Multiclass Models

The C5.0 Decision Tree had the best performance on the testing dataset.

ROC Analysis was also conducted for the logistic regression and random forest models that were developed for each individual defects, the results are tabulated below:

ROC for Individual defects using Logistic Regression:

ROC Curves

AUC Values

Logistic reg

AUC

Other Faults

0.65 0.91 0.92 0.83 0.73 0.67 0.68

Pastry Z_Scratch K_Scratch Stains Dirtiness Bumps

ROC for individual defects using Random Forest

ROC Curves

AUC Values

Random Forest

AUC 0.67 0.68

K_Scratch Stains Dirtiness Bumps Other Faults

0.65 0.91 0.92 0.83 0.73

Pastry Z_Scratch

CONCLUSION

The major takeaways from this project were:

Advanced decision trees such as C5.0 DT and Random Forest are the most efficient techniques in dealing with

multiclass anomaly detection using machine learning.

Although modelling for individual defects gives a very high accuracy rate for almost every one of them, the

combined hierarchical model that will utilize them in practice will not be as efficient due to the fact that its

accuracy is a product of the individual accuracies of the models used in the hierarchical model.

Logistic Regression although a very powerful tool doesn’t seem to be a good fit for multiclass anomaly detection

problems due to the fact that the logistic regression model does not predict the type of defect but rather the

probability of that defect occurring using the log likelihood function.

SVM also turned out as a good tool for conducting multi classification as its accuracy rate was very high, but we

still prefer C5.0 over SVM as it was not the best because of very high number of support vectors.

Artificial Neural Network results on this dataset were not satisfactory with very high misclassification rate. This

can be because of many reasons but the major reason is a small dataset. Also there is no method to fine tune

the number of hidden layers and a slight change in the number of hidden layer causes significant

misclassification. So it can be said that either the model was right value of parameters even after a lot of trial

and errors or the model was not trained properly because of the small dataset.

Future Scope:

The dataset considered in this project uses multiclass classification techniques as the defects are not co-related. Also, as

the defects are not co-related some techniques are performed in a multi univariate technique that is using different

models for each fault. Now, Multi label classification is where same predictor values are causing 2 or more defects at a

time which is not the case in this particular dataset thus not required and hence not used. Thus this project and the

results are limited to multi class classification when the faults are not co-related. So for further data if the defects

become co-related, multi-label classification has to be used which becomes the future scope of this project.

References

-Z. Qu, H. Feng, Z. Zeng, J. Zhuge and S. Jin, "A SVM-based pipeline leakage detection and pre-warning system", Measurement, vol. 43, no. 4, pp. 513-519, 2010.

Sanjay Jain1, Chandreshekhar Azad2, Vijay Kumar Jha3, “Steel faults diagnosis under predictive analysis”, International Journal of Computer Engineering and Applications, Volume IV, Issue II/III, Oct.13.

A. Subasi and E. Erçelebi, "Classification of EEG signals using neural network and logistic regression", Computer Methods and Programs in Biomedicine, vol. 78, no. 2, pp. 87-99, 2005.

Mahmoud Fakhr, Alaa M.Elsayad "Steel Plates Faults Diagnosis with Data Mining Models", Journal of Computer Science, vol. 8, no. 4, pp. 506-514, 2012.

S. Omar, A. Ngadi and H. H. Jebur, "Machine Learning Techniques for Anomaly Detection: An Overview", International Journal of Computer Applications, vol. 79, no. 2, pp. 33-41, 2013.

Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning Algorithms. Proceedings Of The 23Rd International Conference On Machine Learning - ICML '06. http://dx.doi.org/10.1145/1143844.1143865 Frauke Günther and Stefan Fritsch (2008) neuralnet: Training of Neural Networks. The R Journal Vol. 2/1, June 2010

Halawani, M. (2014). A study of decision tree ensembles and feature selection for steel plates faults detection. International Journal of Technical Research and Applications, 2(4), 127-131.

Tsoi, A.C., Pearson, R.A. (1991) Comparison of three classification techniques, CART, C4.5 and Multi-Layer Perceptrons. Advances in Neural Information Processing Systems 3 R. Morgan Kaufmann Publishers, San Mateo, CA. 963-969.

Pohar, M., Blas, M., & Turk, S. (2004). Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study.Metodoloski zvezki, 1(1), 143-161. Neural Network Primer: Part I" by Maureen Caudill, AI Expert, Feb. 1989 http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf http://saiconference.com/Downloads/SpecialIssueNo10/Paper_3A_comparative_study_of_decision_tree_ID3_and_C4.5.pdf www.cs.princeton.edu

APPENDIX

1. On Original Dataset

library(ISLR)

library(boot)

http://dx.doi.org/10.1145/1143844.1143865

http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf

http://saiconference.com/Downloads/SpecialIssueNo10/Paper_3A_comparative_study_of_decision_tree_ID3_and_C4.5.pdf

http://saiconference.com/Downloads/SpecialIssueNo10/Paper_3A_comparative_study_of_decision_tree_ID3_and_C4.5.pdf

library(MASS)

data=read.csv(file.choose(), header=T)

attach(data)

data$alldefects="A"

for(i in 1:1941) {

if ( Z_Scratch[i]==1) {data$alldefects[i]="B"}

if ( K_Scratch[i]==1) {data$alldefects[i]="C"}

if ( Stains[i]==1) {data$alldefects[i]="D"}

if ( Dirtiness[i]==1) {data$alldefects[i]="E"}

if ( Bumps[i]==1) {data$alldefects[i]="F"}

if ( Other_Faults[i]==1) {data$alldefects[i]="G"} }

data$alldefects=factor(data$alldefects)

set.seed(1)

trainingsample=sample(1:nrow(data), size=0.70*nrow(data))

train=data[trainingsample,]

test=data[-trainingsample,]

write.csv(train,file="exportedtrainingdata.csv")

write.csv(test,file="exportedtestingdata.csv")

train2=train[,-(28:34)]

test2=test[,-(28:34)]

test.alldefects=test2[,28]

#LDA

lda.model= lda(alldefects~., data = train2)

lda_pred= predict(lda.model, test2)

table(lda_pred$class, test.alldefects)

mean(lda_pred$class!= test.alldefects)

mean(lda_pred$class== test.alldefects)

lda.cv=lda(alldefects~.,test2, CV=TRUE)

table(lda.cv$class,test.alldefects)

mean(lda.cv$class!= test.alldefects)

predictions <- as.numeric(lda_pred$class, type="response")

multiclass.roc(test.alldefects, predictions, plot=T)

y=rep(0,length(lda_pred$class))

y[lda_pred$class==test.alldefects]=1

x=rep(0,length(test.alldefects))

x[test.alldefects==test.alldefects]=1

roc(x,y,plot=TRUE,main="LDA")

predictions_lda <- as.numeric(lda_pred,type="vote")

multiclass.roc(test.alldefects, predictions_lda, plot=T)

#qda

qda.model= qda(alldefects~., data = train2)

qda_pred= predict(qda.model, test2)

table(qda_pred$class, test.alldefects)

mean(qda_pred$class!= test.alldefects)

##tree

library(tree)

tree1=tree(train2$alldefects~.,data=train2)

plot(tree1)

text(tree1 ,pretty =0)

tree.pred=predict(tree1,test2,type="class")

table(tree.pred ,test.alldefects)

mean(tree.pred!=test.alldefects)

predictions_tree <- as.numeric(tree.pred,type="response")

multiclass.roc(test.alldefects, predictions_tree, plot=T)

##pruning

set.seed (1)

cv.data =cv.tree(tree1 ,FUN=prune.misclass )

names(cv.data)

cv.data

par(mfrow =c(1,1))

plot(cv.data$size ,cv.data$dev ,type="b")

plot(cv.data$k ,cv.data$dev ,type="b")

prune.data = prune.misclass(tree1 ,best =9)

plot(prune.data)

text(prune.data,pretty =0)

tree.pred2=predict(prune.data , test2 ,type="class")

table(tree.pred2 ,test.alldefects)

mean(tree.pred2!=test.alldefects)

predictions_tree <- as.numeric(tree.pred2,type="response")

multiclass.roc(test.alldefects, predictions_tree, plot=T)

## bAGGING

set.seed (1)

bag.data

=randomForest(alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickness

+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , mtry=10,importance =TRUE)

bag.data =randomForest(alldefects~.,data=train2 , mtry=27,importance =TRUE)

bag.data

yhat.bag = predict (bag.data ,test2)

plot(yhat.bag , test.alldefects)

abline (0,1)

table(yhat.bag, test.alldefects)

mean( yhat.bag!=test.alldefects)

predictions_bag <- as.numeric(yhat.bag,type="response")

multiclass.roc(test.alldefects, predictions_bag, plot=T)

#randomforest

set.seed (1)

library(randomForest)

rf =randomForest(alldefects~.,data=train2 , importance =TRUE)

yhat.rf = predict (rf ,test2)

table(yhat.rf, test.alldefects)

mean( yhat.rf !=test.alldefects)

predictions <- as.numeric(predict(rf, test2, type = 'response'))

multiclass.roc(test.alldefects, predictions, plot=T)

#randomforest with important predictors

set.seed (1)


rrf


+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , importance =TRUE)

yhat.rrf = predict (rrf ,test2)

table(yhat.rrf, test.alldefects)

mean( yhat.rrf !=test.alldefects)

#randomforest with Pastry only

set.seed (1)

train_pastry$Pastry=factor(train_pastry$Pastry)

test_pastry$Pastry=factor(test_pastry$Pastry)

rf_pastry =randomForest(Pastry~.,data=train_pastry, importance =TRUE)

yhat.rf_pastry = predict (rf_pastry ,test_pastry)

table(yhat.rf_pastry, test_pastry[,28])

mean( yhat.rf_pastry!=test_pastry[,28])

#randomforest with z_scratch only

set.seed (1)

train_zs$Z_Scratch=factor(train_zs$Z_Scratch)

test_zs$Z_Scratch=factor(test_zs$Z_Scratch)

rf_zs =randomForest(Z_Scratch~.,data=train_zs, importance =TRUE)

yhat.rf_zs = predict (rf_zs ,test_zs)

table(yhat.rf_zs, test_zs[,28])

mean( yhat.rf_zs!=test_zs[,28])

#randomforest with K_scratch only

set.seed (1)

train_ks$K_Scratch=factor(train_ks$K_Scratch)

test_ks$K_Scratch=factor(test_ks$K_Scratch)

rf_ks =randomForest(K_Scratch~.,data=train_ks, importance =TRUE)

yhat.rf_ks = predict (rf_ks ,test_ks)

table(yhat.rf_ks, test_ks[,28])

mean( yhat.rf_ks!=test_ks[,28])

#randomforest with stains only

set.seed (1)

train_stains$Stains=factor(train_stains$Stains)

test_stains$Stains=factor(test_stains$Stains)

rf_stains =randomForest(Stains~.,data=train_stains, importance =TRUE)

yhat.rf_stains = predict (rf_stains ,test_stains)

table(yhat.rf_stains, test_stains[,28])

mean( yhat.rf_stains!=test_stains[,28])

#randomforest with dirt only

set.seed (1)

train_dirt$Dirtiness=factor(train_dirt$Dirtiness)

test_dirt$Dirtiness=factor(test_dirt$Dirtiness)

rf_dirt =randomForest(Dirtiness~.,data=train_dirt, importance =TRUE)

yhat.rf_dirt = predict (rf_dirt ,test_dirt)

table(yhat.rf_dirt, test_dirt[,28])

mean( yhat.rf_dirt!=test_dirt[,28])

#randomforest with bumps only

set.seed (1)

train_bumps$Bumps=factor(train_bumps$Bumps)

test_bumps$Bumps=factor(test_bumps$Bumps)

rf_bumps =randomForest(Bumps~.,data=train_bumps, importance =TRUE)

yhat.rf_bumps = predict (rf_bumps ,test_bumps)

table(yhat.rf_bumps, test_bumps[,28])

mean( yhat.rf_bumps!=test_bumps[,28])

#randomforest with other faults only

set.seed (1)

train_of$Other_Faults=factor(train_of$Other_Faults)

test_of$Other_Faults=factor(test_of$Other_Faults)

rf_of =randomForest(Other_Faults~.,data=train_of, importance =TRUE)

yhat.rf_of = predict (rf_of ,test_of)

table(yhat.rf_of, test_of[,28])

mean( yhat.rf_of!=test_of[,28])

rf.cv=randomForest(train_of$Other_Faults~.,data=train_of, CV=TRUE)

table(rf.cv$class,train_of[,8])

r = randomForest(alldefects~., data = cadets, importance =TRUE, do.trace = 100)

varImpPlot(r)

r###################################################################logistic regression

#####################################################

#000000000000000000000000000000000000000000000000_pastry

train_pastry=train[,-c(29,30,31,32,33,34,35)]

fix(train_pastry)

test_pastry= test[,-c(29,30,31,32,33,34,35)]

attach(train_pastry)

attach(test_pastry)

log_pastry =

glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y

_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train

_pastry,family = "binomial")

summary(log_pastry)

log_pastry_pred = predict(log_pastry, test_pastry, type ="response")

log_pastry_pred_y = rep(0, length(test_pastry[,28])) # default assignment

log_pastry_pred_y[log_pastry_pred> 0.5]= 1

table(log_pastry_pred_y, test_pastry[,28])

mean(log_pastry_pred_y != test_pastry[,28])

#ROC

y=rep(0,length(log_pastry_pred_y))

y[log_pastry_pred_y==1]=1

x=rep(0,length(test_pastry[,28]))

x[test_pastry[,28]==1]=1

roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ PASTRY")

# cross validation

cv_pastry =

glm(Pastry~LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_Index+Log_Y

_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosity,data=train

_pastry,family = "binomial")

cv.glm(train_pastry,cv_pastry,K=10)$delta[1]

#000000000000000000000000000000000000000000000000_zs

train_zs=train[,-c(28,30,31,32,33,34,35)]

fix(train_zs)

test_zs= test[,-c(28,30,31,32,33,34,35)]

attach(train_zs)

attach(test_zs)

log_zs =

glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum

+Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino

sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial")

summary(log_zs)

log_zs_pred = predict(log_zs, test_zs, type ="response")

log_zs_pred_y = rep(0, length(test_zs[,28])) # default assignment

log_zs_pred_y[log_zs_pred> 0.5]= 1

table(log_zs_pred_y, test_zs[,28])

mean(log_zs_pred_y != test_zs[,28])

#ROC

y=rep(0,length(log_zs_pred_y))

y[log_zs_pred_y==1]=1

x=rep(0,length(test_zs[,28]))

x[test_zs[,28]==1]=1

roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Z_skratch")

#CV

log_zs=step(glm(Z_Scratch~.,data=train_zs,family="binomial"),direction="backward")

cv_zs =

glm(Z_Scratch~Pixels_Areas+Edges_X_Index+Sum_of_Luminosity+X_Perimeter+Y_Perimeter+Log_Y_Index+Y_Maximum

+Y_Minimum+Steel_Plate_Thickness+X_Minimum+X_Maximum+Orientation_Index+Edges_Index+Minimum_of_Lumino

sity+Maximum_of_Luminosity+Length_of_Conveyor+TypeOfSteel_A300,data=train_zs,family = "binomial")

cv.glm(train_zs,cv_zs,K=10)$delta[1]

#000000000000000000000000000000000000000000000000_ks

train_ks=train[,-c(28,29,31,32,33,34,35)]

test_ks= test[,-c(28,29,31,32,33,34,35)]

attach(train_ks)

attach(test_ks)

log_ks =

glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+

X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index

+Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family =

"binomial")

summary(log_ks)

log_ks_pred = predict(log_ks, test_ks, type ="response")

log_ks_pred_y = rep(0, length(test_ks[,28])) # default assignment

log_ks_pred_y[log_ks_pred> 0.5]= 1

table(log_ks_pred_y, test_ks[,28])

mean(log_ks_pred_y != test_ks[,28])

log_ks=step(glm(K_Scratch~.,data=train_ks,family="binomial"),direction="backward")

#ROC

y=rep(0,length(log_ks_pred_y))

y[log_ks_pred_y==1]=1

x=rep(0,length(test_ks[,28]))

x[test_ks[,28]==1]=1

roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ k_skratch")

#CV

cv_ks =

glm(K_Scratch~X_Maximum+X_Minimum+Outside_X_Index+Square_Index+SigmoidOfAreas+Y_Maximum+Y_Minimum+

X_Perimeter+Y_Perimeter+Minimum_of_Luminosity+Edges_Index+Outside_Global_Index+Edges_X_Index+Log_X_Index

+Empty_Index+Orientation_Index+Log_Y_Index+Luminosity_Index+Steel_Plate_Thickness,data=train_ks,family =

"binomial")

cv.glm(train_ks,cv_ks,K=10)$delta[1]

#000000000000000000000000000000000000000000000000_Stains

train_stains=train[,-c(28,29,30,32,33,34,35)]

test_stains= test[,-c(28,29,30,32,33,34,35)]

log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed

ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max

imum_of_Luminosity+Outside_X_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,

data=train_stains,family="binomial")

log_stains=glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Ed

ges_Index+LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Max

imum_of_Luminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data=

train_stains,family = "binomial")

summary(log_stains)

log_stains_pred = predict(log_stains, test_stains, type ="response")

log_stains_pred_y = rep(0, length(test_of[,28])) # default assignment

log_stains_pred_y[log_stains_pred> 0.5]= 1

table(log_stains_pred_y, test_stains[,28])

mean(log_stains_pred_y != test_stains[,28])

log_stains=step(glm(Stains~.,data=train_stains,family="binomial"),direction="backward")

#ROC

y=rep(0,length(log_stains_pred_y))

y[log_stains_pred_y==1]=1

x=rep(0,length(test_stains[,28]))

x[test_stains[,28]==1]=1

roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Stains")

# cross validation

cv_stains =

glm(Stains~Y_Perimeter+Minimum_of_Luminosity+X_Perimeter+SigmoidOfAreas+Steel_Plate_Thickness+Edges_Index+

LogOfAreas+Orientation_Index+X_Minimum+Length_of_Conveyor+Edges_Y_Index+Sum_of_Luminosity+Maximum_of_L

uminosity,Outside_Global_Index+Y_Minimum+Log_X_Index+Empty_Index+Square_Index+Y_Maximum,data=train_stain

s,family = "binomial")

cv.glm(train_stains,cv_stains,K=10)$delta[1]

#000000000000000000000000000000000000000000000000_Dirtiness

train_dirt=train[,-c(28,29,30,31,33,34,35)]

test_dirt= test[,-c(28,29,30,31,33,34,35)]

log_dirt =

glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M

aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial")

summary(log_dirt)

log_dirt_pred = predict(log_dirt, test_dirt, type ="response")

log_dirt_pred_y = rep(0, length(test_dirt[,28])) # default assignment

log_dirt_pred_y[log_dirt_pred> 0.5]= 1

table(log_dirt_pred_y, test_dirt[,28])

mean(log_dirt_pred_y != test_dirt[,28])

log_dirt=step(glm(Dirtiness~.,data=train_dirt,family="binomial"),direction="backward")

# cross validation

cv_dirt =

glm(Dirtiness~LogOfAreas+Empty_Index+Orientation_Index+Edges_Index+Y_Maximum+X_Perimeter+X_Minimum+X_M

aximum+Length_of_Conveyor+Outside_X_Index+Y_Perimeter+Square_Index,data=train_dirt,family = "binomial")

cv.glm(train_dirt,cv_dirt,K=10)$delta[1]

y=rep(0,length(log_dirt_pred_y))

y[log_dirt_pred_y==1]=1

x=rep(0,length(test_dirt[,28]))

x[test_dirt[,28]==1]=1

roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ Dirtiness")

#000000000000000000000000000000000000000000000000_Bumps

train_bumps=train[,-c(28,29,30,31,32,34,35)]

test_bumps= test[,-c(28,29,30,31,32,34,35)]

log_bumps =

glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate

_Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+

Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial")

summary(log_bumps)

log_bumps_pred = predict(log_bumps, test_bumps, type ="response")

log_bumps_pred_y = rep(0, length(test_of[,28])) # default assignment

log_bumps_pred_y[log_bumps_pred> 0.5]= 1

table(log_bumps_pred_y, test_bumps[,28])

mean(log_bumps_pred_y != test_bumps[,28])

log_bumps=step(glm(Bumps~.,data=train_bumps,family="binomial"),direction="backward")

# cross validation

cv_bumps =

glm(Bumps~Log_X_Index+Minimum_of_Luminosity+Log_Y_Index+Y_Perimeter+Square_Index+X_Maximum+Steel_Plate

_Thickness+Maximum_of_Luminosity+Luminosity_Index+Edges_Y_Index+Outside_X_Index+Edges_Index+Y_Maximum+

Y_Minimum+TypeOfSteel_A300,data=train_bumps,family = "binomial")

cv.glm(train_bumps,cv_bumps,K=10)$delta[1]

y=rep(0,length(log_bumps_pred_y))

y[log_bumps_pred_y==1]=1

x=rep(0,length(test_bumps[,28]))

x[test_bumps[,28]==1]=1

roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ bumps")

#000000000000000000000000000000000000000000000000_otherfaults

train_of=train[,-c(28,29,30,31,32,33,35)]

test_of= test[,-c(28,29,30,31,32,33,35)]

log_of =

glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey

or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel

_Plate_Thickness,data=train_of,family = "binomial")

summary(log_of)

log_of_pred = predict(log_of, test_of, type ="response")

log_of_pred_y = rep(0, length(test_of[,28])) # default assignment

log_of_pred_y[log_of_pred> 0.5]= 1

table(log_of_pred_y, test_of[,28])

mean(log_of_pred_y != test_of[,28])

log_of=step(glm(Other_Faults~.,data=train_of,family="binomial"),direction="backward")

#ROC

y=rep(0,length(log_of_pred_y))

y[log_of_pred_y==1]=1

x=rep(0,length(test_of[,28]))

x[test_of[,28]==1]=1

roc(x,y,plot=TRUE,main="LOGISTIC REGRESSION _ other faults")

# cross validation

cv_of =

glm(Other_Faults~Edges_X_Index+Log_Y_Index+Outside_Global_Index+Edges_Y_Index+Y_Perimeter+Length_of_Convey

or+TypeOfSteel_A300+Luminosity_Index+X_Perimeter+Log_X_Index+Minimum_of_Luminosity+Orientation_Index+Steel

_Plate_Thickness,data=train_of,family = "binomial")

cv.glm(train_of,cv_of,K=10)$delta[1]

##########PCA

#PCA on complete data set

datap=data[,-(28:35)]

fit <- princomp(datap, cor=TRUE)

summary(fit) # print variance accounted for

loadings(fit) # pc loadings

plot(fit,type="lines") # scree plot

fit$scores # the principal components

biplot(fit)

axes <- predict(fit, newdata = datap)

head(axes, 4)

fix(axes)

data1=axes[,1:7]

write.csv(data1,file="pcadata.csv")

data2=data.frame(data,data1)

write.csv(data2,file="comb_data.csv")

#SVM

install.packages("e1071")

library(e1071)

svm.fit=svm(alldefects~.,data=train2,type="C",kernel="polynomial",degree=3, cost=15)

summary(svm.fit)

predicted=predict(svm.fit,test2)

table(predicted,test.alldefects)

mean(predicted!=test.alldefects)

plot(svm.fit,train2,Length_of_Conveyor~X_Maximum,slice=list(X_Perimeter=3,Y_Perimeter=4),svSymbol=1,dataSymbol

=2,color.palette=terrain.colors )

# ROC

predictions=as.numeric(predicted,type="response")

multiclass.roc(test.alldefects,predictions,plot=T,main="ROC for SVM")

#ANN

library(nnet)

train.nnet<-nnet(alldefects~.,data=train2,size=20,rang=0.1,Hess=FALSE,decay=0.001,maxit=10000)

test.nnet<-predict(train.nnet,test2,type=("class"))

table(test2$alldefects,test.nnet)

mean(test.nnet!=test2$alldefects)

library(pROC)

predictions=as.numeric(test.nnet,type="response")

multiclass.roc(test2$alldefects,predictions,plot=T,main="ROC for ANN")

#read data stored in CSV file.

data=read.csv("Steel_faults.csv",header=TRUE)

attach(data)

x=data[,1:27] # input variables

y=data[,28:34] # response variables

n=1941 # total number of observations

n1=round(n*0.7) # number of observations for training

samp=sample(1:n,n1,replace=FALSE) # to select random observation

## following is the userdefined function to obtain confusion matrix

test.cl = function(true, pred) {

true = max.col(true)

cres = max.col(pred)

table(true, cres)

}

###another package for NNA

install.packages("RSNNS")

library(RSNNS)

model=mlp(x[samp,], y[samp,], size=c(10,10,5),linOut=F)

model=mlp(train2[,-28], train2[,28], size=2,linOut=F)

#library(devtools)

#plot.nnet(model)

test.cl(y[-samp,], predict(model, x[-samp,])) #confusion matrix for training data

test.cl(y[samp,],fitted.values(model)) #confusion matrix for testing data

#C50

crx <- data[ sample( nrow( data ) ), ]

X <- crx[,1:27]

y <- crx[,35]

trainx<- X[1:1358,]

trainy <- y[1:1358]

testx <- X[1358:1941,]

testy <- y[1358:1941]

model <- C5.0( trainx, trainy, trials=75 )

p <- predict( model, testx, type="class" )

sum( p == testy ) / length( p )

table(p, testy)

mean(p != testy)

predictions_c5 <- as.numeric(p,type="response")

multiclass.roc(testy, predictions_c5, plot=T)

2. On PCA data library(ISLR)

library(boot)


attach(data)

data$alldefects="A"

for(i in 1:1941) {








set.seed(1)







test2=test[,-(8:14)]


#LDA





#qda





##tree

library(tree)

tree1=tree(train2$alldefects~Pixels_Areas+Length_of_Conveyor+Log_X_Index+Sum_of_Luminosity+Steel_Plate_Thickne

ss+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2)

plot(tree1)





##pruning

set.seed (1)


names(cv.data)

cv.data

par(mfrow =c(1,2))




plot(prune.data)





## bAGGING with important predictors

set.seed (1)

bag.data


+Outside_X_Index+LogOfAreas+X_Minimum+Minimum_of_Luminosity,data=train2 , mtry=13,importance =TRUE)

bag.data



abline (0,1)


#randomforest

set.seed (1)






set.seed (1)


rrf






set.seed (1)









set.seed (1)









set.seed (1)









set.seed (1)









set.seed (1)









set.seed (1)









set.seed (1)











varImpPlot(r)

###################################################################logistic regression

#000000000000000000000000000000000000000000000000_pastry


fix(train_pastry)



attach(test_pastry)

log_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")

summary(log_pastry)






log_pastry=step(glm(Pastry~., data=train_pastry,family="binomial"),direction="backward")

#####################################################

#000000000000000000000000000000000000000000000000_pastry


fix(train_pastry)



attach(test_pastry)

log_pastry =

glm(Pastry~Y_Minimum+LogOfAreas+TypeOfSteel_A300+Sum_of_Luminosity+Log_X_Index+Square_Index+Orientation_

Index+Log_Y_Index+Maximum_of_Luminosity+X_Maximum+X_Minimum+Length_of_Conveyor+Minimum_of_Luminosi

ty,data=train_pastry,family = "binomial")

summary(log_pastry)






# cross validation

cv_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")


#000000000000000000000000000000000000000000000000_zs

train_zs=train[,-c(8,10,11,12,13,14,15)]

fix(train_zs)

test_zs= test[,-c(8,10,11,12,13,14,15)]

attach(train_zs)

attach(test_zs)

log_zs = glm(Z_Scratch~.,data=train_zs,family = "binomial")

summary(log_zs)






cv_zs = glm(Z_Scratch~.,data=train_zs,family = "binomial")



#000000000000000000000000000000000000000000000000_ks

train_ks=train[,-c(8,9,11,12,13,14,15)]

test_ks= test[,-c(8,9,11,12,13,14,15)]

attach(train_ks)

attach(test_ks)

log_ks = glm(K_Scratch~.,data=train_ks,family = "binomial")

summary(log_ks)






cv_ks = glm(K_Scratch~.,data=train_ks,family = "binomial")



#000000000000000000000000000000000000000000000000_Stains

train_stains=train[,-c(7,8,9,10,12,13,14,15)]

test_stains= test[,-c(7,8,9,10,12,13,14,15)]

log_stains = glm(Stains~.,data=train_stains,family = "binomial")

summary(log_stains)


log_stains_pred_y = rep(0, length(test_stains[,7])) # default assignment





# cross validation

cv_stains = glm(Stains~.,data=train_stains,family = "binomial")


cv_bumps = glm(Bumps~.,data=train_bumps,family = "binomial")


#000000000000000000000000000000000000000000000000_Dirtiness


test_dirt= test[,-c(8,9,10,11,13,14,15)]

log_dirt = glm(Dirtiness~.,data=train_dirt,family = "binomial")

summary(log_dirt)






# cross validation

cv_dirt = glm(Dirtiness~.,data=train_dirt,family = "binomial")


#000000000000000000000000000000000000000000000000_Bumps


test_bumps= test[,-c(8,9,10,11,12,14,15)]

log_bumps = glm(Bumps~.,data=train_bumps,family = "binomial")

summary(log_bumps)


log_bumps_pred_y = rep(0, length(test_bumps[,8])) # default assignment





# cross validation



#000000000000000000000000000000000000000000000000_otherfaults

train_of=train[,-c(8,9,10,11,12,13,15)]

test_of= test[,-c(8,9,10,11,12,13,15)]

log_of = glm(Other_Faults~.,data=train_of,family = "binomial")

summary(log_of)







# cross validation

cv_of = glm(Other_Faults~.,data=train_of,family = "binomial")


3. On Combined dataset (Original+PCA) library(ISLR)

library(boot)

library(MASS)


attach(data)

data$alldefects="A"

for(i in 1:1941) {








set.seed(1)







test2=test[,-(35:41)]


#LDA





#qda





##tree

library(tree)

tree1=tree(train2$alldefects~.,data=train2)

plot(tree1)





##pruning

set.seed (1)


names(cv.data)

cv.data

par(mfrow =c(1,1))




plot(prune.data)





## bAGGING with important predictors

set.seed (1)

bag.data =randomForest(alldefects~.,data=train2 , mtry=10,importance =TRUE)

bag.data



abline (0,1)

table(yhat.bag, test.alldefects)


#randomforest

set.seed (1)




table(yhat.rf, test.alldefects)



set.seed (1)


rrf




table(yhat.rrf, test.alldefects)



set.seed (1)








set.seed (1)








set.seed (1)








set.seed (1)








set.seed (1)








set.seed (1)








set.seed (1)










varImpPlot(r)

###################################################################logistic regression

#000000000000000000000000000000000000000000000000_pastry


fix(train_pastry)



attach(test_pastry)

log_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")

summary(log_pastry)






# cross validation

cv_pastry = glm(Pastry~.,data=train_pastry,family = "binomial")


#000000000000000000000000000000000000000000000000_zs

train_zs=train[,-c(35,37,38,39,40,41,42)]

fix(train_zs)

test_zs= test[,-c(35,37,38,39,40,41,42)]

attach(train_zs)

attach(test_zs)

log_zs = glm(Z_Scratch~.,data=train_zs,family = "binomial")

summary(log_zs)






# cross validation_zs

cv_zs = glm(Z_Scratch~.,data=train_zs,family = "binomial")



#000000000000000000000000000000000000000000000000_ks

train_ks=train[,-c(35,36,38,39,40,41,42)]

test_ks= test[,-c(35,36,38,39,40,41,42)]

attach(train_ks)

attach(test_ks)

log_ks = glm(K_Scratch~.,data=train_ks,family = "binomial")

summary(log_ks)






cv_ks = glm(K_Scratch~.,data=train_ks,family = "binomial")



#000000000000000000000000000000000000000000000000_Stains

train_stains=train[,-c(35,36,37,39,40,41,42)]

test_stains= test[,-c(35,36,37,39,40,41,42)]

log_stains = glm(Stains~.,data=train_stains,family = "binomial")

summary(log_stains)


log_stains_pred_y = rep(0, length(test_stains[,8])) # default assignment





# cross validation

cv_stains = glm(Stains~.,data=train_stains,family = "binomial")




#000000000000000000000000000000000000000000000000_Dirtiness


test_dirt= test[,-c(35,36,37,38,40,41,42)]

log_dirt = glm(Dirtiness~.,data=train_dirt,family = "binomial")

summary(log_dirt)






# cross validation

cv_dirt = glm(Dirtiness~.,data=train_dirt,family = "binomial")


#000000000000000000000000000000000000000000000000_Bumps


test_bumps= test[,-c(35,36,37,38,39,41,42)]

log_bumps = glm(Bumps~.,data=train_bumps,family = "binomial")

summary(log_bumps)


log_bumps_pred_y = rep(0, length(test_bumps[,8])) # default assignment





# cross validation



#000000000000000000000000000000000000000000000000_otherfaults

train_of=train[,-c(35,36,37,38,39,40,42)]

test_of= test[,-c(35,36,37,38,39,40,42)]

log_of = glm(Other_Faults~.,data=train_of,family = "binomial")

summary(log_of)







# cross validation

cv_of = glm(Other_Faults~.,data=train_of,family = "binomial")


Date post:	21-Jan-2017
Category:	Documents
Upload:	naman-kapoor
View:	239 times
Download:	0 times

ISEN 613_Team3_Final Project Report

Documents