1
ZEROTH REVIEW
SUBANYA.B
10CSR021
LAVANYA.M
10CSL149
RAJA.R
10CSR025
PROJECT GUIDE : Dr.R.R.RAJALAXMI
2
INTRODUCTIONDATA MINING • Data mining is the process of extracting knowledge from large amount of data• Knowledge Discovery in Databases
BASIC DATA MINING TASKS
CLASSIFICATION• predicts categorical class labels• classifies data (constructs a model) based on the training set • An algorithm that implements classification is known as a classifier
Predictive Descriptive
ClassificationRegressionTime Series AnalysisPrediction
ClusteringSummarizationAssociation RulesSequence Discovery
FEATURE EXTRACTION
Linear Non -Linear
FEATURE SELECTION
Feature Ranking
Subset Selection
Filter Approaches
Embedded Approaches
Wrapper Approaches
DIMENSIONALITY REDUCTION
4
LITERATURE SURVEY
• Reducing bioinformatics data dimension with ABC-KNN
• Feature Selection for medical diagnosis: Evaluation for cardiovascular diseases
• Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients
5
PAPER -I
Reducing bioinformatics data dimension with ABC-KNN
• Authors: Thananan Prasartvit
Anan Banharnsakun
Boonserm Kaewkamnerdpong
Tiranee Achalakul• Year :2013
PAPER 1
6
PROBLEM• Analyzing a large amount of data often consumes extensive
computational resources and execution time
• All data features do not equally contribute to the end results
• Need to identify the major contributing features and other features with low contribution can be eliminated
• The need for dimension reduction arises because biological data can be massive, with tens of thousands of features to be explored
• The objective is to design an effective algorithm that can selectively remove irrelevant dimensions from data while preserving the semantics of the original data.
7
PROPOSED WORK• Proposed the Artificial Bee colony(ABC) as a
method for data dimension reduction in the classification problems
• The K-Nearest Neighbor (KNN) method is then used for fitness evaluation within the ABC framework
• ABC feature selection method wrapped with KNN for classification( ABC-KNN)
Artificial Bee Colony(ABC) Begin:
Initialize SolutionsRepeat // Employed Bees Process Updating_Feasible_Solutions // Onlooker Bees Process Selecting_Feasible_Solutions Updating_Feasible_Solutions // ScoutBeeProcess
Avoiding_Sub-Optimal_Solutions
Until (maximum number of iterations or thestopping criterion is met)End
K-Nearest Neighbor(KNN)
Begin:For i=1 to number of training data items
Store_dataEndFor j=1 to number of testing data items
Measure_distanceSort_by_distanceEvaluate_data_classEnd
End
10
RESULTS
Colon cance
r
Acute_leuke
mia
Hepatoce
llular
_Carcinoma
High_gra
de_Glioma
Prosta
te_Caner
0102030405060708090
100
LS-SVMPCA-FDAMSDR-LGCLLDE-KNNABC-KNN
Accu
racy
Data Name
12
CONCLUSION
• The experimental results of the gene expression analysis show that the proposed method can effectively reduce the data dimension while maintaining the high classification accuracy
• ABC-KNN can thus be employed to exclude the non-essential data as well as identify the vital elements from a vast amount of biological data
13
PAPER-II
Feature Selection for medical diagnosis Evaluation for cardiovascular diseases
• Author :Swathi Shilaskar
Ashok Ghatol • Year :2013
14
PROBLEM
• To find suitable algorithm that generates smaller feature subset from high dimensional data with improved diagnosis ability for cardio vascular diseases
15
PROPOSED METHODFEATURE SELECTION METHODS Forward Feature Inclusion Back-elimination Feature Selection Forward feature Selection
DATA SETS DESCRIPTION
DATASET NO OF SAMPLES
NO OF FEATURES
CATEGORIES
ARRHYTHMIA 452 279 16
SPECTF CARDIAC
267 44 2
HEART DISEASE
303 14 used 4
20
RESULTS
DATA SET CLASSIFICATION PERFOMANCE WITH ALL FEATURES
CLASSIFICATION PERFORMANCE WITH PROPOSED FEATURE SELECTION ALGORITHM
No of all features
Accuracy with all features
No of feature in subset
Accuracy with feature subset
Arrhythmia 258 0.79 23 0.88
SPECTF cardiac
44 0.75 19 0.78
Heart Disease
10 0.81 4 0.85
21
CONCLUSION
• It gives proper estimation of classifier performance when dataset is balanced
• If the dataset is unbalanced ,it is found that accuracy is not a correct estimate of classifier performance
• Feature ranking methods investigated in this research work well for arrhythmia and heart disease dataset
• Hybrid forward feature selection algorithm successfully reduces feature dimensions and improves accuracy of classifier
• Highest accuracy is achieved when forward selection algorithm is used
22
PAPER-III
Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients
• Author : Susana M. Vieira
Luis F. Mendonca
Goncalo J. Farinha
Joao M.C. Sousa• Year : 2013
23
PROBLEM
• The medical condition taken is Sepsis, a common clinical condition defined by a whole-body inflammatory state, called systemic inflammatory response syndrome (SIRS)
• This clinical condition has different degrees of severity that can lead to severe sepsis and later to septic shock
24
PROPOSED METHOD• A modified binary particle swarm optimization (MBPSO)
method for feature selection with the simultaneous optimization of SVM kernel
• An enhanced version of BPSO, designed to cope with premature convergence of the BPSO algorithm
• The MBPSO is used as a wrapper method
NUMBER DATABASES SAMPLES FEATURES CLASSES
1 German(credit card)
1000 24 2
2 Sonar 208 60 2
3 WBCO 683(699) 9 2
4 WPBC 198 32 2
5 WDBC 569 30 2
6 Colon Cancer 62 2000 2
DATASET S DESCRIPTION
28
CONCLUSION
German Sonar WBCO WPBC WDBC Colon0
20
40
60
80
100
120
NO-FSBPSOIBPSOGAMBPSOAc
cura
cyMBPSO shows a better performance than the methods for PSO and similar or better results than GA
Data base
29
FUTURE WORK
• Future work considers experimenting the introduced algorithm(MBPSO) with other medical databases in order to more consistently compare its performance with other feature selection techniques
30
FINDINGS FROM THE LITERATURE SURVEY
• MBPSO or ABC-KNN can be applied over Heart disease databases to improve the accuracy in the diagnosis of Heart disease
• Hybrid models like PSO-KNN,GA-KNN or ABC with other classification algorithms can be developed and applied over the databases to improve the efficiency of finding the subsets
31
REFERENCES
[1] Thananan Prasartvit, Anan Banharnsakun, Boonserm Kaewkamnerdpong, Tiranee Achalakul,Reducing bioinformatics data dimension with ABC-kNN Neurocomputing 116(2013), 367-381
[2] Swati Shilaskar, Ashok Ghatol ,Feature selection for medical diagnosis :Evaluation for cardiovascular diseases , Expert Systems with Applications 40 (2013), 4146-4153
[3] Susana M. Vieira, Luís F. Mendonca, Gonçalo J. Farinha, Joao M.C. SousaModified binary PSO for feature selection using SVM applied to mortality prediction of septic patients,Applied Soft Computing 13(2013), 3494-3504