KDD – KNOWLEDGE DISCOVERY IN DATABASES PREDICTION METHODS – CLASSIFICATION AND REGRESSION
Daniela Barreiro Claro
Introduction
KDD
Pre-Processing
Data Mining
Tasks
Pos-Processing
Outline
2 de X;X= Prof. Daniela Barreiro Claro
Are you ready for the BigData era?
Introduction
Prof. Daniela Barreiro Claro
Are you ready for the BigData era?
Introduction
Prof. Daniela Barreiro Claro
Big Data = cloud+social+mobile
Introduction
Prof. Daniela Barreiro Claro
What is BIG DATA?
Big data is data that exceeds the processing capacity
of conventional database systems.
The data is too big, moves too fast, or doesn’t fit the
structures of a database architecture
The buzzword started by 2012
FORMAS - UFBA 6 de X
Introduction
Physical Objects
+
Controller, Sensor, and Actuators
+
Internet
=
Internet of Things 1. Adrian McEwen & Hakim Cassimally. Designing the Internet of Things, 7 de X
Internet of Things
Integrate things into the existing web
HTML and REST
Smart things
FORMAS - UFBA 8 de X
Internet of Things
Huge amount of data
Urgent necessity to have new techniques and tools automate the
process to extract data
These techniques and tools may help to transform this huge
amount of data into relevant and useful information.
“Necessity is the mother of invention”
Data mining
Automated analysis of huge amount of data sets.
BIG Data
9
Large number of transactions is running each day, for instance: Walmart, Carrefour
Remote sensors
Telecomunications networks
Medical records, patients records, etc
Traffic Sensors
Devices
10
BIG Data
“The World is Data Rich but information poor”
Collected data is being stored into large repositories.
Data Tombs – “Tumbas de Dados”
Achieved data that is rarely visited
Ex. Camera video
Prof. Daniela Barreiro Claro 11
BIG Data
Data Knowledge Discovery process using data stored
Following Fayyad 1996, KDD is:
“”The nontrivial process of identifying valid, novel, potentially
useful and ultimately understandable patterns in data”
KDD has some steps:
Selection, pre-processing (transformation), interpretation/evaluation and
knowledge
KDD – Knowledge Discovery in Databases
12 Prof. Daniela Barreiro Claro
KDD - Knowledge Discovery in Databases
13 Prof. Daniela Barreiro Claro
1. Domain knowledge
2. Creating of the dataset
3. Pre-processing and Transformations
4. Choose of DM technique
5. Choose of DM algorithm
6. Interpretation and evaluation of patterns found
7. Knowledge discovery
KDD - Knowledge Discovery in Databases
14 Prof. Daniela Barreiro Claro
KDD - Knowledge Discovery in Databases
Some steps of KDD can be visualized as a Data
Warehouse (DW)
15
Three macro steps
Pre- Processing
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Mining
Techniques of DM
Algorithms of DM
Pos-Processing
Analysis and evaluation of the patterns discovered
KDD - Knowledge Discovery in
Databases
16 Prof. Daniela Barreiro Claro
Real data have normally the following characteristics:
Incomplete
Attributes are missing values, attributes are aggregate
Wrong
There are errors; attributes with unexpected values
Inconsistence
There are discrepancies among data items; some attributes that represent a concept, can have distinct names in different databases.
Huge amount of data
Large number of data makes data mining process very slow
Pre-Processing
17 Prof. Daniela Barreiro Claro
The pre-processing process can highlight 4 steps:
Data Cleaning
To clean the data
To complete the data that is missing
To resolve inconsistencies
To soften error (suavizar)
To eliminate or minimize discrepancies among data
If data is dirty, therefore the results will be unreliable
Pre-Processing
18 Prof. Daniela Barreiro Claro
Data Integration
Integrate the data from different databases, data cubes, file systems, etc
Some attributes that represent a concept can have different names in different databases.
Ex. IdCliente, ClienteID, Cli_ID,
Some attributes can be inferred by others
Ex. Annual salary, total amount
Many times the data integration process can generate some redundancy. In this cases, the step Data Cleaning must be re-executed to eliminate the redundancy generated by this phase
Pre-Processing
19 Prof. Daniela Barreiro Claro
Data Transformation
This step covers two main procedures
Agregation
Combination of two or more object into a single object
Ex. Aggregate 365 days into 12 months
Changement of the scale
Small datasets need less memory and time processing
Agregate quantity, such as average and total has less variance than single objects.
Disavantages
Lose of interesting details
Pre-Processing
20
Data Transformation
Normalization or Standardization
Discrete data set has some properties
If different variables need to be combined, it is necessary to transform
them to avoid that large values dominate the results.
Ex. Two variables: age and salary
Difference between both values salary (thousands of dollars) and age
(less than 130)
Pre-Processing
21 Prof. Daniela Barreiro Claro
Data Reduction Reduction of data representation considering volume, even if
it produces the same analytical result (or similar).
Strategies Aggregation
To construct a data cube
Attribute selection To eliminate irrelevant attributes by the use of a correlation analysis
Dimension reduction
Data discretization
Pre-Processing
22 Prof. Daniela Barreiro Claro
Data Reduction Dimension reduction
A dimension consider the number of attributes
Can eliminate irrelevant characteristics and noise reduction
Can generate a more comprehensive model
Can reduce data and many times examine them.
Many times is used to join attributes generating new attributes, that is, a combination of old attributes
Data Discretization Transforming a continuous attribute into a categorical attribute (discrete) or
into binary attributes(binary process )
Pre-Processing
23 Prof. Daniela Barreiro Claro
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, information
harvesting, business intelligence, etc.
Prof. Daniela Barreiro Claro 24
Data Mining
Prof. Daniela Barreiro Claro 25
Data Mining
Data Mining
Machine
Learning Statistics
Applications BI / Web Search
Visualization Database
It is one of the steps in a KDD process
Two macro aims:
Prediction
Description
Prediction
Predict values to future variable or not known variables.
Description
Discover patterns that describe the data set
Data Mining
26 Prof. Daniela Barreiro Claro
TECHNIQUES
Data Mining
Prediction Description
Classification Regression Clustering Summarization Association
Data Mining
27 Prof. Daniela Barreiro Claro
28
Supervised vs. Unsupervised Learning
Supervised learning (prediction)
Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (description)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters or associations in the data
TECHNIQUES
Data Mining
Prediction Description
Classification Regression Clustering Summarization Association
Data Mining
29 Prof. Daniela Barreiro Claro
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Classification techniques
30 Prof. Daniela Barreiro Claro
31
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
Classification techniques
32 de X
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learning
algorithm
Training Set
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Classification algorithms
33 de X FORMAS - UFBA
Classification- Decision tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
MarSt
Refund
TaxInc
YES NO
NO
NO
Yes No
Married Single,
Divorced
< 80K > 80K
There could be more than one tree that fits the
same data!
Classification- Decision tree
Another example
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data Start from the root of tree.
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
4 macro steps:
1. Divide training data set and test data set
2. Choose the classification attribute (labeled attribute)
Decide what features of the data are relevant to the target class we want to predict.
Verify the relevant attributes (entropy and information gain)
3. Generate the decision tree
4. Test the efficiency of the classification algorithm using the test data set
Classification- Decision tree
44 Prof. Daniela Barreiro Claro
Entropy
It is a measure of impurity. It is defined for a binary class with values a/b as:
Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))
Information gain
It is usually a good measure for deciding the relevance of an attribute
It is to define a preferred sequence of attributes to investigate to rapidly narrow down the state
of the predict class
A notable problem occurs when information gain is applied to attributes that can take on a large
number of distinct values
One of the input attributes might be the customer's credit card number.
Classification- Decision tree
45 de X
Classification- Exercise
46
Classification- Decision tree - Results
47
Using a Decision
tree algorithm
Name gender
Pedro M
Miguel M
Ana F
Gabriela F
Predict Daniela’s genre?
Daniela ?
FORMAS - UFBA 48 de X
Classification- Decision tree - Exercise
Features
Ends vowel
Number of vowel
Length
Represents a function to predict a number
Can predict the height of a child given the child’s age
Linear regression is the most simple to use
Algorithms examples
GLM _ Generalized Linear Model
Based on statistical techniques
SVM – Support Vector Machines
Supports linear and non-linear regression
Regression techniques
49 Prof. Daniela Barreiro Claro
Analyze retrieved information
Generate knowledge
In many times, this is a cyclic process, that is, it is necessary to
redo in order to find useful information
KDD is a slow process
Prof. Daniela Barreiro Claro 50
Pos-Processing
/formasresearchgroup /formasresearch
www.formas.ufba.br
Semantic Applications and Formalisms Research Group
Prof. Daniela Barreiro Claro
Email: [email protected]
Our course: formas.ufba.br/dclaro