KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica -...

KDD – KNOWLEDGE DISCOVERY IN DATABASES PREDICTION METHODS – CLASSIFICATION AND REGRESSION

Daniela Barreiro Claro

Introduction

KDD

Pre-Processing

Data Mining

Tasks

Pos-Processing

Outline

2 de X;X= Prof. Daniela Barreiro Claro

Are you ready for the BigData era?

Introduction

Prof. Daniela Barreiro Claro

Are you ready for the BigData era?

Introduction


Big Data = cloud+social+mobile

Introduction


What is BIG DATA?

Big data is data that exceeds the processing capacity

of conventional database systems.

The data is too big, moves too fast, or doesn’t fit the

structures of a database architecture

The buzzword started by 2012

FORMAS - UFBA 6 de X

Introduction

Physical Objects

+

Controller, Sensor, and Actuators

+

Internet

=

Internet of Things 1. Adrian McEwen & Hakim Cassimally. Designing the Internet of Things, 7 de X

Internet of Things

Integrate things into the existing web

HTML and REST

Smart things


Internet of Things

Huge amount of data

Urgent necessity to have new techniques and tools automate the

process to extract data

These techniques and tools may help to transform this huge

amount of data into relevant and useful information.

“Necessity is the mother of invention”

Data mining

Automated analysis of huge amount of data sets.

BIG Data

9

Large number of transactions is running each day, for instance: Walmart, Carrefour

Remote sensors

Telecomunications networks

Medical records, patients records, etc

Traffic Sensors

Devices

10

BIG Data

“The World is Data Rich but information poor”

Collected data is being stored into large repositories.

Data Tombs – “Tumbas de Dados”

Achieved data that is rarely visited

Ex. Camera video

Prof. Daniela Barreiro Claro 11

BIG Data

Data Knowledge Discovery process using data stored

Following Fayyad 1996, KDD is:

“”The nontrivial process of identifying valid, novel, potentially

useful and ultimately understandable patterns in data”

KDD has some steps:

Selection, pre-processing (transformation), interpretation/evaluation and

knowledge

KDD – Knowledge Discovery in Databases

12 Prof. Daniela Barreiro Claro

KDD - Knowledge Discovery in Databases


1. Domain knowledge

2. Creating of the dataset

3. Pre-processing and Transformations

4. Choose of DM technique

5. Choose of DM algorithm

6. Interpretation and evaluation of patterns found

7. Knowledge discovery




Some steps of KDD can be visualized as a Data

Warehouse (DW)

15

Three macro steps

Pre- Processing

Data Cleaning

Data Integration

Data Transformation

Data Reduction

Data Mining

Techniques of DM

Algorithms of DM

Pos-Processing

Analysis and evaluation of the patterns discovered

KDD - Knowledge Discovery in

Databases


Real data have normally the following characteristics:

Incomplete

Attributes are missing values, attributes are aggregate

Wrong

There are errors; attributes with unexpected values

Inconsistence

There are discrepancies among data items; some attributes that represent a concept, can have distinct names in different databases.

Huge amount of data

Large number of data makes data mining process very slow

Pre-Processing


The pre-processing process can highlight 4 steps:

Data Cleaning

To clean the data

To complete the data that is missing

To resolve inconsistencies

To soften error (suavizar)

To eliminate or minimize discrepancies among data

If data is dirty, therefore the results will be unreliable

Pre-Processing


Data Integration

Integrate the data from different databases, data cubes, file systems, etc

Some attributes that represent a concept can have different names in different databases.

Ex. IdCliente, ClienteID, Cli_ID,

Some attributes can be inferred by others

Ex. Annual salary, total amount

Many times the data integration process can generate some redundancy. In this cases, the step Data Cleaning must be re-executed to eliminate the redundancy generated by this phase

Pre-Processing


Data Transformation

This step covers two main procedures

Agregation

Combination of two or more object into a single object

Ex. Aggregate 365 days into 12 months

Changement of the scale

Small datasets need less memory and time processing

Agregate quantity, such as average and total has less variance than single objects.

Disavantages

Lose of interesting details

Pre-Processing

20

Data Transformation

Normalization or Standardization

Discrete data set has some properties

If different variables need to be combined, it is necessary to transform

them to avoid that large values dominate the results.

Ex. Two variables: age and salary

Difference between both values salary (thousands of dollars) and age

(less than 130)

Pre-Processing


Data Reduction Reduction of data representation considering volume, even if

it produces the same analytical result (or similar).

Strategies Aggregation

To construct a data cube

Attribute selection To eliminate irrelevant attributes by the use of a correlation analysis

Dimension reduction

Data discretization

Pre-Processing


Data Reduction Dimension reduction

A dimension consider the number of attributes

Can eliminate irrelevant characteristics and noise reduction

Can generate a more comprehensive model

Can reduce data and many times examine them.

Many times is used to join attributes generating new attributes, that is, a combination of old attributes

Data Discretization Transforming a continuous attribute into a categorical attribute (discrete) or

into binary attributes(binary process )

Pre-Processing


Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount

of data

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, information

harvesting, business intelligence, etc.


Data Mining


Data Mining

Data Mining

Machine

Learning Statistics

Applications BI / Web Search

Visualization Database

It is one of the steps in a KDD process

Two macro aims:

Prediction

Description

Prediction

Predict values to future variable or not known variables.

Description

Discover patterns that describe the data set

Data Mining


TECHNIQUES

Data Mining

Prediction Description

Classification Regression Clustering Summarization Association

Data Mining


28

Supervised vs. Unsupervised Learning

Supervised learning (prediction)

Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (description)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters or associations in the data

TECHNIQUES

Data Mining

Prediction Description

Classification Regression Clustering Summarization Association

Data Mining


Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Classification techniques


31

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees, or mathematical formulae

Model usage: for classifying future or unknown objects

Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable, use the model to classify new data

Note: If the test set is used to select models, it is called validation (test) set

Classification techniques

32 de X

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10


11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learning

algorithm

Training Set

Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

Classification algorithms

33 de X FORMAS - UFBA

Classification- Decision tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

MarSt

Refund

TaxInc

YES NO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the

same data!


Another example

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model


1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No



11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?


Test Set

Tree

Induction

algorithm

Training Set

Decision

Tree

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data Start from the root of tree.


Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data


Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

Assign Cheat to “No”

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model


1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No



11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?


Test Set

Tree

Induction

algorithm

Training Set

Decision

Tree

4 macro steps:

1. Divide training data set and test data set

2. Choose the classification attribute (labeled attribute)

Decide what features of the data are relevant to the target class we want to predict.

Verify the relevant attributes (entropy and information gain)

3. Generate the decision tree

4. Test the efficiency of the classification algorithm using the test data set



Entropy

It is a measure of impurity. It is defined for a binary class with values a/b as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

Information gain

It is usually a good measure for deciding the relevance of an attribute

It is to define a preferred sequence of attributes to investigate to rapidly narrow down the state

of the predict class

A notable problem occurs when information gain is applied to attributes that can take on a large

number of distinct values

One of the input attributes might be the customer's credit card number.


45 de X

Classification- Exercise

46

Classification- Decision tree - Results

47

Using a Decision

tree algorithm

Name gender

Pedro M

Miguel M

Ana F

Gabriela F

Predict Daniela’s genre?

Daniela ?


Classification- Decision tree - Exercise

Features

Ends vowel

Number of vowel

Length

Represents a function to predict a number

Can predict the height of a child given the child’s age

Linear regression is the most simple to use

Algorithms examples

GLM _ Generalized Linear Model

Based on statistical techniques

SVM – Support Vector Machines

Supports linear and non-linear regression

Regression techniques


Analyze retrieved information

Generate knowledge

In many times, this is a cyclic process, that is, it is necessary to

redo in order to find useful information

KDD is a slow process


Pos-Processing

/formasresearchgroup /formasresearch

www.formas.ufba.br

Semantic Applications and Formalisms Research Group


Email: [email protected]

Our course: formas.ufba.br/dclaro

mailto:[email protected]

http://www.dcc.ufba.br/~dclaro

http://www.dcc.ufba.br/~dclaro

Date post:	21-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

KDD KNOWLEDGE DISCOVERY IN DATABASESformas.ufba.br/dclaro/mat700/Aula 3 - KDD e Mecatronica -...

Documents