Ch2_DMIntro_1

8/8/2019 Ch2_DMIntro_1

1/28

1

Chapter 2. Introduction to Data Mining

Prof. Keith [email protected]


2/28

2

The Course Book

Data Mining: A Tutorial Based Primer

by Richard J.Roiger, Michael Geatz.

Amazon.com

Paperback: 408 pages ; Dimensions (in inches): 0.67 x

9.14 x 7.44

Publisher: Addison-Wesley Publishing; ; Book and

CD-ROM edition (September 26, 2002)

ISBN: 0201741288

List Price: $40.00

Availability: Usually ships within 2 to 3days


3/28

3

1.1 Data Mining: A DefinitionThe process of employing one or more

computer learning techniques to

automatically analyze and extractknowledge from data.

Induction-based learningis the process

of forming generally applicable models(or concept definitions) by observing

specific examples.


4/28

4

Concepts

Definition: A concept is a set of objects, symbols or events grouped

together because they share certain characteristics.

Concept set, class, group, cluster, roughly

Classical View: Concept

Set with well defined deterministic inclusionrules. E.g. A home owner is a good credit risk.

Probabilistic View: A set with probabilistic includion rules.

E.g. A home owner has an 80% chance of being a good credit risk.

Exemplar View: this states that a given instance is determined to be an

example of a particulalr concept if the instance is similar enough to a set ofone or more known examples of the concept.

Eg. Mr. Smith owns his own home and is a good credit risk.


5/28

5

An Investment Dataset

Table 1.3Acme Investors Incorporated

Customer Account Margin Transaction Trades/ Favorite Annual

ID Type Account Method Month Sex Age Recreation Income

1005 Joint No Online 12.5 F 3039 Tennis 4059K

1013 Custodial No Broker 0.5 F 5059 Skiing 8099K

1245 Joint No Online 3.6 M 2029 Golf 2039K

2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K

1001 Individual Yes Online 5.0 M 4049 Golf 6079K

Theflatfile of data is in attribute-valueformat.

Each row/recordis also called a case orinstance.

Each column gives values for an attribute (orvariable) for each of the cases.

Attributes are discrete/categorical/factorial, having a fixed number of possible

values,(e.g. sex, and age) orreal, having a continuous range of possible values (e.g.

average Trades/month).


6/28

6

Possible Business Questions

Table 1.3Acme Investors Incorporated








Can I develop a general characterisation/profile of different

investor types? (CLASSIFICATION)

What characteristics distinguish between Online and Brokerinvestors? (DISCRIMINATION)

Can I develop a model which will predict the average

trades/month for a new investor? (PREDICTION)


7/28

7

Supervised LeaningIn last two questions, we distinguish ONE of the attributes that we would like

to be able to determine from the values of the others.

What characteristics distinguish between Online and Broker investors?

(DISCRIMINATION). (Transaction method (categorical)) is the target

variable .

Can I develop a model which will predict the average trades/month for a

new investor? (PREDICTION). (Trades/month (real)) is the targetvariable.

The Target variable is called the Output variable.

The other variables are called Input variables.

Clearly, which attributes are the output and input variables depends on your

question.

For these questions, and output variables, we KNOW the values of the output

variables for the cases in thte dataset.

In such cases we say that we do SUPERVISED learning since the learning

is controlled by the known values of the output variable in the dataset.


8/28

8

Unsupervised LearningFor the question:

Can I develop a general characterisation/profile of different investor types?(CLASSIFICATION),

NO particular attribute is singled out as an OUTPUT variable.

The question is open-ended.

We do not know if there are any different investor types at all.

If there are different investor types, we do not know how many typesthere are.

If there are different investor types then we do not know what the variousinvestor type (or classes, or concepts) mean. We have to determine the

meaning of the concepts, and appropriate names, after we havedetermined that they exist.

The method of induction based learning used is said to beUNSUPERVISED in such a situation, because the there are no knownoutput classes to control the learning process.


9/28

9

Another Example DatasetTable 1.1 Hypothetical Training Data for Disease Diagnosis

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold

6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold

In this example dataset there are categorical attributes

corresponing to Symptoms, and a categorical attribute of

Diagnosis.

The natural question is to predict the Diagnosis (class) [the

Output variable] from the symptoms, [the input variables].

This requires supervised classification learning.


10/28

10

The Two Concept Learning Paradigms

Supervised Learning

builds a learner model, or concept

definitions, using data instances of known

origin.

and uses the model to determine the

outcome new instances of unknown origin.

Unsupervised LearningA data mining method that builds models

from data without predefined classes.

Usually for classification/clustering.


11/28

11

Supervised Learning:

A Decision Tree Example

ADecision Tree is a tree structure where non-terminal

nodes represent tests/decisions on one or more attributes

and terminal nodes reflect decision outcomes.

Let us consider the Symptoms/Diagnosis dataset for a

supervised classification.


12/28

12

Table 1.1 Hypothetical Training Data for Disease Diagnosis

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold

10 Yes Yes No Yes Yes Cold

Consider each of the attributes in turn, to see which would be a good one to

start our Decision Tree with.

Is there a perfect 1-1 relationship between any of the input variables and the

ourput variable:

Sore Throat, Fever dont seem very good. However,

{Swollen Glands = Yes} corresponds 1-1 with {Diagnosis = Strep throat}

i.e. If {Swollen Glands = Yes} then {Diagnosis = Strep throat}

Hence we use Swollen Glands for our first Dicision Node.

Etc we get


13/28

13

Swollen

Glands

Fever

No

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Yes

Diagnosis = Strep Throat

First

Test/Decision

Node

Terminal

Decision Node


14/28

14

Notes on this Decision Tree:

The tree is upside down.

The Decision Tree fits the data perfectly.

There are no errors. Accuracy = 100%.

The Decision Tree discards the unneccessary attributes

A computer algorithm to construct Decision Trees would

be farly easy to programme, and would do the job muchquicker than we humans can.


15/28

15

Use of the Decision Tree for Prediction

We may now use the Decision Tree for futurediagnoses, (or prediction of diagnosis). Consider

the following symptomatic data:

Table 1.2 Data I sta ces it a k o Classificatio

Pati t r llI T r at r lands Congestion Headache Diagnosis

11 No No Y s Y s Y s ?12 Y s Y s No No Y s ?13 No No No No Y s ?

What are the predicted diagnoses?

Are these likely to be 100% accurate?


16/28

16

Production Rules

We may summarize the Decision Tree by listing

the decisions along each path from the starting

node to each terminal node.

1. IF Swollen Glands = Yes

THENDiagnosis = Strep Throat

2. IF Swollen Glands = No&Fever = Yes

THENDiagnosis = Cold3. IF Swollen Glands = No & Fever = No

THENDiagnosis = Allergy


17/28

17

Unsupervised Clustering

A data mining method that builds models from data without

predefined output classes.Table 1.3Acme Investors Incorporated








What attribute similarities group customers together?

What differences in attribute values segment the customers?

How many significant cluster are there?


18/28

18

1.3 Is Data Mining Appropriate for My Problem?

Data Mining orData Query (using SQL and OLAP)?

It depends on the type of question you want to answer, and

the type of knowledge you want to discover.

Shallow Knowledge: simple summaries (e.g. averages), or aggregates

(totals) of an attribute over a selected set of cases.

You need to know the cases to select. SQL can do this.

Multidimensional Knowledge : Information about the frequent

occurance of values of different attributes (known as Association

Analysis). OLAP on the data cube can do this.

Hidden Knowledge : Knowledge about patterns or relationships that

cannot guessed at prior to data mining.

Deep Knowledge : Knowledge about hidden patterns and relationships

which can only be discovered using prior scientific or meta-knowledge.

This is the research frontier for Data Mining.


19/28

19

Data Mining vs. OLA

P vs. Data Query

Use data query if you already almost know what you are

looking for, and you wish to work with large databases.

Use OLAP if you wish to discover simple associations in

large databases.

Use data mining to find patterns and relationships in data

that are not obvious.

Because of the relative slowness of datamining algorithmsthis often means that the database has to be small, or

sampled. Devising Data Mining algorithms which scale to

large databases is a current research topic in Data Mining.


20/28

20

Data Mining Applications

Data mining is a young discipline with wide and

diverse applications

There is still a nontrivial gap between general principles

of data mining and domain-specific, effective datamining tools for particular applications

Some application domains

Biomedical and DNA data analysis

Financial data analysis

Retail industry

Telecommunication industry


21/28

21

Biomedical Data Mining andDNAAnalysis

DNA sequences: 4 basic building blocks (nucleotides): adenine(A), cytosine (C), guanine (G), and thymine (T).

Gene: a sequence of hundreds of individual nucleotidesarranged in a particular order

Humans have around 100,000 genes

Tremendous number of ways that the nucleotides can beordered and sequenced to form distinct genes

Semantic integration of heterogeneous, distributed genome

databases Current: highly distributed, uncontrolled generation and use

of a wide variety of DNA data

Data cleaning and data integration methods developed indata mining will help


22/28

22

DNAAnalysis: Examples

Similarity search and comparison among DNA sequences Compare the frequently occurring patterns of each class (e.g., diseased

and healthy)

Identify gene sequence patterns that play roles in various diseases

Association analysis: identification of co-occurring gene

sequences Most diseases are not triggered by a single gene but by a combination of

genes acting together

Association analysis may help determine the kinds of genes that are likelyto co-occur together in target samples

Path analysis: linking genes to different disease developmentstages

Different genes may become active at different stages of the disease

Develop pharmaceutical interventions that target the different stagesseparately

Visualization tools and genetic data analysis


23/28

23

Data Mining for Financial Data Analysis

Financial data collected in banks and financial institutions are

often relatively complete, reliable, and of high quality

Design and construction of data warehouses for

multidimensional data analysis and data mining

View the debt and revenue changes by month, by region, by

sector, and by other factors

Access statistical information such as max, min, total,average, trend, etc.

Loan payment prediction/consumer credit policy analysis

feature selection and attribute relevance ranking

Loan payment performance

Consumer credit rating


24/28

24

Financial Data Mining

Classification and clustering of customers fortargeted marketing

multidimensional segmentation by nearest-neighbor,

classification, decision trees, etc. to identify customergroups or associate a new customer to an appropriatecustomer group

Detection of money laundering and other financialcrimes

integration of from multiple DBs (e.g., bank transactions,federal/state crime history DBs)

Tools: data visualization, linkage analysis, classification,clustering tools, outlier analysis, and sequential patternanalysis tools (find unusual access sequences)


25/28

25

Data Mining for Retail Industry

Retail industry: huge amounts of data on sales,

customer shopping history, etc.

Applications of retail data mining

Identify customer buying behaviors Discover customer shopping patterns and trends

Improve the quality of customer service

Achieve better customer retention and satisfaction

Enhance goods consumption ratios

Design more effective goods transportation and

distribution policies


26/28

26

Data Mining in Retail Industry: Examples

Design and construction of data warehouses based onthe benefits of data mining

Multidimensional analysis of sales, customers, products,time, and region

Analysis of the effectiveness of sales campaigns Customer retention: Analysis of customer loyalty

Use customer loyalty card information to register sequencesof purchases of particular customers

U

se sequential pattern mining to investigate changes incustomer consumption or loyalty

Suggest adjustments on the pricing and variety of goods

Purchase recommendation and cross-reference ofitems


27/28

27

Data Mining for Telecomm. Industry (1)

A rapidly expanding and highly competitive industryand a great demand for data mining

Understand the business involved

Identify telecommunication patterns

Catch fraudulent activities

Make better use of resources

Improve the quality of service

Multidimensional analysis of telecommunicationdata

Intrinsically multidimensional: calling-time, duration,

location of caller, location of callee, type of call, etc.


28/28

28

Data Mining for Telecomm. Industry (2)

Fraudulent pattern analysis and the identification of unusual

patterns

Identify potentially fraudulent users and their atypical usage patterns

Detect attempts to gain fraudulent entry to customer accounts

Discover unusual patterns which may need special attention

Multidimensional association and sequential pattern analysis

Find usage patterns for a set of communication services by customer

group, by month, etc.

Promote the sales of specific services Improve the availability of particular services in a region

Use of visualization tools in telecommunication data analysis

Date post:	10-Apr-2018
Category:	Documents
Upload:	mamatadalei
View:	220 times
Download:	0 times

Ch2_DMIntro_1

Documents