Date post: | 10-Apr-2018 |
Category: |
Documents |
Upload: | mamatadalei |
View: | 220 times |
Download: | 0 times |
of 28
8/8/2019 Ch2_DMIntro_1
1/28
1
Chapter 2. Introduction to Data Mining
Prof. Keith [email protected]
8/8/2019 Ch2_DMIntro_1
2/28
2
The Course Book
Data Mining: A Tutorial Based Primer
by Richard J.Roiger, Michael Geatz.
Amazon.com
Paperback: 408 pages ; Dimensions (in inches): 0.67 x
9.14 x 7.44
Publisher: Addison-Wesley Publishing; ; Book and
CD-ROM edition (September 26, 2002)
ISBN: 0201741288
List Price: $40.00
Availability: Usually ships within 2 to 3days
8/8/2019 Ch2_DMIntro_1
3/28
3
1.1 Data Mining: A DefinitionThe process of employing one or more
computer learning techniques to
automatically analyze and extractknowledge from data.
Induction-based learningis the process
of forming generally applicable models(or concept definitions) by observing
specific examples.
8/8/2019 Ch2_DMIntro_1
4/28
4
Concepts
Definition: A concept is a set of objects, symbols or events grouped
together because they share certain characteristics.
Concept set, class, group, cluster, roughly
Classical View: Concept
Set with well defined deterministic inclusionrules. E.g. A home owner is a good credit risk.
Probabilistic View: A set with probabilistic includion rules.
E.g. A home owner has an 80% chance of being a good credit risk.
Exemplar View: this states that a given instance is determined to be an
example of a particulalr concept if the instance is similar enough to a set ofone or more known examples of the concept.
Eg. Mr. Smith owns his own home and is a good credit risk.
8/8/2019 Ch2_DMIntro_1
5/28
5
An Investment Dataset
Table 1.3Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite Annual
ID Type Account Method Month Sex Age Recreation Income
1005 Joint No Online 12.5 F 3039 Tennis 4059K
1013 Custodial No Broker 0.5 F 5059 Skiing 8099K
1245 Joint No Online 3.6 M 2029 Golf 2039K
2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K
1001 Individual Yes Online 5.0 M 4049 Golf 6079K
Theflatfile of data is in attribute-valueformat.
Each row/recordis also called a case orinstance.
Each column gives values for an attribute (orvariable) for each of the cases.
Attributes are discrete/categorical/factorial, having a fixed number of possible
values,(e.g. sex, and age) orreal, having a continuous range of possible values (e.g.
average Trades/month).
8/8/2019 Ch2_DMIntro_1
6/28
6
Possible Business Questions
Table 1.3Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite Annual
ID Type Account Method Month Sex Age Recreation Income
1005 Joint No Online 12.5 F 3039 Tennis 4059K
1013 Custodial No Broker 0.5 F 5059 Skiing 8099K
1245 Joint No Online 3.6 M 2029 Golf 2039K
2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K
1001 Individual Yes Online 5.0 M 4049 Golf 6079K
Can I develop a general characterisation/profile of different
investor types? (CLASSIFICATION)
What characteristics distinguish between Online and Brokerinvestors? (DISCRIMINATION)
Can I develop a model which will predict the average
trades/month for a new investor? (PREDICTION)
8/8/2019 Ch2_DMIntro_1
7/28
7
Supervised LeaningIn last two questions, we distinguish ONE of the attributes that we would like
to be able to determine from the values of the others.
What characteristics distinguish between Online and Broker investors?
(DISCRIMINATION). (Transaction method (categorical)) is the target
variable .
Can I develop a model which will predict the average trades/month for a
new investor? (PREDICTION). (Trades/month (real)) is the targetvariable.
The Target variable is called the Output variable.
The other variables are called Input variables.
Clearly, which attributes are the output and input variables depends on your
question.
For these questions, and output variables, we KNOW the values of the output
variables for the cases in thte dataset.
In such cases we say that we do SUPERVISED learning since the learning
is controlled by the known values of the output variable in the dataset.
8/8/2019 Ch2_DMIntro_1
8/28
8
Unsupervised LearningFor the question:
Can I develop a general characterisation/profile of different investor types?(CLASSIFICATION),
NO particular attribute is singled out as an OUTPUT variable.
The question is open-ended.
We do not know if there are any different investor types at all.
If there are different investor types, we do not know how many typesthere are.
If there are different investor types then we do not know what the variousinvestor type (or classes, or concepts) mean. We have to determine the
meaning of the concepts, and appropriate names, after we havedetermined that they exist.
The method of induction based learning used is said to beUNSUPERVISED in such a situation, because the there are no knownoutput classes to control the learning process.
8/8/2019 Ch2_DMIntro_1
9/28
9
Another Example DatasetTable 1.1 Hypothetical Training Data for Disease Diagnosis
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold
6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold
In this example dataset there are categorical attributes
corresponing to Symptoms, and a categorical attribute of
Diagnosis.
The natural question is to predict the Diagnosis (class) [the
Output variable] from the symptoms, [the input variables].
This requires supervised classification learning.
8/8/2019 Ch2_DMIntro_1
10/28
10
The Two Concept Learning Paradigms
Supervised Learning
builds a learner model, or concept
definitions, using data instances of known
origin.
and uses the model to determine the
outcome new instances of unknown origin.
Unsupervised LearningA data mining method that builds models
from data without predefined classes.
Usually for classification/clustering.
8/8/2019 Ch2_DMIntro_1
11/28
11
Supervised Learning:
A Decision Tree Example
ADecision Tree is a tree structure where non-terminal
nodes represent tests/decisions on one or more attributes
and terminal nodes reflect decision outcomes.
Let us consider the Symptoms/Diagnosis dataset for a
supervised classification.
8/8/2019 Ch2_DMIntro_1
12/28
12
Table 1.1 Hypothetical Training Data for Disease Diagnosis
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
Consider each of the attributes in turn, to see which would be a good one to
start our Decision Tree with.
Is there a perfect 1-1 relationship between any of the input variables and the
ourput variable:
Sore Throat, Fever dont seem very good. However,
{Swollen Glands = Yes} corresponds 1-1 with {Diagnosis = Strep throat}
i.e. If {Swollen Glands = Yes} then {Diagnosis = Strep throat}
Hence we use Swollen Glands for our first Dicision Node.
Etc we get
8/8/2019 Ch2_DMIntro_1
13/28
13
Swollen
Glands
Fever
No
Yes
Diagnosis = Allergy Diagnosis = Cold
No
Yes
Diagnosis = Strep Throat
First
Test/Decision
Node
Terminal
Decision Node
8/8/2019 Ch2_DMIntro_1
14/28
14
Notes on this Decision Tree:
The tree is upside down.
The Decision Tree fits the data perfectly.
There are no errors. Accuracy = 100%.
The Decision Tree discards the unneccessary attributes
A computer algorithm to construct Decision Trees would
be farly easy to programme, and would do the job muchquicker than we humans can.
8/8/2019 Ch2_DMIntro_1
15/28
15
Use of the Decision Tree for Prediction
We may now use the Decision Tree for futurediagnoses, (or prediction of diagnosis). Consider
the following symptomatic data:
Table 1.2 Data I sta ces it a k o Classificatio
Pati t r llI T r at r lands Congestion Headache Diagnosis
11 No No Y s Y s Y s ?12 Y s Y s No No Y s ?13 No No No No Y s ?
What are the predicted diagnoses?
Are these likely to be 100% accurate?
8/8/2019 Ch2_DMIntro_1
16/28
16
Production Rules
We may summarize the Decision Tree by listing
the decisions along each path from the starting
node to each terminal node.
1. IF Swollen Glands = Yes
THENDiagnosis = Strep Throat
2. IF Swollen Glands = No&Fever = Yes
THENDiagnosis = Cold3. IF Swollen Glands = No & Fever = No
THENDiagnosis = Allergy
8/8/2019 Ch2_DMIntro_1
17/28
17
Unsupervised Clustering
A data mining method that builds models from data without
predefined output classes.Table 1.3Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite Annual
ID Type Account Method Month Sex Age Recreation Income
1005 Joint No Online 12.5 F 3039 Tennis 4059K
1013 Custodial No Broker 0.5 F 5059 Skiing 8099K
1245 Joint No Online 3.6 M 2029 Golf 2039K
2110 Individual Yes Broker 22.3 M 3039 Fishing 4059K
1001 Individual Yes Online 5.0 M 4049 Golf 6079K
What attribute similarities group customers together?
What differences in attribute values segment the customers?
How many significant cluster are there?
8/8/2019 Ch2_DMIntro_1
18/28
18
1.3 Is Data Mining Appropriate for My Problem?
Data Mining orData Query (using SQL and OLAP)?
It depends on the type of question you want to answer, and
the type of knowledge you want to discover.
Shallow Knowledge: simple summaries (e.g. averages), or aggregates
(totals) of an attribute over a selected set of cases.
You need to know the cases to select. SQL can do this.
Multidimensional Knowledge : Information about the frequent
occurance of values of different attributes (known as Association
Analysis). OLAP on the data cube can do this.
Hidden Knowledge : Knowledge about patterns or relationships that
cannot guessed at prior to data mining.
Deep Knowledge : Knowledge about hidden patterns and relationships
which can only be discovered using prior scientific or meta-knowledge.
This is the research frontier for Data Mining.
8/8/2019 Ch2_DMIntro_1
19/28
19
Data Mining vs. OLA
P vs. Data Query
Use data query if you already almost know what you are
looking for, and you wish to work with large databases.
Use OLAP if you wish to discover simple associations in
large databases.
Use data mining to find patterns and relationships in data
that are not obvious.
Because of the relative slowness of datamining algorithmsthis often means that the database has to be small, or
sampled. Devising Data Mining algorithms which scale to
large databases is a current research topic in Data Mining.
8/8/2019 Ch2_DMIntro_1
20/28
20
Data Mining Applications
Data mining is a young discipline with wide and
diverse applications
There is still a nontrivial gap between general principles
of data mining and domain-specific, effective datamining tools for particular applications
Some application domains
Biomedical and DNA data analysis
Financial data analysis
Retail industry
Telecommunication industry
8/8/2019 Ch2_DMIntro_1
21/28
21
Biomedical Data Mining andDNAAnalysis
DNA sequences: 4 basic building blocks (nucleotides): adenine(A), cytosine (C), guanine (G), and thymine (T).
Gene: a sequence of hundreds of individual nucleotidesarranged in a particular order
Humans have around 100,000 genes
Tremendous number of ways that the nucleotides can beordered and sequenced to form distinct genes
Semantic integration of heterogeneous, distributed genome
databases Current: highly distributed, uncontrolled generation and use
of a wide variety of DNA data
Data cleaning and data integration methods developed indata mining will help
8/8/2019 Ch2_DMIntro_1
22/28
22
DNAAnalysis: Examples
Similarity search and comparison among DNA sequences Compare the frequently occurring patterns of each class (e.g., diseased
and healthy)
Identify gene sequence patterns that play roles in various diseases
Association analysis: identification of co-occurring gene
sequences Most diseases are not triggered by a single gene but by a combination of
genes acting together
Association analysis may help determine the kinds of genes that are likelyto co-occur together in target samples
Path analysis: linking genes to different disease developmentstages
Different genes may become active at different stages of the disease
Develop pharmaceutical interventions that target the different stagesseparately
Visualization tools and genetic data analysis
8/8/2019 Ch2_DMIntro_1
23/28
23
Data Mining for Financial Data Analysis
Financial data collected in banks and financial institutions are
often relatively complete, reliable, and of high quality
Design and construction of data warehouses for
multidimensional data analysis and data mining
View the debt and revenue changes by month, by region, by
sector, and by other factors
Access statistical information such as max, min, total,average, trend, etc.
Loan payment prediction/consumer credit policy analysis
feature selection and attribute relevance ranking
Loan payment performance
Consumer credit rating
8/8/2019 Ch2_DMIntro_1
24/28
24
Financial Data Mining
Classification and clustering of customers fortargeted marketing
multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customergroups or associate a new customer to an appropriatecustomer group
Detection of money laundering and other financialcrimes
integration of from multiple DBs (e.g., bank transactions,federal/state crime history DBs)
Tools: data visualization, linkage analysis, classification,clustering tools, outlier analysis, and sequential patternanalysis tools (find unusual access sequences)
8/8/2019 Ch2_DMIntro_1
25/28
25
Data Mining for Retail Industry
Retail industry: huge amounts of data on sales,
customer shopping history, etc.
Applications of retail data mining
Identify customer buying behaviors Discover customer shopping patterns and trends
Improve the quality of customer service
Achieve better customer retention and satisfaction
Enhance goods consumption ratios
Design more effective goods transportation and
distribution policies
8/8/2019 Ch2_DMIntro_1
26/28
26
Data Mining in Retail Industry: Examples
Design and construction of data warehouses based onthe benefits of data mining
Multidimensional analysis of sales, customers, products,time, and region
Analysis of the effectiveness of sales campaigns Customer retention: Analysis of customer loyalty
Use customer loyalty card information to register sequencesof purchases of particular customers
U
se sequential pattern mining to investigate changes incustomer consumption or loyalty
Suggest adjustments on the pricing and variety of goods
Purchase recommendation and cross-reference ofitems
8/8/2019 Ch2_DMIntro_1
27/28
27
Data Mining for Telecomm. Industry (1)
A rapidly expanding and highly competitive industryand a great demand for data mining
Understand the business involved
Identify telecommunication patterns
Catch fraudulent activities
Make better use of resources
Improve the quality of service
Multidimensional analysis of telecommunicationdata
Intrinsically multidimensional: calling-time, duration,
location of caller, location of callee, type of call, etc.
8/8/2019 Ch2_DMIntro_1
28/28
28
Data Mining for Telecomm. Industry (2)
Fraudulent pattern analysis and the identification of unusual
patterns
Identify potentially fraudulent users and their atypical usage patterns
Detect attempts to gain fraudulent entry to customer accounts
Discover unusual patterns which may need special attention
Multidimensional association and sequential pattern analysis
Find usage patterns for a set of communication services by customer
group, by month, etc.
Promote the sales of specific services Improve the availability of particular services in a region
Use of visualization tools in telecommunication data analysis