Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | hester-skinner |
View: | 217 times |
Download: | 1 times |
Chapter 20
Data Analysis and Mining
2
Data Analysis and Mining
Decision Support Systems Obtain high-level information out of detailed information stored
in (DB) transaction-processing systems for decision making
Data Analysis and OLAP (OnLine Analytical Processing)
• Develop tools/techniques for generating summarized data from very large DBs for data analysis
Data Warehousing
• Integrate data from multiple sources with a unified schema at a single site, where materialized views are created
Data Mining
• Adopt AI/statistical analysis techniques for knowledge discovery in very large DBs
3
Decision Support Systems Decision-support systems are used to make business
decisions, often based on data collected by on-line transaction- processing systems.
Examples of business decisions:
What items to stock?
What insurance premium to change?
To whom to send advertisements?
Examples of data used for making decisions
Retail sales transaction details
Customer profiles (income, age, gender, etc.)
4
Data Mining Data mining (DM) is the process of semi-automatically
analyzing large databases to find useful patterns
Prediction (a DM technique) based on past history Predict if a credit card applicant poses a good credit risk, using
some attributes (income, job type, age) & past payment history
Predict if a pattern of phone calling card usage is likely to be fraudulent
Some examples of prediction mechanisms: Classification
• Given a new item whose class is unknown, predict to which class it belongs
Regression formulae
• Given a set of mappings for an unknown function, predict the function result for a new parameter value
5
Classification Rules Classification rules help assign new objects to classes
e.g., given a new automobile insurance applicant, should (s)he be classified as low risk, medium risk, or high risk?
Classification rules for above example could use a variety of data, such as educational level, salary, age, etc.
person P, P.degree = “masters” and P.income > 100,000 P.credit = excellent
person P, P.degree = “bachelors” and (P.income 75,000 and P.income 100,000) P.credit = good
Rules are not necessarily exact: there may be some misclassifications
Classification rules can be shown compactly as a decision tree
6
Decision Tree
7
Construction of Decision Trees Training set: a data sample in which the classification is
already known.
Greedy top-down generation of decision trees
Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition (on samples) for the node
Leaf node:
• all (or most) of the items at the node belong to the same class, or
• all attributes have been considered, and no further partitioning is possible
Decision trees can also be represented as sets of IF-THEN rules
8
Construction of Decision Trees Example. A decision tree for the concept Play_Tennis
The instance (Outlook = Sunny, Humidity = High, Temperature = Hot, Wind = Strong)
is classified as a negative instance (i.e., predicting PlayTennis = no)
9
Decision-Tree Induction: Attribute Selection Information gain (Gain) measure: select the test attribute at
each node N in the decision tree T The attribute A with the highest Gain (or greatest entropy
reduction) is chosen as the test/partitioning attribute of N
A reflects the least randomness or impurity in the partition
Expected Information Needed (I), of the entire sample data set, to classify a given sample is given by
I(s1, s2, …, sm) = - pi log2 pi
where si (1 i m) is the number of data samples in a subset of S with s data samples, which yields class Ci,
m is the number of distinct classes: C1, C2, … Cm
pi is the probability that an arbitrary sample belongs to Ci, i.e., pi = si /s, and
log2 is used as information is encoded in bits
m
i =1
10
Decision-Tree Induction: Entropy Use entropy to determine expected info. of an attribute
Entropy (or expected information) E of an attribute A Let v be the number of distinct values {a1, a2, …, av} of A
Suppose A partitions S into v subsets, {S1, S2, …, Sv}, where Si (1 i v) is the number of distinct values ai of A in S
If A is the selected node label, i.e., the best attribute for splitting, then {S1, …, Sv} correspond to the labeled branches from
A
E(A) = ( I(s1j, s2j, …, smj) )
where (s1j +s2j +…+ smj) / s is the weight of sj (having aj of A) in S,
sij (1 i m) is the number of data samples in Ci in Sj,
The smaller the E(A) value, the greater the purity of the subset partition, i.e., {S1, S2, …, Sv}
v
j =1
s1j + s2j + … + smj
s
11
Decision-Tree Induction: Information Gain Given the entropy of A:
E(A) = ( I(s1j, s2j, …, smj) )
I(s1j, s2j, …, smj) is defined as
I(s1j, s2j, …, smj) = - pij log2 pij
where pij = sij / |Sj|, the probability that a sample in Sj is in Ci
The information gain of attribute A, i.e., the information gained by using A (as a node label), is defined as
Gain(A) = I(s1, s2, …, sm) – E(A)
Gain(A) is the expected reduction in entropy using values of A The attribute with the highest information gain is chosen for the
given set of samples S (in a recursive manner)
v
j =1
s1j + s2j
+ … + smj
s
m
i =1
12
Decision-Tree Induction Example. Given the following training (sample) data, use the
Decision Tree to predict customers who buy computers
Two classes of samples {yes, no}, i.e., S1 = ‘yes’, S2 = ‘no’, & m = 2
S = 14;
I(s1, s2) = I(9, 5) = - 9/14 log2 9/14 - 5/14 log2 5/14 = 0.94
which is the expected info. needed of all
Next, compute the entropy of each attribute, i.e., age, income, etc.
Consider Age with 3 distinct classes, i.e., “<= 30”, “31..40”, “> 40”
|S1| = 9 ; |S2| = 5,m
i =1since I(s1, s2, …, sm) = - pi log2 pi,
13
Example: Consider Age
Since E(A) = ( I(s1j, s2j, …, smj)), and
I(s1j, s2j, …, smj) = - pi log2 pi,
• Age = “ 30”: s11 = 2, s21 = 3, I(s11, s21) = -2/5 log2 2/5 - 3/5 log2 3/5 = 0.971
• Age = “31 .. 40”: s12 = 4, s22 = 0, I(s12, s22) = -4/4 log2 4/4 - 0/4 log2 0/4 = 0
• Age = “> 40”: s13 = 3, s23 = 2, I(s13, s23) = -3/5 log2 3/5 - 2/5 log2 2/5 = 0.971
• E(Age) = 5/14 I(s11, s21) + 4/14 I(s12, s22) + 5/14 I(s13, s23) = 0.694
• Gain(Age) = I(s1, s2) – E(Age) = 0.94 – 0.694 = 0.246
v
j =1
s1j + s2j + … + smj
sm
i =1
14
Example: Consider Credit_Rating
Two classes of samples {yes, no}: S1 (yes), S2 = no, I(s1, s2) = 0.94
Since E(A) = ( I(s1j, s2j, …, smj)), and
I(s1j, s2j, …, smj) = - pi log2 pi,
• Credit_Rating = “Fair”: s11 = 6 and s21 = 2,
I(s11, s21) = - 6/8 log2 6/8 - 2/8 log2 2/8 = 0.81
• Credit_Rating = “Excellent”: s12 = 3 and s22 = 3
I(s12, s22) = - 3/6 log2 3/6 - 3/6 log2 3/6 = 1
• E(Credit_Rating) = 8/14 I(s11, s21) + 6/14 I(s12, s22) = 0.89
• Gain(Credit_Rating) = I(s1, s2) – E(Credit_Rating) = 0.94 – 0.89 = 0.05
2
j =1
s1j + s2j + … + smj
s m
i =1
15
Example: Consider Income
Two classes of samples {yes, no}: S1 (yes), S2 = no, I(s1, s2) = 0.94
Since E(A) = ( I(s1j, s2j, …, smj) = - pi log2 pi)
• Income = “High”: s11 = 2 and s21 = 2,
I(s11, s21) = - 2/4 log2 2/4 - 2/4 log2 2/4 = 1
• Income = “Low”: s12 = 3 and s22 = 1,
I(s12, s22) = - 3/4 log2 3/4 - 1/4 log2 1/4 = 0.81
• Income = “Medium”: s12 = 4 and s22 = 2
I(s12, s22) = - 4/6 log2 4/6 - 2/6 log2 2/6 = 0.92
• E(Income) = 4/14 1 + 4/14 0.81 + 6/14 0.92 = 0.91
• Gain(Income) = I(s1, s2) – E(Income) = 0.94 – 0.91 = 0.03
3
j =1
s1j + s2j + … + smj
s
m
i =1
16
Example: Consider Student
Two classes of samples {yes, no}: S1 (yes), S2 = no, I(s1, s2) = 0.94
Since E(A) = ( I(s1j, s2j, …, smj) = - pi log2 pi)
• Student = “Yes”: s11 = 6 and s21 = 1,
I(s11, s21) = - 6/7 log2 6/7 - 1/7 log2 1/7 = 0.59
• Student = “No”: s12 = 3 and s22 = 4,
I(s12, s22) = - 3/7 log2 3/7 - 4/7 log2 4/7 = 0.98
• E(Student) = 7/14 0.59 + 7/14 0.98 = 0.78
• Gain(Student) = I(s1, s2) – E(Student) = 0.94 – 0.78 = 0.15
2
j =1
s1j + s2j + … + smj
s
m
i =1
17
Decision-Tree Induction: ExampleSince Gain(Income) = 0.03, Gain(Student) = 0.15, Gain(Rating) = 0.05, and Gain(Age) = 0.246,
• A node is created and labeled with Age
• Branches are grown for each for the attribute’s values
A leaf node (in the ‘Yes’ Class)
Age is the chosen attribute (to split)
18
Decision-Tree Induction: Example Consider the leftmost branch.
Since Gain(Student) = 0.971, Gain(Income) = 0.571, and Gain(Rating) = 0.02,
• A node is created and labeled with Student• Branches are grown for each for the attribute’s values
Income Student Rating ClassHigh No Fair NoHigh No Excellent NoMedium No Fair NoLow Yes Fair YesMedium Yes Excellent Yes
Income Rating ClassHigh Fair NoHigh Excellent NoMedium Fair No
Income Rating Class Low Fair YesMedium Excellent Yes
No YesStudent
A leaf node (in the ‘No’ Class) A leaf node (in the ‘Yes’ Class)
Student is the chosen attribute (to split)
19
Decision-Tree Induction: Example Example. Given the following training (sample) data,
the constructed final Decision Tree to predict customers who buy computers is
20
Association Rules Retail shops are often interested in associations between
different items that people buy. Someone who buys bread is quite likely also to buy milk A person who bought the book Database System Concepts is
quite likely also to buy the book Operating System Concepts
Associations information can be used in several ways e.g., when a customer buys a particular book, an online shop may
suggest associated books
Association rules: bread milk. DB-Concepts, OS-Concepts Networks
Left hand side: antecedent; right hand side: consequent An association rule must have an associated population; the
population consists of a set of instances• e.g., each transaction (sale) at a shop is an instance, and the set
of all transactions is the population
21
Association Rules (Cont.) Rules have an associated support, as well as an associated
confidence
Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule
e.g., suppose only 0.001% of all purchases include milk and screw drivers. The support for the rule is
milk screw drivers is low
Confidence is a measure of how often the consequent is true when the antecedent is true.
e.g., the rule bread milk has a confidence of 80%, if 80% of the purchases that include bread also include
milk
22
Finding Association Rules We are generally only interested in association rules with
reasonably high support (e.g., support 2%)
Naïve algorithm
1. Consider all possible sets of relevant items
2. For each set find its support (i.e., count how many transactions purchase all items in the set)
Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association rules
From itemset S generate the rule (S - s) s for each s S
• Support of rule = support(S)
• Confidence of rule = support(s) support(S)
23
Finding Support Determine support of itemsets via a single pass on set of transactions
Large itemsets: sets with a high count at the end of the pass
If memory not enough to hold all counts for all itemsets, use multiple passes considering only some itemsets in each pass
Optimization: Once an itemset is eliminated because its count (support) is too small, none of its supersets needs to be considered.
The a priori technique to find large itemsets: Pass 1: count support of all sets with just 1 item. Eliminate those items
with low support
Pass i: candidates include every set of i items such that all its i - 1 item subsets are large
• Count support of all candidates; stop, if there are no candidates