Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | gisela-deleon |
View: | 29 times |
Download: | 3 times |
3
Distribution Definitions
• Discrete Probability Distribution
• Continuous Probability Distribution
• Cumulative Distribution Function
4
Discrete Distribution
• A r.v. X is discrete if it takes countably many values {x1,x2,….}
• The probability function or probability mass function for X is given by – fX(x)= P(X=x)
• From previous example
otherwise
x
x
x
xf X
0
24/1
12/1
04/1
)(
5
Continuous Distributions
• A r.v. X is continuous if there exists a function fX such that
b
a
X
X
X
dxxfbxaP
dxxf
f
)()(
1)(
0
6
Example: Continuous Distribution
• Suppose X has the pdf
• This is the Uniform (0,1) distribution
otherwise
xxf X 0
101)(
7
Binomial Distribution
• A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent.
• Let f(x) =P(X=x), then
otherwise
nxppx
nxf
xnx
0
,...1,0)1()(
8
Binomial Example
• Let p =0.5; n = 5 then
• In Matlab >>binopdf(4,5,0.5)
1562.0)5.01(5.04
5)4( 454
XP
9
Normal Distribution
• X has a Normal (Gaussian) distribution with parameters μ and σ if
• X is standard Normal if μ =0 and σ =1. It is denoted as Z.
• If X ~ N(μ, σ2) then
2
2)(
2
1exp
2
1)(
xxf
ZX
~
10
Normal Example
• The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day?
• Let X be the number of spam emails received in a day. We want P(X = 2000)?
• The answer is P(X=2000) = 0;
• It is more meaningful to ask P(X >= 2000);
11
Normal Example
• This is
• In Matlab: >> 1 –normcdf(2000,1000,500)
• The answer is 1 – 0.9772 = 0.0228 or 2.28%
• This type of analysis is so common that there is a special name for it: cumulative distribution function F.
2000
)(1)2000(1)2000( dxxfXPXP
x
dxxfxXPxF )()()(
12
Conditional Independence
• If A and B are independent then P(A|B)=P(A)
• P(AB) = P(A|B)P(B)• Law of Total Probability.
14
Question 1
• Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?
Gender % of credit card holders
% of gender who default
Male 60 55
Female 40 35
15
Answer to Question 1
30.060.055.040.035.0
40.035.0)|()|(
)()|()|(
MGYDPFGYDP
FGPFGYDPYDFGP
But what does G=F and D=Y mean? We have not even formally defined them.
17
Types of Clusterings• A clustering is a set of clusters
• Important distinction between hierarchical and partitional sets of clusters
• Partitional Clustering– A division data objects into non-overlapping
subsets (clusters) such that each data object is in exactly one subset
• Hierarchical clustering– A set of nested clusters organized as a hierarchical
tree
19
Hierarchical Clustering
p4p1
p3
p2
p4 p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Traditional Dendrogram
20
K-means Clustering
• Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid• Number of clusters, K, must be specified• The basic algorithm is very simple
21
K-means Clustering – Details• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.• K-means will converge for common similarity measures mentioned above.• Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points change clusters’
• Complexity is O( n * K * I * d )– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
22
Evaluating K-means Clusters• Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster– To get SSE, we square these errors and sum them.
– x is a data point in cluster Ci and mi is the representative point for cluster Ci
• can show that mi corresponds to the center (mean) of the cluster– Given two clusters, we can choose the one with the smallest error– One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
K
i Cxi
i
xmdistSSE1
2 ),(
23
Hierarchical Clustering • Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram– A tree like diagram that records the
sequences of merges or splits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
24
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of clusters– Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the proper level
• They may correspond to meaningful taxonomies– Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
25
Hierarchical Clustering• Two main types of hierarchical clustering
– Agglomerative: • Start with the points as individual clusters• At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
– Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a
point (or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix– Merge or split one cluster at a time
26
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters– Different approaches to defining the distance between
clusters distinguish the different algorithms
28
Missing Data
• We think of clustering as a problem of estimating missing data.
• The missing data are the cluster labels.
• Clustering is only one example of a missing data problem. Several other problems can be formulated as missing data problems.
29
Missing Data Problem
• Let D = {x(1),x(2),…x(n)} be a set of n observations.
• Let H = {z(1),z(2),..z(n)} be a set of n values of a hidden variable Z.– z(i) corresponds to x(i)
• Assume Z is discrete.
30
EM Algorithm
• The log-likelihood of the observed data is
• Not only do we have to estimate but also H
• Let Q(H) be the probability distribution on the missing data.
H
HDpDpl )|,(log)|(log)(
31
EM Algorithm
Inequality is because of Jensen’s Inequality. This means that the F(Q,) is a lower bound on l()
Notice that the log of sums is become a sum of logs
32
EM Algorithm
• The EM Algorithm alternates between maximizing F with respect to Q (theta fixed) and then maximizing F with respect to theta (Q fixed).
34
EM Algorithm
• The M-step reduces to maximizing the first term with respect to as there is no in the second term.
36
What is Association Rule Mining?
• Association rule mining finds• combinations of items that typically occur
together in a database (market-basket analysis)• Sequences of items that occur frequently
(sequential analysis) in a database• Originally introduced for Market-basket analysis --
useful for analysing purchasing behaviour of customers.
37
Market-Basket Analysis – Examples Where should strawberries be placed to maximize their sale? Services purchased together by telecommunication customers (e.g.
broad band Internet, call forwarding, etc.) help determine how to bundle these services together to maximize revenue
Unusual combinations of insurance claims can be a sign of a fraud Medical histories can give indications of complications based on
combinations of treatments Sport: analyzing game statistics (shots blocked, assists, and fouls) to
gain competitive advantage• “When player X is on the floor, player Y’s shot accuracy
decreases from 75% to 30%” • Bhandari et.al. (1997). Advanced Scout: data mining and knowledge
discovery in NBA data, Data Mining and Knowledge Discovery, 1(1), pp.121-125
38
Support and Confidence - Example
• What is the support and confidence of the following rules? • {Beer}{Bread}• {Bread, PeanutButter}{Jelly} ?
Support(XY)=support(X Y)confidence(XY)=support(XY)/support(X)
39
Association Rule Mining Problem Definition
• Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup and minconf,
• Find all association rules XY with support minsup and confidence minconf
• I.E: we want rules with high confidence and support
• We call these rules interesting
• We would like to• Design an efficient algorithm for mining association rules
in large data sets
• Develop an effective approach for distinguishing interesting rules from spurious ones
40
Generating Association Rules – Approach 1(Naïve)
• Enumerate all possible rules and select those of them that satisfy the minimum support and confidence thresholds
• Not practical for large databases• For a given dataset with m items, the total number of
possible rules is 3m-2m+1+1 (Why?*)
• And most of these will be discarded!
• We need a strategy for rule generation -- generate only the promising rules • rules that are likely to be interesting, or, more accurately, don’t
generate rules that can’t be interesting.
*hint: use inclusion-exclusion principle
41
Generating Association Rules – Approach 2• What do these rules have in common?
A,BC A,CB B,CA
• The support of a rule XY depends only on the support of its itemset X Y
Answer: they have the same support: support({A,B,C})
• Hence, a better approach: find Frequent itemsets first, then generate the rules
• Frequent itemset is an itemset that occurs more than minsup times
• If an itemset is infrequent, all the rules that contain it will have support<minsup and there is no need to generate them
42
• 2 step-approach:
Step 1: Generate frequent itemsets -- Frequent Itemset Mining (i.e. support minsup)• e.g. {A,B,C} is frequent (so A,BC, A,CB and B,CA
satisfy the minSup threshold).
Step 2: From them, extract rules that satisfy the confidence threshold (i.e. confidence minconf)• e.g. maybe only A,B C and C,BA are confident
• Step 1 is the computationally difficult part (the next slides explain why, and a way to reduce the complexity….)
Generating Association Rules – Approach 2
43
Frequent Itemset Generation (Step 1) – Brute-Force Approach
• Enumerate all possible itemsets and scan the dataset to calculate the support for each of them
• Example: I={a,b,c,d,e}
Given d items, there are 2d-1 possible (non-empty) candidate itemsets => not practical for large d
Search space showing
superset / subset relationships
44
A subset of any frequent itemset is also frequent
Example: If {c,d,e} is frequent then {c,d}, {c,e}, {d,e}, {c}, {d} are also frequent
Frequent Itemset Generation (Step 1)-- Apriori Principle (1)
45
If an itemset is not frequent, a superset of it is also not frequent
Frequent Itemset Generation (Step 1)-- Apriori Principle (2)
Example: If we know that {a,b} is infrequent, the entire sub-graph can be pruned.
Ie: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,c,d}, {a,b,c,e}, {a,b,d,e} and {a,b,c,d} are infrequent
46
Recall the 2 Step process for Association Rule Mining
Step 1: Find all frequent Itemsets
So far: main ideas and concepts (Apriori principle).
Later: algorithms
Step 2: Generate the association rules from the frequent itemsets.
47
ARGen Algorithm (Step 2)• Generates interesting rules from the frequent itemsets• Already know the rules are frequent (Why?), just need to check confidence.
ARGen algorithmfor each frequent itemset F
generate all non-empty subsets S.
for each s in S do
if confidence(s F-s) ≥ minConf
then output rule s F-s
end
Example: F={a,b,c}
S={{a,b}, {a,c}, {b,c}, {a}, {b}, {c}}
rules output: {a,b} {c}, etc.
48
ARGen - Example• minsup=30%, minconf=50%• The set of frequent itemsets
L={{Beer},{Bread}, {Milk},
{PeanutButter},
{Bread, PeanutButter}}
• Only the last itemset from L consists of 2 nonempty subsets of frequent itemsets – Bread and PeanutButter.
• => 2 rules will be generated
minconfBread
erPeanutButtBreaderPeanutButtBreadconfidence 75.0
80
60
})({support
}),({support)(
minconferPeanutButt
erPeanutButtBreadBreaderPeanutButtconfidence 1
60
60
})({support
}),({support)(
49
Bayes Classifier
• A probabilistic framework for solving classification problems
• Conditional Probability:
• Bayes theorem:
)()()|(
)|(AP
CPCAPACP
)(),(
)|(
)(),(
)|(
CPCAP
CAP
APCAP
ACP
50
Example of Bayes Theorem
• Given: – A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis?
0002.020/150000/15.0
)()()|(
)|( SP
MPMSPSMP
51
• Consider each attribute and class label as random variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from data?
Bayesian Classifiers
52
Bayesian Classifiers• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem
– Choose value of C that maximizes P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
)()()|(
)|(21
21
21
n
n
n AAAPCPCAAAP
AAACP
53
Naïve Bayes Classifier
• Assume independence among attributes Ai
when class is given: – P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
– Can estimate P(Ai| Cj) for all Ai and Cj.
– New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal.
54
• Class: P(C) = Nc/N– e.g., P(No) = 7/10,
P(Yes) = 3/10
• For discrete attributes:
P(Ai | Ck) = |Aik|/ Nc
– where |Aik| is number of instances having attribute Ai and belongs to class Ck
– Examples:
P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0
k
How to Estimate Probabilities from Data?
55
How to Estimate Probabilities from Data?
• For continuous attributes: – Discretize the range into bins
• one ordinal attribute per bin• violates independence assumption
– Two-way split: (A < v) or (A > v)• choose only one of the two splits as new attribute
– Probability density estimation:• Assume attribute follows a normal distribution• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)• Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
k
56
How to Estimate Probabilities from Data?
• Normal distribution:
– One for each (Ai,ci) pair
• For (Income, Class=No):– If Class=No
• sample mean = 110• sample variance = 2975
2
2
2
)(
221
)|( ij
ijiA
ij
jiecAP
0072.0)54.54(2
1)|120( )2975(2
)110120( 2
eNoIncomeP
57
Example of Naïve Bayes Classifier
120K)IncomeMarried,No,Refund( X
P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K|
Class=No) = 4/7 4/7 0.0072 = 0.0024
P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes)
= 1 0 1.2 10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X) => Class = No
Given a Test Record:
58
Naïve Bayes Classifier
• If one of the conditional probability is zero, then the entire expression becomes zero
• Probability estimation:
mN
mpNCAP
cN
NCAP
N
NCAP
c
ici
c
ici
c
ici
)|(:estimate-m
1)|(:Laplace
)|( :Originalc: number of classes
p: prior probability
m: parameter
59
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammalspython no no no no non-mammalssalmon no no yes no non-mammalswhale yes no yes no mammalsfrog no no sometimes yes non-mammalskomodo no no no yes non-mammalsbat yes yes no yes mammalspigeon no yes no yes non-mammalscat yes no no yes mammalsleopard shark yes no yes no non-mammalsturtle no no sometimes yes non-mammalspenguin no no sometimes yes non-mammalsporcupine yes no no yes mammalseel no no yes no non-mammalssalamander no no sometimes yes non-mammalsgila monster no no no yes non-mammalsplatypus no no no yes mammalsowl no yes no yes non-mammalsdolphin yes no yes no mammalseagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ?
0027.02013
004.0)()|(
021.0207
06.0)()|(
0042.0134
133
1310
131
)|(
06.072
72
76
76
)|(
NPNAP
MPMAP
NAP
MAP
A: attributes
M: mammals
N: non-mammals
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
60
Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance during probability estimate calculations
• Robust to irrelevant attributes
• Independence assumption may not hold for some attributes– Use other techniques such as Bayesian Belief
Networks (BBN)
62
Motivation
• Bulk of data has a time component
• For example, retail transactions, stock prices
• Data set can be organized as N x M table
• N customers and the price of the calls they made in 365 days
• M << N
63
Objective
• Compress the data matrix X into Xc, such that– The compression ratio is high and the
average error between the original and the compressed matrix is low
– N could be in the order of millions and M in the order of hundreds
64
Example database
We
7/10
Thr
7/11
Fri
7/12
Sat
7/13
Sun
7/14
ABC 1 1 1 0 0
DEF 2 2 2 0 0
GHI 1 1 1 0 0
KLM 5 5 5 0 0
smith
0 0 0 2 2
john 0 0 0 3 3
tom 0 0 0 1 1
65
Decision Support Queries
• What was the amount of sales to GHI on July 11?
• Find the total sales to business customers for the week ending July 12th?
68
SVD Definition
• More importantly X can be written as
trrr
tt vuvuvuX 222111
Where the eigenvalues are in decreasing order.
tkkk
ttc vuvuvuX 222111
k,<r
71
Density-based: LOF approach• For each point, compute the density of its local neighborhood• Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the density of its nearest neighbors
• Outliers are points with largest LOF value
p2
p1
In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers
72
Clustering-Based• Basic idea:
– Cluster the data into groups of different density
– Choose points in small cluster as candidate outliers
– Compute the distance between candidate points and non-candidate clusters.
• If candidate points are far from all other non-candidate points, they are outliers
75
Base Rate Fallacy in Intrusion Detection
• I: intrusive behavior, I: non-intrusive behavior A: alarm A: no alarm
• Detection rate (true positive rate): P(A|I)• False alarm rate: P(A|I)
• Goal is to maximize both– Bayesian detection rate, P(I|A) – P(I|A)
76
Detection Rate vs False Alarm Rate
• Suppose:
• Then:
• False alarm rate becomes more dominant if P(I) is very low