Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | moris-cobb |
View: | 219 times |
Download: | 0 times |
Classification & Regression
COSC 526 Class 7
Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]
2
Last Class
• Streaming Data Analytics:– sampling, counting, frequent item sets, etc.
• Berkeley Data Analytic Stack (BDAS)
3
Project Deliverables and Deadlines
Deliverable Due-date % Grade
Initial selection of topics1, 2 Jan 27, 2015 10
Project Description and Approach Feb 20, 2015 20
Initial Report Mar 20, 2015 10
Project Demonstration Apr 16-19, 2015 10
Final Project Report (10 pages) Apr 21, 2015 25
Poster (12-16 slides) Apr 23, 2015 25
• 1Projects can come with their own data (e.g., from your project) or can be provided
• 2Datasets need to be open! Please don’t use datasets that have proprietary limitations
• All reports will be in NIPS format: http://nips.cc/Conferences/2013/PaperInformation/StyleFiles
4
This class and next…
• Classification
• Decision Tree Methods
• Instance based Learning
• Support Vector Machines (SVM)
• Latent Dirichlet Allocation (LDA)
• How do we make these work with large-scale datasets?
5
Part I: Classification and a few new terms…
6
Classification • Given a collection of records (training):
– Each record has a set of attributes [x1,x2, …, xn]
– One attribute is referred to as the class [y]
• Training Goal: Build a model for the class attribute as a function of the set of attributes:
• Testing Goal: Previously unseen records should be assigned the class attribute as accurately as possible
7
Illustration of Classification
ID Attrib1
Attrib2
Attrib3
Class
1 Yes Large 120,000
No
2 No Medium
100,000
No
3 No Small 70,000 No
4 Yes Medium
120,000
No
5 No Large 95,000 Yes
6 Yes Large 220,000
Yes
… … … … …
ID Attrib1
Attrib2
Attrib3
Class
20 No Small 55,000 ?
21 Yes Medium
90,000 ?
22 Yes … … ?
Training
Testing
Learning Algorith
m
Learn Model
Model
Apply Model
8
Examples of classification
• Predicting tumor cells whether they are benign or cancerous
• Classifying credit card transactions are legitimate / fraudulent
• Classifying if a protein sequence is a helix, beta-sheet or a random coil
• Classifying if a tweet is talking about sports, finance, terrorism, etc.
9
Two problem Settings
• Supervised Learning:– Training and testing data
– Training consists of labels (class attribute y)
– Testing data consists of all other attributes
• Unsupervised Learning:– Training data does not usually consist of labels
– We will cover unsupervised learning later (mid Feb)
10
Two more terms..
• Parametric:– A particular functional form is assumed. E.g.,
Naïve Bayes
– Simplicity: easy to interpret and understand
– High bias: real data might not follow this functional form
• Non-parametric:– Estimation is purely data-driven
– no functional form assumption
11
Part II: Classification with Decision TreesExamples illustrated from Tan, Steinbach and Kumar, Introduction
to Data Mining Text book
12
Decision Trees
• One of the more popular learning techniques
• Uses a tree-structured plan of a set of attributes to test in order to predict the output
• Decision of which attribute to be tested is based on information gain
13
Example of a Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
contin
uous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
14
Another Example of Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
contin
uous
class
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single,
Divorced
< 80K > 80K
There could be more than one tree that fits the same data!
15
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training Set
Decision Tree
16
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test DataStart from the root of tree.
17
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
18
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
19
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
20
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
21
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
22
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training Set
Decision Tree
23
Decision Tree Induction
• Many Algorithms:– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
24
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that reach a node t
• General Procedure:
– If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
– If Dt is an empty set, then t is a leaf node labeled by the default class, yd
– If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.
– Recursively apply the procedure to each subset.
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Dt
?
25
Hunt’s Algorithm
Don’t Cheat
Refund
Don’t Cheat
Don’t Cheat
Yes No
Refund
Don’t Cheat
Yes No
MaritalStatus
Don’t Cheat
Cheat
Single,Divorced
Married
TaxableIncome
Don’t Cheat
< 80K >= 80K
Refund
Don’t Cheat
Yes No
MaritalStatus
Don’t Cheat
Cheat
Single,Divorced
Married
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
26
Tree Induction
• Greedy strategy– Split the records based on an attribute test that
optimizes certain criterion.
• Issues– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
27
Tree Induction
• Greedy strategy.– Split the records based on an attribute test that
optimizes certain criterion.
• Issues– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
28
How to Specify Test Condition?
• Depends on attribute types– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split– 2-way split
– Multi-way split
29
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarTypeFamily
Sports
Luxury
CarType{Family, Luxury} {Sports}
CarType{Sports,
Luxury}
{Family} OR
30
• Multi-way split: Use as many partitions as distinct values.
• Binary split: Divides values into two subsets. Need to find optimal partitioning.
Splitting Based on Ordinal Attributes
SizeSmall
Medium
Large
Size{Medium,
Large} {Small}
Size{Small, Medium
}{Large}
OR
Size{Small, Large} {Medium}
31
Splitting Based on Continuous Attributes
• Different ways of handling– Discretization to form an ordinal categorical
attribute• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing(percentiles), or clustering.
– Binary Decision: (A < v) or (A v)• consider all possible splits and finds the best cut
• can be more compute intensive
32
Splitting Based on Continuous Attributes
TaxableIncome> 80K?
Yes No
TaxableIncome?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
33
Tree Induction
• Greedy strategy.– Split the records based on an attribute test that
optimizes certain criterion.
• Issues– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
34
How to determine the Best Split
OwnCar?
C0: 6C1: 4
C0: 4C1: 6
C0: 1C1: 3
C0: 8C1: 0
C0: 1C1: 7
CarType?
C0: 1C1: 0
C0: 1C1: 0
C0: 0C1: 1
StudentID?
...
Yes No Family
Sports
Luxury c1c10
c20
C0: 0C1: 1
...
c11
Before Splitting: 10 records of class 0,10 records of class 1
Which test condition is the best?
35
How to determine the Best Split
• Greedy approach: – Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5C1: 5
C0: 9C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
36
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
37
How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10 C1 N11
C0 N20 C1 N21
C0 N30 C1 N31
C0 N40 C1 N41
C0 N00 C1 N01
M0
M1 M2 M3 M4
M12
M34Gain = M0 – M12 vs M0 –
M34
38
Measure of Impurity: GINI
• Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying most interesting information
j
tjptGINI 2)]|([1)(
C1 0C2 6
Gini=0.000
C1 2C2 4
Gini=0.444
C1 3C2 3
Gini=0.500
C1 1C2 5
Gini=0.278
39
Examples for computing GINI
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
j
tjptGINI 2)]|([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
40
Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the quality of split is computed as,
where, ni = number of records at child i,
n = number of records at node
p.
k
i
isplit iGINI
n
nGINI
1
)(
41
Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
B?
Yes No
Node N1 Node N2
Parent
C1 6
C2 6
Gini = 0.500
N1 N2 C1 5 1
C2 2 4
Gini=0.333
Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194
Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528
Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333
42
Categorical Attributes: Computing Gini Index
• For each distinct value, gather counts for each class in the dataset
• Use the count matrix to make decisions
CarType{Sports,Luxury}
{Family}
C1 3 1
C2 2 4
Gini 0.400
CarType
{Sports}{Family,Luxury}
C1 2 2
C2 1 5
Gini 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
Multi-way split Two-way split (find best partition of values)
43
Continuous Attributes: Computing Gini Index• Use Binary Decisions based on one
value
• Several Choices for the splitting value– Number of possible splitting
values = Number of distinct values
• Each splitting value has a count matrix associated with it– Class counts in each of the
partitions, A < v and A v
• Simple method to choose best v– For each v, scan the database to
gather count matrix and compute its Gini index
– Computationally Inefficient! Repetition of work.
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
TaxableIncome> 80K?
Yes No
44
Continuous Attributes: Computing Gini Index...
• For efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the
count matrix and computing gini index– Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values
45
How do we do this with large datasets?
• GINI computations is expensive:– Sorting along numeric attributes O(n log n)!
– Need an in-memory hashtable to split based on attributes
• Empirical observation:– GINI scores tend to increase or decrease slowly
– Minimum value of the gini score for the splitting attribute is lower than other data points
• CLOUDS (Classification for Large OUt-of-core Datasets) achieves this
46
How to hasten GINI computation? (1)
• Sampling of Splitting Points (SS):– Use quantiles to divide the points into q
intervals
– Evaluate GINI index at each boundary
– Determine the GINImin value
– Split along the attribute with GINImin
• Pro:– Only one pass over the entire dataset
• Cons:– Greedy
– Time is dependent on the number of points in the interval
47
How to hasten GINI computation? (2)
• Sampling Splitting points with Estimation (SSE)
– Find GINImin as outlined in SS
– Estimate a lower bound GINIest for each interval
– Eliminate (prune) all intervals where GINIest ≥
GINImin
– Derive list of intervals (alive) where we can determine the correct split
• How do we determine what to prune?
Goal: Estimate the minimum GINI given an interval [vl, vu]Approach: Hill-climbing heuristicInputs: • n – number of data points
(records)• c – number of classes
Compute:xi (yi) – number of elements in class i that are less than equal to vl (vu)ci – total number of elements in class inl (nu) – number of elements that are less than equal to vl (vu)
Take the partial derivative along the subset of classes cClass with the minimum gradient will remain the split point (for the next split)
• Pros:• Instead of being dependent
on (n) the hill climbing heuristic is dependent on (c)
• Other optimizations can include prioritizing alive intervals based on GINImin estimated
• Better accuracy than SS in estimating the splitting points
• Cons:• Paying more for splitting
the data (I/O costs are higher)
48
Summary
• Decision Trees are an elegant way to design classifiers:– Simple and has intuitive meaning associated
with the splits
• More than one tree representation is possible for the same data– many heuristic measures available for deciding
how to construct a tree
• For Big Data, we need to design heuristics that provide approximate (yet closely accurate) solutions
49
Part III: Classification with k-Nearest Neighbors
Examples illustrated from Tom Mitchell, Andrew Moore, Jure Leskovec, …
50
Instance based Learning
• Given a number of examples {(x,y)}, and a query vector q:– Find the closest example(s) x*
– Predict y*
• Learning:– Store all training examples
• Prediction:– Classification problem: Return majority class
among the examples
– Regression problem: Return average y value of the k examples
51
Application Examples
• Collaborative filtering– Find k most similar people to user x that have
rated movies in a similar way
– Predict rating yx of x as an average of yk
52
Visual Interpretation of k-Nearest Neighbor works
k=1
k=3
53
1-Nearest Neigbhor
• Distance Metric: – E.g.: Euclidean
• How many neighbors to look at?– One
• Weighting function (optional):– Not-used
• How to fit with local points?– Predict the same output as the nearest
neigbhor
54
• Distance Metric: – E.g.: Euclidean
• How many neighbors to look at?– k (e.g., k=9)
• Weighting function (optional):– Not-used
• How to fit with local points?– Predict the same output as the nearest
neighbors
k-nearest Neighbor
k = 9
55
Generalizing k-NN: Kernel Regression
• Instead of k-neighbors, we look at all
• Weight the points using a Gaussian function:
• Fit the local points with the
weighted average…
wi
d(xi, q) = 0
56
Distance Weighted k-NN
• Points that are close by may be weighted heavier than farther away points
• Prediction rule:
• where wi can be:
• where d is the distance between x and xi
57
How to compute d(x,xi)?
• One approach: Shepard’s method
58
How to find nearest neighbors?
• Given: a set P of n points in Rd
• Goal: Given a query point q– NN: Find the nearest neighbor p of q in P
– Range search: Find one/all points in P within distance r from q
58
qp
59
Issues with large datasets
• Distance measure to use
• Curse of dimensionality:– in high dimensions, the nearest neighbor might
not be near at all
• Irrelevant attributes:– a lot of attributes (in big data) are not
informative
• k-NN wants data in memory!
• It must make a pass over the entire data to do classification
60
Distance Metrics
• D(x, xi) must be positive
• D(x, xi) = 0 iff a = b (Reflexive)
• D(x, xi) = D(xi, x) (Symmetric)
• For any other data point y, D(x, xi) + D(xi,y) >= D(x, y) (Triangle inequality)
Euclidean Distance: One of the most commonly used metric
61
But… Euclidean is not fully useful!
• Makes sense when all data points are expressed in SI (or a similar way)
• Can generalize Euclidean distance Lk
norm, also called Minkowski distance…
• L1: Manhattan Distance (L1 norm)
• L2: Euclidean Distance (L2 norm)
62
Curse of Dimensionality…
• What happens if {d} is very large?
• Say we have 10000 points uniformly distributed in a unit hypercube where we want to run k=5 neighbors
• Query point is origin (01, 02, …,0d):
– in d=1: we have to go 5/10000 = 0.0005 on an average to capture our 5 neighbors
– in d=2: we have to go (0.0005)1/2 to find 5 neighbors
– in d=d: we have to go (0.0005)1/d to find neighbors
• Very expensive and can break down!
63
Irrelevant features…
• A large number of features are uninformative about the data:– do not discriminate across classes (but may be
within a class)
• We may be left with a classic “hole” in the data when computing the neighbors…
64
Efficient indexing (for computing neighbors)
• A kd-tree is an efficient data structure:– Split along the median value having the highest
variance
– Points are stored in the leaves…
65
Space and Time Complexity of Kd-trees
• Space: O(n)!!!– Can reduce storage with other forms of efficient
data structures
– E.g., Ball-tree structure (does better with higher dimensional datasets)
• Time: – O(n) – worst case scenario
– O(log n) – normal scenarios…
66
Now, how do we make this online?
• Instead of assuming a static k-NN classifier, how about we make our k-NN adapt to incoming data?– Locally weighted linear regression
• Form an explicit approximation f^ for a region surrounding query q– Called piece-wise approximation
– Minimize error over k nearest neighbor of q
– Minimize error over all training examples (weighting by distances)
– Combine the two
67
LWLR: Mathematical form
• Linear regression:
• Error:
• Minimize error over k-nearest neighbors
• Minimize error over entire set of examples, weighing by distances
• Combine the two:
68
Example of LWLR
simple regression
69
Large-scale Classification & Regression• Decision Trees:
– Querying and inference is easy
– Induction needs special constructs for handling large-scale datasets
– Can adapt to incoming data streams
• k-NN:– Intuitive algorithm
– Building efficient data structures to handle large datasets
– Online approach
• Next class: More classification & Regression!