Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | antony-holmes |
View: | 219 times |
Download: | 0 times |
Seminar of InterestSeminar of Interest
Friday, September 15, at 11:00 am, EMS W220.
Dr. Hien Nguyen of the University of Wisconsin-Whitewater.
"Hybrid User Model for Information Retrieval: Framework and Evaluation".
Overview of Today’s Overview of Today’s LectureLecture
• Last Time: representing examples Last Time: representing examples (feature selection) HW0, intro to (feature selection) HW0, intro to supervised learningsupervised learning• HW0 due on TuesdayHW0 due on Tuesday
• Today: K-NN wrapup, Naïve BayesToday: K-NN wrapup, Naïve Bayes
• Reading Assignment: Section 2.1, Reading Assignment: Section 2.1, 2.2, Chapter 52.2, Chapter 5
Nearest-Neighbor Nearest-Neighbor AlgorithmsAlgorithms(aka. Exemplar models, instance-based learning (aka. Exemplar models, instance-based learning
(IBL), case-based learning)(IBL), case-based learning)
• Learning ≈ memorize training examplesLearning ≈ memorize training examples• Problem solving = find most similar Problem solving = find most similar
example in memory; output its categoryexample in memory; output its categoryVenn
-
--
-
-
--
-+
+
+
+ + +
++
+
+?
…“Voronoi
Diagrams”(pg 233)
Sample Experimental Sample Experimental ResultsResults
TestbedTestbed Testset CorrectnessTestset Correctness
IBLIBL D-TreesD-Trees Neural NetsNeural Nets
Wisconsin Wisconsin CancerCancer
98%98% 95%95% 96%96%
Heart Heart DiseaseDisease
78%78% 76%76% ??
TumorTumor 37%37% 38%38% ??
AppendicitisAppendicitis 83%83% 85%85% 86%86%
Simple algorithm works quite well!
““Hamming Distance”Hamming Distance”•Ex 1 = 2Ex 1 = 2•Ex 2 = 1Ex 2 = 1•Ex 3 = 2Ex 3 = 2
Simple Example – 1-NNSimple Example – 1-NN
Training SetTraining Set1.1. a=0, b=0, c=1a=0, b=0, c=1 ++2.2. a=0, b=1, c=0a=0, b=1, c=0 --3.3. a=1, b=1, c=1a=1, b=1, c=1 --Test ExampleTest Example• a=0, b=1, c=0 a=0, b=1, c=0 ??
So output -
(1-NN ≡(1-NN ≡ one nearest neighbor)one nearest neighbor)
K-NN AlgorithmK-NN Algorithm
Collect K nearest neighbors, select majority Collect K nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)
• What should K be?What should K be?• Problem dependentProblem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select
a good setting for Ka good setting for KTuning SetError Rate
1 2 3 4 5 K
What is the “distance” What is the “distance” between two between two examples?examples?
features
i
iiii eedweed
#
12121 ),(*),(
distance between examples 1 and 2
numeric feature specific weight
distance for feature i only
One possibility: sum the distances between features
Using K neighbors to Using K neighbors to classify an exampleclassify an example
Given: nearest neighbors Given: nearest neighbors ee11, ..., e, ..., ekk
with output categories with output categories OO11, ..., O, ..., Okk
The output for example The output for example eett is is
k
iiti
c
cOee1categories possible
),(*),(maxarg Ot =
the kernel “delta” function (=1 if Oi=c, else =0)
Kernel FunctionsKernel Functions
• Term “kernel” comes from Term “kernel” comes from statisticsstatistics
• Major topic for support vector Major topic for support vector machines (later)machines (later)
• Weights interaction between pairs Weights interaction between pairs of examplesof examples• can involve a similarity measurecan involve a similarity measure
Kernel function Kernel function ((eeii, , eett) Examples) Examples
((eeii, e, ett) = 1) = 1
If If ((eeii, e, ett) =1 / dist() =1 / dist(eeii, e, ett) )
simple majority vote (? classified as -)
inverse distance weight (? could be classified as +)
-
-+?
In the diagram to the right, the example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’.
Gaussian Kernel: Gaussian Kernel: popular in SVMspopular in SVMs
2
2
2),( ti ee
ti eee
Euler’s constant
distance between two examples
“standard deviation”
y = 1 / x
y = 1 / exp(x2)
y = 1 / x2
Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• IBL algorithms postpone work IBL algorithms postpone work
from training to testingfrom training to testing• Pure NN/IBL just memorizes the Pure NN/IBL just memorizes the
training datatraining data
• Computationally intensiveComputationally intensive• Match all features of all training Match all features of all training
examplesexamples
Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• Possible Speed-upsPossible Speed-ups
• Use a subset of the training Use a subset of the training examples (Aha)examples (Aha)
• Use clever data structures (A. Use clever data structures (A. Moore)Moore)• KD trees, hash tables, Voronoi diagramsKD trees, hash tables, Voronoi diagrams
• Use a subset of the featuresUse a subset of the features• Feature selectionFeature selection
Feature Selection as Feature Selection as Search ProblemSearch Problem
• State = set of featuresState = set of features• Start state: Start state:
• No feature (forward selection) or No feature (forward selection) or • All features (backward selection)All features (backward selection)
• Operators = add/subtract Operators = add/subtract featuresfeatures
• Scoring function = acc. on tuning Scoring function = acc. on tuning setset
Forward and Backward Forward and Backward Selection of FeaturesSelection of Features
• Hill-climbing (“greedy”) searchHill-climbing (“greedy”) search
{}50%
{FN}71%
{F1}62%
add F
N
ad
d F
1
{F1,F2,...,FN}73%
{F2,...,FN}79%
Forward
Backward
add F1
...
...
subtract F1
subtract F2
Features to use
Accuracy on tuning set (our heuristic function)
...
Forward vs. Backward Forward vs. Backward Feature SelectionFeature Selection
• Faster in early steps Faster in early steps because fewer because fewer features to testfeatures to test
• Fast for choosing a Fast for choosing a small subset of the small subset of the featuresfeatures
• Misses useful features Misses useful features whose usefulness whose usefulness requires other requires other features (feature features (feature synergy)synergy)
• Fast for choosing all Fast for choosing all but a small subset but a small subset of the featuresof the features
• Preserves useful Preserves useful features whose features whose usefulness requires usefulness requires other featuresother features• Example: area Example: area
important, features = important, features = length, widthlength, width
Forward Backward
Feature Selection and Feature Selection and Machine LearningMachine Learning
Filtering-Based Filtering-Based Feature SelectionFeature Selection
all featuresall features
subset of featuressubset of features
modelmodel
Wrapper-Based Wrapper-Based Feature SelectionFeature Selection
FS algorithm
ML algorithmML algorithm
all features
model
FS algorithm
calls ML algorithm many times, uses it to help select features
Number of Features Number of Features and Performanceand Performance
• Too many features can hurt test set performance
• Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect
““Vanilla” K-NN Report Vanilla” K-NN Report CardCard
Learning EfficiencyLearning Efficiency A+A+
Classification EfficiencyClassification EfficiencyFF
StabilityStability CC
Robustness (to noise)Robustness (to noise) DD
Empirical PerformanceEmpirical PerformanceCC
Domain InsightDomain Insight FF
Implementation EaseImplementation Ease AA
Incremental EaseIncremental Ease AA
But is a good baseline
K-NN SummaryK-NN Summary
• K-NN can be an effective ML K-NN can be an effective ML algorithmalgorithm• Especially if few irrelevant featuresEspecially if few irrelevant features
• Good baseline for experimentsGood baseline for experiments
A Different Approach to A Different Approach to Classification:Classification:Probabilistic ModelsProbabilistic Models• Indicate Indicate confidence confidence in in
classificationclassification• Given feature vector:Given feature vector:
F = (fF = (f11= v= v11, … , f, … , fnn = v = vnn))
• Output probability:Output probability:P(class = + | F)P(class = + | F)
The probability the class is positive given”the feature vector
Probabilistic K-NNProbabilistic K-NN
• Output probability using Output probability using kk neighborsneighbors
• Possible algorithm:Possible algorithm:
P(class = + | F) = number of “+” P(class = + | F) = number of “+” neighborsneighbors
kk
Bayes’ RuleBayes’ Rule
• Definitions:Definitions:P(A^B) P(A^B) P(B)*P(A|B) P(B)*P(A|B)P(A^B) P(A^B) P(A)*P(B|A) P(A)*P(B|A)
• So So (assuming P(B) > 0):(assuming P(B) > 0):
P(B)*P(A|B) = P(A)*P(B|A)P(B)*P(A|B) = P(A)*P(B|A)
P(A|B) = P(A|B) = P(A)*P(B|A)P(A)*P(B|A) P(B)P(B)
A B
Bayes’ rule
Conditional Conditional ProbabilitiesProbabilities
• Note the difference:Note the difference:• P(A|B) is smallP(A|B) is small• P(B|A) is largeP(B|A) is large
Bayes’ Rule Applied to Bayes’ Rule Applied to MLML
P(class | F) =P(class | F) =P(F | class) * P(class)P(F | class) * P(class)
P(F)P(F)
Why do we care about Bayes’ rule?Why do we care about Bayes’ rule?Because while P(class|F) is typically difficult to Because while P(class|F) is typically difficult to
directly measure, the values on the RHS are directly measure, the values on the RHS are often easy to estimate (especially if we often easy to estimate (especially if we make simplifying assumptions)make simplifying assumptions)
Shorthand forShorthand for
P(class = P(class = c c | f| f11= v= v11, … , f, … , fnn = v = vnn))