Seminar of Interest Friday, September 15, at 11:00 am, EMS W220. Dr. Hien Nguyen of the University...

Seminar of InterestSeminar of Interest

Friday, September 15, at 11:00 am, EMS W220.

Dr. Hien Nguyen of the University of Wisconsin-Whitewater.

"Hybrid User Model for Information Retrieval: Framework and Evaluation".

Overview of Today’s Overview of Today’s LectureLecture

• Last Time: representing examples Last Time: representing examples (feature selection) HW0, intro to (feature selection) HW0, intro to supervised learningsupervised learning• HW0 due on TuesdayHW0 due on Tuesday

• Today: K-NN wrapup, Naïve BayesToday: K-NN wrapup, Naïve Bayes

• Reading Assignment: Section 2.1, Reading Assignment: Section 2.1, 2.2, Chapter 52.2, Chapter 5

Nearest-Neighbor Nearest-Neighbor AlgorithmsAlgorithms(aka. Exemplar models, instance-based learning (aka. Exemplar models, instance-based learning

(IBL), case-based learning)(IBL), case-based learning)

• Learning ≈ memorize training examplesLearning ≈ memorize training examples• Problem solving = find most similar Problem solving = find most similar

example in memory; output its categoryexample in memory; output its categoryVenn

-

--

-

-

--

-+

+

+

+ + +

++

+

+?

…“Voronoi

Diagrams”(pg 233)

Sample Experimental Sample Experimental ResultsResults

TestbedTestbed Testset CorrectnessTestset Correctness

IBLIBL D-TreesD-Trees Neural NetsNeural Nets

Wisconsin Wisconsin CancerCancer

98%98% 95%95% 96%96%

Heart Heart DiseaseDisease

78%78% 76%76% ??

TumorTumor 37%37% 38%38% ??

AppendicitisAppendicitis 83%83% 85%85% 86%86%

Simple algorithm works quite well!

““Hamming Distance”Hamming Distance”•Ex 1 = 2Ex 1 = 2•Ex 2 = 1Ex 2 = 1•Ex 3 = 2Ex 3 = 2

Simple Example – 1-NNSimple Example – 1-NN

Training SetTraining Set1.1. a=0, b=0, c=1a=0, b=0, c=1 ++2.2. a=0, b=1, c=0a=0, b=1, c=0 --3.3. a=1, b=1, c=1a=1, b=1, c=1 --Test ExampleTest Example• a=0, b=1, c=0 a=0, b=1, c=0 ??

So output -

(1-NN ≡(1-NN ≡ one nearest neighbor)one nearest neighbor)

K-NN AlgorithmK-NN Algorithm

Collect K nearest neighbors, select majority Collect K nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)

• What should K be?What should K be?• Problem dependentProblem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select

a good setting for Ka good setting for KTuning SetError Rate

1 2 3 4 5 K

What is the “distance” What is the “distance” between two between two examples?examples?

features

i

iiii eedweed

#

12121 ),(*),(

distance between examples 1 and 2

numeric feature specific weight

distance for feature i only

One possibility: sum the distances between features

Using K neighbors to Using K neighbors to classify an exampleclassify an example

Given: nearest neighbors Given: nearest neighbors ee11, ..., e, ..., ekk

with output categories with output categories OO11, ..., O, ..., Okk

The output for example The output for example eett is is

k

iiti

c

cOee1categories possible

),(*),(maxarg Ot =

the kernel “delta” function (=1 if Oi=c, else =0)

Kernel FunctionsKernel Functions

• Term “kernel” comes from Term “kernel” comes from statisticsstatistics

• Major topic for support vector Major topic for support vector machines (later)machines (later)

• Weights interaction between pairs Weights interaction between pairs of examplesof examples• can involve a similarity measurecan involve a similarity measure

Kernel function Kernel function ((eeii, , eett) Examples) Examples

((eeii, e, ett) = 1) = 1

If If ((eeii, e, ett) =1 / dist() =1 / dist(eeii, e, ett) )

simple majority vote (? classified as -)

inverse distance weight (? could be classified as +)

-

-+?

In the diagram to the right, the example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’.

Gaussian Kernel: Gaussian Kernel: popular in SVMspopular in SVMs

2

2

2),( ti ee

ti eee

Euler’s constant

distance between two examples

“standard deviation”

y = 1 / x

y = 1 / exp(x2)

y = 1 / x2

Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• IBL algorithms postpone work IBL algorithms postpone work

from training to testingfrom training to testing• Pure NN/IBL just memorizes the Pure NN/IBL just memorizes the

training datatraining data

• Computationally intensiveComputationally intensive• Match all features of all training Match all features of all training

examplesexamples

Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• Possible Speed-upsPossible Speed-ups

• Use a subset of the training Use a subset of the training examples (Aha)examples (Aha)

• Use clever data structures (A. Use clever data structures (A. Moore)Moore)• KD trees, hash tables, Voronoi diagramsKD trees, hash tables, Voronoi diagrams

• Use a subset of the featuresUse a subset of the features• Feature selectionFeature selection

Feature Selection as Feature Selection as Search ProblemSearch Problem

• State = set of featuresState = set of features• Start state: Start state:

• No feature (forward selection) or No feature (forward selection) or • All features (backward selection)All features (backward selection)

• Operators = add/subtract Operators = add/subtract featuresfeatures

• Scoring function = acc. on tuning Scoring function = acc. on tuning setset

Forward and Backward Forward and Backward Selection of FeaturesSelection of Features

• Hill-climbing (“greedy”) searchHill-climbing (“greedy”) search

{}50%

{FN}71%

{F1}62%

add F

N

ad

d F

1

{F1,F2,...,FN}73%

{F2,...,FN}79%

Forward

Backward

add F1

...

...

subtract F1

subtract F2

Features to use

Accuracy on tuning set (our heuristic function)

...

Forward vs. Backward Forward vs. Backward Feature SelectionFeature Selection

• Faster in early steps Faster in early steps because fewer because fewer features to testfeatures to test

• Fast for choosing a Fast for choosing a small subset of the small subset of the featuresfeatures

• Misses useful features Misses useful features whose usefulness whose usefulness requires other requires other features (feature features (feature synergy)synergy)

• Fast for choosing all Fast for choosing all but a small subset but a small subset of the featuresof the features

• Preserves useful Preserves useful features whose features whose usefulness requires usefulness requires other featuresother features• Example: area Example: area

important, features = important, features = length, widthlength, width

Forward Backward

Feature Selection and Feature Selection and Machine LearningMachine Learning

Filtering-Based Filtering-Based Feature SelectionFeature Selection

all featuresall features

subset of featuressubset of features

modelmodel

Wrapper-Based Wrapper-Based Feature SelectionFeature Selection

FS algorithm

ML algorithmML algorithm

all features

model

FS algorithm

calls ML algorithm many times, uses it to help select features

Number of Features Number of Features and Performanceand Performance

• Too many features can hurt test set performance

• Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect

““Vanilla” K-NN Report Vanilla” K-NN Report CardCard

Learning EfficiencyLearning Efficiency A+A+

Classification EfficiencyClassification EfficiencyFF

StabilityStability CC

Robustness (to noise)Robustness (to noise) DD

Empirical PerformanceEmpirical PerformanceCC

Domain InsightDomain Insight FF

Implementation EaseImplementation Ease AA

Incremental EaseIncremental Ease AA

But is a good baseline

K-NN SummaryK-NN Summary

• K-NN can be an effective ML K-NN can be an effective ML algorithmalgorithm• Especially if few irrelevant featuresEspecially if few irrelevant features

• Good baseline for experimentsGood baseline for experiments

A Different Approach to A Different Approach to Classification:Classification:Probabilistic ModelsProbabilistic Models• Indicate Indicate confidence confidence in in

classificationclassification• Given feature vector:Given feature vector:

F = (fF = (f11= v= v11, … , f, … , fnn = v = vnn))

• Output probability:Output probability:P(class = + | F)P(class = + | F)

The probability the class is positive given”the feature vector

Probabilistic K-NNProbabilistic K-NN

• Output probability using Output probability using kk neighborsneighbors

• Possible algorithm:Possible algorithm:

P(class = + | F) = number of “+” P(class = + | F) = number of “+” neighborsneighbors

kk

Bayes’ RuleBayes’ Rule

• Definitions:Definitions:P(A^B) P(A^B) P(B)*P(A|B) P(B)*P(A|B)P(A^B) P(A^B) P(A)*P(B|A) P(A)*P(B|A)

• So So (assuming P(B) > 0):(assuming P(B) > 0):

P(B)*P(A|B) = P(A)*P(B|A)P(B)*P(A|B) = P(A)*P(B|A)

P(A|B) = P(A|B) = P(A)*P(B|A)P(A)*P(B|A) P(B)P(B)

A B

Bayes’ rule

Conditional Conditional ProbabilitiesProbabilities

• Note the difference:Note the difference:• P(A|B) is smallP(A|B) is small• P(B|A) is largeP(B|A) is large

Bayes’ Rule Applied to Bayes’ Rule Applied to MLML

P(class | F) =P(class | F) =P(F | class) * P(class)P(F | class) * P(class)

P(F)P(F)

Why do we care about Bayes’ rule?Why do we care about Bayes’ rule?Because while P(class|F) is typically difficult to Because while P(class|F) is typically difficult to

directly measure, the values on the RHS are directly measure, the values on the RHS are often easy to estimate (especially if we often easy to estimate (especially if we make simplifying assumptions)make simplifying assumptions)

Shorthand forShorthand for

P(class = P(class = c c | f| f11= v= v11, … , f, … , fnn = v = vnn))

Date post:	18-Jan-2016
Category:	Documents
Upload:	antony-holmes
View:	219 times
Download:	0 times

Seminar of Interest Friday, September 15, at 11:00 am, EMS W220. Dr. Hien Nguyen of the University...

Documents