© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 11
CS760 – Machine CS760 – Machine LearningLearning
• Course Instructor: David PageCourse Instructor: David Page• email: [email protected]: [email protected]• office: MSC 6743 (University & Charter) office: MSC 6743 (University & Charter) • hours: TBAhours: TBA
• Teaching Assistant: Daniel WongTeaching Assistant: Daniel Wong• email: [email protected]: [email protected]• office: TBAoffice: TBA• hours: TBAhours: TBA
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 22
Textbooks & Textbooks & Reading AssignmentReading Assignment• Machine Learning Machine Learning (Tom Mitchell) (Tom Mitchell)• Selected on-line readingsSelected on-line readings
• Read in Mitchell Read in Mitchell (posted on class web (posted on class web page)page)
• PrefacePreface• Chapter 1Chapter 1• Sections 2.1 and 2.2Sections 2.1 and 2.2• Chapter 8Chapter 8
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 33
Monday, Wednesday, Monday, Wednesday, andand Friday?Friday?
• We’ll meet 30 times this term (may or may We’ll meet 30 times this term (may or may not include exam in this count)not include exam in this count)
• We’ll meet on FRIDAY this and next week, We’ll meet on FRIDAY this and next week, in order to cover material for HW 1in order to cover material for HW 1(plus I have some business travel this term)(plus I have some business travel this term)
• DefaultDefault: we WILL meet on Friday unless I : we WILL meet on Friday unless I announce otherwiseannounce otherwise
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 44
Course "Style"Course "Style"
• Primarily algorithmic & experimentalPrimarily algorithmic & experimental• Some theory, both mathematical & Some theory, both mathematical &
conceptual (much on conceptual (much on statisticsstatistics))• "Hands on" experience, interactive "Hands on" experience, interactive
lectures/discussionslectures/discussions• Broad survey of many ML subfields, includingBroad survey of many ML subfields, including
• "symbolic" (rules, decision trees, ILP)"symbolic" (rules, decision trees, ILP)• "connectionist" (neural nets)"connectionist" (neural nets)• support vector machines, nearest-neighborssupport vector machines, nearest-neighbors• theoretical ("COLT")theoretical ("COLT")• statistical ("Bayes rule")statistical ("Bayes rule")• reinforcement learning, genetic algorithmsreinforcement learning, genetic algorithms
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 55
"MS vs. PhD" Aspects"MS vs. PhD" Aspects
• MS'ish topicsMS'ish topics• mature, ready for practical applicationmature, ready for practical application• first 2/3 – ¾ of semesterfirst 2/3 – ¾ of semester• Naive Bayes, Nearest-Neighbors, Decision Trees, Neural Naive Bayes, Nearest-Neighbors, Decision Trees, Neural
Nets, Suport Vector Machines, ensembles, experimental Nets, Suport Vector Machines, ensembles, experimental methodology (10-fold cross validation, methodology (10-fold cross validation, tt-tests)-tests)
• PhD'ish topicsPhD'ish topics• inductive logic programming, statistical relational inductive logic programming, statistical relational
learning, reinforcement learning, SVMs, use of prior learning, reinforcement learning, SVMs, use of prior knowledgeknowledge
• Other machine learning material covered in Other machine learning material covered in Bioinformatics CS 576/776, Jerry Zhu’s CS 838Bioinformatics CS 576/776, Jerry Zhu’s CS 838
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 66
Two Major GoalsTwo Major Goals
• to understand to understand whatwhat a learning a learning system should dosystem should do
• to understand to understand howhow (and how (and how wellwell) ) existing systems workexisting systems work• Issues in algorithm designIssues in algorithm design• Choosing algorithms for applicationsChoosing algorithms for applications
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 77
Background AssumedBackground Assumed
• LanguagesLanguages• Java (see CS 368 tutorial online)Java (see CS 368 tutorial online)
• AI TopicsAI Topics• SearchSearch• FOPCFOPC• UnificationUnification• Formal DeductionFormal Deduction
• MathMath• Calculus (partial derivatives)Calculus (partial derivatives)• Simple prob & statsSimple prob & stats
• No previous ML experience assumedNo previous ML experience assumed (so some overlap with CS 540)(so some overlap with CS 540)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 88
RequirementsRequirements
• Bi-weekly programming HW'sBi-weekly programming HW's• "hands on" experience valuable"hands on" experience valuable• HW0 – build a datasetHW0 – build a dataset• HW1 – simple ML algo's and exper. methodologyHW1 – simple ML algo's and exper. methodology• HW2 – decision trees (?)HW2 – decision trees (?)• HW3 – neural nets (?)HW3 – neural nets (?)• HW4 – reinforcement learning (in a simulated HW4 – reinforcement learning (in a simulated
world)world)
• "Midterm" exam "Midterm" exam (in class, about 90% through semester)(in class, about 90% through semester)
• Find project of your choosingFind project of your choosing• during last 4-5 weeks of classduring last 4-5 weeks of class
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 99
GradingGrading
HW'sHW's 35%35%
"Midterm""Midterm" 40%40%
ProjectProject 20%20%
Quality DiscussionQuality Discussion 5% 5%
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1010
Late HW's PolicyLate HW's Policy
• HW's due @ 4pmHW's due @ 4pm• you have you have 55 late days to use late days to use
over the semesterover the semester• (Fri 4pm (Fri 4pm → Mon 4pm is → Mon 4pm is 11 late "day") late "day")
• SAVE UP late days!SAVE UP late days!• extensions only for extensions only for extremeextreme cases cases
• Penalty points after late days Penalty points after late days exhaustedexhausted
• Can't be more than ONE WEEK lateCan't be more than ONE WEEK late
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1111
Academic Misconduct Academic Misconduct (also on course homepage)(also on course homepage)
All examinations, programming assignments, All examinations, programming assignments, and written homeworks must be done and written homeworks must be done individuallyindividually. Cheating and plagiarism will be . Cheating and plagiarism will be dealt with in accordance with University dealt with in accordance with University procedures (see the procedures (see the Academic Misconduct Guide for StudentsAcademic Misconduct Guide for Students). ). Hence, for example, code for programming Hence, for example, code for programming assignments must not be developed in groups, assignments must not be developed in groups, nor should code be shared. You are encouraged nor should code be shared. You are encouraged to discuss with your peers, the TAs or the to discuss with your peers, the TAs or the instructor ideas, approaches and techniques instructor ideas, approaches and techniques broadly, but not at a level of detail where broadly, but not at a level of detail where specific implementation issues are described by specific implementation issues are described by anyone. If you have any questions on this, anyone. If you have any questions on this, please ask the instructor before you act. please ask the instructor before you act.
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1212
What Do You Think What Do You Think Learning Means?Learning Means?
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1313
What is Learning?What is Learning?
““Learning denotes changes in the system that Learning denotes changes in the system that
… … enable the system to do the same task … enable the system to do the same task …
more effectively the next time.”more effectively the next time.”
- - Herbert Herbert SimonSimon
““Learning is making useful changes in our Learning is making useful changes in our minds.”minds.”
- - Marvin Marvin MinskyMinsky
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1414
Today’sToday’s TopicsTopics
• Memorization as LearningMemorization as Learning• Feature SpaceFeature Space• Supervised MLSupervised ML• KK-NN (-NN (KK-Nearest Neighbor)-Nearest Neighbor)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1515
Memorization (Rote Memorization (Rote Learning)Learning)
• Employed by first machine Employed by first machine learning systems, in 1950slearning systems, in 1950s• Samuel’s Checkers programSamuel’s Checkers program• Michie’s MENACE: Matchbox Educable Michie’s MENACE: Matchbox Educable
Naughts and Crosses EngineNaughts and Crosses Engine
• Prior to these, some people Prior to these, some people believed computers could not believed computers could not improveimprove at a task at a task with experiencewith experience
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1616
Rote Learning is Rote Learning is LimitedLimited
• Memorize I/O pairs and perform Memorize I/O pairs and perform exact matching with new inputsexact matching with new inputs
• If computer has not seen precise If computer has not seen precise case before, it cannot apply its case before, it cannot apply its experienceexperience
• Want computer to “generalize” Want computer to “generalize” from prior experiencefrom prior experience
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1717
Some Settings in Some Settings in Which Learning May Which Learning May HelpHelp• Given an input, what is appropriate Given an input, what is appropriate
response (output/action)?response (output/action)?• Game playing – board state/moveGame playing – board state/move• Autonomous robots (e.g., driving a vehicle) Autonomous robots (e.g., driving a vehicle)
-- world state/action-- world state/action• Video game characters – state/actionVideo game characters – state/action• Medical decision support – symptoms/ Medical decision support – symptoms/
treatmenttreatment• Scientific discovery – data/hypothesisScientific discovery – data/hypothesis• Data mining – database/regularityData mining – database/regularity
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1818
Broad Paradigms of Broad Paradigms of Machine LearningMachine Learning
• Inducing Functions from I/O PairsInducing Functions from I/O Pairs• Decision trees (e.g., Quinlan’s C4.5 [1993])Decision trees (e.g., Quinlan’s C4.5 [1993])• Connectionism / neural networks (e.g., backprop)Connectionism / neural networks (e.g., backprop)• Nearest-neighbor methodsNearest-neighbor methods• Genetic algorithmsGenetic algorithms• SVM’s SVM’s
• Learning without Learning without Feedback/TeacherFeedback/Teacher• Conceptual clusteringConceptual clustering• Self-organizing systemsSelf-organizing systems• Discovery systemsDiscovery systems
Not in Mitchell’s textbook (covered in CS 776)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 1919
IID IID (Completion of Lec #2)(Completion of Lec #2)
• We are assuming examples are We are assuming examples are IID: IID: independently identically independently identically distributeddistributed
• Eg, we are ignoring Eg, we are ignoring temporaltemporal dependencies (covered in dependencies (covered in time-series learningtime-series learning))
• Eg, we assume the learner has no Eg, we assume the learner has no say in which examples it gets say in which examples it gets (covered in (covered in active learningactive learning))
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2020
Supervised Learning Supervised Learning Task OverviewTask Overview
Concepts/Classes/
Decisions
Concepts/Classes/
Decisions
Feature Selection(usually done by humans)
Classification Rule Construction(done by learning algorithm)
Real WorldReal World
Feature SpaceFeature Space
HW 0
HW 1-3
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2121
Supervised Learning Supervised Learning Task Overview (cont.)Task Overview (cont.)
• Note: mappings on previous slide Note: mappings on previous slide are not necessarily 1-to-1are not necessarily 1-to-1• Bad for first mapping?Bad for first mapping?• Good for the second Good for the second
(in fact, it’s the goal!)(in fact, it’s the goal!)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2222
Empirical Learning: Empirical Learning: Task DefinitionTask Definition• Given Given
• A collection of A collection of positivepositive examples of some examples of some concept/class/category (i.e., members of the class) and, concept/class/category (i.e., members of the class) and, possibly, a collection of the possibly, a collection of the negativenegative examples (i.e., non- examples (i.e., non-members)members)
• ProduceProduce• A description that A description that coverscovers (includes) all/most of the (includes) all/most of the
positive examples and non/few of the negative examples positive examples and non/few of the negative examples
(and, hopefully, properly categorizes most future (and, hopefully, properly categorizes most future examples!)examples!)
Note: one can easily extend this definition to handle more than two Note: one can easily extend this definition to handle more than two classesclasses
The KeyPoint!
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2323
ExampleExamplePositive Examples Negative Examples
How does this symbol classify?
•Concept
•Solid Red Circle in a (Regular?) Polygon
•What about?•Figures on left side of page•Figures drawn before 5pm 2/2/89 <etc>
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2424
.
.
.
Concept LearningConcept Learning
Learning systems differ in how they represent Learning systems differ in how they represent concepts:concepts:
TrainingExamples
Backpropagation
C4.5, CART
AQ, FOIL
SVMs
NeuralNet
DecisionTree
Φ <- X^YΦ <- Z
Rules
If 5x1 + 9x2 – 3x3 > 12Then +
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2525
Feature SpaceFeature Space
If examples are described in terms of If examples are described in terms of values of features, they can be plotted values of features, they can be plotted as points in an as points in an NN-dimensional space.-dimensional space.
Size
Color
Weight
?Big
2500
Gray
A “concept” is then a (possibly disjoint) volume in this space.
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2626
Learning from Labeled Learning from Labeled ExamplesExamples
• Most common and successful Most common and successful form of MLform of ML Venn Diagram
+ ++
+
- -
--
-
-
--
•Examples – points in a multi-dimensional “feature space”•Concepts – “function” that labels every point in feature space
(as +, -, and possibly ?)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2727
Brief ReviewBrief Review
• ConjunctiveConjunctive Concept Concept• Color(?obj1, red)Color(?obj1, red)
^̂• Size(?obj1, large)Size(?obj1, large)
• DisjunctiveDisjunctive Concept Concept• Color(?obj2, blue)Color(?obj2, blue)
vv• Size(?obj2, small)Size(?obj2, small)
• More formally a “concept” is of the More formally a “concept” is of the formform• x y z F(x, y, z) -> Member(x, Class1)x y z F(x, y, z) -> Member(x, Class1)
A A A
“and”
“or”
Instances
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2828
Empirical Learning and Empirical Learning and Venn DiagramsVenn Diagrams
Concept = Concept = AA or or B B (Disjunctive concept)(Disjunctive concept)
Examples = labeled points in feature spaceExamples = labeled points in feature space
Concept = a label for a Concept = a label for a setset of points of points
Venn Diagram
A
B
--
--
-
-
- -
-
-
-
-
--
-
-
-
--
--
- - --- -
---
--
-
-
+
++ ++
+ +
+
++
+ +
+
++
+
+
++
Feature Space
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 2929
Aspects of an ML Aspects of an ML SystemSystem• ““Language” for representing classified Language” for representing classified
examplesexamples• ““Language” for representing “Concepts”Language” for representing “Concepts”• Technique for producing concept Technique for producing concept
“consistent” with the training examples“consistent” with the training examples• Technique for classifying new instanceTechnique for classifying new instance
Each of these limits the Each of these limits the expressivenessexpressiveness//efficiencyefficiency of the supervised learning algorithm.of the supervised learning algorithm.
HW 0
OtherHW’s
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3030
Nearest-Neighbor Nearest-Neighbor AlgorithmsAlgorithms(aka. Exemplar models, instance-based learning (aka. Exemplar models, instance-based learning
(IBL), case-based learning)(IBL), case-based learning)
• Learning ≈ memorize training examplesLearning ≈ memorize training examples• Problem solving = find most similar Problem solving = find most similar
example in memory; output its categoryexample in memory; output its categoryVenn
-
--
-
-
--
-+
+
+
+ + +
++
+
+?
…“Voronoi
Diagrams”(pg 233)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3131
““Hamming Distance”Hamming Distance”•Ex 1 = 2Ex 1 = 2•Ex 2 = 1Ex 2 = 1•Ex 3 = 2Ex 3 = 2
Simple Example: 1-NNSimple Example: 1-NN
Training SetTraining Set1.1. a=0, b=0, c=1a=0, b=0, c=1 ++2.2. a=0, b=0, c=0a=0, b=0, c=0 --3.3. a=1, b=1, c=1a=1, b=1, c=1 --Test ExampleTest Example• a=0, b=1, c=0 a=0, b=1, c=0 ??
So output -
(1-NN ≡(1-NN ≡ one nearest neighbor)one nearest neighbor)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3232
Sample Experimental Sample Experimental Results Results (see UCI archive for (see UCI archive for more)more)
TestbedTestbed Testset CorrectnessTestset Correctness
1-NN1-NN D-TreesD-Trees Neural NetsNeural Nets
Wisconsin Wisconsin CancerCancer 98%98% 95%95% 96%96%
Heart Heart DiseaseDisease 78%78% 76%76% ??
TumorTumor 37%37% 38%38% ??
AppendicitisAppendicitis 83%83% 85%85% 86%86%
Simple algorithm works quite well!
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3333
KK-NN Algorithm-NN Algorithm
Collect Collect KK nearest neighbors, select majority nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)
• What should What should KK be? be?• It probably is problem dependentIt probably is problem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select
a good setting for a good setting for KK
1
Shouldn’t really“connect the dots”(Why?)
Tuning SetError Rate
2 3 4 5 K
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3434
Data RepresentationData Representation
• Creating a dataset ofCreating a dataset of
• Be sure to include – on separate 8x11 Be sure to include – on separate 8x11 sheet – a photo and a brief biosheet – a photo and a brief bio
• HW0 out on-lineHW0 out on-line• Due next FridayDue next Friday
fixed length feature vectorsfixed length feature vectors
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3535
HW0 – Create Your Own HW0 – Create Your Own Dataset Dataset (repeated from lecture (repeated from lecture #1)#1)
• Think about before next classThink about before next class• Read HW0 (on-line)Read HW0 (on-line)
• Google to find:Google to find:• UCI archive (or UCI KDD archive)UCI archive (or UCI KDD archive)• UCI ML archive (UCI ML repository)UCI ML archive (UCI ML repository)• More links in HW0’s web pageMore links in HW0’s web page
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3636
HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”
• Step 1: Step 1: Choose a Boolean (true/false) conceptChoose a Boolean (true/false) concept• Books I like/dislike Books I like/dislike
Movies I like/dislike Movies I like/dislike www pages I like/dislikewww pages I like/dislike
• Subjective judgment (can’t articulate)Subjective judgment (can’t articulate)• ““time will tell” conceptstime will tell” concepts
• Stocks to buyStocks to buy• Medical treatmentMedical treatment
• at time at time tt, predict outcome at time (, predict outcome at time (t t ++∆∆t)t)• Sensory interpretation Sensory interpretation
• Face recognition (see textbook)Face recognition (see textbook)• Handwritten digit recognitionHandwritten digit recognition• Sound recognitionSound recognition
• Hard-to-Program FunctionsHard-to-Program Functions
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3737
Some Real-World Some Real-World ExamplesExamples
• Car Steering (Pomerleau, Thrun)Car Steering (Pomerleau, Thrun)
• Medical Diagnosis (Quinlan)Medical Diagnosis (Quinlan)
• DNA CategorizationDNA Categorization• TV-pilot ratingTV-pilot rating• Chemical-plant controlChemical-plant control• Backgammon playingBackgammon playing
Learned Function
Steering Angle
Digitized camera image
age=13,sex=M, wgt=18
Learned Function
sickvs
healthy
Medicalrecord
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3838
HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”
• Step 2: Step 2: Choosing a Choosing a feature spacefeature space• We will use We will use fixed-length feature vectorsfixed-length feature vectors
• Choose Choose NN features features• Each feature has Each feature has VVii
possible valuespossible values• Each example is represented by a vector of Each example is represented by a vector of NN feature feature
values values (i.e., (i.e., is a point in the feature spaceis a point in the feature space))e.g.: e.g.: <red, 50, round><red, 50, round>
colorcolor weight shapeweight shape
• Feature TypesFeature Types• BooleanBoolean• NominalNominal• OrderedOrdered• HierarchicalHierarchical
• Step 3: Step 3: Collect examples (“I/O” pairs)Collect examples (“I/O” pairs)
Defines a space
In HW0 we will use a subset(see next slide)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 3939
Standard Feature TypesStandard Feature Typesfor representing training examples for representing training examples – a source of “ – a source of “domain knowledgedomain knowledge””
• NominalNominal• No relationship among possible valuesNo relationship among possible values
e.g., e.g., color color єє {red, blue, green} {red, blue, green} (vs.(vs. color = 1000 color = 1000 Hertz)Hertz)• Linear (or Ordered)Linear (or Ordered)
• Possible values of the feature are totally orderedPossible values of the feature are totally orderede.g., e.g., size size єє {small, medium, large}{small, medium, large} ←← discretediscrete
weight weight єє [0…500] [0…500] ←← continuouscontinuous
• HierarchicalHierarchical• Possible values are Possible values are partiallypartially
ordered in an ISA hierarchyordered in an ISA hierarchye.g. for e.g. for shapeshape ->->
closed
polygon continuous
trianglesquare circle ellipse
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4040
Our Feature TypesOur Feature Types(for CS 760 HW’s)(for CS 760 HW’s)
• DiscreteDiscrete• tokens (char strings, w/o quote marks and tokens (char strings, w/o quote marks and
spaces)spaces)
• ContinuousContinuous• numbers (int’s or float’s)numbers (int’s or float’s)
• If only a few possible values (e.g., 0 & 1) use If only a few possible values (e.g., 0 & 1) use discretediscrete
• i.e., merge i.e., merge nominalnominal and and discrete-ordereddiscrete-ordered (or convert (or convert discrete-ordereddiscrete-ordered into 1,2,…) into 1,2,…)
• We will ignore hierarchical info and We will ignore hierarchical info and only use the leaf values (common approach)only use the leaf values (common approach)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4141
Example Hierarchy Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17)(KDD* Journal, Vol 5, No. 1-2, 2001, page 17)
Product
Pct Foods
Tea
Canned Cat Food
Dried Cat Food
99 Product Classes
2302 Product Subclasses
Friskies Liver, 250g
~30k Products• Structure of one feature!
• “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.”
- From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001
* Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4242
HW0: HW0: Creating Your DatasetCreating Your Dataset
Ex: IMDB has a lot of data that Ex: IMDB has a lot of data that are not discrete or are not discrete or continuous or binary-valued continuous or binary-valued for target function for target function (category)(category)Studio
Movie
Director/Producer
ActorMade
Acted inDirected
NameCountryList of movies
NameYear of birthGenderOscar nominationsList of movies
Title, Genre, Year, Opening Wkend BO receipts,List of actors/actresses, Release season
NameYear of birthList of movies
Produced
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4343
HW0: Sample DBHW0: Sample DB
Choose a Boolean or binary-Choose a Boolean or binary-valued target function (category)valued target function (category)
• Opening weekend box-office Opening weekend box-office receipts > $2 million receipts > $2 million
• Movie is drama? (action, sci-fi,Movie is drama? (action, sci-fi,…)…)
• Movies I like/dislike (e.g. Tivo)Movies I like/dislike (e.g. Tivo)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4444
HW0: Representing as a HW0: Representing as a Fixed-Length Feature Fixed-Length Feature VectorVector
<discuss on chalkboard><discuss on chalkboard>
Note: some advanced ML approaches do Note: some advanced ML approaches do not not require such “feature mashing” require such “feature mashing” (eg, ILP)(eg, ILP)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4545
IMDB@umassIMDB@umass
David Jensen’s group at UMass uses David Jensen’s group at UMass uses Naïve Bayes and other ML algo’s on the Naïve Bayes and other ML algo’s on the IMDBIMDB
• Opening weekend box-office Opening weekend box-office receipts > $2 millionreceipts > $2 million• 25 attributes25 attributes• Accuracy = 83.3%Accuracy = 83.3%• Default accuracy = 56%Default accuracy = 56% (default algo?)(default algo?)
• Movie is drama?Movie is drama?• 12 attributes12 attributes• Accuracy = 71.9%Accuracy = 71.9%• Default accuracy = 51%Default accuracy = 51%
http://kdl.cs.umass.edu/proximity/about.htmlhttp://kdl.cs.umass.edu/proximity/about.html
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4646
First Algorithm in First Algorithm in DetailDetail
• KK-Nearest Neighbors / -Nearest Neighbors / Instance-Based Learning (Instance-Based Learning (kk-NN/IBL)-NN/IBL)• Distance functionsDistance functions• Kernel functionsKernel functions• Feature selection (applies to all ML Feature selection (applies to all ML
algo’s)algo’s)• IBL SummaryIBL Summary
Chapter 8 of MitchellChapter 8 of Mitchell
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4747
Some Common JargonSome Common Jargon
• ClassificationClassification• Learning a Learning a discretediscrete valued function valued function
• RegressionRegression• Learning a Learning a realreal valued function valued function
IBL easily extended to regression IBL easily extended to regression tasks (and to multi-category tasks (and to multi-category classification)classification)
Discrete/RealOutputs
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4848
Variations on a ThemeVariations on a Theme
• IB1IB1 – keep all examples – keep all examples
• IB2IB2 – keep next instance if – keep next instance if incorrectlyincorrectly classified by using previous instancesclassified by using previous instances• Uses less storage (good)Uses less storage (good)• Order dependent (bad)Order dependent (bad)• Sensitive to noisy data (bad)Sensitive to noisy data (bad)
(From Aha, Kibler and Albert in ML Journal)(From Aha, Kibler and Albert in ML Journal)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 4949
Variations on a Theme Variations on a Theme (cont.)(cont.)• IB3IB3 – extend IB2 to more intelligently decide – extend IB2 to more intelligently decide
which examples to keep (see article)which examples to keep (see article)• Better handling of noisy dataBetter handling of noisy data
• Another IdeaAnother Idea - - cluster groups, keep cluster groups, keep example from each (median/centroid)example from each (median/centroid)• Less storage, faster lookupLess storage, faster lookup
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5050
Distance FunctionsDistance Functions
• KeyKey issue in IBL issue in IBL (instance-based learning)(instance-based learning)
• One approach:One approach:
assign weights to each assign weights to each featurefeature
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5151
Distance Functions Distance Functions (sample)(sample)
features
i
iiii eedweed
#
12121 ),(*),(
distance between examples 1 and 2
a numeric weighting factor
distance for feature i only between examples 1 and 2
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5252
Kernel Functions Kernel Functions and and kk-NN-NN
• Term “kernel” comes from Term “kernel” comes from statisticsstatistics
• Major topic in support vector Major topic in support vector machines (SVMs)machines (SVMs)
• Weights the interaction between Weights the interaction between pairs of examplespairs of examples
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5353
Kernel Functions and Kernel Functions and kk-NN (continued)-NN (continued)
• Assume we haveAssume we have• kk nearest neighbors nearest neighbors ee11, ..., e, ..., ekk
• associated output categories associated output categories OO11, ..., , ..., OOkk
• Then output for test case Then output for test case eett isis
k
iiti
c
cOee1categories possible
),(*),(maxarg
the kernel “delta” function (=1 if Oi=c, else =0)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5454
Sample Kernel Sample Kernel Functions Functions (e(ei i , e, ett))
• ((eei i , e, ett) = 1) = 1
• ((eei i , e, ett) = 1 / dist() = 1 / dist(eei i , e, ett) )
simple majority vote (? classified as -)inverse distance weight (? could be classified as +)
-
-+?
In diagram to right, example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’.
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5555
Gaussian KernelGaussian Kernel
• Heavily Heavily used in used in SVMsSVMs
2
2
2),( ti ee
ti eee
Euler’s constant
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5656
Local LearningLocal Learning
• Collect Collect kk nearest neighbors nearest neighbors• Give them to some supervised ML algoGive them to some supervised ML algo• Apply learned model to test exampleApply learned model to test example
++
+++
+ +
--
-- ? -Train on these
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5757
Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• IBL algorithms postpone work IBL algorithms postpone work
from training to testingfrom training to testing• Pure Pure kk-NN/IBL just memorizes -NN/IBL just memorizes
the training datathe training data• Sometimes called Sometimes called lazy learninglazy learning
• Computationally intensiveComputationally intensive• Match all features of all training Match all features of all training
examplesexamples
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5858
Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• Possible Speed-upsPossible Speed-ups
• Use a subset of the training Use a subset of the training examples (Aha)examples (Aha)
• Use clever data structures (A. Use clever data structures (A. Moore)Moore)• KD trees, hash tables, Voronoi diagramsKD trees, hash tables, Voronoi diagrams
• Use subset of the featuresUse subset of the features
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 5959
Number of Features Number of Features and Performanceand Performance
• Too many features can hurt test set performance
• Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect
• ““Curse of dimensionality”Curse of dimensionality”
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6060
Feature Selection and Feature Selection and MLML(general issue for ML)(general issue for ML)
Filtering-Based Filtering-Based Feature SelectionFeature Selection
all featuresall features
subset of featuressubset of features
modelmodel
Wrapper-Based Wrapper-Based Feature SelectionFeature Selection
FS algorithm
ML algorithmML algorithm
all features
model
FS algorithm
calls ML algorithm many times, uses it to help select features
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6161
Feature Selection as Feature Selection as Search ProblemSearch Problem• State State = set of features= set of features
• Start stateStart state = = emptyempty ((forward selectionforward selection)) or or fullfull ( (backward selectionbackward selection))
• Goal testGoal test = highest scoring state = highest scoring state
• Operators Operators • add/subtract featuresadd/subtract features
• Scoring function Scoring function • accuracy on training (or tuning) set of accuracy on training (or tuning) set of
ML algorithm using this state’s feature setML algorithm using this state’s feature set
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6262
Forward and Backward Forward and Backward Selection of FeaturesSelection of Features
• Hill-climbing (“greedy”) searchHill-climbing (“greedy”) search
{}50%
{FN}71%
{F1}62%
add F
N
ad
d F
1
Forward
Backward
add F1
...
...
Features to use
Accuracy on tuning set (our heuristic function)
...
{F1,F2,...,FN}73%
{F2,...,FN}79%
subtract F1
subtract F2
...
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6363
Forward vs. Backward Forward vs. Backward Feature SelectionFeature Selection
• Faster in early steps Faster in early steps because fewer because fewer features to testfeatures to test
• Fast for choosing a Fast for choosing a small subset of the small subset of the featuresfeatures
• Misses useful features Misses useful features whose usefulness whose usefulness requires other requires other features (feature features (feature synergy)synergy)
• Fast for choosing all Fast for choosing all but a small subset but a small subset of the featuresof the features
• Preserves useful Preserves useful features whose features whose usefulness requires usefulness requires other featuresother features• Example: area Example: area
important, important, features = length, features = length, widthwidth
Forward Backward
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6464
Some Comments on Some Comments on kk--NNNN
• Easy to implementEasy to implement• Good “baseline” Good “baseline”
algorithm / algorithm / experimental controlexperimental control
• Incremental learning Incremental learning easyeasy
• Psychologically Psychologically plausible model of plausible model of human memoryhuman memory
• No insight into No insight into domain (no explicit domain (no explicit model)model)
• Choice of distance Choice of distance function is function is problematicproblematic
• Doesn’t Doesn’t exploit/notice exploit/notice structure in structure in examplesexamples
Positive Negative
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6565
Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)
• Computationally expensive to Computationally expensive to save all examples; slow save all examples; slow classification of new examplesclassification of new examples• Addressed by IB2/IB3 of Aha et al. Addressed by IB2/IB3 of Aha et al.
and work of A. Moore (CMU; now and work of A. Moore (CMU; now Google)Google)
• Is this really a problem?Is this really a problem?
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6666
Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)
• Intolerant of NoiseIntolerant of Noise• Addressed by IB3 of Aha et al.Addressed by IB3 of Aha et al.• Addressed by Addressed by kk-NN version-NN version• Addressed by feature selection - can Addressed by feature selection - can
discard the noisy featurediscard the noisy feature• Intolerant of Irrelevant FeaturesIntolerant of Irrelevant Features
• Since algorithm very fast, can Since algorithm very fast, can experimentally choose good feature experimentally choose good feature sets (Kohavi, Ph. D. – now at Amazon)sets (Kohavi, Ph. D. – now at Amazon)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6767
More IBL CriticismsMore IBL Criticisms
• High sensitivity to choice of similiarity High sensitivity to choice of similiarity (distance) function(distance) function• Euclidean distance might not be best choiceEuclidean distance might not be best choice
• Handling non-numeric features and Handling non-numeric features and missing feature values is not natural, but missing feature values is not natural, but doabledoable
• How might we do this? (Part of HW1)How might we do this? (Part of HW1)
• No insight into task No insight into task (learned concept not interpretable)(learned concept not interpretable)
© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007
CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 6868
SummarySummary
• IBL can be a very effective IBL can be a very effective machine learning algorithmmachine learning algorithm
• Good “baseline” for experimentsGood “baseline” for experiments