Post on 16-Jul-2018
transcript
1
Topics in Artificial Intelligence
308-424A
Gregory Dudek
Office: MC 404
Lecture outline
Introductions
Administrative details
What is AI?
What the course will contain.
Overview of AI sub-topics.
Overview of AI applications.
What is AI?
Is artificial intelligence about solvingproblems, or about core scientificproblems?
If we are working on artificial intelligence,how concerned to we have to be about thenatural kind?
How can we decide that we have solved theproblem?
Answer the following questions:
• What is Intelligence?
• What is Artificial Intelligence?
• When do you expect us to achieve artificialintelligence (already, soon, in 10 years, never)?
AI is about duplicating what the (human) brain DOES.
AI is about duplicating what the (human) brain SHOULD do.
Is AI computer science?
• Yes: it deals with algorithms, efficiency, tractability, etc.
• No: it includes philosophy, cognitive science, andengineering
• Maybe: we doesn’t know yet how to define the area orthe techniques. Maybe it’s a field in it’s own right?
• Who cares: the problems are exciting and important,isn’t this classification needless pedantry?
AI is truly interdisciplinary.
It relates to psychology, neurophysiology,mathematical, control theory (EE), etc.
2
Why AI (in it’s broadest sense) is the best partof science (a personal confession):
• Understanding the mind is one of the oldest and mostchallenging questions considered by modern science.
• It allows you to see an idea come to fruition in atangible and useful way.
• You have wide latitude to select a preferred mix oftheory, construction, data collection, and dataanalysis.
• It can have enormous potential practical impact.
Course contents:We will overview selected topics. This is not a
comprehensive view of all of AI (it can’t be).
Key topics include:
PART 1Knowledge representation: predicate calculusSearch: A*, alpha-beta search
PART 2Learning: a couple of flavors
PART 3Perception: vision
What is AI today?
3 stereotypical components to actual systems:
Perception
Reasoning
Action
AI has a whole has fragmented: there are manysub-areas with reduced interaction between them.Perception, and vision in particular, has become a distinct
community.
Robotics (i.e. action) has also become largely separate.
By and large, deliberative reasoning has held on to the title“(traditional) Artificial Intelligence”.
With in each major branch sub-areas havedeveloped.
Within reasoning, different approaches havedeveloped their own styles and even jargon.E.g. neural networks, learning, game playing,
reasoning with uncertainty, randomized search.
Example AI system
Scheduling• Perception: Trivial task description
language.
• Reasoning: Constraint Satisfaction,Stochastic Optimization, Linearprogramming, Genetic Algorithms
• Action: Trivial
Example AI system
Medical Diagnosis (e.g. Pathfinder byHeckermann at Microsoft)
• Perception: Symptoms, test results.
• Reasoning: Bayes Network inference,Machine Learning, Monte-carlosimulation
• Actions: Suggest tests, make diagnoses
3
There are big questions .....• Can we make something that is asintelligent as a human?• Can we make something that is as intelligent as a bee?• Does intelligence depend on a model of the physical
world?• Can we get something that is really evolutionary and self
improving and autonomous and flexible....?
And little questions.....• Can we save this plant $20million a year by improved
pattern recognition?
• Can we save this bank $50million a year by automaticfraud detection?
• Can we start a new industry of handwriting recognition /software agents
Historical ContextComputer-science-based AI is commonly agreed to have
started with “the Dartmouth Conference” in 1956.
Some of the attendees:• John McCarthy: LISP, time-sharing, application of logic to
reasoning.
• Marvin Minsky: Popularized Neural Nets and showed limits ofneural nets,slots and srames.
• Claude Shannon: Information Theory, Open-loop 5-ball juggling
• Allen Newell & Herb Simon: Bounded Rationality, Logic Theorist /General Problem Solver / SOAR
Historical context• Reasoning was once seen as *the* AI problem.
• Chess, and related games, were once consideredpivotal to understanding intelligence.– They are now seen as a sub-domain of limited relevant to
be bulk of AI research.
– While playing chess it a “solved problem”, understandingof humans play chess (so well) is hardly solved at all.
• Vision (almost all of it) was once given to an MITgraduate student as a “summer project”.– More recently, a a major figure said roughly: it is so hard
that “if it were not for the human existence proof, wewould have given up a long time ago”.
Historical context
• Reasoning was once seen as *the* AI problem.
• Chess, and related games, were once considered pivotal tounderstanding intelligence.
• They are now seen as a sub-domain of limited relevant tobe bulk of AI research.
• While playing chess it a "solved problem", understandingof humans play chess (so well) is hardly solved at all.
• Vision (almost all of it) was once given to an MITgraduate student as a "summer project".
• More recently, a a major figure said roughly: it is so hardthat "if it were not for the human existence proof, wewould have given up a long time ago".
Intelligence implies….
• Reasoning (plan)– Modelling the world: objects and interactions
– Inferring implicit relationships
– Problem solving, search for an answer,planning
• Interaction with the outside world (sense &act)– Perception: the inference of objects and
relationships from what sensors deliver.• Sensors deliver “arrays of numbers”
Early Chronology• George Boole, Gottlob Frege, Alfred Tarski: human
thought
• Alan Turing, John von Neumann, Claude Shannon:– Cybernetics
– Equivalence/analogy between computation and thought !!!
• AI: The 40s and 50s– McCulloch and Pitts: Describe neural networks that could compute
any computable function
– Samuels: Checker playing machine that learned to play better.
– "Dartmouth Conference" (1956) : McCarthy: coined term"Artificial Intelligence"
– McCarthy:Defined LISP.
– Newell and Simon: The Logic Theorist. It was able to prove mostof the theorems in Russell and Whiteheadís Principia Mathematica.Bounded Rationality, Logic Theorist becomes General ProblemSolver.
4
• Early Successes
• Minksy: microworlds
• Evan's ANALOGY solved geometricanalogy problems that appear on IQ tests
• Bobrow's STUDENT solved algebraworld problems
• Gelernter: Geometry Theorem Proverused axioms plus diagram information.
Expert Systems and the Commercialization of AI
• Buchanan and Feigenbaum: DENDRAL (1969)
• MYCIN (1976): diagnose infections.
• LUNAR (1973): First natural language question/answersystem used in real life
• Rejuvenation of neural nets– In theory, they can learn almost any function.
• In practice, i t might take a millenium.
– Neural nets, while having obvious limitations, have surpassedhand-crafted systems in some key domains.
State-of-the-Art
• Almost grandmaster chess (ask me about checkers).
• Real-time speech recognition
• Expert systems "aren't really AI anymore", but many exist
The Bad News• Heavily oversold, with ensuing backlash.
• Almost every AI problem in NP-complete.– Lighthill report (1973).
• Perceptrons (a kind of neural network) shown to haveextremely limited representation ability (Minsky andPapert).
• Some of AI seen as poorly formalized hackery or anmathematical self-indulgence..
An early intelligent system: thebrain
• “Intelligent” processing in the brain iscarried out by neurons, mainly in thecerebral cortex.– Roughly 1012 neurons (1011 if you participated
in frosh week) and thousands of connectionsper neuron.
– “Clock speed” (refractory period): 1 to 10milliseconds
– Processing involves massive parallelism anddistributed representation.
Comparison
• Computers– Roughly 10 million transistors per chip
– Parallel machines: hundreds of CPU elements,1010 bits of RAM
– Clock speed: roughly 1 nanosecond
– Recall rate (for stored data) appears muchfaster.
• Does the different hardware imply thatfundamentally different approaches must beused?
What is intelligence?
• Stock answer: “the ability to learn and to solveproblems” [Webster’s]
– The ability to adapt to new situations.
• Your answers (paraphrased):– “The ability to laugh at humorous situations.”
– “Understanding how other agents behave…”
– “The ability to analyze and solve a problem” *
– “The ability to work with abstract concepts” *– “The ability to recognize patterns…”
“Th bilit t i f ti / i
5
What does the future hold?
• Many of you thought artificially intelligentsystems were a long way off.– “Never”
– “In some timescale comparable to the evolutionof intelligence in animals”
– “”In 50 years”
– “Not very soon”
• Some of the same people said things like:– “Intelligence is the ability to solve problems
that would be complex for a human being”
• In 1997 the computer “Deep Blue” playedthe human world chess champion “GarryKasparov”– (whom some have claimed is the best chess
player in history!)
– DB: 200 board positions per move.• 11 ply
– GK: 7 ply?
Monty Newborne at SOCS/McGill has been apioneer in computer chess. Moderated the
Playing Chess
“We’ll never really have artificialintelligence”
• Garry Kasparov:
“I could feel -- I could smell -- anew kind of intelligence acrossthe table.”
• Drew McDermott
“Saying Deep Blue doesn’t really thinkabout chess is like saying an airplane
doesn’t really fly because it doesn’t flapit’s wings”
• Robbins’ problem:– In 1932 E. V. Huntington presented a basis for
Boolean algebra: commutativity, associativityand the Huntington equation.
– Herbert Robbins conjectured it could bereplaced by one simpler equation (the Robbinsequation), leading (later) to Robbins algebras.Are all Robbins algebras Boolean algebras?
– Despite work by Robbins and Huntington andTarski and other, no solution was found.
In November 1997, a computer solvesthe Robbins conjecture.
• First “creative” proof by computer.– Qualitative difference from prior results
based more heavily of exhaustive search suchas the four-color theorem:
• Any planar map can be colored in using 4 colors sothat no two edge-adjacent regions have the samecolor
• Proven in 1976 with a combination of human effortand“sophisticated computing” that enumerated
Learning
• Backgammon:
TD-gammon [Tesauro]– Plays world-champion level backgammon.
– Learns suitable strategies by playing gamesagainst itself.
– Plays millions of games.
– Based on a neural network trained using“backpropagation”: incremental changes basedon observed errors.
• Methodas notgeneralized too well to other
6
Important challenges
• Domain specificity:– Successful systems are restricted to a narrow
domains and specific tasks.
• Coping with noisy data– Most successes have been in domains where the
objectives and the “rules” were closelyspecified and formalized.
• Incorporation of commonsense knowledge– Does every little thing have to be encoded or
derived explicitl y?
Domain specificity
• Natural language systems work “well” onlywhen the “domain of discourse” isrestricted.
• If not, things get very hard very fast.
• Consider these alternative meanings of“give”:– John have Pete a book. [tangible object
delivered]
– John gave Pete a hard time. [mode of behavior]
John gave Pete a black eye [specific action]
Problem progression in Speech
• “Word spotting” is good today. Key is toignore all of an utterance except keywordsof interest.
• Speaker dependent continuous speech: quitegood.
• Speaker independent continuous speech isgetting good. Works well with a limitedvocabulary.
Topic 2: Propositional logic
How to we explicitly represent our knowledge about the world?
References:Dean, Allen, Aloimonos, Chapter 3Russell and Norvig: Chapter 6
One of two or three logical languages we will consider.
Logical languages are analogous to programming languages:systems for describing knowledge that have a rigid syntax.
Logical languages (unlike programming languages) emphasizesyntax. In principle, the semantics is irrelevant (in a narrowsense).
Knowledge Representation
• Most programs are a set of procedures thataccomplish something using rules and“knowledge” embedded in the programitself.
• This is an example of
implicitly encoded information– If you want to change the way Microsoft Word
implements variables in macros, you have tohack the code.
– When my tax program needs to be upgraded for
Explicit knowledge
When we encode rules in a separate rule bookor
Knowledge Base (KB)we have
explicitly encoded(some of) the information of interest.
i.e. the rules are separate from the proceduresfor interpreting them.– Explicit knowledge encoding, in general, makes
it easier to update and manipulate (assuming
7
Knowledge and reasoning
Objective: to explicitly represent knowledgeabout the world.– So that a computer can use it efficiently….
• Simply to use the facts we have encoded
• To make inferences about things it doesn’t knowyet
– So that we can easily enter facts and modify ourknowledge base.
• The combination of a formal language and a
Wff’s
• In practice, with logical languages we combinesymbols to express truths, or relationships, aboutthe world.
• If we put the symbols together in a permitted way,we get a
well-formed formula or wff• A proposition is another term for an allowed
formula.
• A propositional variable is a proposition that isatomic: that it, it cannot be subdivided into other(smaller) propositions
Terminology
• A set of wffs connected by AND's is aconjunction.
• A set of wffs connected by OR's is adisjunction.
• Literals plain propositional variables, ortheir negations: P and ¬ P.
Semantics• We attach meaning to wffs in 2 steps: 1. By
Discovering “new” truths
• Want to be able to generate new sentencesthat must be true, given the facts in the KB.
• Generation of new true sentences from theKB is called
entailment.
• We do this with an inference procedure.• If the inference procedure works “right”:
only get entailed sentences. Then the
Knowing about knowing
• We would like to have knowledge bothabout the world, as well as the state of ourown knowledge (i.e. meta-knowledge).
• Ontological commitments refer to theguarantees given by our logic and KBregarding the real world.
• Epistemological commitments relate to thestates of knowledge, or kinds of knowledge,th t t t
A particular set of truth assignmentsassociated with propositional variables is amodel IF THE ASSOCIATED FORMULA(or formuli) come out with the value true.
e.g. For the formula
(A and B) implies ( C and D)
the assignment
A=true B=true C=true D=true
is a model.
The assignment
A=false B=true C=true D=true
8
Satisfiability
• If *no model is possible * for a formula,then the formula is NOT SATISFIABLE ,otherwise it is satisfiable.
• A Theory is a set of formuli (in the contextof propositional logic).
• If no model is possible for the negation of aformula, then we say the original formula isvalid (also a formula is always true, it is atautology).
A i i ff th t t t i i
Topic 2: Propositional logic
How to we explicitly represent our knowledge about the world?
References:Dean, Allen, Aloimonos, Chapter 3Russell and Norvig: Chapter 6
One of two or three logical languages we will consider.
Logical languages are analogous to programming languages:systems for describing knowledge that have a rigid syntax.
Logical languages (unlike programming languages) emphasizesyntax. In principle, the semantics is irrelevant (in a narrowsense).
Knowledge Representation
• Most programs are a set of procedures thataccomplish something using rules and“knowledge” embedded in the programitself.
• This is an example of
implicitly encoded information– If you want to change the way Microsoft Word
implements variables in macros, you have tohack the code.
– When my tax program needs to be upgraded for
Explicit knowledge
When we encode rules in a separate rule bookor
Knowledge Base (KB)we have
explicitly encoded(some of) the information of interest.
i.e. the rules are separate from the proceduresfor interpreting them.– Explicit knowledge encoding, in general, makes
it easier to update and manipulate (assuming
Knowledge and reasoning
Objective: to explicitly represent knowledgeabout the world.– So that a computer can use it efficiently….
• Simply to use the facts we have encoded
• To make inferences about things it doesn’t knowyet
– So that we can easily enter facts and modify ourknowledge base.
• The combination of a formal language and a
Wff’s
• In practice, with logical languages we combinesymbols to express truths, or relationships, aboutthe world.
• If we put the symbols together in a permitted way,we get a
well-formed formula or wff• A proposition is another term for an allowed
formula.
• A propositional variable is a proposition that isatomic: that it, it cannot be subdivided into other(smaller) propositions
9
Terminology
• A set of wffs connected by AND's is aconjunction.
• A set of wffs connected by OR's is adisjunction.
• Literals plain propositional variables, ortheir negations: P and ¬ P.
Semantics• We attach meaning to wffs in 2 steps: 1. By
Discovering “new” truths
• Want to be able to generate new sentencesthat must be true, given the facts in the KB.
• Generation of new true sentences from theKB is called
entailment.
• We do this with an inference procedure.• If the inference procedure works “right”:
only get entailed sentences. Then the
Knowing about knowing
• We would like to have knowledge bothabout the world, as well as the state of ourown knowledge (i.e. meta-knowledge).
• Ontological commitments refer to theguarantees given by our logic and KBregarding the real world.
• Epistemological commitments relate to thestates of knowledge, or kinds of knowledge,th t t t
A particular set of truth assignmentsassociated with propositional variables is amodel IF THE ASSOCIATED FORMULA(or formulae) come out with the value true.
e.g. For the formula
(A and B) implies ( C and D)the assignment
A=true B=true C=true D=true
is a model.
The assignment
f l C
Satisfiability
• If *no model is possible * for a formula,then the formula is NOT SATISFIABLE ,otherwise it is satisfiable.
• A Theory is a set of formulae (in thecontext of propositional logic).
• If no model is possible for the negation of aformula, then we say the original formula isvalid (also a formula is always true, it is atautology).
A i i ff th t t t i i
Completeness• The set of steps used by a sound procedure togenerate new sentences is a proof.
• If it is possible for find a proof for anysentence that is entailed, then the inferenceprocedure is complete.
• A set of rules is refutation complete: if a setof sentences cannot be satisfied, thenresolution will derive a contradiction. I.e.wecan derive both P and not(P) for somevariable P.
• Effective: can get answer in finite steps
10
Rules of Inference
α β or
• Modus Ponens
• And-Elimination
• Or-Introduction
• Double-Negation Elimination
U it R l ti
⊥ α β
Complexity
• Determination of satisfiability of anarbitrary problem is a key hard problem. Itis in the class of NP-complete problems.
• Except:– For a formula in CNF, if each disjunct has
only 2 literals, we can “efficiently” determinesatisfiability.
• Note: A is valid only if not(A) is notsatisfiable.– Thus validity is a hard question too
Automated Theorem Proving
• Assume proper axioms of the form
(P1∧ P2∧ … Pn) ⇒ Q
• A fact is a propositional variable this isgiven.
• If we want to prove goal Q, we can do thatby proving (P1∧ P2∧ … Pn).– Q is reduced to (P1∧ P2∧ … Pn).
Predicate Calculus
• Also known as first order logic .
• A formal system with a “world” made up of– Objects
– Properties of objects
– Relations between objects.
– Adds quantification over objects topropositional logic.
• Note: second order logic includes quantificationover classes.
FOL components
• Relations can be functions
Hair_color_of()
Is_student()
Took_ai424()
But they don’t have to be
Son_of()
Owns_CD_titled()
FOL terminology
• Terms: represent objects, can be constantsor expressions.
• Predicate symbols: a relation (sometimesfunctional).
• Sentences: as with propositional logic
• Arity : number of arguments to a relation
• Atomic sentence: predicate symbols andtermsOwns_printer_model(brother_of(Sue),HP_D
550)
11
Complexity of ATP in FOL?
• First order logic is universal.– Any inference or computation we know of can
be described.• We can describe the operation of a Turing machines.
– Thus, entailment is semidecidable.• We can’t tell if a computation halts except by
running it an waiting… maybe forever.
• Much effort on restricting FOL to assure itis decidable.– It still may be “exponentially difficult”.
Lecture 4
308-424A
Topics in AI
Gregory Dudek
Predicate calculus cont’d
• Review & new stuff (in PDF form).
Clausal Form
• Any predicate calculus proposition can beconverted into clauses in 6 steps:– Removing implications
– Moving negation inwards
– Skolemising
– Moving universal quantifiers outwards
– Distributing AND over OR (for CNF)
– Putting into clauses (notation only)
• Read Dean,Allen,Aloimonos Sec 3.5 pp. 96
Getting Horn(y)
• We have observed an equivalence betweenarbitrary sentences in FOL and CNF.– This extends to HORN CLAUSES
• Thus, we can use Horn clauses to expressesanything in FOL.
• The PROLOG language uses Horn clausesexplicitly as its notation.
PROLOG
• PROLOG: a logic programming language.– Name derives from PROgramming in LOGic
• Based on theorem proving, first-order logic.
• Small, unusual, influential language.
• Developed in the 1970s.
• On-line: documentation and executables forSOCS (lisa) and home PCs.
12
PROLOG notation
, (comma)
; (semicolon)
:-
not
AND
OR
IMPLIES
NOT
Variables start with uppercase
Predicates & bound variables in lower case
Facts & rules
• In prolog we can state facts like natashalikes nicholas by defining a suitablepredicate:– Likes(natasha, nicholas )
• We can define rules that allow inference.
• Uses closed world assumption: anything isfalse unless it is provably true.
Rules
• Rules:– One predicate as conclusion.
• Implication works to the left.
• Left hand predicate must be a positive literal.
– Resolution and unification are the “internal”mechanisms.
• Prolog is based on satisfying goals using aresolution theorem prover.
PROLOG Examples
likes(X,dudek)
likes(Everybody,cs424)Everybody likes cs424
Likes(richard,X), likes(eric,X)Things likes by both richard and eric
Likes(phil,X) :- likes(eric,X)
If eric likes something, so does phil.
The Montreal Student Domain
goodstudent(X) :- awakeinclass(X), csstudent(X).
csstudent(X) :- smart(X), (adventurous(X) ;sensible(X)).
adventurous(X) :- ( montrealer(X) ;rockclimber(X) ).
awakeinclass(X) :- drinks(X,Y) , hasdrug(Y,Z),stimulant(Z).
smart(X) :- not(rockclimber(X)), reader(X).
Facts about people.
• montrealer(jane).
• smart(jane).
• nerd(jane).
• drinks(jane,coffee).
• montrealer(bob).
• nerd(bob).
• drinks(bob,sprite).
• owns(teapot,ted).
13
More facts...
• reader(ted).
• reader(mary).
• reader(helen).
• fatherof(mary,ted).
• drinks(helen,sprite).
hasdrug(tea,caffiene).hasdrug(tea,tannin).hasdrug(tea,theobromine).hasdrug(coffee,caffiene).hasdrug(coffee,oil).hasdrug(quat,foo).hasdrug(sprite,sugar).stimulant(caffiene).% stimulant(theobromine).
Simple stuff
• reader( ted ).
yes
• reader(X).
X = ted ;
X = mary ;
X = helen ;
no
On-line examples...
Run prolog
This is “open prolog” for the Mac.
Lecture 5
G. Dudek
Topics in AI
McGill University
Today’s lecture
• Administrative issues– Comments on assignment
– PDF files
– Class notes
• Knowledge representation: wrap-up– Prolog details
– Non-monotonic logic
– Forward and backward chaining
• Introduction to search
Don’t care
• Symbol _ (underscore) is used to match apredicate that we don’t plan to use on theright-hand side.
• It’s like a dummy variable.
Eg. likes(a,b).Would return true no matter what a & b
are.
We can uselikes(a, ).
14
Prolog (continued)
• Supports lists of items[] - empty list
[1,2,3] - 3 items
[bob, ted, alice] - three objects
[[a], [1,2,3], []] - a list of lists
To examine a sub-part
[ H | L ]
refers to a list decomposed into a
head:H (the first element)
Lists: [H|T][1,2,3]
- head is 1
[bob, ted, alice]- head is bob
[ [a], [1,2,3], [ ] ]- head is [a]
[ [ [1,2,3]], [1,2,3], [ ] ]- head is [ [1,2,3] ]
[ ]- cannot match
Testing membership
• Now we can easily define a predicate to testfor list membership.
• Step 1: the head
member(H,[H|L]).• First argument is an item.
• Second argument is a list.– This matches if H is the head of the list.
member(bob,[bob,alice]) unifies withmember(H,[H|L)) is we let bob match Hand [bob,alice] match [H|L].
Membership: the body.
• Step 2: if it’s not the head, then there mustbe a sublist for which it is the head.– Recursive definition
• See if the item is the head of the tail portion.
member(Item,[Head|Tail]) :-member(Item,Tail).
Membership: complete
member( Item, [ Item | _ ]).
member(Item, [ _ | Rest ] ):-member(Item, Rest).
Unification: examiningcombinations
• Remember: prolog executionproceeds by repeated unifications,applied recursively.
• Consider:– foo(x) :- bar(x).–bar(x) :- foo(x).
–This will lead to a problem: theifi ti i ll t i t !
15
Recursion fix
• How can we fix the infinite recursion?– Never re-examine an already-considered unifier
(i.e. solution).
1. Within the definition, save the previoussolutions (unifications).
2. Check if the new unifier (solution) is one ofthose.
How?
Use a list!
Improved foo!
Improved foo!
foo(y,[]).
member …
foo(X,L) :- n ot(member(X,L)),bar(X,[X|L]).
bar(X ,L) :- not(member(X,L)),foo(X,[X|L]).
foo(a).
Concept Description Language
• A specialized language for efficientinference.
• Represent– classes of objects,
– sub-classes of classes,
– instances of classes,
– properties of instances (and classes).
• Akin to inheritance in object-orientedprogramming.
Nonmonotonic Logic
A monotonic logic:Things are are theorems remains theorems as we
add additional axioms.
Non-monotonic logic: formulas that wereonce theorems may not remain so as thetheory is augmented.
• Idea: add “default” assumptions to thetheory in the absence of completeknowledge.
• These assumptions may beretracted
Minimal (nonmonotonic) models
• Can induce a preference ordering oninterpretations.
Deductive retrieval• Deductive retrieval uses a KB to storeinformation, and uses rules to achieve goals.
• A system for maintaining a knowledge base.Includes retracting conclusions based oninformation that changes or which is deleted.
• Use forward chaining and backwardchaining.– Forward: start from KB and see what can be
inferred (leading to the goal, we hope). Especiallyhappens when a question involves newknowledge.
• Q. if pigs fly then will I pass?
B k d l d l d ti
16
Search
• Reference: DAA Chapter 4.
• The process of explicitly examining a set ofobjects to find a particular one, or satisfysome goal.
• In the contact, the objects are typicallypossible configurations of some problemrepresentation.
• Search is a central topic in AI– Originated with Newell and Simon's work on
problem solving.Famous book:
``Human Problem Solving'' (1972)
• Automated reasoning is a natural searchtask More recently: Given that almostall AI formalisms (planning, learning, etc.)are NP Complete or worse some from of
Problem Definition
• [State space] -- described by an initial stateand the set of possible actions available(operators).
• A path is any sequence of actions that leadfrom one state to another.
[Goal test] -- applicable to a single state todetermine if it is the goal state.
[Path cost] -- relevant if more than one pathleads to the goal, and we want theh t t th
Example I : Cryptarithmetic
SEND
+ MORE
--------
MONEY
• Find substitution of digits for letters such
th t th lti i ith ti ll
Cryptarithmetic, cont.
• [States:] a (partial) assignment of digitsto letters.
• [Operators:] the act of assigning digitsto letters.
• [Goal test:] all letters have beenassigned digits and sum is correct.
• [Path cost:] zero. All solutions are
Familiar ideas….
• Blind search: just examine “successive”possibilities.
• Depth first search (DFS)
• Breadth first search (BFS).
17
Lecture 6
• Conclusion of KR
• Search– Blind search
Concept Description Language
• A specialized language for efficientinference.
• Represent– classes of objects,
– sub-classes of classes,
– instances of classes,
– properties of instances (and classes).
• Akin to inheritance in object-orientedprogramming.
Nonmonotonic Logic
A monotonic logic:Things are are theorems remains theorems as we
add additional axioms.
Non-monotonic logic: formulas that wereonce theorems may not remain so as thetheory is augmented.
• Idea: add “default” assumptions to thetheory in the absence of completeknowledge.
• These assumptions may beretracted
Minimal (nonmonotonic) models
• Can induce a preference ordering oninterpretations.
Deductive retrieval• Deductive retrieval uses a KB to storeinformation, and uses rules to achieve goals.
• A system for maintaining a knowledge base.Includes retracting conclusions based oninformation that changes or which is deleted.
• Use forward chaining and backwardchaining.– Forward: start from KB and see what can be
inferred (leading to the goal, we hope). Especiallyhappens when a question involves newknowledge.
• Q. if pigs fly then will I pass?
B k d l d l d ti
Search
• Reference: DAA Chapter 4.
• The process of explicitly examining a set ofobjects to find a particular one, or satisfysome goal.
• In the contact, the objects are typicallypossible configurations of some problemrepresentation.
18
Search
• Search is a central topic in AI– Originated with Newell and Simon's work on
problem solving.Famous book:
``Human Problem Solving'' (1972)
• Automated reasoning is a natural searchtask More recently: Given that almostall AI formalisms (planning, learning, etc.)are NP Complete or worse some from of
State space
• In AI, search usually refers to search of astate space.
• State space: the ensemble of possibleconfigurations of the domain of interest.– Like phase space in physics
• E.g.– Chess: The set of allowed arrangements of
pieces on a chess board.
– Speech understanding: The set of possiblearrangements of words that make valid
Search: Problem Definition
State space -- described by an initial stateand the set of possible actions available(operators).– A path is any sequence of actions that lead
from one state to another.
Goal test -- applicable to a single state todetermine if it is a (the) goal state.
Path cost -- relevant if more than onepath leads to the goal, and we want the
Graphs, Trees & Search
• We can visualize generic state space searchin terms of searching a graph or tree.
• Graph search corresponds to looking for aparticular state given an arbitrary transitiondiagram.– A graph is defined as G = (V, E)
– V: set of vertices (i.e. states)
– E : set of transitions ei = (vj,vk). Can be directedor undirected
Traversal
• To traverse means to visit the vertices insome systematic order. You should befamiliar with various traversal methods fortrees:
preorder: visit each node before its children.
postorder: visit each node after its children.
inorder (for binary trees only): visit left subtree,node, right subtree.
Familiar ideas….
• Blind search: just examine “successive”alternative possibilities.
• Does not exploit knowledge of what statesto examine first.– It what order should we consider the states?
• Sequentially along a path? Leads toDepth First Search (DFS)
19
Depth First Search
• Key idea: pursue a sequence of successivestates as long as possible.
unmark all vertices choose some starting vertex x
mark x
list L = x
tree T = x
while L nonempty
choose the vertex v from front oflist
visit v
Depth First Search
DFS illustrated BFS
• Key: explore nodes at the same distancefrom the start at the same time
unmark all vertices choose some starting vertex x
mark x
list L = x
tree T = x
while L nonempty
choose the vertex v from front oflist
visit v
Breadth First Search BFS illustrated
20
Key issues in search
• Here’s what to keep in mind.
• Completeness: are we assured to find asolution (if one exists)?
• Space complexity: how much storage do weneed?
• Time complexity: how many operations dowe need?
• Solution quality: how good is the solution
Example I : Cryptarithmetic
• Find substitution of digits for letters such
• that the resulting sum is arithmeticallycorrect.
• Each letter must stand for a different digit.
SEND+ MORE -------- MONEY
Cryptarithmetic, cont.
• States: a (partial) assignment of digitsto letters.
• Operators: the act of assigning digitsto letters.
• Goal test: all letters have beenassigned digits and sum is correct.
• Path cost: zero. All solutions are
• Solution method?
• Depth first search:
Search performance
• Key issues that determine the nature ofthe problem are:–Branching factor of the search space:
how many options do we have at anytime?• Typically summarized by the worst-case
braching factor, which may be quitepessimistic.
–Solution depth: how long is the path tothe first solution?
Example: knight’s tour
• Tour executed by a chess knight to cover(touch) every square on a chess board.– Sub-problem: Find a path from one position to
another.
• States: possible positions on the chessboard.
• Operators: the ways a knight moves.
• Goal: the positions of the pawns.
• Path cost: the number of moves used to get
21
Knight’s
• How large is the state space?– A knight has up to 8 moves per turn.
– Each possible tour must be verified up to theend of the trip. For a board of width N, thereare N*N squares.
• Thus, each tour can be up to N*N states in length.
– If the correct solutions is found last, we mightconsider up every wrong tour first.
O( 8N^2 ) states to examine!
Example: Knight’s heuristics
• Zero: a board can be considered a dead endif any square has zero paths remaining to it.Any square with no paths to it would beunreachable, so no Knight's Tour wouldexist.
• Two ones: a board with more than onesquare with only one path to it is a deadend. A square with one path left isnecessarily a dead end, so two of themindicate a dead end position.
• How well will we do if we use “blind”DFS?
• That is, if we consider random tours?
• Three heuristics based on the number of pathsremaining to each square were implemented andtested in combination, as well as a representationalspeedup. The optimal combination of theheuristics is to eliminate boards with either asquare with zero paths remaining or a square withtwo ones remaining; this combination led to a 950-fold speedup on a 6x6 board. The representationalspeedup led to a 2.5-fold additional speedup on a6x6 board.
– Michael Bernstein
Heuristics: Knight’s Performance
22
BFS
• Consider a state space with a uniformbranching factor of b
• At each level we have b nodes for everynode we had before
1 + b + b2 + b3 + b4 …. + bd
So, solution at depth d implies we must
expand O(bd) nodes!
• internal nodes (b**d-1)/(b-1)
Depth Nodes Time Memory
0 1 1 millisecond 100 bytes2 111 .1 seconds 11 kilobytes4 11,111 11 seconds 1 megabyte6 106 18 minutes 111 megabytes8 108 31 hours 11 gigabytes
10 1010 128 days 1 terabyte12 1012 35 years 111 terabytes14 1014 3500 years 11,111 terabytes
Lecture 7
• Review
• Blind search
• Chess & search
Depth First Search
• Key idea: pursue a sequence of successivestates as long as possible.
unmark all vertices choose some starting vertex x
mark x
list L = x
tree T = x
while L nonempty
choose the vertex v from front oflist
visit v
BFS
• Key: explore nodes at the same distancefrom the start at the same time
unmark all vertices choose some starting vertex x
mark x
list L = x
tree T = x
while L nonempty
choose the vertex v from front oflist
visit v
Key issues in search
• Here’s what to keep in mind.
• Completeness: are we assured to find asolution (if one exists)?
• Space complexity: how much storage do weneed?
• Time complexity: how many operations dowe need?
• Solution quality: how good is the solution
23
Search performance
• Key issues that determine the nature ofthe problem are:–Branching factor of the search space:
how many options do we have at anytime?• Typically summarized by the worst-case
braching factor, which may be quitepessimistic.
–Solution depth: how long is the path tothe first solution?
Example: knight’s tour
• Tour executed by a chess knight to cover(touch) every square on a chess board.– Sub-problem: Find a path from one position to
another.
• States: possible positions on the chessboard.
• Operators: the ways a knight moves.
• Goal: the positions of the pawns.
• Path cost: the number of moves used to geti i
Knight’s
• How large is the state space?– A knight has up to 8 moves per turn.
– Each possible tour must be verified up to theend of the trip. For a board of width N, thereare N*N squares.
• Thus, each tour can be up to N*N states in length.
– If the correct solutions is found last, we mightconsider up every wrong tour first.
O( 8N^2 ) states to examine!
Example: Knight’s heuristics
• Zero: a board can be considered a dead endif any square has zero paths remaining to it.Any square with no paths to it would beunreachable, so no Knight's Tour wouldexist.
• Two ones: a board with more than onesquare with only one path to it is a deadend. A square with one path left isnecessarily a dead end, so two of themindicate a dead end position.
• How well will we do if we use “blind”DFS?
• That is, if we consider random tours?
24
• Three heuristics based on the number of pathsremaining to each square were implemented andtested in combination, as well as a representationalspeedup. The optimal combination of theheuristics is to eliminate boards with either asquare with zero paths remaining or a square withtwo ones remaining; this combination led to a 950-fold speedup on a 6x6 board. The representationalspeedup led to a 2.5-fold additional speedup on a6x6 board.
– Michael Bernstein
Heuristics: Knight’s Performance
BFS
• Consider a state space with a uniformbranching factor of b
• At each level we have b nodes for everynode we had before
1 + b + b2 + b3 + b4 …. + bd
So, solution at depth d implies we must
expand O(bd) nodes!
• internal nodes (b**d-1)/(b-1)
BFS: solution quality
• How good is the solution quality from BFS?
• Remember that edges in the search tree canhave weights.
• BFS will always find the shallowest(shortest depth) solution first.
• This will also be the minimum-cost solutionif…
BFS & Memory
• For BFS, we must record all the nodes atthe current level.– BFS can have large (enormous) memory
requirements!
• Memory can be a worst constraint thantime.
• When the problem is exponential, however,
Depth Nodes Time Memory
0 1 1 millisecond 100 bytes2 111 .1 seconds 11 kilobytes4 11,111 11 seconds 1 megabyte6 106 18 minutes 111 megabytes8 108 31 hours 11 gigabytes
10 1010 128 days 1 terabyte12 1012 35 years 111 terabytes14 1014 3500 years 11,111 terabytes
25
Comparison to DFS?
• Worst case:– If the search tree can be arbitrarily deep, DFS
may never terminate!
– If the search tree has maximum-depth m, theDFS (at worst) visits every node up to depth m.
– Time complexity O(bm )
• If there are lots of solutions (or you’relucky), DFS may do better then BFS.– If it gets a solution on the first try, it only looks
at d nodes.
Uniform Cost Search
• Note the edges in the search tree may havecosts.
• All nodes as a given depth may not have thesame cost, or desirability.
• Uniform cost search: expand nodes inorder of increasing cost from the start.
• This is assured to find the cheapest pathfirst, so long as ….
Iterative-Deepening
• Combine benefits of DFS (less memory) and BFS(best solution depth/time).
• Repeatedly apply DFS up to amaximum depth “diameter”.
• Incrementally increase the diameter.Unlike BFS we do not store the leaves
Iterative-Deepening Performance
Idea: expand the same nodes as BFS, and inthe same order as BFS!
Don’t save intermediate results, so nodes mustbe re-expanded.
How can this be good?!
This is a classic time-space tradeoff.
Because the search tree is exponential, wastedwork near the top doesn’t matter much.
Asymptotically optimal, complete.
SEMINAR
Deep Blue: IBM's Massively ParallelChess Machine
Wednesday, September 23, 1998
TIME: 11:00 A.M.
Strathcona Anatomy and Dentistry Building
3640 University Street
Room M-1
Dr Gabriel M Silberman
Seminar abstract
IBM's premiere chess system, based on an IBM RS/6000 SPscalable parallel processor, made history by defeatingworld chess champion. Garry Kasparov. Deep Blue'schess prowess stems from its capacity to examine over 200million board positions per second, utilizing the computingresources of a 32-node IBM RS/6000-SP, populated with512 special purpose chess accelerators.
In this talk we describe some of the technology behind DeepBlue, how chess knowledge was incorporated into itssoftware, as well as the attitude of the media and generalpublic during the match.
26
Computer Chess & DB
• Background for tomorrow’s seminar
– Presentation on Deep Blue and Chess(not available here)
Not intextBidirectional search
• Remember that the bottom of the search treeis where most of the work is.
• Idea:– keep the tree smaller
– Search from both the initial state and the goal.• Recall forwards + backwards chaining.
– Each leads to a half-size tree (and with luck,they meet)!
• If the goal is at depth d time is O(b(d/2) )(space too)
Not intext
Lecture 8
• Heuristic search
• Motivational video
• Assignment 2
Informed Search
• Search methods so far based on expandingnodes is search space based on distancefrom the start node.– Obviously, we always know that!
• How about using the estimated distanceh’(n) to the goal!– What’s the problem?
DAA Ch. 4
Heuristic Search
• What is we just have a guess as to thedistance to the goal: a heuristic. (like“Eureka!”)
Best-First Search
• At any time, expand the most promisingnode.
• Recall our general search algorithm (fromlast lecture).
Compare this to uniform-cost searchwhich is, in some sense, the opposite!
Best-First
• Best-First is like DFS
• HOW much like DFS depends on thecharacter of the heuristic
evaluation function h’(n)– If it’s zero all the time, we get BFS
• Best-first is a greedy method.– Greed methods maximize short-term advantage
27
Example: route planning
• Consider planning a path along a roadsystem.
• The straight-line distance from one place toanother is a reasonable heuristic measure.
• Is it always right?– Clearly not: some roads are very circuitous.
Example: The Road to Bucharest
Bucharest
Giurgiu
Urziceni
Hirsova
Eforie
NeamtOradea
Zerind
Arad
Timisoara
LugojMehadia
DobretaCraiova
Sibiu
Fagaras
PitestiRimnicu Vilcea
Vaslui
Iasi
Straightline distanceto Bucharest
0160242161
77151
241
366
193
178
25332980
199
244
380
226
234
374
98
Giurgiu
UrziceniHirsova
Eforie
Neamt
Oradea
Zerind
Arad
Timisoara
Lugoj
Mehadia
Dobreta
Craiova
Sibiu Fagaras
Pitesti
Vaslui
Iasi
Rimnicu Vilcea
Bucharest
71
75
118
111
70
75
120
151
140
99
80
97
101
211
138
146 85
90
98
142
92
87
86
Problem: Too Greedy
• From Arad to Sibiu to Fagaras --- but toRimnicu would have been better.
• Need to consider: cost of getting from startnode (Arad) to intermediate nodes!
Intermediate nodes
• Desirability of an intermediate node– How much it costs to get there
– How much farther one has to go afterwards
• Leads to evaluation function of the form:
e(n) = g(n) + h’(n)
– As before, h’(n)
– Use g(n) express the accumulated cost to get tothis node.
Details...
• Set L to be the initial node(s).
• Let n be the node on L that minimizes toe(n).
• If L is empty, fail.
• If n is a goal node, stop and return it (andthe path from the initial node to n).
• Otherwise, remove n from L and add all ofn's
Finds Optimal Path
• Now expands Rimnicu (f = (140 + 80) + 193 = 413)
over
Faragas(f = (140 + 99) + 178 = 417).
• Q. What if h(Faragas) = 170 (also an
28
Admissibility
• An admissible heuristic always findsthe best (lowest-cost) solution first.– Sometimes we refer to the algorithm being
admissible -- a minor “abuse of notation”.
– Is BFS admissible?
• If
h’(n) <= h(n)
then the heuristic is admissible.
A*
• The effect is that is it never overlyoptimistic (overly adventurous) aboutexploring paths in a DFS-like way.
• If we use this type of evaluation functionwith and admissible heuristic, we havealgorithm A*– A* search
Monotonicity
• Let's also also assume (true for mostadmissible heuristics that e is monotonic,i.e., along any path from the root f neverdecreases.
• Can often modify heuristic to becomemonotonic if it isn’t already.
• E.g. let n be parent of n’. Suppose that g(n)= 3 and h(n) = 4 , so f(n) = 7.and g(n’) = 4 and h(n’) = 2, so f(n’) = 6.
B t b th th h ’ i l
A* is good
• If h’(n) = h(n) then we know “everything”and e(n) is exact.
• If e(n) was exact, we could expand only thenodes on the actual optimal path to the goal.
• Note that in practice, e(n) is always anunderestimate.
• So, we always expand more nodes thanneeded to find the optimal path.
• A* finds the optimal path (first)
• A* is complete
• A* is optimally efficient for a givenheuristic: if we ever skipped expanding anode, we might make a serious mistake.
Video
• Nick Roy (former McGill grad student)describes tour-guide robot on the discoverychannel.
• Note: such a mobile robot might well useA* search.– Obstacles are a major source of the mismatch
between h’ and h.
– Straight-line distance provides a naturalestimate for h’
29
Hill-climbing search
• Usually used when we do not have aspecific goal state.
• Often used when we have asolution/path/setting and we want toimprove it:an iterative improvement algorithm.
• Always move to a successor that thatincreases the desirability (can minimize or
i i t)
Lecture 9
• Heuristic search, continued
A* revisited
• Reminder:with A* we want to find the best-cost (C )path to the goal first.– To do this, all we have to do is make sure our
cost estimates are less than the actual costs.
– Since our desired path has the lowest actualcost, no other path can be fully expanded first,since at some point it’s estimated cost will haveto be higher that the cost C.
Heuristic Functions: Example
Start State Goal State
2
45
6
7
8
1 2 3
4
67
81
23
45
6
7
81
23
45
6
7
8
5
8-puzzle
hC = number of missplaced tiles
hM = Manhattan distance
• Admissible?
• Which one should we use?
30
Choosing a good heuristic
hC <= hM <= hopt
• Prefer hM
Note: Expand all nodes with e(n) = g(n) +h(n) < e*
• So, g(n) < e* - h(n), higher h means fewern's.
• When one h funtion is larger than another,
Inventing Heuristics
• Automatically
• A tile can move from sq A to sq B if
• A is adjacent to B and B is blank.
• (a) A tile can move from sq A to sq B if Ais adjacent to B.
• (b) A tile can move from sq A to sq B if Bis blank.
• (c) A tile can move from sq A to sq B.
IDA*
• Problem with A*: like BFS, uses too muchmemory.
• Can combine with iterative deepening tolimit depth of search.
• Only add nodes to search list L is they arewithin a limit on the e cost.
A Different Approach
• So far, we have considered methods thatsystematically explore the full search space,possibly using principled pruning (A* etc.).
• The current best such algorithms (IDA* /SMA*) can handle search spaces of up to10100 states or around 500 binary valuedvariables. (These are ``ballpark '' figures only!)
And if that’s not enough?
• What if we have 10,000 or 100,000variables / search spaces of up to 1030,000
states?
• A completely different kind of method iscalled for:
Local Search Methodsor
Iterative Im provement Methods
When?
Applicable when we're interested in the GoalState not in how to get there.
E.g. N-Queens, VLSI layout, or map coloring.
31
Iterative Improvement• Simplest case:
– Start with some (initial) state
– Evaluate it’s quality
– Apply a perturbation ∂ to moveto an adjacent state
– Evaluate the new state
– Accept the new state sometimes,based on decision D(∂)
∂ can be random
D(.) can bealways.
Or….
Can we be smarter?
• Can we do something smarter thanA) moving randomly
B) accepting every change?
A) …..
move in the “right” direction
B)
accept changes that get us towards a better
Hill-climbing search
• Often used when we have asolution/path/setting and we want toimprove it, and we have a suitable space:
an iterative improvement algorithm.
• Always move to a successor that thatincreases the desirability (can minimize ormaximize cost).
Hill-climbing problems
• Hill climbing corresponds to functionoptimization by simply moving uphill alongthe gradient.– Why isn’t this a reliable way of getting to a
global maximum?
• 1. Local maxima
• 2. Plateaus
• 3. Ridges
Gradient ascent
• In continuous spaces, there are better waysto perform hill climbing.
– Acceleration methods• Try to avoid sliding along a ridge
– Conjugate gradient methods• Take steps that do not counteract one another.
Stochastic search
• Simply moving uphill (or downhill) isn’tgood enough.– Sometime, have to go against the obvious trend.
• Idea: Randomly make a “non-intuitive” move.– Key question: How often?
• One answer (simple intuition):
32
Simulated Annealing
• Initially act more randomly.• As time passes, we assume we are doing better
and act less randomly.
• Analogy: associate each state with an energy or“goodness”.
• Specifics:– Pick a random successor s2 to the current state
s1.
– Compute ∂E = energy(s2 ) - energy(s1 ).
– If ∂E good (positive) go to the new state.
GA’s
Genetic Algorithms
Plastic transparencies & blackboard.
Adversary Search
Basic formalism
• 2 players.
• Each has to beat the other (competingobjective functions).
• Complete information.
• Alternating moves for the 2 players.
Minimax
• Divide tree into plies
• Propagate information up from terminalnodes.
• Storage needs grow exponentially withdepth.
Lecture 10
• Annealing (final comments)
• Adversary Search
• Genetic Algorithms (genetic search)
A comment
• On minimal length paths….
• The Dynamic-Programming Principle[Winston]:“The best way through a particular, intermediate
place is the best way to it from the startingplace, followed by the best way from it to thegoal. There is no need to look at any otherpaths to or from the intermediate place.”
33
Simulated Annealing
• Initially act more randomly.• As time passes, we assume we are doing better
and act less randomly.
• Analogy: associate each state with an energy or“goodness”.
• Specifics:– Pick a random successor s2 to the current state
s1.
– Compute ∂E = energy(s2 ) - energy(s1 ).
– If ∂E good (positive) go to the new state.
Adversary Search
Basic formalism
• 2 players.
• Each has to beat the other (competingobjective functions).
• Complete information.
• Alternating moves for the 2 players.
The players
• Players want to achieve “opposite” goals:– What’s bad for A is good for B, and vice versa.
• With respect to our static evaluationfunctions, one wants to maximize thefunction, the other wants to minimize it.
• Use a game tree to describe the state space.
Game Tree
• Nodes represent board configurations.
• Edges represent allowed moves from oneconfiguration to another.
• For most 2-player games, alternating levelsin the tree refer to alternating moves by the2 players.
• Static evaluation function measures the
See overhead
MINIMAX
• If the limit of search has been reached,compute the static value of the currentposition relative to the appropriate player.Return the result
• Otherwise, use MINIMAX on the childrenof the current position.– If the level is a minimizing level, return the
minimum of the results
– If the level is a maximizing layer, return themaximum of the results.
34
MINIMAX observations
• The static evaluation function is…– Crucial: all decisions are eventually based of
the value it returns.
– Irrelevant: if we search to the end of the game,we always know the outcome.
• In practice, if same has a large searchspace then the static evaluation is especiallyimportant.
Minimax
• Divide tree into plies
• Propagate information up from terminalnodes.
• Storage needs grow exponentially withdepth.
Alpha-Beta Search
• Can we avoid all the search-tree expansionimplicit in MINIMAX?
• Yes, with a simple observation closelyrelated to A* :
Once you have found a good path, you onlyneed to consider alternative paths that are
better than that one.
The α−β πρινχιπλε
“ Ιφ ψου ηαϖε αν ιδεα τηατ ισσυρελψ βαδ, δο νοτ τακετιµε το σεε ηοω τρυλψ
αωφυλ ιτ ισ. ” [Ωινστον]
The α−β principle.
• “If you have an idea that issurely bad, do not take time to
see how truly awful it is.”[Winston]
α−β cutoff: How good is it?
• What is the best-case improvement foralpha-beta?
• What is the worst-case improvement foralpha-beta?
• Best case: only examine one leftmost “pre-terminal” nodes fully.
• Worst-case:
Not intext
35
Progressive deepening
• How do you deal with (potential) timepressure?
• Search progressively deeper: first depth 1,then depth 2, etc.
• As before, ratio of interior to leaf nodes is
bd(b-1)/(bd-1) or roughly (b-1)
Not intextIssues
• Horizon effect
• Search-until-quiescent heuristic
• Singular-extension heuristic
Not intext
GA’s
Genetic Algorithms
Plastic transparencies & blackboard.
Lecture 11
• Learning– What is learning?
– Supervised vs. unsupervised
– Supervised learning• ALVINN demo (cf. DAA p. 181)
• Inductive inference
Problem Solving as Search:recap
Uninformed search:
• DFS
• BFS
• Uniform cost search
• time / space complexity– size search space: up to approx. 1011
nodes
Better….
• Informed search: use heuristic functionguide to goal– Greedy search
– A* search: provably optimal• Search space up to approx.\ $1025
– Local search (incomplete)
– Greedy / hillclimbing / GSAT
– Simulated annealing\\
– Genetic Algorithms / Genetic Programming
– Larger search spaces
36
Adversary search
Adversary search / game playing
– Minimax• Up to around 1010 nodes, 6 --- 7 ply in
chess.
–alpha-beta pruning• Up to around 1020 nodes, 14 ply in
chess.
bl i l
Genetic Algorithms
• Example:worked out on blackboard.
– Using an GA to generate a course lectureschedule.
Kinds of Learning (Q&A)
• What do you associate with the termlearning?
What is for you the prototypical learning task?
•Memorization and r ote learning like
flashcards?
•Skill acquisition such as lear ning to ski
or learning to do symbolic i ntegration ?
•Theory or discovery lear ning like
discovering a new economics model f or
predicting market fluctuations?
What is learning?Definition: “… changes to thecontent and organization of anagent's knowledge enabling it toimprove its performance on aparticular task or population oftasks” [Herb Simon].
Types of learning
• Inferential basis– Inductive learning
– Deductive learning
Inductive
• inductive learning and the acquisition ofnew knowledge– inferring generalities from particulars - note
that this type of learning is not sound - e.g.,learn what foods served at the cafeteria aredigestible
37
Deductive
• deductive learning and the organization ofexisting knowledge– making explicit deductive consequences of
existing axioms - this type of learning generallyis sound - e.g., expedite deductive inference byadding new axiomsfrom forall x, loves(x,x) andforall x,y, loves(x,y) -> (has-money(x) -> pay-bill(x,y)) we can conclude thatforall x , has-money(x) -> pay-bill (x,x)
Learning: pedagogicalclassification
• Pedagogical basis– How to we teach the learning system?
– Supervised vs. unsupervised.
Supervised
• supervised learning involves a teacher thatprovides examples of the form(description,solution).– I.e. Pair where the solution for a specific
problem instance is explicitly indicated.
• The description is a representation of aparticular problem instance or situation andsolution is a representation of the desiredsolution or response
Unsupervised
• unsupervised learning need not involve ateacher at all.
• if it does teacher only provides rewards andpunishments.
• The learner not only has to figure out whatconstitutes a problem instance but also hasto figure out an appropriate response
• e.g., learn chess by trial and error or discoverylearning as in learning new concepts inmathematics (perfect numbers) or game playing
Concepts as functions
• f:X -> Y
Input spaceX = 0,1 x 0,1 x ... 0,1 = 0,1 n
i.e., the set of all possible assignments to n booleanvariables
X = Rn where R is the real numbers
X = set of all descriptions of some class of objects
f:X -> Y
Output space• Y = 0,1 this is called concept
learning
• Y = 1,2,3,...,n this is calledclassification– think of the integers 1 through n as representing
classes e.g., compact, sporty, subcompact,midsize, full size, too_damn_big
• Y = set of possible actions to takel i d i f i
38
Supervised learning• Example:
• The learning system is given a set of trainingexamples of the form (x,y).
• For example, ((0 1),0), ((1 0),0), ((1 1),1) aretraining examples for a two input booleanfunction.
conjunction.
Example presentation
• Batch problems all of the examples aregiven at once
• Online problems, the learner is given oneexample at a time and is assumed not tohave the storage necessary to keep track ofall of the examples seen so far.
• Why is online interesting? Examples?
A learning system
• ALVINN demo video….
• This system does real learning.– Is it supervised?
– Is it batch?
• For now, consider learning problems inwhich the training examples are noise free.
• This is not always the case.– e.g., what if the driver made a bad decision
while training ALVINN?
• Generally, some sort of statistical analysis isneeded to avoid being misled by mis-classified training examples.
O h l ?
•Learning web-page preferences.
•Movie selections (grouchy mood).
•Medical diagnosis (erraticsymptom/dishonest patient).
Learning: formalism
Come up with some function f such that
• f(x) = yfor all training examples (x,y) and
• f (somehow) generalizes to yet unseenexamples.
I ti d ’t l d it f tl
Inductive bias: intro
• There has to be some structure apparent inthe inputs in order to support generalization.
• Consider the following pairs from the phonebook.
Inputs Outputs
Ralph Student 941-2983
Louie Reasoner 456-1935
Harry Coder 247-1993
Fred F lintstone ???-????
There is not much to go on here.•Suppose we were to add zip code information.•Suppose phone numbers were issued based on the spellingof a person's last name.•Suppose the outputs were user passwords?
39
Example 2
• Consider the problem of fitting a curve to aset of (x,y) pairs.
| x x
|-x----------x---
| x
|__________x_____
– Should you fit a linear, quadratic, cubic, piece-wise linear function?
– It would help to have some idea of how smooththe target function is or to know from whatfamily of functions (e g polynomials of degree
Inductive Bias: definition
• This "some idea of what to choose from" iscalled an inductive bias.
• TerminologyH, hypothesis space - a set of functions tochoose from
C, concept space - a set of possible functionsto learn
• Often in learning we search for a hypothesisf in H that is consistent with the trainingexamples, i.e., f(x)= y for all training
Lecture 14
• Learning– Inductive inference
– Probably approximately correct learning
What is learning?
Key point: all learning can be seen aslearning the representation of a function.
Will become clearer with more examples!
Example representations:
• propositional if-then rules
• first-order if-then rules
• first-order logic theories
• decision trees
• neural networks
Learning: formalism
Come up with some function f such that
• f(x) = yfor all training examples (x,y) and
• f (somehow) generalizes to yet unseenexamples.
I ti d ’t l d it f tl
Inductive bias: intro
• There has to be some structure apparent inthe inputs in order to support generalization.
• Consider the following pairs from the phonebook.
Inputs Outputs
Ralph Student 941-2983
Louie Reasoner 456-1935
Harry Coder 247-1993
Fred F lintstone ???-????
There is not much to go on here.•Suppose we were to add zip code information.•Suppose phone numbers were issued based on the spellingof a person's last name.•Suppose the outputs were user passwords?
40
Example 2
• Consider the problem of fitting a curve to aset of (x,y) pairs.
| x x|-x----------x---| x|__________x_____
– Should you fit a linear, quadratic, cubic, piece-wise linear function?
– It would help to have some idea of how smooththe target function is or to know from whatf il f f ti ( l i l f d
Inductive Learning
Given a collection of examples (x,f(x)), returna function h that approximates f.
h is called the hypothesis and is chosen fromthe hypothesis space.
• What if f is not in the hypothesis space?
Inductive Bias: definition
• This "some idea of what to choose from" iscalled an inductive bias.
• TerminologyH, hypothesis space - a set of functions tochoose from
C, concept space - a set of possible functionsto learn
• Often in learning we search for a hypothesisf in H that is consistent with the trainingexamples,
Which hypothesis?
oo
oo
(c)
oo
o
oo
(a)
oo
o
oo
(b)
oo
o
oo
(d)
o
Bias explanation
How does learning algorithm decide
Bias leads them to prefer one hypothesis overanother.
Two types of bias:
• preference bias (or search bias) dependingon how the hypothesis space is explored,you get different answers
• restriction bias (or language bias), the“language” used: Java, FOL, etc. (h is not
Issues in selecting the bias
Tradeoff (similar in reasoning):
more expressive the language, the harder tofind (compute) a good hypothesis.
Compare: propositional Horn clauses withfirst-order logic theories or Java programs.
• Also, often need more examples.
41
Occam’s Razor
• Most standard and intuitive preference bias:
Occam’s Razor(aka Ockham’s Razor)
The most likely hypothesis isthe simplest one that isconsistent will all of the
observations.
Implications
• The world is simple.
• The chances of an accidentally correctexplanation are low for a simple theory.
Probably Approximately Correct(PAC) LearningTwo important questions that we have yet to
address:
• Where do the training examples comefrom?
• How do we test performance, i.e., are wedoing a good job learning?
• PAC learning is one approach to dealing withthese questions.
Classifier example
Consider learning the predicate Flies(Z) = true,false.
We are assigning objects to one of two categories:recall we call this a classifier.
Suppose that X = pigeon,dodo,penguin,747, Y =true,false, and that
Pr(pigeon) = 0.3 Flies(pigeon)= true
Pr(dodo) = 0.1 Flies(dodo) =false
• Note that if we mis-classified dodos but goteverything else right, then we would still bedoing pretty well in the sense that 90% ofthe time
• we would get the right answer.
• We formalize this as follows.
• The approximate error associated with ahypothesis f is
• error(f) = ∑ x | f(x) not= Flies(x)Pr(x)
• We say that a hypothesis is
approximately correct with error at most εif
42
• The chances that a theory is correctincreases with the number of consistentexamples it predicts.
• Or….
• A badly wrong theory will probably beuncovered after only a few tests.
PAC: definition
Relax this requirement by not requiring that thelearning program necessarily achieve a small errorbut only that it to keep the error small with highprobability .
Probably approximately correct (PAC) withprobability δ and error at most ε if, givenany set of training examples drawnaccording to the fixed distribution, theprogram outputs a hypothesis f such that
PAC
• Idea:
• Consider space of hypotheses.
• Divide these into “good” and “bad” sets.
• Want to assure that we can close in on theset of good hypotheses that are closeapproximations of the correct theory.
PAC Training examples
Theorem:
If the number of hypotheses |H| is finite, thena program that returns an hypothesis that isconsistent with
ln(δ /|H|)/ln(1- ε)
training examples (drawn according to Pr) isguaranteed to be PAC with probability δand error bounded by ε.
PAC theorem: proof
If f is not approximately correct then Error(f) > ε so theprobability of f being correct on one example is < 1 - ε andthe probability of being correct on m examples is < (1 - ε)m.
Suppose that H = f,g. The probability that f correctlyclassifies all m examples is < (1 - ε )m. The probabilitythat g correctly classifies all m examples is < (1 - ε )m. Theprobability that one of f or g correctly classifies all mexamples is < 2 * (1 - ε )^m.
To ensure that any hypothesis consistent with m trainingexamples is correct with an error at mostε with probability
Generalizing, there are |H| hypotheses in therestricted hypothesis space and hence theprobability that there is some hypothesis in H thatcorrectly classifies all m examples is bounded by
|H|(1- ε )m.
Solving for m in
|H|(1- ε )m < δ
we obtain
m >= ln(δ /|H|)/ln(1- ε ).
43
Stationarity
• Key assumption of PAC learning:
Past examples are drawn randomly from thesame distribution as future examples:stationarity.
The number m of examples required is calledthesample complexity.
A class of concepts C is said to be PAClearnable for a hypothesis space H if(roughly) there exists an polynomial timealgorithm such that:
for any c in C, distribution Pr, epsilon, anddelta,
if the algorithm is given a number of trainingexamples polynomial in 1/epsilon and1/d lt th ith b bilit 1 d lt th
Overfitting
Consider error in hypothesis h over:
• training data: error train (h)
• entire distribution D of data: errorD (h)
• Hypothesis h \in H overfits training data if
– there is an alternative hypothesis h’ \in Hsuch that• errortrain (h) < errortrain (h’)
but
Lecture 14
• Learning– Probably approximately correct learning
(cont’d)
– Version spaces
– Decision trees
PAC: definition
Relax this requirement by not requiring that thelearning program necessarily achieve a small errorbut only that it to keep the error small with highprobability .
Probably approximately correct (PAC) withprobability δ and error at most ε if, givenany set of training examples drawnaccording to the fixed distribution, theprogram outputs a hypothesis f such that
PAC Training examples
Theorem:
If the number of hypotheses |H| is finite, thena program that returns an hypothesis that isconsistent with
m = ln(δ /|H|)/ln(1- ε)
training examples (drawn according to Pr) isguaranteed to be PAC with probability δand error bounded by ε.
44
We want….
• PAC (so far) describes accuracy of thehypothesis, and the chances of finding sucha concept.
– How may examples do we need to rule out the“really bad” hypotheses.
• We also want the process to proceedquickly.
PAC learnable spaces
A class of concepts C is said to be
PAC learnable for a hypothesis space Hif there exists an polynomial time algorithm A
such that:
for any c ∈ C, distribution Pr, ε , and δ ,
if A is given a quantity of training examplespolynomial in 1/ ε and 1/ δ,
then with probability 1- δ
the algorithm will return a hypothesis f fromH h h
Observations on PAC
• PAC learnability doesn’t tell us how to findthe learning algorithm.
• The number of examples needed growsslowly as the concept space increases, andwith other key parameters.
Example
• Target and learned concepts areconjunctions with up to n predicates. (Thisis our bias.)– Each predicate might be appear in either
positive or negated form, or be absent: 3options.
– This gives 3n possible conjunctions in thehypothesis space.
Result
• I have such a formula in mind.
• I’ll give you some examples.
• You try to guess what the formula is.
A concept that matches all our examples willbe PAC if m is at least
n/ε ln ( 3/ δ )
How
• How can we actually find a suitableconcept?
• One key approach: start with the examplesthemselves, and try to generalize.
• E.g. Given f(3,5) and f(5,5).– We might try replacing the first argument with
a variable X: f(X,5).
45
Version Space [DAA 5.3]
• Deals with conjunctive concepts.
• Consider a concept C as being identifiedwith the set of positive examples itassociated with.– C:even numbered hearts = 3,5 ,7 ,9 .
• A concept C1 is a specialization of conceptC2 if the examples associated with C1 are a
Specialization/GeneralizationCards
Black Red
Odd red Even red
Even- (red is implied)
3 5 7 9
Immediate
• Immediate Specialization: no intermediate.
• Red is not the immediate specialization of2-of-hearts.
• Red is the immediate specialization ofhearts and diamonds.– Note: This observation depends on knowing the
hypothesis space restriction.
Algorithm outline
• Incrementally process training data.
• Keep list of most and least specific conceptsconsistent with the observed data.– For two concepts A and B that are consistent
with the data,the concept C= (A AND B) willalso be consistent yet more specific.
• Tied in a subtle way to conjunctions.– Disjunctive concepts can be obtained trivially
by joining examples but they’ re not interesting
• 4 :no
• 5 :yes
• 5 ♣ :no
• 7 :yes
• 9 ♠ ---
• 3 yes
VS example
Cards
Black Red
Even red Odd red
Odd- (red is implied)
3 5 7 9
Algorithm specifics
• Maintain two bounding concepts:– The most specialized (DAA: specific boundary)
– The broadest (DAA: general boundary).
• Each example we see is either positive (yes) ornegative (no).
• Positive examples (+) tend to make the conceptmore general (or inclusive). Negative examples (-) are used to make the concept more exclusive (toreject them).
46
Observations
• It allows you to GENERALIZE from atraining set to examples never-before-seen!!!– In contrast, consider table lookup or rote
learning.
• Why is that good?1 It allows you to infer things about new data (the
whole point of learning)
2 It allows you to (potentially) remember old data
Restaurant Selector
Example attributes:
• 1. Alternate
• 2. Bar
• 3. Fri/Sat
• 4. Hungry
• 5. Patrons
• 6. Price
• etc.
f ll P t ( F ll) AND
Example 2
Maybe we should have made a reservation?(using a decision tree)
• Restaurant lookup: you’ve heard Joe’s isgood.
• Lookup Joe’s
• Lookup Chez Joe• Lookup Restaurant Joe’s
Decision trees: issues
• Constructing a decision tree is easy…really easy!– Just add examples in turn.
• Difficulty: how can we extract a simplifieddecision tree?– This implies (among other things) establishing
a preference order (bias) among alternative
Office size example
Training examples:
1. large ^ cs ^ faculty -> yes
2. large ^ ee ^ faculty -> no
3. large ^ cs ^ student -> yes
4. small ^ cs ^ faculty -> no
5. small ^ cs ^ student -> no
The questions about office size, departmentand status tells use something about the
47
Decision tree #1
size
/ \
large small
/ \
dept no 4,5
/ \
cs ee
/ \
Decision tree #2
status / \ faculty student / \ dept dept / \ / \ cs ee ee cs / \ / \ size no ? size / \ / \
Making a tree
How can we build a decision tree (that mightbe good)?
Objective: an algorithm that builds a decision treefrom the root down.
Each node in the decision tree is associated with aset
of training examples that are split among itschildren.
Procedure: Buildtree
If all of the training examples are in the same class,
then quit,
else 1. Choose an attribute to split the examples.
2. Create a new child node for each attributevalue.
3. Redistribute the examples among thechildren
according to the attribute values.
4. Apply buildtree to each child node.
A Bad tree
• To identify an animal(goat,dog,housecat,tiger)
• Is it a dog?
• Is it a housecat?
• Is it a tiger?
• Is it a goat?
• A good tree?
• It is a cat? (if yes, what kind.)
• Is it a dog?
• Max depth 2 questions.
48
Best Property
• Need to select property / feature / attribute
• Goal find short tree (Occam's razor)– Base this on MAXIMUM depth
• select \bf most informative feature– One that best splits (classifies) the examples
Entropy
• measures the (im) purity in collection S ofexamples
• Entropy(S) =
- p_+ log_2 (p_+) - p_- log_2 (p_-)
• p_+ is proportion of positive examples.
• p is proportion of negative examples
• Learning– Decision trees
• Building them
• Building good ones
– Sub-symbolic learning• Neural networks
Decision trees: issues
• Constructing a decision tree is easy…really easy!– Just add examples in turn.
• Difficulty: how can we extract a simplifieddecision tree?– This implies (among other things) establishing
a preference order (bias) among alternative
Office size example
Training examples:
1. large ^ cs ^ faculty -> yes
2. large ^ ee ^ faculty -> no
3. large ^ cs ^ student -> yes
4. small ^ cs ^ faculty -> no
5. small ^ cs ^ student -> no
The questions about office size, departmentand status tells use something about the
Decision tree #1
size
/ \
large small
/ \
dept no 4,5
/ \
cs ee
/ \
49
Decision tree #2
status / \ faculty student / \ dept dept / \ / \ cs ee ee cs / \ / \ size no ? size / \ / \
Making a tree
How can we build a decision tree (that mightbe good)?
Objective: an algorithm that builds a decision treefrom the root down.
Each node in the decision tree is associated with aset
of training examples that are split among itschildren.
Procedure: Buildtree
If all of the training examples are in the same class,
then quit,
else 1. Choose an attribute to split the examples.
2. Create a new child node for each value ofthe attribute.
3. Redistribute the examples among thechildren
according to the attribute values.
4. Apply buildtree to each child node.
A “Bad” tree
• To identify an animal(goat,dog,housecat,tiger)
Is it a wolf?
Is it in the cat family?
Is it a tiger?
wolf
cat
tiger
dog
noyes
yes
yes
no
no
• Max depth 3.
• To get to fish or goat, it takes threequestions.ions.
• In general, a bad tree for N categories cantake N questions.
• Can’t we do better? A good tree?
• Max depth 2 questions.M ll l (N) ti
Cat family?
Tiger? Dog?
50
Best Property
• Need to select property / feature / attribute
• Goal: find short tree (Occam's razor)1. Base this on MAXIMUM depth
2. Base this on the AVERAGE depthA) over all leaves
B) over all queries
• select most informative feature
Optimizing the tree
All based on buildtree.
To minimize maximum depth, we want tobuild a balanced tree.
• Put the training set (TS) into anyorder.
• For each question Q– Construct a K-tuple of 0s and 1s
• The jth entry in the tuple is
– 1 if the jth instance in the TS has answerYES to Q
– 0 if it has answer NO
Min Max Depth
• Minimize max depth:
• At each query, come as close as possible tocutting the number of samples in the subtreein half.
• This suggests the number of questions persubtree is given by the log2 of the number ofsample categories to be subdivided.
Entropy
Measures the (im) purity in collection S ofexamples
Entropy(S) =
- [ p+ log2 (p+) + p- log2 (p-) ]
• p+ is the proportion of positive examples.
• p is the proportion of negati e e amples
Example
• S, 14 examples, 9 positive, 5 negative
Entropy([9+,5-]) =
-(9/14) log2(9/14) - (5/14)log2(5/14) =
0.940
Intuition / Extremes
• Entropy in collection is zero if all examplesin same class.
• Entropy is 1 if equal number of positive andnegative examples.
Intuition:
If you pick random example,how many bits do you needto specify what class the
51
Entropy: definition
• Often referred to a “randomness”.
• How useful is a question:– How much guessing does knowing an answer
save?
• How much “surprise” value is there in aquestion.
Information Gain
General definition
• Entropy(S) =
c
∑ 1
pi log2 (pi)
• In this lecture we consider some alternativehypothesis spaces based on continuousfunctions. Consider the following booleancircuit.
x1 ---------------|\
|\ | ----+
x2 ---+--| -------|/ +------|\
| |/ NOT AND | -----f(x1,x2,x3)
+-----------|\ +------|/
| ----+ OR
x3----------------|/
The topology is fixed and logic elements arefixed so there is a single Boolean function.
Is there a fixed topology that can be used torepresent
a family of functions?
Yes! Neural-like networks (aka artificialneural networks) allow us this flexibility
The idealized neuron
• Artificial neural networks come in several“flavors”.– Most of based on a simplified model of a
neuron.
• A set of (many) inputs.
• One output.
• Output is a function of the sum on the
52
Today’s Lecture
• Tangential questions (warm up)
• Administrative Details
• Learning– Decision trees: cleanup & details
– Sub-symbolic learning• Neural networks
• Why do I use an Apple Macintosh?– Who cares
– It’s elegant
– I respect Apple’s innovation.
• On learning...
Administrativia
• Please signup AGAIN using the web page.– There was a bug in the CGI/HTML connection.
• Note that the normal late policy will notapply to the project.– You **must** submit the electronic
(executable) on time, or it may not beevaluated!
– It must run on LINUX. Be certain to compileand test it on one of the linux machines in the
ID3
Considered information implicit in a query about aset of examples.– This provides the total amount of information implicit
in a decision tree.
– Each question along the tree provides some fraction ofthis total information.
• How much ??
• Consider information gain per attribute A.– Gain(Q:X) = E(Q:X) - E(A:X)
– Info needed to complete tree is weighted sum ofthe subtrees
i
v
=∑ 1
Is there a fixed circuit network topology thatcan be used to represent
a family of functions?
Yes! Neural-like networks (a.k.a. artificialneural networks) allow us this flexibilityand more; we can represent arbitraryfamilies of continuous functions using fixed
Neural Networks?
Artificial Neural Netsa.k.a.
Connectionist Nets
(connectionist learning)a.k.a.
Sub-symbolic learning
a.k.a.
Perceptron learning (a special case)
53
The idealized neuron
• Artificial neural networks come in several“flavors”.– Most of based on a simplified model of a
neuron.
• A set of (many) inputs.
• One output.
• Output is a function of the sum on the
Why neural nets?
• Motives:– We wish to create systems with
abilities akin to those of thehuman mind.
• The mind is usually assumed to bebe a direct consequence of thestructure of the brain.
– Let’s mimic the structure of thebrain!
– By using simple computingelements, we obtain a systemthat might scale up easily to
Not intext
Not intextReal and fake neurons
• Signals inneurons arecoded by “spikerate”.
• In ANN’s, inputscan be either:– 0 or 1 (binary)
– [0,1]
– [-1,1]
– R (real)
• Each input Ii hasan associated
Not intext
Not intext
54
Inductive bias?
• Where’s the inductive bias?– In the topology and architecture of the network.
– In the learning rules.
– In the input and output representation.
– In the initial weights.
Not intextSimple neural models
• Oldest ANN model is McCulloch-Pittsneuron [1943] .– Inputs are +1 or -1 with real-valued weights.
– If sum of weighted inputs is > 0, then theneuron “fires” and gives +1 as an output.
– Showed you can comput logical functions.
– Relation to learning proposed (later!) byDonald Hebb [1949].
• Perceptron model [Rosenblatt, 1958].– Single-layer network with same kind of neuron
Not intext
Perceptron nets
Perceptron Network Single Perceptron
InputUnits Units
Output InputUnits Unit
Output
OIj Wj,i Oi Ij Wj
Perceptron learning
• Perceptron learning:– Have a set of training examples (TS) encoded
as input values (I.e. in the form of binaryvectors)
– Have a set of desired output values associatedwith these inputs.
• This is supervised learning.
– Problem: how to adjust the weights to makethe actual outputs match the training examples.
• NOTE: we to not allow the topology to change![You should be thinking of a question here ]
Learning algorithm
• Desired Ti Actual output Oi• Weight update formula (weight from unit j
to i):
Wj,i = Wj,I + k* xj * (Ti - Oi)
Where k is the learning rate.
• If the examples can be learned (encoded),then the perceptron learning rule will findthe weights.
Perceptrons: what can they learn?
• Only linearly separable functions [Minsky& Papert 1969].I 1
I 2
I 1
I 2
I 1
I 2
?
(a) (b) (c)and or xor
0 1
0
1
0
1 1
0
0 1 0 1
I 2I 1I 1 I 2I 1 I 2
55
More general networks
• Generalize in 3 ways:– Allow continuous output values [0,1]
– Allow multiple layers.• This is key to learning a larger class of functions.
– Allow a more complicated function thanthresholded summation
[why??]
Generalize the learning rule to accommodatethis: let’s see how it works.
The threshold
• The key variant:– Change threshold into a differentiable
function
– Sigmoid, known as a “soft non-linearity”(silly).
M = ∑xiwi
Today’s Lecture
• Neural networks– Training
• Backpropagation of error (backprop)
– Example
– Radial basis functions
Recall: training
For a single input-output layer, we couldadjust the weights to get linearclassification.– The perceptron computed a hyperplane over
the space defined by the inputs.• This is known as a linear classifier.
• By stacking layers, we can compute a widerrange of functions.
• Computeerror derivative
output
inputs
Hiddenlayer
• “Train” the weights to correctly classify aset of examples (TS: the training set).
• Started with perceptron, which usedsumming and a step function, and binaryinputs and outputs.
• Embellished by allowing continuousactivations and a more complex “threshold”
The Gaussian
• Another continuous, differentiable functionthat is commonly used is the Gaussianfunction.
Gaussian(x) =
• where σ is the width of the Gaussian.
• The Gaussian is a continuous, differentiableversion of the step function.
ex 22σ
ex 22σ
56
What is learning?
• For a fixed set of weights w1,...,wn
f(x1,...,xn) = Sigma(x1 w1 + ... + xn wn)represents a particular scalar function of n
variables.
• If we allow the weights to vary, then we canrepresent a family of scalar function of nvariables.
F(x1,...,xn,w1,...,wn) = Sigma(x1 w1 + ... + xn
wn)
Basis functions
• Here is another family of functions. In thiscase, the family is defined by a linearcombination of basis functions
g1,g2,...,gn.
The input x could be scalar or vector valued.
F(x,w1,...,wn) = w1 g1(x) + ... + wn gn(x)
Combining basis functions
We can build a network as follows:
g 1(x) --- w 1 ---\
g 2(x) --- w 2 ----\
... \
... ∑ --- f(x)
/
/
g n(x) --- w n ---/
E.g. From the basis 1,x,x2 we can buildquadratics:
Receptive Field
• It can be generalized to an arbitrary vectorspace (e.g., Rn).
• Often used to model what are called“ localized receptive fields” in biologicallearning theory.– Such receptive fields are specially designed to
represent the output of a learned function on asmall portion of the input space.
– How would you approximate an arbitrarycontinuous function using a sum of gaussians or
Backprop
• Consider sigmoid activation functions.
• We can examine the output of the net as afunction of the weights.– How does the output change with changes in
the weights?
– Linear analysis: consider partial derivative ofoutput with respect to weight(s).
• We saw this last lecture.
– If we have multiple layers, consider effect oneach layer as a function of the preceding
Backprop observations
• We can do gradient descent in weight space.
• What is the dimensionality of this space?– Very high: each weight is a free variable.
• There are as many dimensions as weights.
• A “typical” net might have hundreds of weights.
• Can we find the minimum?– It turns out that for multi-l ayer networks, the
error space (often called the “energy” of thenetwork) is NOT CONVEX . [so?]
– Commonest approach: multiple restart gradient
57
Success? Stopping?
• We have a training algorithm (backprop).
• We might like to ask:– 1. Have we done enough training (yet)?
– 2. How good is our network at solving the problem?
– 3. Should we try again to learn the problem (from thebeginning)?
• The first 2 problems have standard answers:– Can’t just look at energy. Why not?
• Because we want to GENERALIZE across examples. “Iunderstand multiplication: I know 3*6=18, 5*4=20.”
What can we learn?
• For any mapping from input to output units,we can learn it if we have enough hiddenunits with the right weights!
• In practice, many weights means difficulty.
• The right representation is critical!
• Generalization depends on bias.– The hidden units form an internal
representation of the problem. make theml hi l
Representation
• Much learning can be equated with selecteda good problem representation.– If we have the right hidden layer, things
become easy.
• Consider the problem of face recognitionfrom photographs. Or fingerprints.– Digitized photos: a big array (256x256 or
512x512) of intensities.
– How do we match one array to another?
Faces (an example)
• What is an important property to measurefor faces?– Eye distance?
– Average intensity• BAD!
– Nose width?
– Forehead height?
• These measurements form the basisfunctions for describing faces.
BUT NOT NECESSARILY photographs!!!
Radial basis functions
• Use “blobs” summed together to create anarbitrary function.– A good kind of blob is a Gaussian: circular,
variable width, can be easily generalized to 2D,3D, ....
Topology changes
• Can we get by with fewer connections?
• When every neuron from one layer isconnected to every layer in the next layer,we call the network fully-connected.
• What if we allow signals to flowbackwards to a preceding layer?Recurrent networks
58
Today’s Lecture• Neural networks
– Backprop example
• Clustering & classification: case study– Sound classification: the tapper
• Recurrent nets• Nettalk
• Glovetalk– video
• Radial basis functions
• Unsupervised learning
Not intextBackprop demo
• Consider the problem of learning torecognize handwritten digits.– Each digit is a small picture
– Sample the location in the picture: these arepixels (picture elements)
– A grid of pixels can be used as input to anetwork
– Each digit is a training example.
– Use multiple output units, one to classify eachpossible digit.
Not intext
Clustering & recognition
• How to we recognize things?– By learning salient features.
• That’s what the hidden layer is doing.
• Another aspect of this is clustering togetherdata that is associated.1. What do you observe (what features to
extract)?
2. How do you measure similarity in “featurespace”?
3. What do you measure with respect to?
Not intextThe tapper: a case study
See overheads…..
Not intext
Temporal features?
• Question: how do you deal with timevarying inputs?
• We can represent time explicitly orimplicitly.
• The tapper represented time explicitly byrecording a signal as a function of time.– Analysis is then static on a signal f(t) where t
Not intextDifficulties of explicit time
• It is storage intensive (more units, moresignal most be stored and manipulated).
• It is “weight intensive”: more units, moreweights?
• You must determine a priori how larger thetemporal sampling window much be!
So what?More weights -> harder training. (less bias)
Not intext
59
Topology changes
• Can we get by with fewer connections?
• When every neuron from one layer isconnected to every layer in the next layer,we call the network fully-connected.
• What if we allow signals to flowbackwards to a preceding layer?Recurrent networks
Recurrent nets
• Simplest instance:– Allow signals to flow in various directions
between layers
• Can now trivially compute
f(t) = f(x)+ 0.5 f(t-1)
Time is implicit in the topology and weights.
Difficult to avoid blurring in time, however.
Example: nettalk
• Learn to read English: output is audiospeech
Firstgradtext,before &aftertraining.
DictionaryText, getting progressivelyBetter.
Today’s Lecture• Administrative:
– Midterm
– Assignment 4
• Recurrent nets• Nettalk (continued)
• Glovetalk– video
• Radial basis functions (RBF’s)
• Unsupervised learning– Kohonen nets
Not intext
Example: nettalk
• Learn to read English: output is audiospeech
Firstgradtext,before &aftertraining.
DictionaryText, getting progressivelyBetter.
NETtalk details
Reference: Sejnowski & Rosenberg, 1986, Cognitive Science, 14, pp.179-211.
Mapped English text to phonemes.Phonemes then converted to sound by a separate
engine.
• 3-layer network trained using backprop– 203-80-26 MPL (multi-layer perceptron)
• 203 input units
– Encoded 7 consecutive characters.• window in time before and after the current sound.
60
Example: glovetalk
• Input: gestures
• Output: speech
• Akin to reading sign language, butHIGHLY simplified.– Input is encoded as electrical signals from,
essentially, a Nintendo Power Glove.• Various joint angles as a function of time.
Today’s Lecture•
• Learning (wrap-up)–Reinforcement learning
–Unsupervised learning• Kohonen nets
• Planning
Not intext
Reinforcement Learning
• So far, we had a well-defined set of trainingexamples.
What if feedback is not so clear?
E.g., when playing a game, only after manyactions
final result: win, loss, or draw.
• In general, agent exploring environment,
DAA pp. 231
• Issue: delayed rewards / feedback.– Exacerbates the credit assignment problem.
ASK NOW if you don’t recall what this is!
• Field: reinforcement learning
• Main success: Tesauro's backgammonplayer
(TD Gammon).
Illustration
Imagine agent wandering around inenvironment.
• How does it learn utility values of eachstate?
• (i.e., what are good / bad states? avoid badones...)
Reinforcement learning
• Compare: in backgammon game: states =boards.only clear feedback in final states(win/loss).(We will assume, for now, that we haven’t
cooked up a good heuristic evaluationfunction.)
We want to know utility of the other states
• Intuitively: utility = chance of winning
61
Policy
• Key things we want to learn is a policy
• Jargon for the “table” that associatesactions with states.
– It can be non-deterministic.
• Learning the expected utility of a specific
How: strategies
• Three strategies:
• (a) ``Sampling'' (Naive updating)
• (b) ``Calculation'' / ``Equation solving''– (Adaptive dynamic programming)
• (c) ``in between (a) and (b)''
(Temporal Difference Learning --- TDlearning)
Naive updating
• Comes from adaptive control theory.
• (a) ``Sampling'' --- agent makes randomruns through environment; collect statisticson final payoff for each state (e.g. when at(2,3), how often do you reach +1 vs. -1?)
• Learning algorithm keeps a running averagefor each state. Provably converges to true
Example: Reinforcement
(a)
1 2 3
1
2
3
1
+ 1
4
START
1
+1
.5
.33
.5
.33 .5
.33
.5
.33
.33
.33
(b)
.5
.5
.5
.5
.5
.5
.5
.5
.33
.33
.33
Stochastic transition network
(a)
2 3
1
+ 1
4
1
+1
.5
.33
.5
.33 .5
.33
.5
1.0.33
.33
.33
(b)
1.0.5
.5
.5
.5
.5
.5
.5
.5
.33
.33
.33
1 2
1
2
3
(c)
0.0380
0.0380
0.0886 0
0.1646
0.2911
0.
0.
62
Stochastic transition network
1
+1
.5
.33
.5
.33 .5
.33
.5
1.0.33
.33
.33
(b)
1.0.5
.5
.5
.5
.5
.5
.5
.5
.33
.33
.33
1 2 3
1
2
3
1
+ 1
4
(c)
0.0380
0.0380
0.0886 0.2152
0.1646
0.2911
0.4430
0.5443 0.7722
Issues
• Main drawback: slow convergence.See next figure.
• In relatively small world takes agent
• over 1000 sequences to get a reasonablysmall (< 0.1) root-mean-square errorcompared with true expected values.
• Question: Is sampling necessary?
• Can we do something completely different?
Temporal difference learning
• Combine ``sampling'' with ``calculation'’ orstated differently: it's using a samplingapproach to solve the set of equations.
• Consider the transitions, observed by awandering agent.
• Use an observed transition to adjust theutilities of the obeserved states to bring
Temporal difference learning
• U(i) : utility in state i
• R(i): Reward
• When observing a transition from i to jbring U(i) value closer to that of U(j)
• Use update rule:
• U(i) = U(i) + k(R(i) + U(j) - U(i))**
• k is the learning rate parameter.
• Rule is called the temporal difference or TD
Learning by exploration
• Issue raised in text: What if we need to(somehow) move in state space to acquiredata.
• Only then, can we learn.
We have 3 coupled problems:– 1. Where are we (in state space)
– 2. Where should we go next
– 3. How do we get thereSkipping #2 & #3 gives “passive learning” as opposed
to “active learning”
Mobile Robotics Research
• Mobile Robotics exemplifies these issues.
• The canonical problems:
Where am I(position estimation)
How do I get there
(path planning)
M i d l t i
63
Where am I
(position estimation)
How do I get there(path planning)
Mapping and exploration(occupancy and uncertainty)
Pose estimation:vision, sonar, GPS
Dynamic programming,A*
Geometry, occupancygrids, graphs,uncertainty
Mobile Robotics ResearchUnsupervised learning
• Given a set of input examples, can wecategorize them into meaningful groups?
• This appears to be a common aspect ofhuman intelligence.– A precursor to recognition, hypothesis
formation, etc.
• Several mechanisms can be used.
Unsupervised learning byclustering
Two important ideas:
1. Self-organization: let the learning systemre-arrange it’s topology or coverage of thefunction space.We saw some of this in the tapper
We some some of this with backprop.
2. Competitive learning: aspects of therepresentation compete to see which givesh b l i
Not intextKohonen nets
• AKA Kohonen feature maps
• A graph G that encodes learning– Think of vertices as embedded in a metric
space, eg. The plane.
– Inputs represented as points in an N-dimensional space
• Based on number of degrees of freedom (number ofparameters)
• Every node connected to all inputs
Not intext
Objective
• Want the nodes of our graph to distributethemselves to “explain” the data.– Nodes that are close to each other are
associated with related examples.
– Nodes that are distant from one another arerelated to unrelated examples.
• This leads to an associative memory– A storage device that allows us to find items
that are related to one another
Kohonen: intuition
• Two key ideas:
1. Let nodes represent points in “examplespace”.– Let them learn away of covering/explaining
data.
2. Let nodes cover the space “suitably”.
Using both competition with each other and
64
Kohonen learning
We are given a set of nodes ni and a set ofexamples si.
For each training examples s i
Compute distance d between s i andeach node n j descrbied by weightswj,i .
d = sqrt ( ( s i wj,iT ) 2 )
Find node with min distance
For each node close to n
Kohonen parameters
Need two functions:
• (Weighted) Distance from nodes to inputs
• Distance between nodes
Two constants:– A threshold K on inter-node distance
– A learning rate c
Kohonen example
• Here’s the demo you’ll use for assignment 4
Kohonen: issues?
• Exploits both distribution of examples and theirfrequency to establish the mapping.
• Accomplishes clustering and recognition.
• May serve as a model of biological
associative memory.• Can project a high-dimensional space to a lower
dimensional comparison space:
dimensionality reduction.
W h d t h th f t i d
Planning: the idea [DAA:ch 7]
A systematic approach to achieving goals.
Examples:– Getting home
– Surviving the term
– Fixing your car
Optimizing vs. Satisficing.
Planning & search
• Many planning problems can be viewed assearch:– 1.
• States are world configurations.
• Operations are actions that can be taken
• Find a set of intermediate states from current to goalworld configuration.
– 2.• States are plans
• Operations are changes to the sequence of possibleactions.
65
Planning differs from search
• Planning has several attributes thatdistinguish it from basic search.
• Conditional plans: plans that explicitly indicateobservations that can be made, and actions thatdepend on those observations.– If the front of MC is locked, go to the Chem building.
– If she takes my rook, then capture the Queen.
• Situated activity: respond to things as theydevelop.
Planning
Like problem solving (search) but with keydifferences:
• Can decompose a big problem into potentiallyindependent sub-parts that can be solvedseparately.
• May have incomplete state descriptions at times.– E.g. when to tie ones shoes while walking.
– Tie-shoes should be done whenever they are untied,without regard to other aspects of the state description.
• Direct connection between actions andcomponents of the current state description
Not intext
Planning: general approach
• Use a (restrictive) formal language todescribe problems and goals.– Why restrictive? More precision and fewer
states to search
• Have a goal state specification and an initialstate.
• Use a special-purpose planner to search forl i
Basic formalism
• Basic logical formalism derived fromSTRIPS.
• State variables determine what actions canor should be taken: in this context they areconditions– Shoe_untied()
– Door_open(MC)
Today’s Lecture
• Planning
Planning: general approach
• Use a (restrictive) formal language todescribe problems and goals.– Why restrictive? More precision and fewer
states to search
• Have a goal state specification and an initialstate.
• Use a special-purpose planner to search for
66
Basic formalism
• Basic logical formalism derived fromSTRIPS.
• State variables determine what actions canor should be taken: in this context they areconditions– Shoe_untied()
– Door_open(MC)
Going forwards
• All state variables are true or false,but some may not be defined at a certainpoint on our
State Progression.A planner based on this is a progression planner.
Idea:In a state S,
Can apply operator X=(P,A,D).
Leads to new state T
Constancy
• Important caveat
• When we go from one state to another,
we assume that the
only changes were thosethat resulted explicitly from the
Additions and Deletions.
Gi thi ti th t X
Aside: FOL with time
• One approach is a variation of first-orderlogic called situation calculus [McCarthy].– Events take place at specific times.
– Some predicates are fluents and only apply forcertain ranges in time.
– A situation is a temporal interval over whichall the predicates remain fixed.
– This material from Ch. 6 will largely be skipped. Wewil cover, or have covered, 6.6 on in class.
Going backwards
• Remember backwards chaining?
• State at the goal G.
• Assuming the deletions aren’t there forsome operator X– Why?
• Can chain backwards by adding what wouldhave been deleted and removing whatwould have been added
Means/ends analysis
• How can we get from initial to final?– Assume the states and operators are given.
– What’s the right path? How to we measuredistance?
• Means/ends analysis assumes we simplyreduce the number of things that make ourcurrent state different from out goal.
67
STRIPS
• STRIPS is an old planning language• STanford Research Institute Problem Solver.
– Less expressive than situation calculus
– Initial state:
At(office) & NOT(Have(Video)) & Have(Cash) & Have(Uncooked-kernels)
– Goal stateAt(H ) & H (Vid ) & H (C k d P )
Schemas
• Basic operators assume a completespecification of the state in which they areapplied.
• This can be tedious– An operator schema is a “generic” operator
that has variables in it• Related to axiom schemas
• Related to unification in logic (e.g. prolog)E.g.
Least Commitment Planning
• When we formulate a plan intuitively, weoften think of doing things in a specificsequenceeven when the sequencing is arbitrary.– This may not be wise.
• This can leads to re-shuffling actions...which is undesirable.
Partially ordered planA
D
EGB
CF
Terminology
• Constraints on sequencing, requirementsfor operators, links relating operators,conflicts between operators in a given plan.
For a plan:• Sound
– Plan steps obey constraints on sequencing
– Successful
• Systematic– Doesn’t “waste” effort
• Complete– Generates a plan if one exists.
Links & ConflictsConsumerProducer
ConsumerProducer
Clobb
erer
A conflict involves a link, & a step that messes it up.
68
Refinement
Fix conflicts by creating a new one from anold one.– Keep old structures (links, producers,
consumers, constraints) but add new constraints
• If there are conflicts, resolve them byadding constraints: move a clobbererbefore of after the link it’s hitting.– (if you can).
If th fli t ti f
Applications of planning
• Planning for Shakey the robot– Climb boxes
– Push things
– Move around
• Blocks world– Moving blocks
– Piling them onto one another
– Clearing the tops of chosen blocks
Configuration Space Planning Issues
Today’s Lecture• Computational Vision– Images
– Image formation in brief
– Image processing: filtering• Linear filters
• Non-linear operations
• Signatures
• Edges
– Image interpretation• Edge extraction
• Grouping
– Scene recovery
Items in bluewill (may) beCovered later
What’s an image?
• The ideal case:– We have an a continuous world (at macroscopic scales).
– We have continuous images to that world.
– Images are 2-dimensional projections of a three-dimensional world.
• In addition, there are other key factors in the world thatdetermine the image:
– Object reflectance properties (wh ite or gray shirt?)
– L ight s ource posit ion
» A lt ers intensit ies (day/n ight, shading &ch iaros curo)
» A lt ers shadow s
I ti
69
Digital images: synopsis• 2D continuous image a(x,y) divided into Nrows and M columns.– The intersection of a row and a column termed a
pixel.
– Value assigned to integer coordinates [m,n] withm=0,1,2,...,M-1 and n=0,1,2,...,N-1 is a[m,n].•In fact, in most cases a(x ,y) is a function of
many variables including depth z, color λ , andtime t.
•I.e. we really have a(x,y ,z,y, λ ,t ).•Most vision deals with the case of 2D, monochromatic (“black and white”), static images.
Image formation
Key processes:1. Light comes from source
2. Hits objects in scene
3. Is reflected towardsA) other objects [return to step 2]
B) the camera lens
4. Passes through lens being focussed on imaging surface
5. Interacts with transducers on surface to produceelectrical signals
In the eye, the lens is that of the eye, the imagingsurface is the retina, and the transducers are rodsandcones
The Human Visual System
• Digression of the biology of the earlyhuman visual system.
(On blackboard: no notes available.)
Vision, image processing, and AI
• Image processing:– Image -> image
– Often characterized as data reduction, filtering
– Signal-level transformation
• Traditional AI:– predicates -> predicates
– What’s hidden in the data: inference, “datamining”
– Symbol-level transformation• (“Non traditional” AI: neural nets uncertainty etc )
Vision problemsA) Images -> scenes
Known as “shape-from” or “shape from X”Examples:• Recovery of scene structure from a sequence of pictures: shape-
from-mot ion
• Recovery of scene structure from shading: shape-from-shading
• Recovery of scene structure from how shadows are cast: shape-from- shado ws (actually called “shape-from-darknes s”)
• Recovery of shape of changes in texture: shape-from-t exture
B) Images -> predicatesSeveral variations, generally less mature.
Object recognition, functional interpretation, supportrelations.
What we want/need is (B), but (A) seems
“easier” or is a natural prerequisi te
What is vision?
In general, vision involves the recovery of allthose things that determine the world:– Material properties, shadows, etc.
as well as the functional and categoricalrelationships between or pertaining toobjects!
70
Image Processing
• Better understood than vision.
• Produce new images or arrays of datawithout worrying (much) about“interpretations.”
∆ø**πø ß
Image processing“operator”
(often a form of filtering)
Filtering
• Given an input signal f(t) we can compute atransformed description g(t).– Key requirement: the dimensionality of the
domain and range is the same.
• This transformed signal is derived from f(t)by the application of either linear(multiplication/addition) or non-linearoperatorsE.g.
– g(t) = f(t) + f(t -1) + f(t+1) [linear]
Filtering in 2D
• Note that filtering applies in essentially thesame way in– 1-D signals,
– 2-D images,
– or even higher dimensional spaces.
Filters: more flavors
• Additional key characterization is the thedegree of locality of the filter.– Does it look
• At a single point?
• At a region…. And is the region symmetric?
• At everything?
Convolution
• For vision and image processing, the mostimportant class of filtering operation isconvolution.– This is almost the same a correlation.
– Convolution for 2 signals
– c(t) = a(t) * b(t)
– c(x,y) = a(x,y) * b(x,y)
For discrete signals
Convolution: specifics
• Typically we convolve a signal with akernel
• Note that convolution is distributive,associative, and commutative.
Impulse
71
Sample Image E.g. Blurring
• Convolve the input with a kernel thatcombines information from a range ofspatial locations.
• What is the precise shape of the kernel?
• Why might this be useful?
Blur picture Edges
• Boundaries are thought to be critical toimage interpretation.– Why do cartoons look as reasonable as they do?
– Idea: detect the boundaries of objects in images.
The Sobel Edge operator
• Filter for horizontal and vertical edges,combine.
Edges
72
Edge linking
Image Noise
• Images usually are corrupted by severaltypes of “noise”
• Digital noise
• Shadows
• Shiny spots (specularities)
• Camera irregularities
• Bad assumptions about what’s beingcomputed (“model noise”).
Gaussiannoise
DotsLines
E.g.: sample noise1
2
3
Edge detection: trickier than itseems
Example: Median Filtering
• A classical non-linear filter.
• Over a window, compute the median valueof the signal.
• This is the value of the filter.
• This can be considered a non-linear form ofaveraging.– Note it never produces values that weren’t
Median filter: radius 1
73
Median filter: radius 2 Median filter: radius 4
Median filter: radius 8
5
Median filter: radius 11
Notes
4
3
2
1
Today’s Lecture• Computational Vision– Images
– Image formation in brief (+reading)
– Image processing: filtering• Linear filters
• Non-linear operations
• Signatures
• Edges
– Image interpretation• Edge extraction
• Grouping
– Scene recovery
Color code:•Done•Today•Next class(or not at all)
Image formation
Key processes:1. Light comes from source
2. Hits objects in scene
3. Is reflected towardsA) other objects [return to step 2]
B) the camera lens
4. Passes through lens being focussed on imaging surface
5. Interacts with transducers on surface to produceelectrical signals
In the eye, the lens is that of the eye, the imagingsurface is the retina, and the transducers are rodsandcones
74
Scene “Reconstruction”
• Solve the inverse problem: find the scenethat produced the image.
• Things to account for:– The camera
– The geometry of the scene
– The interaction of light with objects in thescene.
C d l b
Recovery
• Typically simplified models of illuminationand reflectance are employed.– One light source.
– All objects have the same matte reflectance.• Sometimes: don’t worry about occlusion.
• Sometimes: don’t worry about shading.
• Recovery of geometry is known as scenereconstruction.
The Sobel Edge operator
• Filter for horizontal and vertical edges,combine.
Edge extraction
• Edges: places where the intensity changes:hence there is a large derivative.
• Each edge has an amplitude and orientation thatcan be expressed as a combination of orthogonalcomponents in the x and y directions.
Sobel detector specifics
• Sobel edge detector– Convolve image with a pair of operators, S and S’
S*I and S’*I
– The edge map was the Pythagorean sum of the twoconvolutions
E = (S * I) 2 +(S' * I) 2
The Sobel kernel, as an example, can be broken down intotwo parts:
1. A smoothing operation (discussed last class)
reduces effect of noise, sets scale (i.e. size)b d/d F( ) d/d k i t ik i t
Other edge operators• Older edge detectors neglected the smoothing step:
– Larry Roberts developed one of the first (2x2 Robert’soperator[s]).
– The Prewitt operator was based on assuming we could fita smooth surface to the data and then differentiate.
– Hueckel operator: use least-squares line fitting
• More recent work involves smoothing with a two-dimension Gaussian:
• Gaussian provides localization in space, as well asfrequency.
• Rather than explicitly compute (1D form):
1
2 2
2 2
22
πσσe
x y− +( )
∂∂t G I( * )
( )*∂∂t G I
( )*∂∂t G I
Can be precomputed
75
Edges: modern methods
• Instead to looking for peaks in thederivative, look for zero-crossings in thesecond derivative.
• A recent operator developed by Canny usesmultiple scales (sizes) to improve edge detection,based on optimizing the notion of what an edge is.Wanted:
– Detection: detect an edge iff it’s there
– Uniqueness: detect each edge just once
– Localization:detect it at the right place
Edges
• An UNREALISTICALLY simple example:– Edges from the Canny Operator.
Input image
• Consider: extract the edge elements +grouping them into contours.
Sobel detector output
Canny operator output
• Note the effect of non-maximumsuppression.
Human Edge Detection
• Overheads
76
Today’s Lecture• Computational Vision
– Biological vision with emphasis on grouping
– Scene recovery
– Recognition
Mostly not on computer
77
Shape from Shading
• [on blackboard]
• Intensity i = f (e,g,n)– Reflected intensity depends on viewing position
and light source position (assumed known)ANDsurface normal.
– Given the knowns we can estimate the surfacenormal, although there is usually a “circular”ambiguity that is resolved by assumingsomething about surface structure.
Review