8/3/2019 Stanford Intro AI Class Notes
1/56
Videos:http://www.wonderwhy-er.com/ai-class/
Unit 1 Theory: Welcome to AI
Purposes- Teach basic of AI- Excite you
Structure
- Videos Quizzes Answer Videos- Home works (assignments) Exams
AI program = Intelligent Agent
Agent function = it maps any given percept sequence to action = abstract math description
Agent program = a concrete impelemtation of agent function
Rational agent: the one that does the right thing i.e. For each possible percept sequence, a rational
agent should select an action that is expected to maximize itsperformance measure, given the
evidence provided by the percept sequence and whatever built-in knowledge the agent has
TerminologyEnvironment types:
- Fully vs. Partially Observable
http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/http://www.wonderwhy-er.com/ai-class/8/3/2019 Stanford Intro AI Class Notes
2/56
- Deterministic (e.g. Chess game) vs. Stochastic (e.g. Dice game)- Discrete vs. Continuous- Benign vs. Adversarial
AI as uncertainty management
Reasons for uncertainty
- Sensor limits- Adversaries- Stochastic environment- Laziness- Ignorance
Unit 2 Problem Solving
Definition of a problem
- Initial state- A function: ACTIONS (S) {a1, a2, a3, }
o S is a stateo a1, a2, a3 are possible actions from the state S
- A function: RESULT(S, a)So S is a stateo a is an action applied to the stateo S is the new state
- A function: GOALTEST(S) T|Fo S is a stateo GOALTEST will test if S is the destination (final state)
- A function PATHCOST (SSS) no n is cost to move from a state S to S to the final state So It is mostly additive so it will be many STEPCOST(S, a, S); sum of which is PATHCOST
8/3/2019 Stanford Intro AI Class Notes
3/56
Route finding problem
3 regions
- Explored- Frontier- Unexplored
Base algorithm
1. take a state on the Frontier (by some criteria)2. GoalTest it if YES terminate here3. expand it (to new states that would be added to the Frontier)4. remove it from the Frontier (to the Explored)
(Generic) Tree-Search
Tree-Search applied to path-finder problem
8/3/2019 Stanford Intro AI Class Notes
4/56
Graph-Search (like Tree-Search but remember what already explored so
when the frontier is expanded it will not take already explored states)
The key point of the base algorithm is the criteria used in step 1. It leads to few concrete algorithms:
- Breadth-First (aka shortest-first): consider the shortest path first to expand frontier- Uniform-Cost (aka cheapest-first)- Depth-First: consider the longest path first to expand frontier
A* algorithm
It is proven that the algorithm is improved if we know some extra info, e.g. the distance from the
current state (which is about to be expanded) to the goal. It is A* algorithm
h is called heuristic function. A* will always find the lowest cost path only if h(s) < true cost, inother words h never over-estimates (h is said to be optimistic, or admissible).
8/3/2019 Stanford Intro AI Class Notes
5/56
A* works well if we can come up with good heuristicbut it needs our intelligence. h, however
can be generated by relaxing conditions like below
When it works
Problem solving technique like above works when the problem is
- Fully observable- Known- Discrete- Deterministic- Static
Unit 3 Probability in AI
Key things to remember- Joint probability (see definition in the table below)- Conditional probability (see definition in the table below)- Total probability formula (see in the table below)
Event Probability
A
not A
It is applied to conditional probability as well: 1 = P(A|B) + P (A|B)
But be careful when do negating on the condition side:- Wrong : 1 = P(A|B) + P (A|B)- Wrong : P(A) = P(A|B) + P (A|B)- Right : P(A) = P(A|B).P(B) + P(A|B).P(B) (total probability; see also next
line)
Total
probability
| where b spans the whole probability space i.e. .
In part if b has only 2 values 1 or 0 then P(A) = P(A|B).P(B) + P(A|B).P(B)
This formula is applied to conditional probability as well
P(A|M) = P(A|B, M).P(B|M) + P(A|B, M).P(B|M)
A or B
8/3/2019 Stanford Intro AI Class Notes
6/56
A and B
(joint
probability)
A given B
(conditional
probability) Proof:- B and A are sets in some space
- Given B
- The space is now limited to be only B.- Probability of A given B is AB. In this new space of B we have
Q.E.D
Bayes Rule From the formula above we have
P(AB) = P(A|B).P(B) = P(B|A).P(A)
So we have
To calculate using Bayes rule we usually just calculate the numerators of P(A|B) and
P(A|B) and then do the proportion. See the picture below
B
A B
8/3/2019 Stanford Intro AI Class Notes
7/56
For multiple variables in the condition
| | || Variables
and
probabilitydistributions
Example of variables
1. We have seen theexampleof the uncertain event a = "Spurs win the FA Cup inthe year 2011".
a. We can think of this event as just one state of the variable A whichrepresents "FA Cup winners in 2011".
b. In this case A has many states, one for each team entering the FA Cup.c. We write this asA = {a1, a2, ..., an} where a1 = "Spurs", a2 =
"Chelsea", a3 = "West Ham", etc.
d. Since in this case the setA is finite we say thatA is afinite discretevariable.
2. As another example, suppose we are interested in the number of critical faultsin our control system.
a. The uncertain event isA = "Number of critical faults". Again it is bestto think ofA as a variable which can take on any of the discrete values0,1,2,3,... thus A={0,1,2,3,....;}.
b. In this case we say thatA is a infinite discrete variable.c. Let us define a1 as the event "A=0", and a2 as the event "A=1".d. Clearly the events a1 and a2 are mutually exclusive and so P(a1 or
a2)=P(a1)+P(a2). However, we cannot say that P(a1 or a2) = 1 because
a1 and a2 are not exhaustive. That is, they do not form a complete
partition ofA.
e. However, if we define a3 as the event "A>1" then a1, a2, and a3 arecomplete and mutually exhaustive and in this case P(a1)+P(a2)+P(a3)
= 1
f. In general if A is a variable with states a1, a2, ..., an:g. Theprobability distribution ofA, written P(A), is simply the set of
values {P(a1), P(a2), ..., P(an)}
Key thing to remember, summary from reddit
1) P(A) : Concept of Probability.Like intelligence, Probability is about trying to predict something about the future, but Probability is
a prediction with only one number.
That number is in the range from 0 to 1... Where 0 means 0% of the times will ocurr the given event,
and 1 means It would ocurr 100% of the times.Example 1: P(A) = 0.3 ... it means event A will ocurr 30% of the times. (if you get 100 samples, you
predict that 30 will be "of the A type")
Example 2: Dice. P("get a 5") = 1/6 = 0.1666..7
How to calculate that result: * The dice has a total of 6 possible events. Sample Space = {1, 2, 3, 4, 5,
6} * Every event is as probable as the other (so called "equiprobable") (you don't think the dice is
loaded, so P("1") = P("2") = ... = P("5") = P("6") ) * Every event is disjoint from the other (If the result
is 1, then it cannot be 2, nor 3, ... nor 6).
With those conditions, you can imagine:
P("get 1 or 2 or 3 or 4 or 5 or 6 ") = 1
(by definition of Probability that means:
100% of the times you will get 1 or 2 or 3 ... or 6)
Since they are disjoint: P( Union( Ai ) ) = sum ( P (Ai) )
http://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htmhttp://www.eecs.qmul.ac.uk/~norman/BBNs/Bayesian_approach_to_probability.htm8/3/2019 Stanford Intro AI Class Notes
8/56
P("1") + P("2") + P("3") + P("4") + P("5") + P("6") = 1
since equiprobable then 6*P("1") = 1
then P("1") = 1/6
then P("5") = P("1") = 1/6
In general, if you have a discrete set as Sample Space with those conditions:
Equiprobable elements Disjoint events
Then probability can be calculated as:
P(A) = "number of A events" / "total number of events"
Example: P("get a pair number in the dice") = "number of pair events" / "total dice events" = 3 / 6 =
0.5
Probability = number of favourable events / total number of events
2) P(A,B) : Joint ProbabilityP(A,B) is the same as P("A intersection B") = P(A B)
P(A,B) = P("we get something that is A and ALSO B") = "# of events which are A and B" / totalP(A,B) = "# of events which are A and B" / "total"
Example:
L4 = "get a lower than 4 number in the dice"
O = "get an odd number in the dice"
P(O,L4) = "number of Odd which are lower than 4" / total
= number of {"1","3"} / 6 = 2/6 = 1/3
3) P(A|B) : Conditional ProbabilityP(A|B) : (reads "probability of A given B")
we will consider only the B events, what "percentage" OF THEM are also A
Don't confuse with the Joint. Here you are talking about those events which are B and ALSO A but it
is related to the B events. (as if B was a new Sample Space of another experiment where you only
get B samples )
P(A|B) = "# of events which are A and B" / "# of B events"
From the concept, we can imagine the Conditional Prob. formula: if P(A,B) and P(A|B) are different,
how can we relate them?
P(A,B) = "# of events which are A and B" / total
P(A|B) = "# of events which are A and B" / "# of B events"
if we divide: P(A,B) / P(A|B) = "# of B events" / total
but... Hey! we know that concept! "# of B events" / total is what we call P(B) !!!
so:
**P(A,B) = P(A|B) * P(B)**
4) Bayes Theorem / Bayes Rule / Bayes LawWhat is the relation between P(B|A) and P(A|B) ??
well, P(A,B) = P(B,A)
so, from the Conditional Prob., formula: P(A|B) * P(B) = P(B|A) * P(A)
Dividing by P(B) we get the Bayes formula:
P(A|B) = P(B|A) * P(A) / P(B)
Translated into intuitive numbers:
"# of A and B" / "# of B" =
8/3/2019 Stanford Intro AI Class Notes
9/56
= "# of A and B" / "# of A" * "# of A" / total / ("# of B" / total)
Another version is this (changing A and B):
P(B|A) = P(A|B) * P(B) / P(A)
5) AB : Concept ofindependent eventsIn the conditional probability we talked about "probability of A given B" but what if the "given B"
doesn't matter for A? That is, what if the "probability of A given B" is the same as the "probability ofA" alone ? Or, seen another way, if getting only samples which have the B property will have the
same effect (for the calculation of P(A) ) as getting samples which have not the B property, and the
same as any sample (no matter it is B or not B)...
That is, the B property "doesn't affect" the A property...
In this case, it is said that A and B are independent. (written as: A B )It is the same as saying this:
A B P(A|B) = P(A)Is it conmutative? Is the same "A and B are independent" and "B and A are independent"?
P(B|A) = P(A|B) * P(B) / P(A)
given "A and B are independent" :
P(B|A) = P(A|B) * P(B) / P(A) = P(A) * P(B) / P(A) = P(B)then "B and A are independent"
by symetry, if "B and A are independent" then "A and B are independent"
and so... YES, it is the same (as the language "they are independent" would suggest, as you are not
saying any order)
From the combination of Independence and the concept of Conditional Probability a new concept
comes:
Conditional Independence
A B | C P(A|B,C) = P(A|C)This is very important for Bayes Networks: it is related to the concept of D-Separation and it is
important to solve exercises: knowing when you can apply this formula.
(if there is D-separation between A and B, given C ... then you are sure A and B are conditionallyindependent, given C, and then you can use that formula)
6) Total ProbabilityImagine you have 2 disjoint subsets (that is, their intersection is an empty set... they have no
elements in common). What's the number of elements of the union of both? In that case, the
number is the sum.
In general, #(A U B) = #(A) + #(B) - #(AB)
One simple example of disjoints: A and "not A" if one element (or sample) belongs to A it cannot
belong to "not A"
"not A" is written as A
P(A U A) = P(A) + P(A) - P(A A) = P(A) + P(A) = 1
Another example of disjoints: "A B" and "A B"P("A B" U "A B") = P(A B) + P(A B)
We know "A B" U "A B" ... it is simply A !!
And we know P("A B"), it is what we called thejoint
So: P(A) = P(A,B) + P(A,B)
And we can express that in terms of conditionals:
P(A,B) = P(A|B) * P(B)
So:
P(A) = P(A|B) * P(B) + P(A|B) * P(B)
This is called the Total Probability formula
Which can be seen in numbers as:
"# of A" / total == "# of A and B" / "# of B" * "# of B" / total + "# of A and B" / "# of B" * "# of B" / total
8/3/2019 Stanford Intro AI Class Notes
10/56
which is in fact like saying:
"# of A" / "total" = ("# of A and B" + "# of A and not B" ) / "total"
Typical problem #1Consider the Bayes network on the left.
- C is cancer space with the probability to get P(C)=0.01- T is a test for cancer
o Probability of positive result given C is P(+|C)=0.9o Probability of positive result given C is P(+|C)=0.2
- T1 and T2 are 2 test attempts of T- Calculate probability of cancer if tests T1 is negative and T2 are
positive P(C|-+)?
Solution
Use Bayes rule to express P(C|-+) and P(C|-+)
- P(C|-+) = P(-+|C) * P(C) / P(-+)- P(C|-+) = P(-+|C) * P(C) / P(-+)
P(-+) joint probability of T1 is negative and T2 is positive - is not easy to find. We however know
that 1 = P(C|-+) + P(C|-+) so we just calculate the numerators and do the proportion
- P(-+|C) * P(C) = P(-|C) * P(+|C) * P(C) = (1-0.9)*0.9*0.01 = 0.0009- P(-+|C) * P(C) = P(-|C) * P(+|C) * P(C) = (1-0.2)*0.2*(1-0.01) = 0.1584
Now doing the proportion we find the answer P(C|-+) = 0.0009/(0.0009+0.1584)=0.0056=0.56%
Typical problem #2
The conditions are as in typical problem #1. Find P(T1|T2) the probability of test 1 is positive
given that test 2 is positive
Solution
To solve this we need to know conditional independence (read next chapter first). Steps
- Apply total probability formula P(T1|T2) = P(T1|T2,C)*P(C|T2) + P(T1|T2,C)*P(C|T2)- Because T1 and T2 are independent given C so
o P(T1|T2,C) = P(T1|C)o P(T1|T2, C) = P(T1|C)
- So P(T1|T2) = P(T1|C)*P(C|T2) + P(T1|C)*P(C|T2)= P(+|C)*P(C|+) + P(+|C)*P(C|+)
(simplified writing by replacing T1 and T2 with +)
- Apply Bayes ruleo
P(C|+) = P(+|C)*P(C)/P(+)o P(C|+) = P(+|C)*P(C)/P(+)
C
T1T2
8/3/2019 Stanford Intro AI Class Notes
11/56
- So P(T1|T2) = P(+|C)*P(+|C)*P(C)/P(+) + P(+|C)*P(+|C)*P(C)/P(+)= (0.9*0.9*0.01 + 0.2*0.2*0.99) / P(+) = 0.0486 / P(+)
- Apply total probability formula to calculate P(+) = P(+|C)*P(C) + P(+|C)*P(C) = 0.207 wefind finally that P(T1|T2) = 0.2348
Typical problem #3 Confounding CauseWe have seen one type ofBayes network in Typical problem #1 and #2: one single hidden cause
causes 2 different measurements
Confounding Cause is another type of Bayes network where there are 2 hidden causes getting
confounded within a single observational variable
Explaining Away or problem of Happiness when Sunny and Raise of salaryIt is a typical confounding cause Bayes network
Cause
Measure1
Measure2
Measure
Cause1
Cause2
8/3/2019 Stanford Intro AI Class Notes
12/56
a) Find P(R|S)R and S are independent if H is not given so P(R|S) = P(R) = 0.01
b) Explaining Away question 1: find P(R|H,S)Use Bayes rule (multiple variables in condition case) P(R|H,S) = P(H|R,S) * P(R|S) / P (H|S)
- P(H|R,S) = 1- P(R|S) = 0.01 as calculated in (a) above- Use total probability: P(H|S) = P(H|S,R)*P(R) + P(H|S, R).P(R) = 1*0.01 + 0.7*0.99 =0.703
So P(R|H,S) = 1* 0.01 / 0.703 = 0.0142
c) Explaining Away question 1: find P(R|H)Use Bayes rule P(R|H) = P(H|R) * P(R) / P (H)
-
Use total probability P(H|R) = P(H|R,S)*P(S) + P(H|R, S).P(S) = 1*0.7 + 0.9*0.3=0.97- P(R) = 0.01- Use total probability across all the cases
P(H) = P(H|S,R)*P(S,R) + P(H|S, R).P(S,R) + P(H|S,R)*P(S,R) + P(H|S, R).P(S,R)
P(H) = 0.5245 (remember R, S are independent so P(R,S) = P(R)*P(S), similar for others)
So P(R|H) = 0.97* 0.01 / 0.5245 = 0.0185
Conditional Independence
By definition B and C are independent given A if P(B|C)=P(B|A,C)
D-separation (aka reachability)
It is used to find out if 2 states are independent:
- Active Triplets: parameters aredependent
- Inactive Triplets: parameters areindependent
- (shading = given, or known state)
8/3/2019 Stanford Intro AI Class Notes
13/56
Bayes Networks
Bayes networks define probability distributions over graph of random variables
Simplest Bayes network with 2 variables
To specify this network we need 3 parameters: P(A) and 2 others: P(B|A) and P(B|A)
Example of a Bayes network with 5 variables
The compact of Bayes network helps to reduce significantly the number of states. On the graph
above of 5 variables it is reduced from 31 (2^5-1) to 10 (1+1+4+2 see picture above) thanks to the
formula P(A,B,C,D,E) = P(A).P(B).P(C|A,B).P(D|C).P(E|C).
The formula is written as product of probabilities, each factor is probability of a state and is written
as conditional probability from the states it depends on.
Unit 4 Probability Inference(how to answer probability questions using Bayes network)
8/3/2019 Stanford Intro AI Class Notes
14/56
Enumeration
Given conditional probabilities
Then we can calculate P(+b,+j,+m) by enumeration over hidden parameters e and a
Speedup Techniques for Enumeration
1. Pull out terms
2. Maximise independence (good idea to sort following causal direction)3. Variable elimination
8/3/2019 Stanford Intro AI Class Notes
15/56
Unit 5 Machine Learning
Supervised Learning
Feature vector X Target label Y:
SL tries to predict labels given the input vectors (i.e. to find the functions f see the picture above)
Occams Razor
Quiz Spam filter
Here we use Nave Bayes filter to detect spam
8/3/2019 Stanford Intro AI Class Notes
16/56
P(SPAM|M) = P(M|SPAM)*P(SPAM)/P(M) = P(today,is,secret|SPAM)*P(SPAM)/P(M)
Using normalised Bayes rule as in Key thing to remember, summary from
1) P(A) : Concept of Probability.Like intelligence, Probability is about trying to predict something about the future, but Probability is
a prediction with only one number.
That number is in the range from 0 to 1... Where 0 means 0% of the times will ocurr the given event,
and 1 means It would ocurr 100% of the times.
Example 1: P(A) = 0.3 ... it means event A will ocurr 30% of the times. (if you get 100 samples, you
predict that 30 will be "of the A type")
Example 2: Dice. P("get a 5") = 1/6 = 0.1666..7
How to calculate that result: * The dice has a total of 6 possible events. Sample Space = {1, 2, 3, 4, 5,
6} * Every event is as probable as the other (so called "equiprobable") (you don't think the dice is
loaded, so P("1") = P("2") = ... = P("5") = P("6") ) * Every event is disjoint from the other (If the result
is 1, then it cannot be 2, nor 3, ... nor 6).
With those conditions, you can imagine:
P("get 1 or 2 or 3 or 4 or 5 or 6 ") = 1
(by definition of Probability that means:
100% of the times you will get 1 or 2 or 3 ... or 6)
Since they are disjoint: P( Union( Ai ) ) = sum ( P (Ai) )
P("1") + P("2") + P("3") + P("4") + P("5") + P("6") = 1
since equiprobable then 6*P("1") = 1
then P("1") = 1/6
then P("5") = P("1") = 1/6
In general, if you have a discrete set as Sample Space with those conditions:
Equiprobable elements
8/3/2019 Stanford Intro AI Class Notes
17/56
Disjoint eventsThen probability can be calculated as:
P(A) = "number of A events" / "total number of events"
Example: P("get a pair number in the dice") = "number of pair events" / "total dice events" = 3 / 6 =
0.5
Probability = number of favourable events / total number of events 2) P(A,B) : Joint Probability
P(A,B) is the same as P("A intersection B") = P(A B)
P(A,B) = P("we get something that is A and ALSO B") = "# of events which are A and B" / total
P(A,B) = "# of events which are A and B" / "total"
Example:
L4 = "get a lower than 4 number in the dice"
O = "get an odd number in the dice"
P(O,L4) = "number of Odd which are lower than 4" / total
= number of {"1","3"} / 6 = 2/6 = 1/3
3) P(A|B) : Conditional ProbabilityP(A|B) : (reads "probability of A given B")
we will consider only the B events, what "percentage" OF THEM are also A
Don't confuse with the Joint. Here you are talking about those events which are B and ALSO A but it
is related to the B events. (as if B was a new Sample Space of another experiment where you only
get B samples )
P(A|B) = "# of events which are A and B" / "# of B events"
From the concept, we can imagine the Conditional Prob. formula: if P(A,B) and P(A|B) are different,
how can we relate them?
P(A,B) = "# of events which are A and B" / total
P(A|B) = "# of events which are A and B" / "# of B events"
if we divide: P(A,B) / P(A|B) = "# of B events" / total
but... Hey! we know that concept! "# of B events" / total is what we call P(B) !!!
so:
**P(A,B) = P(A|B) * P(B)**
4) Bayes Theorem / Bayes Rule / Bayes LawWhat is the relation between P(B|A) and P(A|B) ??
well, P(A,B) = P(B,A)
so, from the Conditional Prob., formula: P(A|B) * P(B) = P(B|A) * P(A)
Dividing by P(B) we get the Bayes formula:
P(A|B) = P(B|A) * P(A) / P(B)
Translated into intuitive numbers:
"# of A and B" / "# of B" =
= "# of A and B" / "# of A" * "# of A" / total / ("# of B" / total)
Another version is this (changing A and B):
P(B|A) = P(A|B) * P(B) / P(A)
5) AB : Concept ofindependent eventsIn the conditional probability we talked about "probability of A given B" but what if the "given B"
doesn't matter for A? That is, what if the "probability of A given B" is the same as the "probability of
A" alone ? Or, seen another way, if getting only samples which have the B property will have the
same effect (for the calculation of P(A) ) as getting samples which have not the B property, and the
same as any sample (no matter it is B or not B)...
8/3/2019 Stanford Intro AI Class Notes
18/56
That is, the B property "doesn't affect" the A property...
In this case, it is said that A and B are independent. (written as: A B )It is the same as saying this:
A B P(A|B) = P(A)Is it conmutative? Is the same "A and B are independent" and "B and A are independent"?
P(B|A) = P(A|B) * P(B) / P(A)given "A and B are independent" :
P(B|A) = P(A|B) * P(B) / P(A) = P(A) * P(B) / P(A) = P(B)
then "B and A are independent"
by symetry, if "B and A are independent" then "A and B are independent"
and so... YES, it is the same (as the language "they are independent" would suggest, as you are not
saying any order)
From the combination of Independence and the concept of Conditional Probability a new concept
comes:
Conditional Independence
A B | C P(A|B,C) = P(A|C)This is very important for Bayes Networks: it is related to the concept of D-Separation and it isimportant to solve exercises: knowing when you can apply this formula.
(if there is D-separation between A and B, given C ... then you are sure A and B are conditionally
independent, given C, and then you can use that formula)
6) Total ProbabilityImagine you have 2 disjoint subsets (that is, their intersection is an empty set... they have no
elements in common). What's the number of elements of the union of both? In that case, the
number is the sum.
In general, #(A U B) = #(A) + #(B) - #(AB)
One simple example of disjoints: A and "not A" if one element (or sample) belongs to A it cannot
belong to "not A"
"not A" is written as AP(A U A) = P(A) + P(A) - P(A A) = P(A) + P(A) = 1
Another example of disjoints: "A B" and "A B"
P("A B" U "A B") = P(A B) + P(A B)
We know "A B" U "A B" ... it is simply A !!
And we know P("A B"), it is what we called thejoint
So: P(A) = P(A,B) + P(A,B)
And we can express that in terms of conditionals:
P(A,B) = P(A|B) * P(B)
So:
P(A) = P(A|B) * P(B) + P(A|B) * P(B)
This is called the Total Probability formulaWhich can be seen in numbers as:
"# of A" / total =
= "# of A and B" / "# of B" * "# of B" / total + "# of A and B" / "# of B" * "# of B" / total
which is in fact like saying:
"# of A" / "total" = ("# of A and B" + "# of A and not B" ) / "total"
(remember today,is,secret are independent):
- P(today,is,secret|SPAM)*P(SPAM) = P(today|SPAM)*P(is|SPAM)*P(secret|SPAM)*P(SPAM)= 0
8/3/2019 Stanford Intro AI Class Notes
19/56
- P(today,is,secret|HAM)*P(HAM) = P(today|HAM)*P(is|HAM)*P(secret|HAM)*P(HAM) =2/15*1/15*1/15*5/8 = 0.000037
So P(SPAM|M) = 0/(0+0.000037) = 0 !!!
It is not good, just because of single word today we cant detect the spam (OVERFITTING!).Overfitting is common problem when maximum likelihood is used!
One of solution is to use Laplace Smoothing define probability of words (in this case the word is
today)
- ML = max likelihood, LS = Laplace smoothing- x is a variable (in this case a word).- count(x) is the number of occurrences of this value (e.g.
today) of the variable x.
- |x| is the number of all possible values that the variablex can take.
- k is a smoothing parameter.-
N is the total number of occurrences of x (the variable, not the value) in the sample space.So apply Laplace Smoothing with k = 1 to the quiz (assuming the dictionary is 12 words both for
SPAM and HAM, 9 is total number of words on SPAM side, 15 is total number of words on HAM side)
- P(today,is,secret|SPAM)*P(SPAM)=(0+1)/(9+12) * (1+1)/(9+12) * (3+1)/(9+12) * 0.4=0.00034- P(today,is,secret|HAM)*P(HAM)=(2+1)/(15+12) * (1+1)/(15+12) * (1+1)/(15+12) *
0.6=0.00037
Normalising to get P(SPAM|M) = 0.00034/(0.00034+0.00037) = 0.48
Overfitting Prevention
Types of Supervised Learning
- Classification: values of the target are discrete (binary in the picture above)
8/3/2019 Stanford Intro AI Class Notes
20/56
- Regression: values of the target are continuous
- Parametric: those methods have parameters and # of them are constant, independent oftraining set size
- Non-parametric: # of parameters can grow significantlyK-nearest neighbours
K-nearest neighbours is non-parametric supervised learning method. It has 2 steps
- Learning step: memorise all data- Label new example
o Find K nearest neighbourso Return majority class label
Linear regression
M data points, y is continuous. Linear Regression tries to find the function f (linear!) as shown below
2 types of f
- f(x)=w1*x+w0 where w1 and w0 are scalar
8/3/2019 Stanford Intro AI Class Notes
21/56
- f(x)=w*x+w0 where w is a vector
To find f we define quadratic loss function and
try to minimise the loss (M is number of training samples):
Unit 6 Unsupervised Learning
Unlike supervised learning there is only an input vector, no label. The goal is to find the structure
(pattern) of this input
Clustering algorithms
k-means
Algorithm
- Select k cluster centres at random- Repeat until no move can be made:
o Correspond data points to the nearest clusterso For each cluster: move the cluster centre to the mean (average point) of
corresponding data points
o If cluster becomes empty: restart at randomo This algorithm is proved to converge to local minima, and not NP
Problems of k-means clustering algorithm
8/3/2019 Stanford Intro AI Class Notes
22/56
- Need to know k- Local minima
- High dimensionality- Lack of mathematical basis
Expectation maximisation (EM)
- A generalisation of k-means, but first we need to learn Gaussian distribution)- Gaussian distribution
o mean averageo quadratic deviationo M number of data points
- Gaussian learning
- Maximum likelihood
8/3/2019 Stanford Intro AI Class Notes
23/56
- EM as a probabilistic generalisation of k-means- Choose k
Linear dimensionality reduction
Spectral (affinity-based) clustering
Unit 7 Representation with Logic
Propositional logic
Truth table
8/3/2019 Stanford Intro AI Class Notes
24/56
Given a space of states a model has its own truth table
- A sentence is valid if it is true for any models- A sentence is satisfiable if it is true in some model but false in other- A sentence is unsatisfiable or not valid if it is false in all models
Limitations of propositional logic
- Can handle only TRUE or FALSE; not uncertainty- Cant talk about object properties, nor relationship between objects- No shortcuts
Overcome propositional logic limitations: first-order logic and probability. We focus on 1st
order logic
which is to overcome the last 2 limitations.
3 types of representations
- Atomic (e.g. problem solving)- Factored (e.g. presentation logic)- Structured (e.g. programming language)
First order logic
Model
8/3/2019 Stanford Intro AI Class Notes
25/56
- Set of objects- Set of constants- Set of functions- Set of relations (unary, binary, etc.)
Why it is called first-order: because it operators work on objects only; no operations on the
relationship between objects (it would be higher order logic).
Syntax
- Sentence = relation- Operators: operate on sentences- Terms: can be constants, variables or functions- Quantifiers: that is unique and important for 1st order logic
o 2 quantifiers: for all and there existso Ifquantifier is omitted we assume for all quantifiero Although all variations are allowed normally
for all structures have the form there exists structure have the form
8/3/2019 Stanford Intro AI Class Notes
26/56
Unit 8 Planning
Why Plan? (or Planning vs. Problem Solving)
Problem solving: find a solution upfront and then execute it. Although we have a solution we are not
always able to execute it due to:
- Changing (partially observable) environment; and/or- Unpredictable (stochastic) environment; and/or- Multi-agency
The solution is PLANNING i.e. before doing next action we observe what happens after previous
action and make decision. With planning we move from the world of actual states to the world of
belief states. See the example below with a vacuum cleaner (one belief state consists of one or few
world states)
8/3/2019 Stanford Intro AI Class Notes
27/56
More details about plans, actions, and observations
3 types of vacuum cleaners (VC)
- Sensorless: the VC doesnt have any sensor so no observation- Partially observable: the VC can see its location, and if the location is clean. But it cant see
another place.
- Stochastic: the VC can attempt to move left or right but the move can be successful or notsuccessful
Few things to note
- sensorless vacuum example: even we cant observe when we do actions we know moreabout the world
- partially observable vacuum example: when we do actions *and also observe* we evenknow more
- stochastic vacuum example: we may need branching (and loop): to do an action, observethe result and based on the result we go different way. This branching is not the same as
branching in Problem Solving!
In general, actions may increase uncertainty when observations always reduce it. See the diagram
below:
2 types of plans
- Bounded (finite number of steps)- Unbounded (infinite number of steps is allowed)
Plans are usually specified in 2 ways
- Linear (list of steps in order); or- Tree (when we have branches in plan, usually branching is done by observation!)
Specify plans mathematically
8/3/2019 Stanford Intro AI Class Notes
28/56
- A=set of actions, S=set of states, F=final states (goals)- First equation: exact state world- Second equation: belief state worldpredict-observe-update cycle
o Problem: some belief states can become very largeo Solution: instead of describing a belief state as a list exact states we use variables
Classical Planning: a representation language to describe plans
- It is propositional logic to be used- Variables, not states are used to describe things- To describe states:
o Variableso State spaceo World state: complete assignment of all variableso Belief state
- To describe actions:o Described by action schema a group of many possible actions similar to each
other
o An action schema is described by specifying PRE(CONDITION) where the actionschema is possible and EFF(ECT) of the action schema
- 2 ways to find a plano Search in state space
Progression (forward) search: normal problem-solving in state space Regression (backward) search: backward search from goal state
o Search in the plan space
8/3/2019 Stanford Intro AI Class Notes
29/56
Situation calculus
SC = a first order logic with set of conventions how to represent states and actions. Comparing with
classical planning where propositional logic is usedSC has advantage to describe with for each
and there exists flexibility
- 2 types of objectso Actions: normally they are functions e.g. Fly(p,x,y) fly of plane p from x to y.o Situations: normally they are paths (of actions in state space search), not states
Initial situation S0 A function S = Result(S,a) where S is a situation and a is an action; S is
another action
- Using predicate to specify a set ofpossible actions given a situation Poss(a,S).o It usually is in form SomePrecond(S) Poss(a,S). Poss() is called possibility axiom
for the action a.
o Example of the possibility axiom for the action Fly(p,x,y) p is a plane x and y are 2 locations s is a situation
Fluent = a predicate (i.e. a function or a relation) that can change from one situation to another. For
example the predicate At(P,X,S) - a plan P is at theairport X given situation S is a fluent.
- Convention: the situation S is given as an argument of the predicate, the last argument- True fluent is a fluent that is true in situation s
In Classical Planning we use schema to describe what happens when execute each action. In
Situation Calculus we use Successor-state axioms to describe what happens in the state that is a
successor of executing an action (state is synonym of situation?). One s-s axiom per fluent!
- In general, an s-s axiom has the form
o a is an actiono s is a stateo It says: if it is possible to execute a in state s then the fluent is TRUE if action a
makes it true, or action a doesnt undo it
- Example of the s-s axiom for the fluent (predicate) some cargo in some planeo c is a cargoo p is a planeo a is an actiono s is a state / situation
8/3/2019 Stanford Intro AI Class Notes
30/56
A great thing about SC is we already have solver for first order logic, i.e. once a problem is described
using the language of SC, we automatically come up with a solution (the path from initial state to the
goal state)!
Unit 9 Planning Under Uncertainty
(RL reinforcement learning)
Planning in different environments:
Deterministic StochasticFully Observable A*, Depth-First, Breadth-First MDP (Markov Decision Process)
Partially Observable POMDP (Partially Observable MDP)
MDP
What is Markov process?
Finite State Machine , when outcome is not certain but probable(action a1 moves the system from state S1 to state S2 with probability of 50%), is Markov process
- States S1, , SN- Activities a1, , aN- State-transition matrix T(S,a,S) = P(S|a,S) (T is called transition function)- Reward function R(S,a,S) or sometimes simply R(S)- Policy assigns (optimal) an action to each state- We try to find a policy (S) that maximises the discounted, total Reward
8/3/2019 Stanford Intro AI Class Notes
31/56
Problem under study Grid World
Move North could lead to move North (80%), East (10%) or West (10%). Conventional planning wont
work so need a Policy (S)A for each state. The task is to find optimal policy.
Value function
- E[] is expectation of a stochastic process- t is time moment- is discount factor- Planning = Calculate Value Functions!
Recursive algorithm (so called Value Iteration) to determine the Value Function
Policy is made based on value function
8/3/2019 Stanford Intro AI Class Notes
32/56
Conclusion
POMDP
Information Space Belief Space
Unit 10 Reinforcement Learning (RL)
3 forms of learning
- Supervised- Unsupervised- Reinforcement
o A sequence of state, action, state, action, etc.o Rewards associated with the sequenceo We try to learn what to do to maximise rewards
Agents of RL
(or what to do if P() and/or R() are not known)
Agent Know Learn Use
Utility-based agent P R, then U (utility) U
Q-learning agent Q(S,a) Q
Reflex agent (S)
-
Passive RL agents: stick to a fixed policyo Example: Temporal Difference (TD) learning
- Active RL agents: change the policy as we learn
8/3/2019 Stanford Intro AI Class Notes
33/56
o Example: Greedy learning recalculate the policy after certain number of iterations.Problem not enough exploration because greedy, once it found some local
optimum, it sticks with it
o Solution for Greedy: need more exploration (at some point we dont take theoptimal policy to explore more). BUT, more exploration means more cost need
balancing!
Q Learning
- Many varieties, but common point is to find Q(s,a), not utility function U and not transitionmatrix
- Policy may change in Q-learning
Unit 11 Hidden Markov Model and Filters
HMM
Used to analyse and time series, applicable in
- Robotics- Medical- Finance- Speech- Etc.
Markov Chain is a simple Bayes network where each state depends only on a previous state
s1s2sN, each state also emits so called a measurements.
Hidden Markov Model (HMM) is a Markov chain but the states s1, s2, etc. (prior probability) are
hidden (not observable); instead we can observe only the measurements (posterior probability).
Using HMM it is possible to do 2 things: prediction and state estimation
- Prediction (of next state and/or next measurement)
8/3/2019 Stanford Intro AI Class Notes
34/56
o Bayes rule (see above picture) is used to do prediction (usually we calculate onlynumerators and then normalise)
- State estimation (to compute the probability of hidden or internal states givenmeasurements)
o Total probability formula is used to predict next state- 2 equations above, plus distribution of the initial state P(s0), form the math of HDD
Stationary Distribution
SD = find the probabilities in a HMM when time approaches infinity
Transition Probabilities
Observe a sequence of days e.g. R-R-S-S-R-S
Then find max likelihood (or using Laplace smoothing)
Example
HMM from observed measures (HMM happy-grumpy problem)
Use Bayes formula to calculate HMM from observed measures
8/3/2019 Stanford Intro AI Class Notes
35/56
Particle Filter
Example: a robot in a maze
- Robot can move freely in the maze; it has a sonar to measure the distance to objects / wallsaround it
- The belief space is represented by a collection of points or particles- Each point / particle represents a possible state- Each point / particle is a 3-dimensional vector (x coordinate, y coordinate, and the heading
direction)
- Particle filters approximate a posterior by many guesses- The density of the guesses represents the posterior probability of being in certain location- The more consistent the particle with the measurement, the more the sonar measurement
fits into the place where the particle says the robot is, the better chance to survive
8/3/2019 Stanford Intro AI Class Notes
36/56
Particle Filter Algorithm
Given
- S a set of n particles with associated important weights- U a control- Z a measurement vector
Aim: to construct a new particle set S
Algorithm
PARTICLE_FILTER(S,U,z)
S=
=0
For i=1n
Sample j ~ {w} with replacement
x ~ P(x|U, sj)
w= P(z|x)
S = S + {}
= + w
End For
For i=1n
wi = wi/
End For
an auxiliary parameter (for normalisation)
Loop through all n particles in S
Sample in index j according to the distribution
defined by the importance weights associated
with the particle we have a new particle sj
Sample a possible successor state x according to
the state transition probability using our controls
U and new particle sjCompute an importance weight w the
measurement probability for the particle sj.
Add the particle sjto S
Add w to (will be sum of w in the end of the
loop and will be used for normalisation)
Move to the next particle in the loop
A loop to normalise the weights
Betterexplanation(also see wikikidnapped robot)
Suppose Ikidnap your robotand put it back in your house at random.
It knows
http://www.aiqus.com/questions/18339/the-kidnapped-robothttp://www.aiqus.com/questions/18339/the-kidnapped-robothttp://www.aiqus.com/questions/18339/the-kidnapped-robothttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttps://secure.wikimedia.org/wikipedia/en/wiki/Kidnapped_robot_problemhttp://www.aiqus.com/questions/18339/the-kidnapped-robot8/3/2019 Stanford Intro AI Class Notes
37/56
- it's in your house- the layout of your house
But it doesn't know where in your house it is.
Observation step
1. It generates 100 locations at random to use as estimates of where it might be. (If your houseis big it might need 1,000 or 10,000 locations instead of 100.)
2. Since they are random, each of these locations (x,y) is given an equal likelihood of 1%.3. Each triad of (x,y,%) is a state, so we have 100 states.4. The robot now takes a measurement of its surroundings and sees that it's in a 4-way
junction.
5. With perfect sensing, we could eliminate states which aren't in a 4-way junction, but oursensors are a bit flaky. The North-pointing sensor is only correct 80% of the time, so there's a
chance that we're not really in a 4-way junction after all. The other sensors (E, W, S) are also
flaky.
6. So we adjust the probability of all states according to Bayes Rule. It's possible that two ormore sensors are incorrect, but that's less likely than just one sensor being incorrect. When
we're done, the states describing 4-way junctions have a higher probability (since that'smost likely), and the rest have lower.
7. Perhaps 30 of our states - the ones which describe 4-way junctions - have a weight of 2%and the rest have weights smaller than 1% and the total of all weights is 100%.
Resample step
1. Now we move East. Just like the vacuum with slippery wheels, we could end up 1 positionEast, but there's also a smaller chance that we could end up NE or SE.
2. How do we update the list of states?a. We generate 100 new states from the existing states.b. We choose a state and duplicate it, then apply the movement and randomly choose
the expected outcome. If we are using the robot from the gridworld lectures(80&/10%/10%) then there's an 80% chance that the new position is East of our
original position, and 10% chance that we generate a new state which is NE and 10%
SE.
c. We choose states to duplicate using a weighted average of the existing states. Since30 of the states (the ones which describe a 4-way junction) are 2% likely and the rest
are
8/3/2019 Stanford Intro AI Class Notes
38/56
and the observation second. This is the same algorithm with a different implementation. In his
model the system makes a move and presents the list of original states, the move taken, and the
measurement made after the move. The essential concepts are the same.
Unit 12 MDP Review
Unit 13 GamesKey point: games can be solved by search (depth-first, breadth-first, A*, etc.)
Deterministic single-player games
- Set of states S (including the start state S0)- Set of players P (in this case it contains a single player)- A function Actions(s,p) that gives us the possible actions at state s for player p- A transition function Result(s,a)a that tells us the result of action a at state s- A terminal test Terminal(s)TRUE or FALSE to tell us if it is an end of the game- Terminal utilities U(s,p) that tells us, for a given state s and a given player p, some number
which is the value of the game for this player
Deterministic 2-player (turn taking) zero-sum games
- Deterministic: there is a single result of any actionMinimax routine
- 2-player: 2 players, MAX and MIN
8/3/2019 Stanford Intro AI Class Notes
39/56
o is a move by MAXo is a move by MINo is a terminal stateo The value function is defined as
o MAX tries to maximise value function. The algorithm is
The complexity of the algorithm for the tree below
Computational complexity = (bm) Space complexity = (bm)
o MIN tries to minimise value function: similar to above but oppositely- Zero-sum: sum of utilities of the 2 players is 0
Reduce complexity: 3 approaches
- Reduce b the breadth of the treeo - pruning technique can reduce (bm) to (bm/2)
- Reduce m the depth of the tree: e.g. cut-off at some level and use an evaluation function
8/3/2019 Stanford Intro AI Class Notes
40/56
- Combination of reduce b and reduce m (alpha-beta pruning)
o This algorithm also uses new definition of maxValue() as below
- Convert tree into graph: e.g. in chess we have opening books, ending books, midgame-
Only in the reduce m approach we have information lost
Stochastic games
8/3/2019 Stanford Intro AI Class Notes
41/56
- ? is a chance node: where we take the expect value- Expected value in Probability is calculated this way
o Assume we have N possibilities with values a1, a2,, aN and correspondingprobabilities p1, p2,, pN (p=1)
o Expected value = apUnit 14 Game Theory
2 objectives
- Agent design: given a game find optimal policy- Mechanism design: design game rules to attract players and game owner. More formally:
given a utility function and assuming the agents act rationally, find mechanism to maximise
global utilities
Key definitions (on example ofPrisoners dilemma)
Dominant strategy: a strategy that a player does better than in any other strategies
- For A: testify- For B: testify
Pareto optimal outcome: no other outcomes that all players prefer
- The outcome A=-1, B=-1 is Pareto optimalEquilibrium: an outcome that no player can benefit from switching to a different strategy assuming
the other player stays the same
- The outcome A=-5, B=-5 is Pareto optimalTwo Finger Morra (zero-sum)
2 players Even (E) and Odd (O) showing their finger at the same time
8/3/2019 Stanford Intro AI Class Notes
42/56
Difficulty: no dominant strategy, Pareto optimum
Solution 1: move from matrix form to tree one; assuming that one player must go first:
- Left: MAX goes first- Right: MIN goes first- Utility -3UE2 no good, very big discrepancy because we handicap (ask to reveal) the
first player to much. Solution 2 will ask less
Solution 2: like solution 1 but we assume the first player only need to reveal his strategy
- The probability the first player select his move is [p: one, (1-p): two]
Unit 15 Advanced PlanningAdvanced planning is like normal planning taking into account also the followings
- Time- Resources- Active perception
8/3/2019 Stanford Intro AI Class Notes
43/56
- Hierarchical plansScheduling
- Network of tasks- S start, F finish- Each task has ES (earliest start) and LS (latest start) as defined above. Below is ES (left box)
and LS (right box) for each state
Extending planning
Problem of classical planningit cant handle resources; so it may need to check many combinations.
So it is natural to add resources to the language of classical planning; below is an example
- New type Resources (red highlighted above)- 2 new attributes to Actions to deal with resources
o USE (green highlighted above): to use a resource. After using the resource still existso CONSUME (green highlighted above): to consume a resource. After consuming the
resource vanished
8/3/2019 Stanford Intro AI Class Notes
44/56
Hierarchical Planning
Aim: close abstraction gap
- Group actions into abstract actions- Do the planning with bigger abstract actions- Then do the refinement to find concrete action for a abstract action
HTN = hierarchical task network
How do we know we reach solution?
- A hierarchical task network achieves the goal if for every part, every abstract action at leastone of the refinements achieves the goal
Reachable states (by a abstract action)
Approximate reachable states: lower and upper bounders of the states we can reach by a abstract
action.
Conformant vs. Sensory Planning
Conformant plan = plan without perception. Sensory planning is about to extend classical planning to
allow active perception to deal with partial observability
8/3/2019 Stanford Intro AI Class Notes
45/56
- New type Percept (red highlighted above)to express that we sense something
Unit 16 Computer Vision I
Image formation
(the way to capture image)
Pinhole camera
Perspective Projection formula (for one dimension, but also applied for other dimensions)
Vanishing points: parallel lines converges in perspective into vanishing points
Lens: to eliminate the drawback of pinhole which is only one ray reaches the image. Restriction of
lens is the image must be on certain distance (lens law)
8/3/2019 Stanford Intro AI Class Notes
46/56
Computer vision
- Classify objects- 3d reconstruction- Motion analysis
Invariance is a key concept in Object Recognition: there are natural variations of an image that dont
affect the nature of the object itself. We will try to design recognition algorithms invariant to, say,
scale, illumination, rotation, deformation, occlusion (object is shaded by other objects) and view
point.
Grey scale images
-
More used than colour ones in image recognition- As usual a grey scale image represent by a 700x700 matrix with values from 0255 in each
cell (255=black, 0 = white)
Extract features: using (kernel) masks
Linear filter
Gradient kernels (filters)
- Horizontal filter - Vertical filter
8/3/2019 Stanford Intro AI Class Notes
47/56
Horizontal filters find the vertical edges and vice versa!
Gradient images
To find all the edges (both horizontal and vertical) we need to combine horizontal and vertical filters
gradient images (gradient magnitude kernel)
Canny edge detector (by professor Canny) improves gradient images significantly.
There are other masks like Prewitt, Gaussian kernel (to blur images), etc.
Harris corner detector
- Corners is where exist a lot of horizontal edges and vertical edges (top figure below)- Sometimes we may need to rotate the image (bottom figure below). The trick is to use
eigenvalues
Modern feature detectors
- Localisable- As
HOG = Histogram of Oriented Gradient
SIFT = Scale Invariant Feature Transform
8/3/2019 Stanford Intro AI Class Notes
48/56
Unit 17 Computer Vision II (3D)
Stereo
Task: sensing range (distance) with cameras
- One camera sometimes we can recover the 3D (i.e. distance to the object) but not all thetime.
- Stereo vision with 2 cameras more easily but again not all the time e.g. in the case withaperture effect.
Stereo Rig = 2 pinholes usually with the same focal length. Below is how we solve the depth Z (of an
object at P) from images from 2 pinholes camera:
- f = focal length- Baseline B = distance between 2 cameras- x1 = projected image via pinhole 1- x2 = projected image via pinhole 2- Displacement aka parallax = x1-x2- Optical axes = the axes drawn via the pinholes orthogonally to the image planes
Correspondence in stereoWe have images for 2 points P1, P2 (each point has 2 images from 2 cameras)
- If we mistake to mix the images we may end up in phantom points P1 and P2
8/3/2019 Stanford Intro AI Class Notes
49/56
- So to find correspondence (data association) is importantTake for example 2 cameras and an object that projects to point P in camera 1
Question is how to find its projection to camera 2
- Not the whole square (2D)- Not able yet to pinpoint it (0D)- The right answer is along some line (1D). The line is the projection of the line connecting the
real object and P!
Search along the line
- How can we find (pinpoint) the image on camera 2 along the line? 2 wayso Matching small image pattern; oro Matching the features (like edges) using linear filter (see unit 16)
- SSD (sum of square difference) minimization algorithm is usually used
Example
We try to correspond 2 patterns below
8/3/2019 Stanford Intro AI Class Notes
50/56
To do it we try to minimise the cost function (see above
Dynamic Programming is usually used to find best alignment (similar to MVP)
- Define of each point in the grid to be the best taking the value of getting there- Start-point: top-left, end-point: bottom right. Calculate V(i,j) for each point in grid, in the
end we get the V(i,j) of the end-point of 20 and also find the path
Example of dynamic programming
B B R R R B
B
R
RR
8/3/2019 Stanford Intro AI Class Notes
51/56
R
B
Unit 18 Computer Vision III
SFM - structure from motion
Motion here means we move a camera around, capture images of and object and recover the object
structure (3D world)
SFM = Non-linear Least Squares problem, minimisation is through
- Gradient descent- Conjugate gradient- Gauss-Newton- Levenberg Marquardt (common method!)- Singular Value decomposition (affine, orthographic)
Unit 19 Robotics I
2 key tasks
- Find out where you are (aka localisation)- Find the path to the goal state (aka planning)
2 types of state
- Kinematic state: state of an object in space- Dynamic state = kinetic state + velocity
Localisation: how to find your position in space given a map. For robotic cars
- Could use GPS but error is ~5m- Particle Filtergives error of 10cm!
Monte-Carlo localisation
(on example of differential-drive robot)
Deterministic case
8/3/2019 Stanford Intro AI Class Notes
52/56
Add noise (probability): after giving the command MOVE the robot could be few possible places,
each with some probability. This is the PREDICTION step ofParticle Filter
MEASUREMENT step
Unit 20 Robotic II
Robotic Path Planning vs. normal Planning
- Robotic one is in continuous state space- Normal one is in discrete state space
A* in continuous space
A* is discrete. It can find the path to the goal like in the picture below, but the path has many sharp
turns not suitable for robots like self-driving car.
In continuous space A* becomes Hybrid A*
Hybrid A* lacks of completeness (it may not find the path) but it guarantees correctness (if it found a
path, the path is correct)
8/3/2019 Stanford Intro AI Class Notes
53/56
Unit 21 Natural Language Processing
2 language models
- Word-based; probabilistic; learned from datao Probability P(word1, word2,)
- Tree-based; logical; hand-codedo Set of sentences (=a language): {S1, S2,}
Probabilistic modelsWe talk about probability that a sequence of words makes a sentence or for short
|
2 important assumptions
- Markov assumption (of order k): the localism of the probabilities i.e. | |
. Specifically when k=1 we have
| |
- Stationary assumption: the probabilities are similar across the sequence i.e. | ()
We look at the data and try to find the probability for one word to follow another very often we
need smoothing or other techniques otherwise the probability=0%
We also want to go beyond words (augmented models) by extending to non-word components
n-gram models
- Bag of words (e.g. all Shakespeare text)- Build n-gram model and sample from that model (i.e. to generate random sentences that
come from the probability distribution defined by that model)
Unigram model: to sample from words according to frequency in the corpus of Shakespeare text)
but not taking into account any relationship between adjacent words
Bigram model: to sample from the probability if a word given the previous word
Classification
Common tasks
- Classify words into categories- Detect language of a text
8/3/2019 Stanford Intro AI Class Notes
54/56
Can be word-based or character-based
Methods
- Nave Bayes-
K-nearest neighbour- Support Vector Machine (SVM)- Logistic regression- gzip compression utility (Unix)
Segmentation
Given a sequence of words (characters) find where spaces are (like in Chinese)
Probabilistic model of segmentation: the best segmentation S* is the one that maximises the joint
probability of the segmentation | Approximation can be done with Markov assumption, nave Bayes, etc.
In case of nave Bayes we try just to maximise probability of each individual word - Equivalently we can find argmax over all possible segmentations of the string s into a 1st
word f and the rest of the words r:
Spelling correction
Given a misspelled word w find the correct word c.
Probabilistic model of spelling correction: find best correction
|
Apply Bayes rule (ignoring the denominator as it is the same) |- P(c) comes from data counts- P(w|c) comes from spelling collection data
Example: pulse misspell as pluse
- It is usually not enough data for P(pluse|pulse)- So instead we work at character level - define misspelling type ullu
Unit 22 Natural Language Processing IITree model
Need a grammar e.g.
Grammar
SNP VP
NPN | D N | N N | N N N
VPV | V NP | V NP NP
Ninterest | Fed | rates | raises
Vinterest | rates | raisesD the | a
Where:
- S = sentence- N = noun- V = verb- NP = noun phrase- VP = verb phrase- D = determiner (e.g. a or the)
8/3/2019 Stanford Intro AI Class Notes
55/56
This type of grammar is called Context Free Grammar (CFG)
Problems with grammar
- Easy to omit good parser-
Easy to include bad parser by accident- Not a problem: trees are unobservable
Solutions
- Add probability to each tree- Add word association like Markov assumption- Not possible solution: make grammar unambiguous
Probabilistic Context Free Grammar (PCFG)
Add probability to CFG grammar we know so far
Example
Lexiconsi
How to define the probabilities: people are trained and paid to parse real life texts
Ambiguity
- I saw (a man with telescope)
8/3/2019 Stanford Intro AI Class Notes
56/56
- I saw (a man) with telescopeLexicalised PCFG (LPCFG)
Normal PCFG
- The probability is given in regard of the category of left hand side- Example P(VPV NP NP | lhs=VP) = 0.2; lhs = left hand side
Lexicalised PCFG
- The probability is given to specific word- Example P(VPV NP NP | V=gave) = 0.25; gave is the word
How to build grammar tree? Use Search