Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | garey-gibbs |
View: | 217 times |
Download: | 0 times |
Learning
Holy grail of AI. If we can build systems that learn,
then we can begin with minimal information and
high-level strategies and have systems better
themselves. Avoid the “knowledge engineering
bottleneck” where everything must be hand-coded.Effective learning is very difficult.
Goal
any change in a system that allows it to perform
better the second time on repetition of the same task
or on another task drawn from the same population
(Herbert Simon, 1983).
Machine Learning
Symbol-based: A set of symbols represents the
entities and relationships of a problem domain.
Infer useful generalizations of conceptsConnectivist approach: Knowledge is represented
by patterns in a network of small, simple processing
units. Recognize invariant patterns in data and
represent them in the structure.
Machine Learning (cont'd)
Genetic algorithms: Population of candidate
solutions which mutate, combine with one another,
and are selected according to a fitness measure.Stochastic methods: New results are based on both
the knower's expectation and the data (Bayes' rule).
Often implemented using Markov processes.
Types of Learning
Supervised learning: Training examples both
positive and negative, are classified by a teacher for
use by the learning algorithm.Unsupervised learning: Training data not used
Category formation, or conceptual clustering are
examples.Reinforcement learning: Agent receives feedback
from the environment.
Categorization: Symbol-based
What is the data?What are the goals?How is knowledge represented?What is the concept space?What operations may be performed on concepts?How is the concept space searched (heurisitics)?
Example – Arch recognition
Problem: How to recognize the concept of 'arch' from
building blocks (Winston).SymbolistSupervised learningBoth positive and negative examples (near-misses)KR is by semantic networksGraph modification, node generalizationSearch is data-driven
Example (cont'd)
part(arch, x), part(arch, y), part(arch, z)
type(x, brick), type(y, brick), type(z, brick)
supports(x, z), supports(y,z)
Example (cont'd)
part(arch, x), part(arch, y), part(arch, z)
type(x, brick), type(y, brick), type(z, pyramid)
supports(x, z), supports(y,z)
Example (cont'd)
Background knowledge: isa(brick, polygon),
isa(pyramid, polygon)
Generalization:
part(arch, x), part(arch, y), part(arch, z)
type(x, brick), type(y, brick), type(z, polygon)
supports(x, z), supports(y,z)
Negative Example: Near Miss
part(arch, x), part(arch, y), part(arch, z)
type(x, brick), type(y, brick), type(z, brick)
supports(x, z), supports(y,z)
touches(x,y), touches(y,x)
Generalization
part(arch, x), part(arch, y), part(arch, z)
type(x, brick), type(y, brick), type(z, brick)
supports(x, z), supports(y,z)
~touches(x,y), ~touches(y,x)
Version Space Search (Mitchell)
The problem is to find a general concept (or set of
concepts) that includes the positive examples and
excludes the negative ones.Symbolist Supervised learningBoth positive and negative examplesPredicate calculusGeneralization operationsSearch is data driven
Generalization Operators
Replace constant with variable:
color(ball, red) -> color(X,red)
Drop conjuncts:
shape(X,round)^size(X,small)^color(X,red) ->shape(X,round)^color(X,red)
Add disjunct:
shape(X,round)^color(X,red) ->shape(X,round)^(color(X,red) v color(X,blue)
Replace property by more general property:
color(X,red) -> color(X, primary_color)
More General Concept
Concept p is more general than concept q (or p
covers q) if the set of elements that satisfy p is a
superset of the set of elements that satisfy q. If p(x)
and q(x) are descriptions that classify objects as
positive examples, then
p(x) -> positive(x) |= q(x) -> positive(x).
Version Space
Version space is the set of all concept descriptions
that that are consistent with the training examples.
Mitchell created three algorithms for finding the
version space: specific to general search, general to
specific search, and the candidate elimination
algorithm which works in both directions.
Specific to General Search
S = {first positive training instance};
N = {}; // Set of all negative instances seen so far
for each positive instance p {
for every s ∊ S, if s doesn't match p, replace s in S with its most
specific generalization that matches p;
Delete from S all hypotheses more general than others in S;
Delete from S all hypotheses that match any n ∊ N;
}
For every negative instance n {
Delete all hypotheses from S that match n;
N = N u {n};
}
General to Specific Search
G = {most general concept in the concept space};
P = {}; // Set of all positive instances seen so far
for each negative instance n {
for every g ∊ G, if g matches n, replace g in G with its most specific
specialization that doesn't match n;
Delete from G all hypotheses more specific than others in G;
Delete from G all hypotheses that fail to match some p ∊ G;
}
For every positive instance g {
Delete all hypotheses from G that fail to match p;
P = P u {p};
}
Candidate Elimination Algorithm
G = {most general concept in the concept space};
S = {first positive training instance};
for each new positive instance p {
Delete from G all hypotheses that fail to match p;
for every s ∊ S, if s doesn't match p, replace s in S with its most
specific generalization that matches p;
Delete from S all hypotheses more general than others in S;
Delete from S all hypotheses that match any n ∊ N;
}
CAE (cont'd)
for each negative instance n {
Delete from S all hypotheses that match n;
for every g ∊ G, if g matches n, replace g in G with its most specific
specialization that doesn't match n;
Delete from G all hypotheses more specific than others in G;
Delete from G all hypotheses that fail to match some p ∊ G;
}
If G == S and both are singletons, the algorithm has found a single
concept that is consistent with the data and the algorithm halts.
If G and S become empty, there is no concept that satisfies the data.
Candidate Elimination Algorithm
G should always be a superset of S, and the
concepts that lie between them satisfy the data.Incremental in nature – can process one training
example at a time and form a usable, though
incomplete, generalization.Is sensitive to noise and inconsistency in the set of
training data.Essentially breadth-first search – heuristics can be
used to trim the search space.
LEX: Integrating Algebraic Exprs.
LEX (Mitchell, et al.) integrates algebraic expressios
by starting with an initial expression and then
searching the space of expressions until it finds an
equivalent expression with no integral signs. The
system induces heuristics that improve its
performance based on data obtained from its
problem solver.
LEX (cont'd)
The operations are the rules of expression
transformation:
OP1: ∫r f(x) dx -> r ∫ f(x) dxOP2: ∫u dv -> uv - ∫ v duOP3: 1 * f(x) -> f(x)OP4: ∫ f
1(x) + f
2(x) dx -> ∫ f
1(x) + ∫ f
2(x)
Heuristics
Heuristics are of the form:
If the current problem state matches P then apply
operator O with bindings B.
Example:
If a problem state matches ∫ transcendental(x) dx,
then apply OP2 with bindings
u = x
dv = transcendental(x) dx
LEX Architecture
LEX consists of four components:A generalizer that uses the Candidate Elimination
Algorithm to find heuristics,A problem solver that produces traces of problem
solutions, A critic that produces positive and negative
instances from the problem trace, andA problem generator that produces new candidate
problems.
How it works
LEX maintains a version space for each operator.
The version spaces represents the partially learned
heuristic for that operator. The version space is
update from the positive and negative examples
generated by critic.
The problem solver builds a tree of the space
searched in solving an integration problem. It does
best first search using the partial heuristics.
How it works (cont'd)
To decide if an example if positive or negative is an
example of the credit assignment problem. After
solving a problem, LEX finds the shortest path from
the input to the solution. Those operators on the
shortest path are classified as positive, and those
that are not are classified as negative. Since the
search is not admissible, the path may not actually
be the shortest one.
ID3 Decision Tree Algorithm
A different approach to machine learning is to
construct decision trees. At each node we test one
property of the object and proceed to the proper
child node, until reaching a leaf, at which point we
can classify the object. We try to construct the best
decision tree, the one with the fewest nodes
(decisions). Here there many be many categories,
not just positive and negative.
ID3
Problem: Classify a set of instances based on their
values of given properties.SymbolistSupervised learningEach instance is classified to a finite typeKR is the tree and the operations are tree creation.All instances must be known in advance (non-
iterative)
Simple Tree Formation
Choose a property.The property divides the set of examples up into
subsets depending on their value of that property.Recursively create a sub-tree for each subset.Make all the sub-trees be children of the root which
tests the given property.
Caveat
The tree that is formed is highly dependent on the
order in which the properties are chosen. The idea is
to chose the most informative property first, and use
that to sub-divide the space of examples. This leads
to the best (smallest) tree.
Information Theory
The amount of information in a message (Shannon)
is a function of the probability of occurrence p of
each possible message, namely -log2(p). Given a
universe of messages M = {m1, m
2, ..., m
n} and a
probability, p(mi), for the occurrence of each
message, the expect information content of a
message M is:
I[M] = (∑n i=1
-p(mi) log
2(p(m
i))) = E[-log
2p(m
i)]
Choosing the Property
The information gain provided by choosing property
A at the root of the tree is equal to the total
information of the tree minus the amount of
information needed to complete the classification of
the tree. The amount of information needed to
complete the tree is defined at the weighted average
of the information in all its subtrees.
Choosing the Property (cont'd)
Assuming a set of training instances C, if we make
property P with n values the root of the tree, then C
will be partitioned into subsets {C1, C
2, ..., C
n}. The
expected value of the information needed to
complete the tree is:
E[P] = ∑n i=1
|Ci| / |C| * I[C
i]
and the expected information to complete the tree is:
gain(P) = I[C] - E[P].