Post on 29-Jul-2018
transcript
Lecture 6: Inductive LogicProgramming
Cognitive Systems II - Machine LearningSS 2005
Part II: Special Aspects of Concept Learning
FOIL, Inverted Resolution, Sequential Covering
Lecture 6: Inductive Logic Programming – p. 1
Motivation
it is useful to learn the target function as a set of if-then-rules
one of the most expressive and human readablerepresentations
e.g. decision trees
Inductive Logic Programming (ILP):
rules are learned directly
designed to learn first-order rules (i.e. including variables)
sequential covering to incrementally grow the final set of rules
PROLOG programs are sets of first-order rules
⇒ a general-purpose method capable of learning such rule sets maybe viewed as an algorithm for automatically inferring PROLOGprograms
Lecture 6: Inductive Logic Programming – p. 2
Sequential Covering
basic strategy: learn one rule, remove the data it covers, then iteratethis process
one of the most widespread approaches to learn a disjunctive set ofrules (each rule itself is conjunctive)
subroutine LEARN-ONE-RULE
accepts a set of positive and negative examples as input andoutputs a single rule that covers many of the positive and few ofthe negative examples
high accuracy: predictions should be correct
low coverage: not neccessarily predictions for each example
performs a greedy search without backtracking
⇒ no guarantee to find the smallest or best set of rules
Lecture 6: Inductive Logic Programming – p. 3
Sequential Covering Algorithm
SEQUENTIAL-COVERING (Target_attribute, Attributes, Examples, Threshold)
Learned_Rules← {}
Rule← LEARN-ONE-RULE (Target_attribute, Attributes, Examples)
While PERFORMANCE (Rule, Examples) > Threshold, Do
Learned_rules← Learned_rules + Rule
Examples← Examples− { examples correctly classified by Rule}
Rule← LEARN-ONE-RULE (Target_attribute, Attributes, Examples)
Learned_rules← sort Learned_rules accord to PERFORMANCE over Examples
return Learned_rules
Lecture 6: Inductive Logic Programming – p. 4
General to Specific Beam Search
question: How shall LEARN-ONE-RULE be designed to to meet theneeds of the sequential covering algorithm?
organize the search through H analoguous to ID3
but follow only the most promising branch in the tree at eachstep
begin by considering the most general rule precondition (i.e.empty test)
then greedily add the attribute test that most improves ruleperformance over the training examples
unlike to ID3, this implementation follows only a singledescendant at each search step rather than growing a subtreethat covers all possible values of the selected attribute
Lecture 6: Inductive Logic Programming – p. 5
General to Specific Beam Search
...
...
IF Wind=weakTHEN PlayTennis=yes
IF Wind=strongTHEN PlayTennis=no
IFTHEN PlayTennis=yes
THEN
IF Humidity=normalWind=weak
PlayTennis=yes
IF Humidity=normalTHEN PlayTennis=yes
THEN
IF Humidity=normalOutlook=sunny
PlayTennis=yesTHEN
IF Humidity=normalWind=strong
PlayTennis=yes THEN
IF Humidity=normalOutlook=rain
PlayTennis=yes
IF Humidity=highTHEN PlayTennis=no
Lecture 6: Inductive Logic Programming – p. 6
General to Specific Beam Search
so far a local greedy search (analoguous to hill-climbing) isemployed
danger of suboptimal results
susceptible to the typical hill-climbing problems
⇒ extension to beam search
algorithm maintains a list of the k best candidates at each step
at each step, descendants are generated for each of the k
candidates and the resulting set is again reduced to the k mostpromising candidates
Lecture 6: Inductive Logic Programming – p. 7
LEARN-ONE-RULE
LEARN-ONE-RULE (Target_attribute, Attributes, Examples, k)Returns a single rule that covers some of the Examples. Conducts a general to specificgreedy beam search for the best rule, guided by the PERFORMANCE metric.
Initialize Best_hypothesis to the most general hyothesis ∅
Initialize Candidate_hypotheses to the set {Best_hypothesis}
While Candidate_hypotheses is not empty, Do1. Generate the next more specific candidate_hypotheses
New_Candidate_hypotheses← new generated and specialized candidates2. Update Best_hypotheses
Best_hypothesis← h with best PERFORMANCE3. Update Candidate_hypotheses
Candidate_hypotheses← the k best members ofNew_Candidate_hypotheses
Return a rule of the form
“IF Best_hypothesis THEN prediction”where prediction is the most frequent value of Target_attribute among thoseExamples that match Best_hypothesis.
Lecture 6: Inductive Logic Programming – p. 8
Sequential vs. Simultaneous Covering
sequential covering:
learn just one rule at a time, remove the covered examples andrepeat the process on the remaining examples
many search steps, making independent decisions to selectearch precondition for each rule
simultaneous covering:
ID3 learns the entire set of disjunct rules simultaneously as partof a single search for a decision tree
fewer search steps, because each choice influences thepreconditions of all rules
⇒ Choice depends of how much data is available
plentiful→ sequential covering (more steps supported)
scarce→ simultaneous covering (decision sharing effective)Lecture 6: Inductive Logic Programming – p. 9
Differences in Search
generate-then-test:
search through all syntactically legal hypotheses
generation of the successor hypotheses is only based on thesyntax of the hypothesis representation
training data is considered after generation to choose amongthe candidate hypotheses
each training example is considered several times
⇒ impact of noisy data is minimized
example driven:
individual training examples constrain the generation ofhypotheses
e.g. FIND-S, CANDIDATE ELIMINATION
⇒ search can easily be misled
Lecture 6: Inductive Logic Programming – p. 10
Learning First-Order Rules
propositional expressions do not contain variables and are thereforeless expressive than first-order expressions
no general way to describe essential relations among the values ofattributes
now we consider learning first-order rules (Horn Theories)
a Horn clause is a clause containing at most one positive literal
expression of the form:H ∨ ¬L1 ∨ ¬L2 ∨ ... ∨ ¬Ln
⇐⇒ H ← (L1 ∧ L2 ∧ ... ∧ Ln)
⇐⇒ IF (L1 ∧ L2 ∧ ... ∧ Ln) THEN H
FOL terminology see CogSysI
Lecture 6: Inductive Logic Programming – p. 11
FOIL (Quinlan, 1990)
natural extension of SEQUENTIAL-COVERING andLEARN-ONE-RULE
outputs sets of first-order rules similar to Horn Clauses with twoexceptions
1. more restriced, because literals are not permitted to containfunction symbols
2. more expressive, because literals in the body can be negated
differences between FOIL and earlier algorithms:
seeks only rules that predict when the target literal is True
conducts a simple hill-climbing search instead of beam seach
Lecture 6: Inductive Logic Programming – p. 12
FOIL algorithm
FOIL (Target_predicate, Predicate, Examples)
Pos← those Examples for which the Target_predicate is True
Neg ← those Examples for which the Target_predicate is False
Learned_rules← {}
while Pos, Do
NewRule← the rule that predicts Target_predicate with no precondition
NewRuleNeg ← Neg
while NewRuleNeg, DoCandidate_literals← generate new literals for NewRule, based onPredicates
Best_literal← maxL∈Candidate_literals FoilGain(L, NewRule)
add Best_literal to preconditions of Rule
NewRuleNeg ← subset of NewRuleNeg that satisfies NewRule
preconditions
Learned_rules← Learned_rules + NewRule
Pos← Pos− { members of Pos covered by NewRule}
Return Learned_RulesLecture 6: Inductive Logic Programming – p. 13
FOIL Hypothesis Space
outer loop (set of rules):
specific-to-general search
initially, there are no rules, so that each example will beclassified negative (most specific)
each new rule raises the number of examples classified aspositive (more general)
disjunctive connection of rules
inner loop (preconditions for one rule):
general-to-specific search
initially, there are no preconditions, so that each examplessatisfied the rule (most general)
each new precondition raises the number of examplesclassified as negative (more specific)
conjunctive connection of preconditionsLecture 6: Inductive Logic Programming – p. 14
Generating Candidate Specializations
current rule:
P (x1, x2, ..., xk)← L1...Ln whereL1...Ln are the preconditions andP (x1, x2, ..., xk) is the head of the rule
FOIL generates candidate specializations by considering newliterals Ln+1 that fit one of the following forms:
Q(v1, ..., vr) where Q ∈ Predicates and the vi are new oralready present variables (at least one vi must already bepresent)
Equal(xj , xk) where xj and xk are already present in the rule
the negation of either of the above forms
Lecture 6: Inductive Logic Programming – p. 15
Induction as Inverted Deduction
observation: induction is just the inverse of deduction
in general, machine learning involves building theories that explainthe observed data
Given some data D and some background knowledge B, learningcan be described as generating a hypothesis h that, together withB, explains D.
(∀ < xi, f(xi) >∈ D)(B ∧ h ∧ xi) ` f(xi)
the above equation casts the learning problem in the framework ofdeductive inference and formal logic
Lecture 6: Inductive Logic Programming – p. 16
Induction as Inverted Deduction
features of inverted deduction:
subsumes the common definition of learning as finding somegeneral concept
background knowledge allows a more rich definition of when ahypothesis h is said to “fit” the data
practical difficulties:
noisy data makes the logical framework completely lose theability to distinguish between truth and falsehood
search is intractable
background knowledge often increases the complexity of H
Lecture 6: Inductive Logic Programming – p. 17
Inverting Resolution
resolution is a general method for automated deduction
complete and sound method for deductive inference
see CogSys1
Inverse Resolution Operator (propositional form):
1. Given initial clause C1 and C, find a literal L that occurs in C1
but not in clause C.
2. Form the second clause C2 by including the following literalsC2 = (C − (C1 − {L})) ∪ {L}
inverse resolution is not deterministic
Lecture 6: Inductive Logic Programming – p. 18
Inverting Resolution
Lecture 6: Inductive Logic Programming – p. 19
Inverting Resolution
Inverse Resolution Operator (first-order form):
resolution rule:1. Find a literal L1 from clause C1, literal L2 from clause C2,
and substitution θ such that L1θ = ¬L2θ
2. Form the resolvent C by including all literals from C1θ andC2θ, except for L1θ and ¬L2θ. That is,
C = (C1 − {L1})θ ∩ (C2 − {L2})θ
analytical derivation of the inverse resolution rule:C = (C1 − {L1})θ1 ∩ (C2 − {L2})θ2 where θ = θ1θ2
C − (C1 − {L1})θ1 = (C2 − {L2})θ2 where L2 = ¬L1θ1θ−1
2
⇒ C2 = (C − (C1 − {L1})θ1)θ−1
2 ∩ {¬L1θ1θ−1
2 }
Lecture 6: Inductive Logic Programming – p. 20
Inverting Resolution
D = {GrandChild(Bob, Shannon)}
B = {Father(Shannon, Tom), Father(Tom, Bob)}
Lecture 6: Inductive Logic Programming – p. 21
Generalization,θ-Subsumption, Entailment
interesting to consider the relationship between themore_general_than relation and inverse entailment
more_general_than: hj ≥g hk iff (∀x ∈ X)[hk(x)→ hj(x)]. Ahypothesis can also be expressed as c(x)← h(x).
θ − subsumption: Consider two clauses Cj and Ck, both of the formH ∨ L1 ∨ ... ∨ Ln, where H is a positive literal and the Li arearbitrary literals. Clause Cj is said to θ − subsume clause Ck iff(∃θ)[Cjθ ⊆ Ck].
Entailment: Consider two clauses Cj and Ck. Clause Cj is said toentail clause Ck (written Cj ` Ck) iff Cj follows deductively from Ck.
Lecture 6: Inductive Logic Programming – p. 22
Generalization,θ-Subsumption, Entailment
if h1 ≥g h2 then C1 : c(x)← h1(x) θ-subsumes C2 : c(x)← h2(x)
furthermore, θ-subsumption can hold even when the clauses havedifferent heads
A : Mother(x, y)← Father(x, z) ∧ Spouse(z, y)
B : Mother(x, L)← Father(x, B) ∧ Spouse(B, y) ∧ Female(x)
Aθ ⊆ B if we choose θ = {y/L, z/B}
θ-subsumption is a special case of entailment
A : Elephant(father_of(x))← Elephant(x)
B : Elephant(father_of(father_of(y)))← Elephant(y)
A ` B, but ¬∃θ[Aθ ⊆ B]
Lecture 6: Inductive Logic Programming – p. 23
Summary
learns sets of first-order rules directly
sequential covering algorithms learn just one rule at a time andperform many search steps
hence, applicable if data is plentiful
FOIL is a sequential covering algorithm
a specific-to-general search is performed to form the result set
a general-to-specific search is performed to form each new rule
Induction can be viewed as the inverse of deduction
hence, an inverse resolution operator can be found
Lecture 6: Inductive Logic Programming – p. 24