Knowledge-Based Discovery:Using Semantics in Machine Learning
Bruce BuchananJoe Phillips
University of Pittsburgh
buchanan @ cs.pitt.edujosephp @ cs.pitt.edu
Intelligent Systems Laboratory
• Faculty: Bruce Buchanan, P.I., John Aronis• Collaborators: John Rosenberg (Biol.Sci.), Greg
Cooper (Medicine), Bob Ferrell (Genetics), Janyce Wiebe (CS), Lou Penrod (Rehab.Med.), Rich Simpson (Rehab.Sci.), Russ Altman (Stanford MIS)
• Research Associates: Joe Phillips, Paul Hodor, Vanathi Gopalakrishnan, Wendy Chapman
• Ph.D. Students: Gary Livingston, Dan Hennessy, Venkat Kolluri, Will Bridewell, Lili Ma
• M.S. Students: Karl Gossett
GOALS(A) Learn understandable &
interesting rules from data
(B) Construct an understandable & coherent model from rules
METHOD:To use background knowledge to search for:
simple rules with familiar predicatesinteresting and novel rulescoherent models
• Familiar Syntax – (conditional rules)
• Syntactically Simple• Semantically Simple
• Familiar Predicates• Accurate Predictions• Meaningful Rules• Relevant to Question• Novel• Cost-Effective• Coherent Model
Rules or Models:Understandable | Interesting
The RL ProgramExplicit Bias
TrainingExamples
RL
RULES
NewCases
Performance Program
Predictions
MODEL
PartialDomainModel
HAMB
(A) Individual Rules
• J. Phillips
• Rehabilitation Medicine Data
Simple single rules
• Syntactic Simplicity– Fewer terms on the LHS
• Explicitly stated constraints (rules with no more than N terms)
• Tagged attributes (e.g. must have at least one control attribute to be interesting)
Simple sets of rules
• Syntactic simplicity– Fewer rules:
• independent rules
• E.g. in physics:
• U(x) = Ugravity(x) + Uelectronic(x) + Umagnetic(x)
• HAMB removes highly similar terms from feature set
– less independence when there’s feedback• e.g. medicine
Interestingness:
• Given, controlled and observed– explicitly state observed attributes as interesting
target
• Temporal– future (or distant past) predictions are interesting
• Influence diagram (e.g. Bayes net)– strong but more indirect influences are interesting
Using typed attribute background knowledge
• Organize terms into “given”, “controlled” and “observed”– E.g. in medical domain “demographics”,
“intervention” and “outcome”
• Benefits:– Categorization of rules by whether they use
givens (default), controls (controllable) or both (conditionally controllable):
Typed attribute example
• Rehab. (RL; Phillips, Buchanan, Penrod)
• > 2000 records
given controlled observed
demographic medical temporal medical
age race sex
admit general_condition
specific_condition
time rate
absolute normalize
Example interestingness:
• Group rules by whether they predict by medical, demographic or both:– by medical:
• Left_Body_Stroke => poor improvement (interesting, expected)
– by demographic:• High_age => poor improvement (interesting, expected)• (Race=X) => poor improvement (interesting, NOT
expected)
Using temporal background knowledge
• Organize data by time– Utility may or may not extend to other metric
spaces (e.g. space, mass)
• Benefits:– Predictions parameterized by time: f(t)
• Future or distant past may be interesting
– Cyclical patterns
Temporal example
• Geophysics (Scienceomatic; Phillips 2000)– Subduction zone discoveries of type:
d(qafter) = d(qmain) + m*[t(qafter)-t(qmain)] + b
– NOTE: This is not an accurate prediction!– interesting, generally quakes can’t be predicted
X
d
Using influence diagram background knowledge
• This is future work!
• Organize terms to follow pre-existing influence diagram– E.g. Bayesian nets, but do not need conditional
probabilities
• Benefits:– Suggest hidden variables, new influences
• f(x) => f’(x,y)
Interestingness summary
• How different types of background knowledge help us achieve interestingness:– Explicitly stated: “observed” attributes– Implicitly stated: parameterized equations with
“interesting” parameters– Learned: “new” influence factors
(B) Coherent Models
• B.Buchanan
• Protein Data
EXAMPLE:Predicting Ca++ Binding Sites
(G.Livingston)
Given 3-d descriptions of 16 sites in proteins that bind calcium ions& 100 other sites that do not
Find a model that allows predictingwhether a proposed new site will bind Ca++ [in terms of subset of 63 attributes]
Ca++ binding sites in proteins SOME ATTRIBUTES
ATOM-NAME-IS-C ATOM-NAME-IS-O CHARGE CHARGE-WITH-HIS HYDROPHOBICITY MOBILITY RESIDUE-CLASS1-IS-CHARGED RESIDUE-CLASS1-IS-HYDROPHOBIC RESIDUE-CLASS2-IS-ACIDIC RESIDUE-CLASS2-IS-NONPOLAR RESIDUE-CLASS2-IS-UNKNOWN
RESIDUE-NAME-IS-ASP RESIDUE-NAME-IS-GLU RESIDUE-NAME-IS-HOH RESIDUE-NAME-IS-LEU RESIDUE-NAME-IS-VAL RING-SYSTEM SECONDARY-STRUCTURE1-IS-4-HELIX SECONDARY-STRUCTURE1-IS-BEND SECONDARY-STRUCTURE1-IS-HET SECONDARY-STRUCTURE1-IS-TURN SECONDARY-STRUCTURE2-IS-BETA SECONDARY-STRUCTURE2-IS-HET VDW-VOLUME
Predicting Ca++ Binding Sites
semantic types of attributes:
Physical Chemical Structural
e.g.,
solvent accessibilitychargeVDW volume
heteroatomoxygencarbonylASN
helixbeta-turnring-systemmobility
Coherent Model= subset of locally acceptable rules that
• explains as much of the data• uses entrenched predicates [Goodman]• uses predicates of same semantic type• uses predicates of same grain size• uses classes AND their complements• avoids rules that are "too similar": identical; subsuming; sem.close
EXAMPLE:predict Ca++ binding sites in
proteins158 rules found independently. E.g., R1: IF a site (a) is charged > 18.5 AND (b) no. of C=O > 18.75 THEN it binds calcium
R2: IF a site (a) is charged > 18.5 AND (b) no. of ASN > 15 THEN it binds calcium
Predicting Ca++ Binding Sites
semantic network of attributes
Heteroatoms
Sulfur Oxygen ... Nitrogen "Hydroxyl" Carbonyl Amide Amine | SH OH ASP GLU ASN GLN...PRO | / CYS SER THR TYR
...
...
Ca++ binding sites in proteins58 rules above threshold:
threshold = at least 80% TP AND no more than 20% FP 42 rules predict SITE 16 rules predict NON-SITE
Average accuracy for five 5-fold x-validations = 100%for the redundant model with 58 rules
Predicting Ca++ Binding SitesPrefer complementary rules -- e.g.,
R59: IF, within 5 A of a site , # oxygens > 6.5 THEN it binds calcium
R101: IF, within 5 A of a site , # oxygens <= 6.5 THEN it does NOT bind calcium
o
o
5 A Radius ModelFive perfect rules*
R1. #Oxygen LE 6.5 --> NON-SITE R2. Hydrophobicity GT -8.429 --> NON-SITE R3. #Oxygen GT 6.5 --> SITE R4. Hydrophobicity LE -8.429 --> SITE R5. #Carbonyl GT 4.5 & #Peptide LE 10.5 --> SITE*( 100% of TP's and 0 FP's )
o
Final Result Ca++ binding sites in
proteinsModel with 5 rules:
same accuracyno unique predicatesno subsumed or very similar rules more genl. rules for SITES (prior prob. < 0.01)more specific rules for NON-SITES (prior prob. > 0.99)
Predicting Ca++ Binding SitesAttribute Hierarchies
RESIDUE CLASS 1 POLAR (ASN, CYS, GLN, HIS, SER THR, TYR, TRP, GLY)
CHARGED (ARG ASP GLU LYS)
HYDROPHOBIC (ALA ILE LEU MET PHE PRO VAL)
Attribute HierarchiesRESIDUE CLASS 2
POLAR (ASN, CYS, GLN, HIS, SER THR, TYR, TRP, GLY)
CHARGED ACIDIC (ARG ASP GLU)
BASIC ( LYS)
NONPOLAR (ALA ILE LEU MET PHE PRO VAL)
TRP
HIS
CONCLUSION Induction systems can be augmented with semantic criteria to provide (A) interesting & understandable rules
• syntactically simple• meaningful
(B) coherent models • equally predictive
• closer to a theory
CONCLUSION
• We have shown– how specific types of background knowledge
might be incorporated in the rule discovery process
– possible benefits of incorporating those types of knowledge
• more coherent models
• more understandable models
• more accurate models