1/24
Learning to ExtractGenic Interactions
Using Gleaner
LLL05 Workshop, 7 August 2005ICML 2005, Bonn, Germany
Mark Goadrich, Louis Oliphant and Jude ShavlikDepartment of Computer Sciences
University of Wisconsin – Madison USA
2/24
Learning Language in Logic Biomedical Information Extraction Challenge
Two tasks: with and without co-reference 80 sentences for training 40 sentences for testing
Our approach: Gleaner (ILP ‘04) Fast ensemble ILP algorithm Focused on recall and precision evaluation
LLL
3/24
A Sample Positive Example Given: Medical Journal abstracts tagged
with genic interaction relations Do: Construct system to extract genic
interaction phrases from unseen text
ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.
LLL
4/24
All unlabeled word pairings? Wastes time with irrelevant words We know the testset will include a dictionary
Use only unlabeled pairings of words in dictionary 106 positive, 414 negative without co-reference 59 positive, 261 negative with co-reference
What is a Negative Example? LLL
5/24
Tagging and Parsing
verbnoun verb prep noun noun
sentence
nounphrase
…verb
phraseprep
phrasenoun
phrase
ykuD was transcribed by SigK RNA …
LLL
6/24
Some Additional Predicates High-scoring words in agent phrases
depend, bind, protein, …
High-scoring words in target phrases gene, promote, product
High-scoring BETWEEN agent & target negative, regulate, transcribe, …
Medical Subject Headings (MeSH) canonized method for indexing biomedical articles in_mesh(RNA), in_mesh(gene)
LLL
7/24
Even More Predicates Lexical Predicates
Internal_caps(Word) alphanumeric(Word)
Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold)
Relative Location of Phrases agent_before_target(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2)
LLL
8/24
Link Parser (CMU) creates parse tree Root lemma of each word (not used)
27 Syntactic Information Predicates complement_of_N_N(Word, Word) modifier_ADV_V(Word, Word) object_V_Passive_N(Word, Word)
Enriched Data From Committee LLL
9/24
Gleaner Definition of Gleaner
One who gathers grain left behind by reapers
Key Ideas of Gleaner Use Aleph as underlying ILP clause engine Keep wide range of clauses usually discarded Create separate theories for different recall ranges
10/24
Aleph - Background Seed Example
A positive example that our clause must cover
Bottom Clause All predicates which are true about seed example
seed
agent_target(A,T,S)
11/24
Aleph - Learning Aleph learns theories of clauses
(Srinivasan, v4, 2003) Pick positive seed example, find bottom clause Use heuristic search to find best clause Pick new seed from uncovered positives
and repeat until threshold of positives covered
Theory produces one recall-precision point Learning complete theories is time-consuming Can produce ranking with ensembles
12/24
Gleaner - Background Rapid Random Restart (Zelezny et al ILP 2002)
Stochastic selection of initial clause Time-limited local heuristic search Randomly choose new initial clause and repeat
seed
initial 1
initial 2
13/24
Gleaner - LearningP
reci
sion
Recall
Create B Bins Generate Clauses Record Best per Bin Repeat for K seeds
14/24
Gleaner - Combining Combine K clauses per bin
If at least L of K clauses match, call example positive
How to choose L ? L=1 then high recall, low precision L=K then low recall, high precision
We want a collection of high precision theories spanning space of recall levels
15/24
Gleaner - Overlap Take topmost curve of overlapping theories
Recall
Pre
cisi
on
16/24
Gleaner - Practical UseP
reci
sion
Recall
Generate Curve User Selects Recall Bin Return Classifications
With L of K Confidence
Recall = 0.50Precision = 0.70
17/24
agent_target(Agent, Target, Sentence) :-several_phrases_in_sentence(Sentence),some_wordPOS_in_sentence(Sentence,
novelword),n(Agent),alphabetic(Agent), word_parent(Agent, F),
phrase_contains_internal_cap_word(F, noun, _), few_POS_in_phrase(F, novelword),in_between_target_phrases(Agent, Target, _), n(Target).
0.14 Recall, 0.93 Precision on without co-reference training set
Sample Extraction Clause
18/24
agent_target(Agent, Target, Sentence) :-avg_length_sentence(Sentence),
n(Agent),
word_previous(Target,_),
in_between_target_phrases(Agent, Target, _).
0.76 Recall, 0.49 Precision on without co-reference training set
Sample Extraction Clause
19/24
Experimental Methodology Used other trainset for tuneset in both cases Testset unlabeled, but dictionary provided
Included sentences with no positives 936 total testset examples generated
Parameter Settings Gleaner (20 recall bins)
seeds = 100 clauses = 25,000
Aleph (0.75 minimum accruacy) nodes = {1K, 25K)
20/24
LLL Without Co-reference Results
0 .0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
1 .0
0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0
R e c a l l
Pr
ec
isio
n
Gleaner Basic
Gleaner Enriched
Aleph Basic 1K
21/24
LLL With Co-reference Results
0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
1
0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1
R e c a l l
Pr
ec
isio
n
Gleaner Basic
Gleaner Enriched
Aleph Basic 1K
22/24
We Need More Datasets LLL Challenge task is small
Would prefer to do cross-validation Need labels for testset
Our ILP’04 dataset open to community ftp://ftp.cs.wisc.edu/machine-learning/shavlik-
group/datasets/IE-protein-location
Biomedical information-extraction tasks Genetic Disorder (Ray and Craven 2001) Genia BioCreAtiVe
23/24
Conclusions Contributions
Develop large amount of background knowledge Exploit normally discarded clauses Visually present precision and recall trade-off
Proposed Work Achieve gains in High-Recall areas Reduce overfitting when using enriched data Increase diversity of learned clauses
24/24
Acknowledgements
USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01 UW Condor Group David Page, Vitor Santos Costa, Ines Dutra,
Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jessie Davis