Download - 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

1/24

Learning to ExtractGenic Interactions

Using Gleaner

LLL05 Workshop, 7 August 2005ICML 2005, Bonn, Germany

Mark Goadrich, Louis Oliphant and Jude ShavlikDepartment of Computer Sciences

University of Wisconsin – Madison USA

http://images.google.com/imgres?imgurl=http://www.jwp.bc.ca/westdakota/wheat.jpg&imgrefurl=http://www.jwp.bc.ca/westdakota/left2.htm&h=338&w=500&sz=31&tbnid=SViwa3LNyAEJ:&tbnh=85&tbnw=125&start=47&prev=/images%3Fq%3Dwheat%26start%3D40%26hl%3Den%26lr%3D%26ie%3DUTF-8%26safe%3Doff%26sa%3DN

2/24

Learning Language in Logic Biomedical Information Extraction Challenge

Two tasks: with and without co-reference 80 sentences for training 40 sentences for testing

Our approach: Gleaner (ILP ‘04) Fast ensemble ILP algorithm Focused on recall and precision evaluation

LLL

3/24

A Sample Positive Example Given: Medical Journal abstracts tagged

with genic interaction relations Do: Construct system to extract genic

interaction phrases from unseen text

ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.

LLL

4/24

All unlabeled word pairings? Wastes time with irrelevant words We know the testset will include a dictionary

Use only unlabeled pairings of words in dictionary 106 positive, 414 negative without co-reference 59 positive, 261 negative with co-reference

What is a Negative Example? LLL

5/24

Tagging and Parsing

verbnoun verb prep noun noun

sentence

nounphrase

…verb

phraseprep

phrasenoun

phrase

ykuD was transcribed by SigK RNA …

LLL

6/24

Some Additional Predicates High-scoring words in agent phrases

depend, bind, protein, …

High-scoring words in target phrases gene, promote, product

High-scoring BETWEEN agent & target negative, regulate, transcribe, …

Medical Subject Headings (MeSH) canonized method for indexing biomedical articles in_mesh(RNA), in_mesh(gene)

LLL

7/24

Even More Predicates Lexical Predicates

Internal_caps(Word) alphanumeric(Word)

Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold)

Relative Location of Phrases agent_before_target(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2)

LLL

8/24

Link Parser (CMU) creates parse tree Root lemma of each word (not used)

27 Syntactic Information Predicates complement_of_N_N(Word, Word) modifier_ADV_V(Word, Word) object_V_Passive_N(Word, Word)

Enriched Data From Committee LLL

9/24

Gleaner Definition of Gleaner

One who gathers grain left behind by reapers

Key Ideas of Gleaner Use Aleph as underlying ILP clause engine Keep wide range of clauses usually discarded Create separate theories for different recall ranges


10/24

Aleph - Background Seed Example

A positive example that our clause must cover

Bottom Clause All predicates which are true about seed example

seed

agent_target(A,T,S)

11/24

Aleph - Learning Aleph learns theories of clauses

(Srinivasan, v4, 2003) Pick positive seed example, find bottom clause Use heuristic search to find best clause Pick new seed from uncovered positives

and repeat until threshold of positives covered

Theory produces one recall-precision point Learning complete theories is time-consuming Can produce ranking with ensembles

12/24

Gleaner - Background Rapid Random Restart (Zelezny et al ILP 2002)

Stochastic selection of initial clause Time-limited local heuristic search Randomly choose new initial clause and repeat

seed

initial 1

initial 2


13/24

Gleaner - LearningP

reci

sion

Recall

Create B Bins Generate Clauses Record Best per Bin Repeat for K seeds

14/24

Gleaner - Combining Combine K clauses per bin

If at least L of K clauses match, call example positive

How to choose L ? L=1 then high recall, low precision L=K then low recall, high precision

We want a collection of high precision theories spanning space of recall levels


15/24

Gleaner - Overlap Take topmost curve of overlapping theories

Recall

Pre

cisi

on

16/24

Gleaner - Practical UseP

reci

sion

Recall

Generate Curve User Selects Recall Bin Return Classifications

With L of K Confidence

Recall = 0.50Precision = 0.70


17/24

agent_target(Agent, Target, Sentence) :-several_phrases_in_sentence(Sentence),some_wordPOS_in_sentence(Sentence,

novelword),n(Agent),alphabetic(Agent), word_parent(Agent, F),

phrase_contains_internal_cap_word(F, noun, _), few_POS_in_phrase(F, novelword),in_between_target_phrases(Agent, Target, _), n(Target).

0.14 Recall, 0.93 Precision on without co-reference training set

Sample Extraction Clause

18/24

agent_target(Agent, Target, Sentence) :-avg_length_sentence(Sentence),

n(Agent),

word_previous(Target,_),

in_between_target_phrases(Agent, Target, _).

0.76 Recall, 0.49 Precision on without co-reference training set

Sample Extraction Clause

19/24

Experimental Methodology Used other trainset for tuneset in both cases Testset unlabeled, but dictionary provided

Included sentences with no positives 936 total testset examples generated

Parameter Settings Gleaner (20 recall bins)

seeds = 100 clauses = 25,000

Aleph (0.75 minimum accruacy) nodes = {1K, 25K)

20/24

LLL Without Co-reference Results

0 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1 .0

0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0

R e c a l l

Pr

ec

isio

n

Gleaner Basic

Gleaner Enriched

Aleph Basic 1K

21/24

LLL With Co-reference Results

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1

0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1

R e c a l l

Pr

ec

isio

n

Gleaner Basic

Gleaner Enriched

Aleph Basic 1K

22/24

We Need More Datasets LLL Challenge task is small

Would prefer to do cross-validation Need labels for testset

Our ILP’04 dataset open to community ftp://ftp.cs.wisc.edu/machine-learning/shavlik-

group/datasets/IE-protein-location

Biomedical information-extraction tasks Genetic Disorder (Ray and Craven 2001) Genia BioCreAtiVe

23/24

Conclusions Contributions

Develop large amount of background knowledge Exploit normally discarded clauses Visually present precision and recall trade-off

Proposed Work Achieve gains in High-Recall areas Reduce overfitting when using enriched data Increase diversity of learned clauses

24/24

Acknowledgements

USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01 UW Condor Group David Page, Vitor Santos Costa, Ines Dutra,

Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jessie Davis