Download - Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Rela%onal Representa%ons

Daniel Lowd University of Oregon

April 20, 2015

Caveats

•  The purpose of this talk is to inspire meaningful discussion.

•  I may be completely wrong.

My background: Markov logic networks, probabilis%c graphical models

Q: Why rela%onal representa%ons?

A: To model rela%onal data.

Rela%onal Data •  A rela%on is a set of n-‐tuples:

Friends: {(Anna,Bob), (Bob,Anna), (Bob,Chris)} Smokes: {(Bob),(Chris)} Grade: {(Anna, CS612, Fall2012, “A+”),…}

•  Rela%ons can be visualized as tables:

•  Typically make closed world assump%on: all tuples not listed are false.

Anna Bob

Bob Anna

Bob Chris Friend

s Bob

Chris Smokes

Rela%onal Knowledge •  First-‐order logic •  Descrip%on logic •  Logic programs General form: A set of rules of the form “For every tuple of objects (x1,x2,…,xk), certain rela%onships hold.”

e.g., For every pair of objects (x,y), if Friends(x,y) is true then Friends(y,x) is true.

Sta%s%cal Rela%onal Knowledge •  First-‐order logic •  Descrip%on logic •  Logic programs General form: A set of rules of the form “For every tuple of objects (x1,x2,…,xk), certain rela%onships probably hold.” (Parametrized factors or “parfactors”.)

•  Bayesian networks •  Markov networks •  Dependency networks

e.g., For every pair of objects (x,y), if Friends(x,y) is true then Friends(y,x) is more likely.

Applica%ons and Datasets

What are the “killer apps” of rela%onal learning?

They must be rela%onal.

Graph or Network Data •  Many kinds of networks: – Social networks –  Interac%on networks – Cita%on networks – Road networks – Cellular pathways – Computer networks – Webgraph

Graph Mining

Graph Mining •  Well-‐established field within data mining •  Representa%on: nodes are objects, edges are rela%ons •  Many problems and methods –  Frequent subgraph mining –  Genera%ve models to explain degree distribu%on and graph evolu%on over %me

–  Community discovery –  Collec%ve classifica%on –  Link predic%on –  Clustering

•  What’s the difference between graph mining and rela%onal learning?

Social Network Analysis

Specialized vs. General Representa%ons

In many domains, the best results come from more restricted, “specialized” representa%ons and algorithms. •  Specialized representa%ons and algorithms – May represent key domain proper%es beeer –  Typically much more efficient –  E.g., stochas%c block model, label propaga%on, HITS

•  General representa%ons –  Can be applied to new and unusual domains –  Easier to define complex models –  Easier to modify and extend –  E.g., MLNs, PRMs, HL-‐MRFs, ProbLog, RBNs, PRISM, etc.

Specializing and Unifying Representa%ons

There have been many representa%ons proposed over the years, each with their own advantages and disadvantages. •  How many do we need? •  Which comes first, representa%onal power or algorithmic convenience?

•  What are the right unifying frameworks? •  When should we resort to domain-‐specific representa%ons?

•  Which domain-‐specific ideas actually generalize to other domains?

Applica%ons and Datasets

What are the “killer apps” of general rela%onal learning?

They must be rela%onal.

They should probably be complex.

BioNLP Shared Task Workshop

In 2009, Riedel et al. win with a Markov logic network!

•  They claim Markov logic contributed to their success: “Furthermore, the declara%ve nature of Markov Logic helped us to achieve these results with a moderate amount of engineering. In par%cular, we were able to tackle task 2 by copying the local formulae for event predic%on and adding three global formulae.”

•  However, conver%ng this problem to an MLN was non-‐trivial: "In future work we will therefore inves%gate means to extend Markov Logic (interpreter) in order to directly model event structure.”

Task: Extract biomedical informa%on from text.

event (i) ) 9t.eventType (i, t)

eventType (i, t) ) event (i)

eventType (i, t) ^ t 6= o ) ¬eventType (i, o)

¬site (i) _ ¬event (i)

role (i, j, r) ) event (i) j r i i

role (i, j, r1) ^ r1 6= r2 ) ¬role (i, j, r2)

eventType (e, t) ^ role (e, a, r) ^ event (a) ) regType (t)

role (i, j, r) ^ taskOne (r) ) event (j) _ protein (j)

role (i, j, r) ^ taskTwo (r) ) site (j)

site (j) ) 9i, r.role (i, j, r) ^ taskTwo (r)

event (i) ) 9j.role (i, j, )

eventType (i, t) ^ ¬allowed (t, r) ) ¬role (i, j, r)

role (i, j, r1) ^ k 6= i ) ¬role (k, j, r2)

j < k ^ i < j ^ role (i, j, r1) ) ¬role (i, k, r2)

47

BioNLP Shared Task Workshop

For 2011, Riedel and McCallum produce a more accurate model as a factor graph: Is this a victory or a loss for rela%onal learning?

... phosphorylation of TRAF2 inhibits binding to the CD40 ...

Phosphorylation

Regulation

BindingTheme

Cause Theme

ThemeTheme

Regulation BindingPhosphorylationTheme

Cause

ThemeTheme Theme

Same Binding

2 3 4 5 6 7 8 9

b4,9

e2,Phos.a6,9,Theme

(a)

(b)

Figure 1: (a) sentence with target event structure; (b) pro-jection to labelled graph.

sentence, as seen figure 1b).

We will first present some basic notation to sim-plify our exposition. For each sentence x we havea set candidate trigger words Trig (x), and a set ofcandidate proteins Prot (x). We will generally usethe indices i and l to denote members of Trig (x), theindices p, q for members of Prot (x) and the index jfor members of Cand (x) def= Trig (x) � Prot (x).

We label each candidate trigger i with an eventType t � T (with None � T ), and use the binaryvariable ei,t to indicate this labeling. We use binaryvariables ai,l,r to indicate that between i and l thereis an edge labelled r � R (with None � R).

The representation so far has been used in previ-ous work (Riedel et al., 2009; Björne et al., 2009).Its shortcoming is that it does not capture whethertwo proteins are arguments of the same bindingevent, or arguments of two binding events with thesame trigger. To overcome this problem, we intro-duce binary “same Binding” variables bp,q that areactive whenever there is a binding event that hasboth p and q as arguments. Our inference algorithmwill also need, for each trigger i and protein pair p, q,a binary variable ti,p,q that indicates that at i there isa binding event with arguments p and q. All ti,p,q aresummarized in t.

Constructing events from solutions (e,a,b) canbe done almost exactly as described by Björne et al.(2009). However, while Björne et al. (2009) grouparguments according to ad-hoc rules based on de-pendency paths from trigger to argument, we simplyquery the variables bp,q.

3 Model

We use the following objective to score the struc-tures we like to extract:

s (e,a,b) def=�

ei,t=1

sT (i, t) +�

ai,j,r=1

sR (i, j, r) +

�

bp,q=1

sB (p, q)

with local scoring functions sT (i, t) def=�wT, fT (i, t)�, sR (i, j, r) def= �wR, fR (i, j, r)�and sB (p, q) def= �wB, fB (p, q)�.

Our model scores all parts of the structure in isola-tion. It is a joint model due to the three types of con-straints we enforce. The first type acts on trigger la-bels and their outgoing edges. It includes constraintssuch as “an active label at trigger i requires at leastone active outgoing Theme argument”. The secondtype enforces consistency between trigger labels andtheir incoming edges. That is, if an incoming edgehas a label that is not None, the trigger must not belabelled None either. The third type of constraintsensures that when two proteins p and q are part ofthe same binding (as indicated by bp,q = 1), thereneeds to be a binding event at some trigger i thathas p and q as arguments. We will denote the set ofstructures (e,a,b) that satisfy all above constraintsas Y .

To learn w we choose the passive-aggressiveonline learning algorithm (Crammer and Singer,2003). As loss function we apply a weighted sum offalse positives and false negative labels and edges.The weighting scheme penalizes false negatives 3.8times more than false positives.

3.1 FeaturesFor feature vector fT (i, t) we use a collection ofrepresentations for the token i: word-form, lemma,POS tag, syntactic heads, syntactic children; mem-bership in two dictionaries used by Riedel et al.(2009).For fR (a; i, j, r) we use representations ofthe token pair (i, j) inspired by Miwa et al. (2010) .They contain: labelled and unlabeled n-gram depen-dency paths; edge and vertex walk features (Miwa etal., 2010), argument and trigger modifiers and heads,words in between (for close distance i and j). ForfB (b; p, q) we use a small subset of the token pairrepresentations in fR.

47

Task: Extract biomedical informa%on from text.

Other NLP Tasks? Hoifung Poon and Pedro Domingos obtained great NLP results with MLNs: •  “Joint Unsupervised Coreference Resolu%on with Markov Logic,” ACL 2008. •  “Unsupervised Seman%c Parsing,” EMNLP 2009.

Best Paper Award. •  “Unsupervised Ontology Induc%on from Text,” ACL 2010.

…but Hoifung hasn’t used Markov logic in any of his follow-‐up work: •  “Probabilis%c Frame Induc%on,” NAACL 2013.

(with Jackie Cheung and Lucy Vanderwende) •  “Grounded Unsupervised Seman%c Parsing,” ACL 2013. •  “Grounded Seman%c Parsing for Complex Knowledge Extrac%on,” NAACL

2015. (with Ankur P. Parikh and Kris%na Toutanova)

MLNs were successfully used to obtain state-‐of-‐the-‐art results on several NLP tasks. Why were they abandoned? Because it was easier to hand-‐code a custom solu%on as a log-‐linear model.

Soqware •  There are many good machine learning toolkits –  Classifica%on: scikit-‐learn, Weka –  SVMs: SVM-‐Light, LibSVM, LIBLINEAR – Graphical models: BNT, FACTORIE – Deep learning: Torch, Pylearn2, Theano

•  What’s the state of soqware for rela%onal learning and inference? –  Frustra(ng. – Are the implementa%ons too primi%ve? – Are the algorithms immature? – Are the problems just inherently harder?

Hopeful Analogy: Neural Networks

•  In computer vision, specialized feature models (e.g., SIFT) outperformed general feature models (neural networks) for a long %me.

•  Recently, convolu%onal nets are best and are used everywhere for image recogni%on.

•  What changed? More processing power and more data.

Specialized rela%onal models are widely used. Is there a revolu%on in general rela%onal learning wai%ng to happen?

Conclusion

•  Many kinds of rela%onal data and models –  Specialized rela%onal models are clearly effec%ve. – General rela%onal models have poten%al, but they haven’t taken off.

•  Ques%ons: – When can effec%ve specialized representa%ons become more general?

– What advances do we need for general-‐purpose methods to succeed?

– What “killer apps” should we be working on?