Announcements

Announcements

• Friday, April 23, 2004, 2:30pm, 4623 Wean Hall, Marti Hearst, UCB

• Abstract: New methods and tools are needed to improve how bioscience researchers search for and synthesize information from the bioscience literature. Towards this end, we are building a flexible, efficient, platform-independent database system infrastructure specifically geared towards supporting the advanced and particular search needs of bioscience researchers. We are using this infrastructure to support the development and deployment of statistical approaches to natural language processing which identify entities and relations between them in the bioscience literature. The results of the text analysis will be accessed via an intuitive, appealing search user interface that will be developed using the appropriate human-centered design methods. The resulting system will support new ways of asking scientific questions, and new tools for assembling the pieces of biosciences puzzles.

Hardening Soft Databases

Cohen, Kautz, McAllester

A familiar problem: clustering names

• Given a group of strings that refer to some external entities, cluster them so that strings in the same cluster refer to the same entity.

Bart Selmann

B. Selman

Bart Selman

Cornell Univ.

Critical behavior in satisfiability

Critical behavior for satisfiability

BLACKBOX theorem proving

A variation of this problem: clustering Tuples from Databases

Database S1 (extracted from paper 1’s title page):

Database S2 (extracted from paper 2’s bibliography):

Assumption: identical strings from the same source are co-referent (one sense per discourse?)

A variation of this problem: clustering Tuples from Databases

So this gives some known matches, which might interact with proposed matches: e.g. here we deduce...

Example: end result of clustering

“Soft database” from IE:

“Hard database” suitable for Oracle, MySQL, etc

Overview of Hardening Paper

• Definition of “hardening”:– Find “interpretation” (maps variant->name) that

produces a compact version of database S.• Probabilistic interpretation of hardening:

– Original “soft” data S is version of latent “hard” data H. – Hardening finds max likelihood H.

• Hardness result:– Optimal hardening is NP-hard.

• Greedy algorithm:– naive implementation is quadratic in |S|– clever data structures make it P(n log n), where n=|S|d

Definition of Hardening

• Soft DB contains tuples of references of the form R(r1,...,rk). (The max arity of relations is k.)

• A reference is a string naming an entity, tagged with the source from which it came.


• Possible matches are given as a set of weighted arcs Ipot between references (think –logP).

• We’d like to assume there are not too many of these: e.g., that at most d touch any reference.

0.1

0.01

200


• An interpretation is an acyclic subset I of Ipot such

that for any r, there is at most one are r->r’ in I.– “r’ is the hard version of r”.

• Interpretation arcs are chained, and I(r) is the final interpretation of r:


• An interpretation is an acyclic subset I of Ipot such


• Interpretation arcs are chained, and I(r) is the final interpretation of r: “evolutionary model”

r1 r2r3

r4r0

r1 r2r3

r4r0

1 2 3t=0 1 2 3t=0

w(I)=4 w(I)=7

Definition of Hardening• An interpretation is an acyclic subset I of Ipot such


• Interpretation arcs are chained• If S is a soft database, I(S) is the hard version of

S– Apply I to each tuple, discard duplicates

• w(I) is sum of weights of arcs in I.• Cost of I is linear combination of |I|, |I(S)|, w(I)


• I (well-formed) subset of Ipot

– Interpretation arcs are chained

• I(S) is the hardened version of S

• Goal of hardening: optimize cost c(I):

Given

Probabilistic motivation:Hardening is ‘almost’ finding ML hard DB

• Assume joint Pr(H,I,S) over hard db H, “corruption process” I, resulting soft db S

• Natural goal: pick H,I to max Pr(H,I|S)• Assume a particular generative model:

– U is fixed set of (hard, real-world) entities => fixed number of possible hard tuples N

– H is generated by picking possible hard tuples uniformly at random, stopping with fixed prob eH

Probabilistic motivation

• Assume:– I is generated by picking possible arcs,

with probability exponential in weight w, stopping with fixed prob eI

– (Also assume we discard an ill-formed I at the end of the process)

Probability of well-formedness


• Assume:– Pr(H,I,S) = Pr(S|I,H) * Pr(I) * Pr(H)– S is generated by picking tuples at random,

until we decide to stop, from I-1(H)


• pick H,I to max Pr(H,I|S)• ok to mix H,I to max Pr(H,I,S)• given S,I the best H is H=I(S), so ok to pick I to

max Pr(I(S), I, S), or min c’(I) = –lg Pr(I(S),I,S)

nonlinearterm

Why not use probabilistic inference?

Proof sketch: by reduction from vertex cover (subset of vertices of G that covers all edges).

• Each edge (x,y) => soft tuple R(e); Ipot contains e->x, e->y

• A well-formed I is a vertex cover; the minimal one is NP-hard to find (so “well-formed” is significant!)

Fast Greedy Hardening

• Simple greedy algorithm:– start with empty I– for t=1,....,T

• add to I the single edge in Ipot that reduces cost most

• Intuitions:– similar to (fast) greedy agglomerative clustering– perform low-risk merges first– reduction in database size generally means similar names

appear in many similar contexts– merge to balance similarity of merged items and

reduction in database size• Problem: naive implementation is O(|S|2)

– value of an arc r->r’ changes over the course of the algorithm

Fast Greedy Hardening

Updating priorities: main idea

• Key idea: track tuples that will be collapsed together as the result of incorporating an arc.

)1),'(()'( :Claim

]0:1?)([|} contains :{|),' ( :Def

})),'(,(:)({)'( :Def

} to changes )' (:)),'(,({

)'(

rrcountrrH

SfindeEerrcount

ErrSfindrrrange

rrrrE

rrrange

w

w

Summary of Hardening Paper

• Hardening: – given S, find low-cost, small, latent H and I

• new inferences possible in H

• evolutionary model of I

– minimizing a|H|+b|I|+w(I) is “almost like” MAP H,I

– optimal hardening is NP-hard.

• Greedy algorithm:– naive implementation is quadratic in |S|

– possible in O(n k3log n), where n=|S|kd

Further work...

• Experimental validation?• Other generative models?• Type constraints?• “Real” Bayesian MAP inference?

– MCMC approach, similar to greedy search

Date post:	30-Dec-2015
Category:	Documents
Upload:	justin-bullock
View:	19 times
Download:	0 times

Announcements

Documents