Web-Scale Knowledge Inference Using Markov …yang/doc/slg2013.pdfMarkov logic network (MLN). We aim...

Web-Scale Knowledge Inference Using Markov Logic Networks

Yang Chen [email protected] Zhe Wang [email protected]

University of Florida, Dept of Computer Science, Gainesville, FL 32611 USA

Abstract

In this paper, we present our on-going workon ProbKB, a PROBabilistic KnowledgeBase constructed from web-scale extractedentities, facts, and rules represented as aMarkov logic network (MLN). We aim atweb-scale MLN inference by designing a novelrelational model to represent MLNs and algo-rithms that apply rules in batches. Errors arehandled in a principled and elegant mannerto avoid error propagation and unnecessaryresource consumption. MLNs infer from theinput a factor graph that encodes a probabil-ity distribution over extracted and inferredfacts. We run parallel Gibbs sampling algo-rithms on Graphlab to query this distribu-tion. Initial experiment results show promis-ing scalability of our approach.

1. Introduction

With the exponential growth in machine learning, sta-tistical inference techniques, and big-data analyticsframeworks, recent years have seen tremendous re-search interest in automatic information extractionand knowledge base construction. A knowledge basestores entities, facts, and their relationships in amachine-readable form so as to help machines under-stand information from human.

Currently, most popular techniques used to acquireknowledge include automatic information extraction(IE) from text corpus (Carlson et al., 2010; Poonet al., 2010; Schoenmackers et al., 2010) and mas-sive human collaboration (Wikepedia, Freebase, etc).Though they proved their success in a broad range ofapplications, there is still much improvement we canmake by making inference. For example, Wikipediapages state that Kale is very high in calcium and cal-cium helps prevent Osteoporosis, then we can inferthat Kale helps prevent Osteoporosis. However, thisinformation is stated in neither page and only be dis-

covered by making inference.

Existing IE works extract entities, facts, and rules au-tomatically from the web (Fader et al., 2011; Schoen-mackers et al., 2010), but due to corpus noise andthe inherent probabilistic nature of the learning algo-rigthms, most of these extractions are uncertain. Ourgoal is to facilitate inference over such noisy, uncertainextractions in a principled, probabilistic, and scalablemanner. To achieve this, we use Markov logic net-works (MLNs) (Richardson & Domingos, 2006), an ex-tension of first-order logic that augments each clausewith a weight. Clauses with finite weight are allowedto be violated, but with a penalty determined by theweight. Together with all extracted entities and facts,the MLN defines a Markov network (factor graph)that encodes a probability distribution over all in-ferred facts. Probabilistic inference is thus supportedby querying this distribution.

The main challenge related to MLNs is scalability. Thestate-of-the-art implementation, Tuffy (Niu et al.,2011) and Felix (Niu et al., 2012), partially solvedthis problem by using relational databases and taskspecialization. However, their systems work well onlyon a small set of relations and hand-crafted MLNs, andwere not able to scale up to the web-scale datasets likeReVerb and Sherlock (See Section 3). To gain thisscalability, we design a relational model that pushesall facts and rules into the database. In this way, thegrounding is reduced to a few joins among these tablesand the rules are thus applied in batches. This helpsachieve much greater scalability than Tuffy and Fe-lix.

The second challenge stems from extraction errors.Most IE systems assign a score to each of their extrac-tions to indicate their fidelity so they typically don’tdo any further cleaning. This poses a significant chal-lenge to inference engines: errors propagate rapidlyand violently without constraints. Figure 1 shows anexample of such propagation: starting from a single er-

mailto:[email protected]

mailto:[email protected]

ProbKB: Managing Web-Scale Knowledge

ror1 stating that Paris is the capital of Japan, a wholebunch of incorrect results are produced.

Errors come from diverse sources: incorrectly ex-tracted entities and facts, wrong rules, ambiguity, in-ference, etc. In each rule application, with a singleerroneous facts participating, the result is likely to beerroneous. In a worse scenario, even if the extrac-tions are correct, errors may arise due to word ambi-guity. Removing errors early is a crucial task in knowl-edge base construction; it improves knowledge qualityand saves computation resources for high-quality in-ferences.

This paper introduces the ProbKB system that aimsat tackling these challenges and presents our initialresults over a web-scale extraction dataset.

2. The ProbKB System

This section presents our on-going work on ProbKB.We designed a relational model to represent the ex-tractions and the MLN and designed grounding algo-rithms in SQL that applies MLN clauses in batches.We implemented it on Greenplum, a massive parallelprocessing (MPP) database system. Inference is doneby a parallel Gibbs sampler (Gonzalez et al., 2011)on GraphLab (Low et al., 2010). The overview of thesystem architecture is shown in Figure 2.

MLN

FactorGraph

PROBabilistic Knowledge Base

Inference Engine

Entities

Rules

Facts

RelationalDatabase

Figure 2. ProbKB architecture

The remaining of this section is structured as fol-lows: Section 2.1 justifies an important assumption

1See http://en.wikipedia.org/w/index.php?title=Statement_(logic)&diff=545338546&oldid=269607611

regarding Horn clauses. Section 2.2 introduces ourrelational model for MLNs. Section 2.3 presents thegrounding algorithms in terms of our relational model.Section 2.4 desribes our initial work on maintainingknowledge integrity. Section 2.5 describes our infer-ence engine built on GraphLab. Experiment resultsshow ProbKB scales much better than the state-of-the-art.

2.1. First-Order Horn Clauses

A first-order Horn clause is a clause with at most onepositive literal2. In this paper, we focus on this classof clauses only, though in general Markov logic sup-ports arbitrary forms. This restriction is reasonable inour context since our goal to discover implicit knowl-edge from explicit statements in the text corpus; thistype of inference is mostly expressed as “if... then...”statements in human language, corresponding to theset of Horn clauses. Moreover, due to their simplestructure, Horn clauses are more easily to learn andbe represented in a structured form than general ones.We used a set of extracted Horn clauses from Sher-lock (Schoenmackers et al., 2010).

2.2. The Relational MLN Model

We represent the knowledge base as a relational model.Though previous approaches already used relationaldatabases to perform grounding (Niu et al., 2011;2012), the MLN model and associating weights arestored in an external file. During runtime, a SQLquery is constructed for each individual rule using ahost language. This approach is inefficient when thereare a large number of rules. Our approach, on thecontrary, stores the MLN in the database so that therules are applied in batches using joins. The only com-ponent that the database does not efficiently supporta probabilistic inference engine that needs many ran-dom accesses to the input data. This component isdiscussed in Section 2.5.

Based on the assumptions made in Section 2.1, we con-sidered Horn clauses only. We classified Sherlockrules according to their sizes and argument orders. Weidentified six rule patterns in the dataset:

2http://en.wikipedia.org/wiki/Horn_clause

http://en.wikipedia.org/w/index.php?title=Statement_(logic)&diff=545338546&oldid=269607611



http://en.wikipedia.org/wiki/Horn_clause


Figure 1. Error propagation: How a single error source generates multiple errors and how they propagate furthur. Errorscome different sources including incorrect extractions, rules, and propagated errors. The fact that Paris is the capital ofJapan is extracted from a Wikipedia page describing logical statements. All base and derived errors are shown in red.

p(x, y)← q(x, y) (1)

p(x, y)← q(y, x) (2)

p(x, y)← q(x, z), r(y, z) (3)

p(x, y)← q(x, z), r(z, y) (4)

p(x, y)← q(z, x), r(y, z) (5)

p(x, y)← q(z, x), r(z, y) (6)

where p, q, r are predicates and x, y, z are variables.Each rule type i has a table Mi recording the predi-cates involved in the rules of that type. For each rule

p(x, y)← q(x, y) (1)

of type 1, we have a tuple (p, q) in M1. For each rule

p(x, y)← q(x, z), r(y, z) (3)

of type 3, we have a tuple (p, q, r) in M3. We constructM2, M4, M5, and M6 similarly. These tables recordthe predicates only. The argument orders are impliedby the rule types.

Longer rules may cause a problem since the number ofrule patterns grows exponentially with the rule size,making it impractical to create a table for each ofthem. We leave this extension as a future work, butour intuition is to record the arguments and use UDFto construct SQL queries.

We have another table R for relationships. For eachrelationship p(x, y) that is stated in the text corpus, wehave a tuple (p, x, y) in R. The grounding algorithmis then easily expressed as equi-joins between the Rtable and Mi tables as discussed next.

2.3. Grounding

We use type 3 rules to illustrate the grounding algo-rithm in SQL.

p(x, y)← q(x, z), r(y, z) (3)

Assume these rules are stored in table M3(p, q, r), andrelationships p(x, y) are stored in R(p, x, y), then thefollowing SQL query infers new facts given a set of

extracted and already inferred facts:

SELECT DISTINCT M3.p AS p,R1.x AS x, R2.y AS y

FROM M3 JOIN R R1 ON M3.q = R1.pJOIN R R2 ON M3.r = R2.p

WHERE R1.y = R2.y;

This process is repeated until convergence (i.e. nomore facts can be inferred). Then the following SQLquery then generates the factors:

SELECT DISTINCTR1.id AS id1, R2.id AS id2, R3.id AS id3

FROM M3 JOIN R R ON M3.p = R.pJOIN R R1 ON M3.q = R1.pJOIN R R2 ON M3.r = R2.p

WHERE R.x = R1.x AND R.y = R2.x AND R1.y =R2.y;

To see why it is more efficient than Tuffy andAlchemy, consider the first query. Suppose we firstcompute RM3 := M3 ./M3.q=R.p R. Since the Mi

tables are often small, we use a one-pass join algo-rithm and hash M3 using q. Then when each tuple(p, x, y) in R is read, it is matched against all rulesin M3 with the first argument being p. In the secondjoin RM3 ./RM3.r=R.p AND RM3.y=R.y R, since RM3

is typically much larger than M3, we assume using atwo-pass hash-join algorithm, which starts by hasingRM3 and R into buckets using keys (r, y) and (p, y),respectively. Then, for any tuple (p, x, y) in R, all pos-sible results can be formed in one pass by consideringtuples from RM3 in the corresponding bucket. As aresult, each tuple is read into memory for at most 3times and rules are simultaneously applied. This is insharp contrast with Tuffy, where tuples have to beread into main memory as many times as the relationappears in the rules.

Our performance gain over Tuffy owes also to thesimple syntax of Horn clauses. In order to supportgeneral first-order clauses, Tuffy materializes eachpossible grounding of all clauses to identify a set of


active clauses (Singla & Domingos, 2006)3. This pro-cess quickly uses up disk space when trying to grounda dataset with large numbers of entities, relations andclauses like Sherlock. Our algorithms avoid thistime- and space-consuming process by taking advan-tage of simplicity of Horn clauses.

The result of grounding is a factor graph (Markov net-work). This graph encodes a probability distributionover its variable nodes, which can be used to answeruser queries. However, marginal inference in Markovnetworks is #P-complete (Roth, 1996), so we turnto approximate inference methods. The state-of-the-art marginal inference algorithm is MC-SAT (Poon &Domingos, 2006), but due to absence of deterministicrules and access to efficient parallel implementationsof the widely-adopted Gibbs sampler, we sticked to itand present an initial evaluation in Section 3.2.

2.4. Knowledge Integrity

(Schoenmackers et al., 2010) observed a few commonerror patterns (ambiguity, meaningless relations, etc)that might affect learning quality and tried to re-move these errors to obtain a cleaner corpus to workon. This manual pruning strategy is useful, but notenough, for an inference engine since errors often ariseand propagate in unexpected manners that are hard todescribe systematically. For example, the fact “Parisis the capital of Japan” shown in Figure 1 is acciden-tally extracted from a Wikipedia page that describeslogical statement, and no heuristic in (Schoenmack-ers et al., 2010) is able to filter it out. Unfortunately,even a single and unfrequent error like this propagatesrapidly in the inference chain. These errors hamperknowledge quality, waste computation resources, andare hard to catch.

Instead of trying to enumerate common error patterns,we feel that it is much easier to identify correct infer-ences: facts that are extracted from multiple sources,or inferred by facts from multiple sources are morelikely to be correct than others. New errors are lesslikely to arise when we applied rules to this subset offacts. Thus, we split the database for the facts, mov-ing the qualified facts to another table called beliefs,and we call the remaining facts candidates. Candi-dates are promoted to beliefs if we become confidentabout their fidelity. The terminologies are borrowedfrom Nell (Carlson et al., 2010), but we are solv-ing different problems: we are trying to identify a

3For illustration, consider a non-Horn clause∀x∀y(p(x, y) ∨ q(x, y)), or ∃xp(x). These clauses canbe made unsatisfied only by considering all possibleassignments for x and y.

computational-safe subset of our knowledge base sothat we can safely apply the rules with no errors prop-agating.

To save computation even further, we adapt this work-flow to a semi-naive query evaluation algorithm, whichwe call robust semi-naive evalution to emphasize thefact that inferences only occur among most correstfacts and errors are unlikely to arise. Semi-naive queryevaluation originates from Datalog literature (Balbin& Ramamohanarao, 1987); the basic idea is to avoidrepeated rule applications by restricting one operandto containing the delta records between two iterations.

Algorithm 1 Robust Semi-Naive Evaluation

candidates = all factsbeliefs = promote(∅,candidates)delta = ∅repeatpromoters = infer(beliefs,delta)

beliefs = beliefs ∪ delta

delta = promote(promoters,candidates)

until delta = ∅

In Algorithm 1, infer is almost the same as discussedin Section 2.3, except that the operands are replacedby beliefs and delta, potentially much smaller thanthe original R. The promote algorithm is what weare still working on. It takes a new set of inferenceresults and uses them to promote candidates to beliefs.Our intuition was to exploit lineage and promote factsimplied by most external sources or to learn a set ofconstraints to detect erroneous rules, facts, and joinkeys.

2.5. The GraphLab Inference Engine

The state-of-the-art MLN inference algorithm is MC-SAT (Singla & Domingos, 2006). For an initial evalu-ation and access to existing libraries, though, our pro-totype used a parallel Gibbs sampler (Gonzalez et al.,2011) implemented on the GraphLab (Low et al., 2010;2012) framework. GraphLab is a distributed frame-work that improves upon abstractions like MapRe-duce for asynchronous iterative algorithms with sparsecomputational dependencies while ensuring data con-sistency and achieving a high degree of parallel per-formance. We ran the parallel Gibbs algorithm onour grounded factor graph and got better results thanTuffy.

Though the performance looks good, one concern wehave is coordinating the two systems. Our initial goalis to build a coherent system that works on MLN in-ference problem. Getting two independent systems to


work together is hard and error-prone. Synchroniza-tion becomes especially troublesome: to query even asingle atom, we need to get output from GraphLab,write the results to the database, and perform adatabase query to get the result. Motivated by this,we are trying to work out a shared memory model thatpushes external operators directly into the database.In Section 3.2, we only present results from GraphLab.

3. Experiments

We used extracted entities and facts from Re-Verb (Fader et al., 2011) and applied Sherlockrules (Schoenmackers et al., 2010) to discover implicitknowledge. Sherlock uses TextRunner (?) extrac-tions, an older version of ReVerb, to learn the rules,so there is a schema mismatch between the two. Weare working on resolving this issue by either mappingthe ReVerb schema to Sherlock’s or implementingour own rule learner for ReVerb, but for now we use50K out of 400K extracted facts to present the ini-tial results. The statistics of our dataset is shown inTable 1.

#entities 480K#relations 10K

#facts 50K#rules 31K

Table 1. ReVerb-Sherlock data statistics.

Experiment Setup We ran all the experiments ona 32-core machine at 1400MHz with 64GB of RAMrunning Red Hat Linux 4. ProbKB, Tuffy, andGraphLab are implemented in SQL, SQL (using Javaas a host language), and C++, respectively. Thedatabase system we use is Greenplum 4.2.

3.1. Grounding

We used Tuffy as a baseline comparison. Beforegrounding, we remove some ambiguous mentions, in-cluding common first or last names and general classreferences (Schoenmackers et al., 2010). The resultingdataset has 7K relationships. We run ProbKB andTuffy using this cleaner dataset and learn 100K newfacts. Performance result is shown in Table 2.

System Time/s

ProbKB 85Tuffy Crash

Table 2. Grounding time for ProbKB and Tuffy.

As discussed in Section 2.3, the reason for our per-formance improvement is the Horn clauses assumptionand batch rule application. In order to support generalMLN inference, Tuffy materializes all ground atomsfor all predicates. This makes the active closure al-gorithm efficient but consumes too much disk space.For a dataset with large numbers of entities and rela-tions like ReVerb-Sherlock, the disk space is usedup even before the grounding starts.

3.2. Inference

This section reports ProbKB’s performance for infer-ence using GraphLab. Again we used Tuffy as com-parison. Since Tuffy cannot ground the whole Re-Verb-Sherlock dataset, we sampled a subset with700 facts and compared the time spent on generating200 joint samples for ProbKB and Tuffy. The resultis shown in Table 3.

System Time/min

ProbKB 0.47Tuffy 55

Table 3. Time to generate 200 joint samples.

The performance boost owes mostly to the GraphLabparallel engine.

The experimental results presented in this section arepreliminary, but it shows the need for better ground-ing and inference system to achieve web scale and thepromise of our proposed techniques to solve the scala-bility and integrity challenges.

4. Conclusion and Future Work

This paper presents our on-going work on ProbKB.We built an initial prototype that stores MLNs in arelational form and designed an efficient grounding al-gorithm that applies rules in batches. We maintaineda computational-safe set of “beliefs” to which rules areapplied with minimal errors occurring. We connect toGraphLab for a parallel Gibbbs sampler inference en-gine. Future works are discussed throughout the paperand are summarized below:

• Develop techniques for the robust semi-naive al-gorithm to maintain knowledge integrity.

• Tightly integrate grounding and inference overMLN using shared-memory model.

• Port SQL implementation of grounding to otherframeworks, such as Hive and Shark.


References

Balbin, Isaac and Ramamohanarao, Kotagiri. A gen-eralization of the differential approach to recursivequery evaluation. The Journal of Logic Program-ming, 4(3):259–262, 1987.

Carlson, Andrew, Betteridge, Justin, Kisiel, Bryan,Settles, Burr, Hruschka Jr, Estevam R, andMitchell, Tom M. Toward an architecture for never-ending language learning. In Proceedings of theTwenty-Fourth Conference on Artificial Intelligence(AAAI 2010), volume 2, pp. 3–3, 2010.

Fader, Anthony, Soderland, Stephen, and Etzioni,Oren. Identifying relations for open information ex-traction. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing, pp.1535–1545. Association for Computational Linguis-tics, 2011.

Gonzalez, J, Low, Yucheng, Gretton, Arthur,Guestrin, Carlos, and Gatsby Unit, UCL. Parallelgibbs sampling: From colored fields to thin junctiontrees. Journal of Machine Learning Research, 2011.

Low, Yucheng, Gonzalez, Joseph, Kyrola, Aapo,Bickson, Danny, Guestrin, Carlos, and Hellerstein,Joseph M. Graphlab: A new parallel framework formachine learning. In Conference on Uncertainty inArtificial Intelligence (UAI), Catalina Island, Cali-fornia, July 2010.

Low, Yucheng, Bickson, Danny, Gonzalez, Joseph,Guestrin, Carlos, Kyrola, Aapo, and Hellerstein,Joseph M. Distributed graphlab: A framework formachine learning and data mining in the cloud. Pro-ceedings of the VLDB Endowment, 5(8):716–727,2012.

Niu, Feng, Re, Christopher, Doan, AnHai, and Shav-lik, Jude. Tuffy: scaling up statistical inference inmarkov logic networks using an rdbms. Proceedingsof the VLDB Endowment, 4(6):373–384, 2011.

Niu, Feng, Zhang, Ce, Re, Christopher, and Shavlik,Jude. Scaling inference for markov logic via dual de-composition. In Data Mining (ICDM), 2012 IEEE12th International Conference on, pp. 1032–1037.IEEE, 2012.

Poon, Hoifung and Domingos, Pedro. Sound and effi-cient inference with probabilistic and deterministicdependencies. In Proceedings of the national confer-ence on artificial intelligence, volume 21, pp. 458.Menlo Park, CA; Cambridge, MA; London; AAAIPress; MIT Press; 1999, 2006.

Poon, Hoifung, Christensen, Janara, Domingos, Pe-dro, Etzioni, Oren, Hoffmann, Raphael, Kiddon,Chloe, Lin, Thomas, Ling, Xiao, Ritter, Alan,Schoenmackers, Stefan, et al. Machine reading atthe university of washington. In Proceedings of theNAACL HLT 2010 First International Workshop onFormalisms and Methodology for Learning by Read-ing, pp. 87–95. Association for Computational Lin-guistics, 2010.

Richardson, Matthew and Domingos, Pedro. Markovlogic networks. Machine learning, 62(1):107–136,2006.

Roth, Dan. On the hardness of approximate reasoning.Artificial Intelligence, 82(1):273–302, 1996.

Schoenmackers, Stefan, Etzioni, Oren, Weld, Daniel S,and Davis, Jesse. Learning first-order horn clausesfrom web text. In Proceedings of the 2010 Con-ference on Empirical Methods in Natural LanguageProcessing, pp. 1088–1098. Association for Compu-tational Linguistics, 2010.

Singla, Parag and Domingos, Pedro. Memory-efficientinference in relational domains. In PROCEEDINGSOF THE NATIONAL CONFERENCE ON AR-TIFICIAL INTELLIGENCE, volume 21, pp. 488.Menlo Park, CA; Cambridge, MA; London; AAAIPress; MIT Press; 1999, 2006.

Date post:	16-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Web-Scale Knowledge Inference Using Markov …yang/doc/slg2013.pdfMarkov logic network (MLN). We aim...

Documents