+ All Categories
Home > Documents > Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering Jason...

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering Jason...

Date post: 17-Dec-2015
Category:
Upload: magnus-mcbride
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering Jason Hale 1 , Sumali Conlon 1 , Tim McCready 1 , Susan Lukose 2 , Anil Vinjamur 2 1 Department of Management Information Systems University of Mississippi University, MS 38677 2 Department of Computer and Information Science University of Mississippi University, MS 38677
Transcript

Building Discerning Knowledge Bases from Multiple Source Documents,

with Novel Fact Filtering

Jason Hale1, Sumali Conlon1, Tim McCready1, Susan Lukose2, Anil Vinjamur2

1Department of Management Information Systems University of Mississippi

University, MS 38677

2Department of Computer and Information ScienceUniversity of Mississippi

University, MS 38677

Outline

Novel Facts

Web Articles Reuters “OPQ” 1/3/05

x Bx Cxx xx HIx x xx XYx…

CONFLICTS1. X !Y2. X Y

Sources1. WSJ2. Reuters

Articles1. LMN2. OPQ

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

FACTS1. ABC2. HI3. X !Y

XML Facts

WSJ “Article LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx X !Yxx…

Novelty Fact Filtering Agent

Information Extraction AgentPresentation Outline

BackgroundMotivationResearch GoalsSystems ArchitectureMethod of ApproachFuture Research

Presentation Outline

BackgroundMotivationResearch GoalsSystems ArchitectureMethod of ApproachFuture Research

Yesterday

•Scarce•Expensive•Printed text•Slow moving•Stale upon arrival•Hoarded by Experts•Manually Processed•Trusted, but not always correct

Today

•Over abundance•Cheap•Electronic text•Electric Speed•Fresh mixed w/stale•Communicable •Semi-automatic•Mix of correct/incorrect, trusted/untrusted

Business Information

• Repetitive information in multiple packages• No time to read them all• You want just the facts you need

- From all (and just) the relevant docsInformation Retrieval (IR)

- Maybe without reading any articlesInformation Extraction (IE)

- Definitely without redundant readingNovelty Filtering

• Impossible to keep up with, manually

Looking for Information on the Web

Advancing Information Extraction Methods• Extracting financial information from online documents (Reuters, Wall Street Journal).• via FIRST System (Lukose et. al, AMCIS 2004)

Making business information available on the web more processable

• Converting the extracted facts into XML.

• FIRST Quarter (Vinjamur, et. al., AMCIS 2005)

Ongoing Research Goals of UM Team

Making web business information more manageable

•Adding a Novelty Filtering Layer to evolving First Quarter IE System

• Storing novel facts extracted from FIRST Quarter into a Knowledge Base

Liberating facts from their sources• Multiple Sourcing (Wall Street Journal, Reuters)

• Fact trustworthiness

Ongoing Research of Our Team – Goals Addressed in this Paper

Flexible Information extRaction SysTem (FIRST)

Extracted info from Wall Street Journal only- corporate earnings facts and predictions

Human text-pattern based rule creation

Used natural language processing- w/ WordNET to enhance recall- w/ KWIC Index to enhance precision

Output facts in semi-structured text

Organization Name: SANYO ELECTRIC CO

Organization Description: One of Japan’s biggest makers of electrical and electronic products

Fact / Prediction Fact (Has Happened)

Financial Item: Earnings

Financial Item Status: Fell

Financial Item % Change: 94%

Financial Item Change Description: From $235.4 million to $13.7 million

Sales Status: Fell

Sales % Change: 21%

Sales Change Description: from $9.76 billion to $7.68 billion

Sample Corporate Report Table

FIRST Quarter Enhancements

-Extracting from multiple sources: WSJ, Reuters, etc.-multi-sourced facts -requires humans adding more rules

- Extracting time and date information- Extracting more-structured facts

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

WSJ “Article LMN” 1/2/05xx Axx Bx xx Hx Ixx xx XYxx…

Novelty Fact Filtering Agent

Information Extraction Agent

A theoretical IR agent retrieves relevant, text-based corporate earnings reports

from multiple web sources…

…and feeds them to an IE agent, such asFIRST Quarter.

Information Retrieval Agent

Web Articles

Example WSJ Article Fed Into FIRST Quarter

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “Article LMN” 1/2/05xx Axx Bx xx Hx Ixx xx XYxx…

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

Novelty Fact Filtering Agent

Information Extraction Agent

XML Facts

Information is extracted from the text, producing

discrete XML facts.

This pool of XML factsis funneled into a

novelty filter.

WSJ “Article LMN” 1/2/05xx Axx Bx xx Hx Ixx xx XYxx…

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…

Web Articles

BC Compliments AB HI duplicates HI XY conflicts with X !Y

Novelty Fact Filtering Agent

Information Extraction Agent

Knowledge Base

Each XML fact is packaged with meta-data identifying

its respective source.

Reuters “OPQ” 1/3/05

Tasks of the FIRST Quarter Novelty Filter

Weed out duplicate factsFold in complimentary facts

- facts of differing precisionDetect and manage conflicting facts

- corrected facts

Tasks of the FIRST Quarter Novelty Filter

Weed out duplicate factsFold in complimentary facts

- facts of differing precisionDetect and manage conflicting facts

- corrected facts

XML Fact Extracted by FIRST Quarter

ARTICLESSOURCESCONFLICTS

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “Article LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Information Extraction Agent

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

FACTS

Novelty Fact Filtering Agent

…partial facts are detected in the novelty filter

…and joined into complete facts

In concept…

before entering the knowledge base.

…each partial fact is interrogated in isolation…

FACTS ARTICLESSOURCES

WSJLMN

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Information Extraction Agent

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

Novelty Filter

In practice……made to reveal

its source…... then admitted

to the knowledge base.

AB and BC provide complimentary

info about B.

FACTS ARTICLESSOURCES

ReutersOPQ

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Information Extraction Agent

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

XML Facts

LMNWSJ

Novelty Fact Filtering AgentAs each subsequent fact

is digested…

Does it match afact already learned?

Match Types

Complimenting FactsDuplicate Facts

Facts of Differing PrecisionConflicting Facts

Match Types

Complimenting FactsDuplicate Facts

Facts of Differing PrecisionConflicting Facts

so rather than inserting another partial fact

we augment (update) the existing fact.

FACTS SOURCES ARTICLES

WSJLMN

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Information Extraction Agent

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

Novelty Fact Filtering Agent

LMNOPQ

WSJReuters

ABC

Novel fact HI is detected…

…from a familiar source…HI enters theKnowledge base……and remembers

its sole source.

FACTS SOURCES ARTICLES

WSJLMN

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Information Extraction Agent

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

LMNOPQ

WSJReuters

ABC

Novelty Fact Filtering Agent …and digested as a sole-sourced fact.

Novel fact XY is detected…

SOURCES ARTICLES

WSJReuters

OPQReuters

FACTS

LMNOPQ

FACT_ARTICLE

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Information Extraction Agent

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

XML Facts Duplicate fact HI is found to

have come from a 2nd source…

We remember the new source…

…but discardthe duplicate fact.

H1 is now linkedto multiple sources.

Novelty Fact Filtering Agent

Match Types

Complimenting FactsDuplicate Facts

Facts of Differing PrecisionConflicting Facts

Match Types

Complimenting FactsDuplicate Facts

Facts of Differing PrecisionConflicting Facts

SOURCES ARTICLES

WSJReuters

FACTS

LMNOPQ

CONFLICTS

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Information Extraction Agent

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

XML Facts

Fact X!Y is matched

against known facts

Novelty Fact Filtering Agent

Match Types

Complimenting FactsDuplicate Facts

Facts of Differing PrecisionConflicting Facts

Match Types

Complimenting FactsDuplicate Facts

Facts of Differing PrecisionConflicting Facts

…and found to conflict with XY

Both facts are moved to

a Conflicts table

SOURCES ARTICLES

WSJReuters

FACTS

LMNOPQ

CONFLICTS

Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

Web Articles

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

X!Y is later extracted

from a 3rd source

WSJ “ZZZ” 1/4/05Xx!Yxxxxx

Since it matches an existing conflict…

…and matchedagainst known

facts and conflicts. X!Y is vindicated.While XY is disavowed.

Novelty Filter

Information Extraction Agent

LMNOPQZZZ

X!Y is now a dual-sourced fact.

SOURCE_CD = SOURCE_CD

ARTICLE_ID = ARTICLE_ID

FACT_ID = FACT_ID

ORG_CD = ORG_CD

INTERVAL_CD = TO_INTERVAL_CD

ACTION_CD = ACTION_CD

ELEMENT_CD = ELEMENT_CD

ACTIONS

ACTION_CDDESCRIPTION

VARCHAR2(3)VARCHAR2(50)

<pk>

ARTICLES

ARTICLE_IDTITLEURLDATE_STAMPSOURCE_CD

NUMBERVARCHAR2(50)VARCHAR2(500)DATEVARCHAR2(3)

<pk>

<fk>

ARTICLE_FACTS

ARTICLE_IDFACT_ID

NUMBERNUMBER

<fk1><fk2>

CONFLICTS

CONFLICT_IDFACT_IDFACT_TYPE_CDELEMENT_CDCHANGE_PERCENTFROM_AMTTO_AMTORG_CDTO_INTERVAL_CDACTION_CDFROM_INTERVAL_CDCHANGE_TO_PERCENT

NUMBERNUMBERVARCHAR2(1)VARCHAR2(2)NUMBERNUMBERNUMBERVARCHAR2(6)VARCHAR2(6)VARCHAR2(3)VARCHAR2(6)NUMBER

ELEMENTS

ELEMENT_CDDESCRIPTION

VARCHAR2(2)VARCHAR2(30)

<pk>

FACTS

FACT_IDFACT_TYPE_CDELEMENT_CDCHANGE_PERCENTFROM_AMTTO_AMTORG_CDTO_INTERVAL_CDACTION_CDFROM_INTERVAL_CDCHANGE_TO_PERCENT

NUMBERVARCHAR2(1)VARCHAR2(2)NUMBERNUMBERNUMBERVARCHAR2(6)VARCHAR2(6)VARCHAR2(3)VARCHAR2(6)NUMBER

<pk>

<fk4>

<fk1><fk2><fk3>

ORGANIZATIONS

ORG_CDDESCRIPTION

VARCHAR2(10)VARCHAR2(50)

<pk>

SOURCES

SOURCE_CDDESCRIPTION

VARCHAR2(3)VARCHAR2(50)

<pk>

TIME_INTERVALS

INTERVAL_CDDESCRIPTION

VARCHAR2(6)VARCHAR2(50)

<pk>

Knowledge Base Schema

Method of Approach

1. Find a pair of related earnings reports from WSJ and Reuters.

2. Manually extract all targeted facts from the articles.

3. For each document in the pair, count the number of:- Facts to be extracted- Items to be extracted- Duplicate facts - Complimenting facts- Conflicting facts

Method of Approach (cont.)

4. Feed the document pair into the FIRST Quarter system.

5. At the end, look in the database and compare the results with the manually extracted facts.

6. If all facts were not processed correctly, then:• Manually update the rule base• Re-process the pair of source documents.• Backup and wipe out the database• Re-process the corpus of test documents, and compare

with backup database to compute the new scores

Method of Approach

7. We will be finished with FIRST Quarter when:The last X pair of new documents processed does notresult in a improved accuracies over the previous X, in spite of rule updates. [WE STOP IMPROVING]

Measures of Effectiveness

• Fact-level Recall/Precision• Item-level Recall/Precision• Duplicate Fact Recall/Precision• Complimenting Fact Recall/Precision• Conflicting Fact Recall/Precision

FIRST Results to DatePrecision = The number of items that are tagged correctly The number of items being tagged

First’s Precision = 90%

Recall = The number of items tagged by the system The number of possible items that experts would tag

First’s Recall = 85%

F F = = 2 PR 2 PR P + RP + R

First’s F value = 87.43%

Future Research Goals of UM Team

- Incorporate Machine Learning Techniques to improve FIRST Quarter IE precision and recall

-Build tools to:- mark-up/weed-out copies of processed source docs

-to reflect which facts were extracted-to weed out redundant information

- Add an IR agent to feed the FIRST Quarter system docs to build the knowledge base automatically from the web

- Add web services built on the knowledge base.

Novel Facts

Web Articles Reuters “OPQ” 1/3/05

x Bx Cxx xx HIx x xx XYx…

CONFLICTS1. X !Y2. X Y

Sources1. WSJ2. Reuters

Articles1. LMN2. OPQ

Knowledge Base

BC Compliments AB HI duplicates HI XY conflicts with X !Y

FACTS1. ABC2. HI3. X !Y

XML Facts

WSJ “Article LMN” 1/2/05

xx Axx Bx xx Hx Ixx xx X !Yxx…

Novelty Fact Filtering Agent

Information Extraction Agent

Questions?Questions?


Recommended