Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | magnus-mcbride |
View: | 216 times |
Download: | 0 times |
Building Discerning Knowledge Bases from Multiple Source Documents,
with Novel Fact Filtering
Jason Hale1, Sumali Conlon1, Tim McCready1, Susan Lukose2, Anil Vinjamur2
1Department of Management Information Systems University of Mississippi
University, MS 38677
2Department of Computer and Information ScienceUniversity of Mississippi
University, MS 38677
Outline
Novel Facts
Web Articles Reuters “OPQ” 1/3/05
x Bx Cxx xx HIx x xx XYx…
CONFLICTS1. X !Y2. X Y
Sources1. WSJ2. Reuters
Articles1. LMN2. OPQ
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
FACTS1. ABC2. HI3. X !Y
XML Facts
WSJ “Article LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx X !Yxx…
Novelty Fact Filtering Agent
Information Extraction AgentPresentation Outline
BackgroundMotivationResearch GoalsSystems ArchitectureMethod of ApproachFuture Research
Presentation Outline
BackgroundMotivationResearch GoalsSystems ArchitectureMethod of ApproachFuture Research
Yesterday
•Scarce•Expensive•Printed text•Slow moving•Stale upon arrival•Hoarded by Experts•Manually Processed•Trusted, but not always correct
Today
•Over abundance•Cheap•Electronic text•Electric Speed•Fresh mixed w/stale•Communicable •Semi-automatic•Mix of correct/incorrect, trusted/untrusted
Business Information
• Repetitive information in multiple packages• No time to read them all• You want just the facts you need
- From all (and just) the relevant docsInformation Retrieval (IR)
- Maybe without reading any articlesInformation Extraction (IE)
- Definitely without redundant readingNovelty Filtering
• Impossible to keep up with, manually
Looking for Information on the Web
Advancing Information Extraction Methods• Extracting financial information from online documents (Reuters, Wall Street Journal).• via FIRST System (Lukose et. al, AMCIS 2004)
Making business information available on the web more processable
• Converting the extracted facts into XML.
• FIRST Quarter (Vinjamur, et. al., AMCIS 2005)
Ongoing Research Goals of UM Team
Making web business information more manageable
•Adding a Novelty Filtering Layer to evolving First Quarter IE System
• Storing novel facts extracted from FIRST Quarter into a Knowledge Base
Liberating facts from their sources• Multiple Sourcing (Wall Street Journal, Reuters)
• Fact trustworthiness
Ongoing Research of Our Team – Goals Addressed in this Paper
Flexible Information extRaction SysTem (FIRST)
Extracted info from Wall Street Journal only- corporate earnings facts and predictions
Human text-pattern based rule creation
Used natural language processing- w/ WordNET to enhance recall- w/ KWIC Index to enhance precision
Output facts in semi-structured text
Organization Name: SANYO ELECTRIC CO
Organization Description: One of Japan’s biggest makers of electrical and electronic products
Fact / Prediction Fact (Has Happened)
Financial Item: Earnings
Financial Item Status: Fell
Financial Item % Change: 94%
Financial Item Change Description: From $235.4 million to $13.7 million
Sales Status: Fell
Sales % Change: 21%
Sales Change Description: from $9.76 billion to $7.68 billion
Sample Corporate Report Table
FIRST Quarter Enhancements
-Extracting from multiple sources: WSJ, Reuters, etc.-multi-sourced facts -requires humans adding more rules
- Extracting time and date information- Extracting more-structured facts
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
WSJ “Article LMN” 1/2/05xx Axx Bx xx Hx Ixx xx XYxx…
Novelty Fact Filtering Agent
Information Extraction Agent
A theoretical IR agent retrieves relevant, text-based corporate earnings reports
from multiple web sources…
…and feeds them to an IE agent, such asFIRST Quarter.
Information Retrieval Agent
Web Articles
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…
WSJ “Article LMN” 1/2/05xx Axx Bx xx Hx Ixx xx XYxx…
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
Novelty Fact Filtering Agent
Information Extraction Agent
XML Facts
Information is extracted from the text, producing
discrete XML facts.
This pool of XML factsis funneled into a
novelty filter.
WSJ “Article LMN” 1/2/05xx Axx Bx xx Hx Ixx xx XYxx…
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…
Web Articles
BC Compliments AB HI duplicates HI XY conflicts with X !Y
Novelty Fact Filtering Agent
Information Extraction Agent
Knowledge Base
Each XML fact is packaged with meta-data identifying
its respective source.
Reuters “OPQ” 1/3/05
Tasks of the FIRST Quarter Novelty Filter
Weed out duplicate factsFold in complimentary facts
- facts of differing precisionDetect and manage conflicting facts
- corrected facts
Tasks of the FIRST Quarter Novelty Filter
Weed out duplicate factsFold in complimentary facts
- facts of differing precisionDetect and manage conflicting facts
- corrected facts
ARTICLESSOURCESCONFLICTS
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “Article LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Information Extraction Agent
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
FACTS
Novelty Fact Filtering Agent
…partial facts are detected in the novelty filter
…and joined into complete facts
In concept…
before entering the knowledge base.
…each partial fact is interrogated in isolation…
FACTS ARTICLESSOURCES
WSJLMN
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Information Extraction Agent
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
Novelty Filter
In practice……made to reveal
its source…... then admitted
to the knowledge base.
AB and BC provide complimentary
info about B.
FACTS ARTICLESSOURCES
ReutersOPQ
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Information Extraction Agent
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
XML Facts
LMNWSJ
Novelty Fact Filtering AgentAs each subsequent fact
is digested…
Does it match afact already learned?
Match Types
Complimenting FactsDuplicate Facts
Facts of Differing PrecisionConflicting Facts
Match Types
Complimenting FactsDuplicate Facts
Facts of Differing PrecisionConflicting Facts
so rather than inserting another partial fact
we augment (update) the existing fact.
FACTS SOURCES ARTICLES
WSJLMN
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Information Extraction Agent
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
Novelty Fact Filtering Agent
LMNOPQ
WSJReuters
ABC
Novel fact HI is detected…
…from a familiar source…HI enters theKnowledge base……and remembers
its sole source.
FACTS SOURCES ARTICLES
WSJLMN
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Information Extraction Agent
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
LMNOPQ
WSJReuters
ABC
Novelty Fact Filtering Agent …and digested as a sole-sourced fact.
Novel fact XY is detected…
SOURCES ARTICLES
WSJReuters
OPQReuters
FACTS
LMNOPQ
FACT_ARTICLE
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Information Extraction Agent
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
XML Facts Duplicate fact HI is found to
have come from a 2nd source…
We remember the new source…
…but discardthe duplicate fact.
H1 is now linkedto multiple sources.
Novelty Fact Filtering Agent
Match Types
Complimenting FactsDuplicate Facts
Facts of Differing PrecisionConflicting Facts
Match Types
Complimenting FactsDuplicate Facts
Facts of Differing PrecisionConflicting Facts
SOURCES ARTICLES
WSJReuters
FACTS
LMNOPQ
CONFLICTS
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Information Extraction Agent
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
XML Facts
Fact X!Y is matched
against known facts
Novelty Fact Filtering Agent
Match Types
Complimenting FactsDuplicate Facts
Facts of Differing PrecisionConflicting Facts
Match Types
Complimenting FactsDuplicate Facts
Facts of Differing PrecisionConflicting Facts
…and found to conflict with XY
Both facts are moved to
a Conflicts table
SOURCES ARTICLES
WSJReuters
FACTS
LMNOPQ
CONFLICTS
Reuters “OPQ” 1/3/05x Bx Cxx xx HIx x xx Xx!Yx…WSJ “LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx XYxx…
Web Articles
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
X!Y is later extracted
from a 3rd source
WSJ “ZZZ” 1/4/05Xx!Yxxxxx
Since it matches an existing conflict…
…and matchedagainst known
facts and conflicts. X!Y is vindicated.While XY is disavowed.
Novelty Filter
Information Extraction Agent
LMNOPQZZZ
X!Y is now a dual-sourced fact.
SOURCE_CD = SOURCE_CD
ARTICLE_ID = ARTICLE_ID
FACT_ID = FACT_ID
ORG_CD = ORG_CD
INTERVAL_CD = TO_INTERVAL_CD
ACTION_CD = ACTION_CD
ELEMENT_CD = ELEMENT_CD
ACTIONS
ACTION_CDDESCRIPTION
VARCHAR2(3)VARCHAR2(50)
<pk>
ARTICLES
ARTICLE_IDTITLEURLDATE_STAMPSOURCE_CD
NUMBERVARCHAR2(50)VARCHAR2(500)DATEVARCHAR2(3)
<pk>
<fk>
ARTICLE_FACTS
ARTICLE_IDFACT_ID
NUMBERNUMBER
<fk1><fk2>
CONFLICTS
CONFLICT_IDFACT_IDFACT_TYPE_CDELEMENT_CDCHANGE_PERCENTFROM_AMTTO_AMTORG_CDTO_INTERVAL_CDACTION_CDFROM_INTERVAL_CDCHANGE_TO_PERCENT
NUMBERNUMBERVARCHAR2(1)VARCHAR2(2)NUMBERNUMBERNUMBERVARCHAR2(6)VARCHAR2(6)VARCHAR2(3)VARCHAR2(6)NUMBER
ELEMENTS
ELEMENT_CDDESCRIPTION
VARCHAR2(2)VARCHAR2(30)
<pk>
FACTS
FACT_IDFACT_TYPE_CDELEMENT_CDCHANGE_PERCENTFROM_AMTTO_AMTORG_CDTO_INTERVAL_CDACTION_CDFROM_INTERVAL_CDCHANGE_TO_PERCENT
NUMBERVARCHAR2(1)VARCHAR2(2)NUMBERNUMBERNUMBERVARCHAR2(6)VARCHAR2(6)VARCHAR2(3)VARCHAR2(6)NUMBER
<pk>
<fk4>
<fk1><fk2><fk3>
ORGANIZATIONS
ORG_CDDESCRIPTION
VARCHAR2(10)VARCHAR2(50)
<pk>
SOURCES
SOURCE_CDDESCRIPTION
VARCHAR2(3)VARCHAR2(50)
<pk>
TIME_INTERVALS
INTERVAL_CDDESCRIPTION
VARCHAR2(6)VARCHAR2(50)
<pk>
Knowledge Base Schema
Method of Approach
1. Find a pair of related earnings reports from WSJ and Reuters.
2. Manually extract all targeted facts from the articles.
3. For each document in the pair, count the number of:- Facts to be extracted- Items to be extracted- Duplicate facts - Complimenting facts- Conflicting facts
Method of Approach (cont.)
4. Feed the document pair into the FIRST Quarter system.
5. At the end, look in the database and compare the results with the manually extracted facts.
6. If all facts were not processed correctly, then:• Manually update the rule base• Re-process the pair of source documents.• Backup and wipe out the database• Re-process the corpus of test documents, and compare
with backup database to compute the new scores
Method of Approach
7. We will be finished with FIRST Quarter when:The last X pair of new documents processed does notresult in a improved accuracies over the previous X, in spite of rule updates. [WE STOP IMPROVING]
Measures of Effectiveness
• Fact-level Recall/Precision• Item-level Recall/Precision• Duplicate Fact Recall/Precision• Complimenting Fact Recall/Precision• Conflicting Fact Recall/Precision
FIRST Results to DatePrecision = The number of items that are tagged correctly The number of items being tagged
First’s Precision = 90%
Recall = The number of items tagged by the system The number of possible items that experts would tag
First’s Recall = 85%
F F = = 2 PR 2 PR P + RP + R
First’s F value = 87.43%
Future Research Goals of UM Team
- Incorporate Machine Learning Techniques to improve FIRST Quarter IE precision and recall
-Build tools to:- mark-up/weed-out copies of processed source docs
-to reflect which facts were extracted-to weed out redundant information
- Add an IR agent to feed the FIRST Quarter system docs to build the knowledge base automatically from the web
- Add web services built on the knowledge base.
Novel Facts
Web Articles Reuters “OPQ” 1/3/05
x Bx Cxx xx HIx x xx XYx…
CONFLICTS1. X !Y2. X Y
Sources1. WSJ2. Reuters
Articles1. LMN2. OPQ
Knowledge Base
BC Compliments AB HI duplicates HI XY conflicts with X !Y
FACTS1. ABC2. HI3. X !Y
XML Facts
WSJ “Article LMN” 1/2/05
xx Axx Bx xx Hx Ixx xx X !Yxx…
Novelty Fact Filtering Agent
Information Extraction Agent
Questions?Questions?