+ All Categories
Home > Documents > Entity-oriented filtering of large streams

Entity-oriented filtering of large streams

Date post: 10-Feb-2016
Category:
Upload: drago
View: 57 times
Download: 1 times
Share this document with a friend
Description:
Entity-oriented filtering of large streams. Date: Tue, 13 Mar 2012 02:45:40 +0000 From: Google Alerts < [email protected] > Subject: Google Alert - "John R. Frank" === Web - 2 new results for ["John R. Frank"] === John R. Frank - PowerPoint PPT Presentation
Popular Tags:
27
Entity-oriented filtering of large streams John R. Frank [email protected] Ian Soboroff ian.soboroff@ni st.gov Max Kleiman- Weiner [email protected] Dan A. Roberts [email protected] http://trec-kba.org
Transcript
Page 1: Entity-oriented filtering  of large streams

Entity-oriented filtering of large streams

John R. [email protected]

Ian [email protected]

Max [email protected]

Dan A. [email protected]

http://trec-kba.org

Page 2: Entity-oriented filtering  of large streams

Date: Tue, 13 Mar 2012 02:45:40 +0000

From: Google Alerts <[email protected]>

Subject: Google Alert - "John R. Frank"

=== Web - 2 new results for ["John R. Frank"] ===

John R. Frank

SPOKANE, Wash. - John R. Frank, 55, died March 4, 2012, in Coeur d' Alene,

Idaho. Survivors include: his wife, Miki; daughter, Patricia Frank; ...

<http://www.hutchnews.com/obituaries/Frank--John-CP>

In Memory of John R Frank

Biography. John R. Frank, age 55, passed away at Sacred Heart Medical

Center in Spokane, WA, on March 4, 2012. John was born in Hutchison, KS, ...

<http://www.englishfuneralchapel.com/sitemaker/sites/Englis1/obit.cgi?user=583335Frank>

Page 3: Entity-oriented filtering  of large streams

2012 Task:Filtering to Recommend Citations

1) Initialize with a target WP entity• state of WP from Jan 2012

2) Iterate over stream of text items• Oct-Dec 2011: train on labels

3) For each, output relevance between 0, 1• Jan-Apr 2012: labels hidden

Content Stream•462M texts, 40% English•4,973 hourly chunks of a 105 docs/hour•News, blogs, forums, and link shortening

Your KBA System

Entities in Wikipedia or another Knowledge Base

Automatically recommend

new edits

Page 4: Entity-oriented filtering  of large streams

s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/

Page 5: Entity-oriented filtering  of large streams
Page 6: Entity-oriented filtering  of large streams

Accelerate?

rate of assimilation << stream size

# editors << # entities << # mentions

(definition of a “large” KB)

Page 7: Entity-oriented filtering  of large streams

How many days must a news article wait before being cited in Wikipedia?

Page 8: Entity-oriented filtering  of large streams

Complex entity with many relationships and attributes.

Page 9: Entity-oriented filtering  of large streams

Has many interests, including trying to takeover UK soccer teams.

His empire includes many entities…

Note: Usmanov not mentioned in this text!

Citation #18

Elaborate link trails…

Page 10: Entity-oriented filtering  of large streams

Example KBA Rating Task

Published: March 31, 2012Impact of Thoughts on WaterBy Denis Gorce-Bourge

Water covers 70% of our Blue planet and our body is made of about 70% water.

Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness.

For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals.

Pollution has a direct impact on the beauty of a frozen crystal but as well words, music and thoughts. He tested the quality of water crystals by exposing it to various conditions: to written words like hate and violence and Love and gratitude. The results were just astonishing. The crystal exposed to Love and gratitude was beautiful and perfectly formed where the other one was severely degraded. He demonstrated as well the impact of Heavy Metal music versus Mozart or Beethoven and how the vibration of music impacts water.

The very shape of water crystals is modified by violence, aggression, and negative words.

Page 11: Entity-oriented filtering  of large streams

Example KBA Rating Task

Published: March 31, 2012Impact of Thoughts on WaterBy Denis Gorce-Bourge

Water covers 70% of our Blue planet and our body is made of about 70% water.

Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness.

For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals.

Pollution has a direct impact on the beauty of a frozen crystal but as well words, music and thoughts. He tested the quality of water crystals by exposing it to various conditions: to written words like hate and violence and Love and gratitude. The results were just astonishing. The crystal exposed to Love and gratitude was beautiful and perfectly formed where the other one was severely degraded. He demonstrated as well the impact of Heavy Metal music versus Mozart or Beethoven and how the vibration of music impacts water.

The very shape of water crystals is modified by violence, aggression, and negative words.

Page 12: Entity-oriented filtering  of large streams
Page 13: Entity-oriented filtering  of large streams

97.6% +/- 1.4% (N=5365) coref

69.5% +/- 2.7% (N=1352) central

70.9% +/- 2.0% (N=2403) relevant

58.4% +/- 3.4% (N=884) neutral

84.9% +/- 2.0% (N=2599) garbage

82.6% +/- 1.8% (N=3200) central relevant

89.0% +/- 1.7% (N=3551) central relevant neutral

Interannotator Agreement

Page 14: Entity-oriented filtering  of large streams

IR:•User task centric•Variation in interpretation•Scores cascading lists•Constructionist, emergence

NLP:•Data parsing centric•Universal annotation•Scores probabilities•Reductionist

TRECing the continental divide between NLP and IR

Page 15: Entity-oriented filtering  of large streams
Page 16: Entity-oriented filtering  of large streams

string matching

task generator 91% recall15% precision26% F1

Page 17: Entity-oriented filtering  of large streams
Page 18: Entity-oriented filtering  of large streams

KBA 2013More entity types with an emphasis on temporality in the stream.

Target Entities KB Centrally Relevant

Training Data Annotation

People and Organizations

Wikipedia ormaybe Freebase

Citation worthy Judgments from early stream

High recall on all mentioning docs.

Pharmaceutical Compounds

Merck KB? Reporting of Adverse Drug Reaction (ADR)(an event)

(same) Focus recall on first person reporting & negative reactions?

Event-type Entities

Defined by a cluster of entities and possibly a Type-of-Event from a taxonomy

WP/FB for cluster of entities, possibly also event itself.

Provides causality info

Judgments on docs for that Type-of-Event but different specific event.

Find training data from TDT?Use citations in Category:Current_events?

Judge post-hoc?

Page 19: Entity-oriented filtering  of large streams

KBXCold Start queries

focused on:nil entities

related to target clusterand/or

causality of event

Pool top-K filtered docs, or use each KBA run as separate KBP input.

(1000x filter)

Must coordinate choice of KBA target entities with

desired content of KBs for Cold Start queries.

KBA Stream Corpus 2012 (or the new Stream Corpus 2013)•462M texts, 40% English•4,973 hourly chunks of a 105 docs/hour•News, blogs, forums, and link shortening

Output KB

Clusters of related entities

and/orevent-type

entities

KBP

KBA

Page 20: Entity-oriented filtering  of large streams

Sponsors: Thank You.

Diffeo

Page 21: Entity-oriented filtering  of large streams

Thanks for your time.

John R. [email protected]

http://trec-kba.org

Page 22: Entity-oriented filtering  of large streams

Lessons Learned(in progress)

• 97% coref, but 70% rating agreement hard– one-in-twenty WP citations non-mentioning– Definition of “citable” varies across entities– Tension between IR “rating” and NLP “labeling”

• Lost >1 teams from challenges of crunching big data– Learn from Kaggle: score a run every day– AWS is really useful.

• KB feature mining & ML beat string matching– Too much training data?– Must exercise temporality in the stream… spikes & events.

Page 23: Entity-oriented filtering  of large streams

NLP deriveslogical structures from messages.Might leverage O(n2) and more to explore the problem space. Limits corpus size to ~106 docs.

IR filters relevant messagesfrom large streams.Limit algorithmic complexity to O(n) and simpler by forcing large corpora, 108 docs and higher.

KBP

KBA

Future of Large

Knowledge Bases

volume

depth

As NLP and IR converge, relevance concepts are gaining structure and logical inference is spreading across documents.

Page 24: Entity-oriented filtering  of large streams

Users query the tagged textand occasionally create more structure

Traditional Doc DB

pipeline of NLP taggers add metadata to doc DB

create structure once

docs

one-shot NLP pipeline(traditional approach)

Page 25: Entity-oriented filtering  of large streams

Doc DBdocs

End users and adaptive tagging algorithms access the same content in the DB

Paradigm Shift:persistent, adaptive NLP

in the database

Page 26: Entity-oriented filtering  of large streams

Methods in the Madnessm

ean

edit

inte

rval

(d

ays)

mean mention interval (hours)

Page 27: Entity-oriented filtering  of large streams

Has many interests, including trying to takeover UK soccer teams.

His empire includes many entities…

Elaborate link trails…


Recommended