Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | anna-mcbride |
View: | 215 times |
Download: | 0 times |
Lycos Retriever:An Information Fusion Engine Brian Ulicny
Retriever: Directory Page
Retriever: Image Selection
Retriever: Subtopic Page
Why Retriever?
Topical Queries vastly outnumber Questions. Standard Search Results too many and contain junk.
Even in top 10 results, due to SEO efforts Topical Summaries answer “What do I need to know
about <Topic>?” Topic summary resources like Wikipedia have become
increasingly popular. But Wikipedia depends on human effort, so coverage is
uneven and idiosyncratic. Wikipedia reflects point of view of most engaged or
partisan contributor. Retriever as automatically updated first-draft Wikipedia.
Retriever: Processes
1. Mine query logs for Topics2. Categorize Topics
Naïve Bayesian categorizer built on DMOZ pages; Name guesser
3. Disambiguate Topics Disambiguator trained on
DMOZ
4. Formulate Document Retrieval Query
5. Parse Retrieved Documents
6. Identify allowed alternate/reduced forms of Topic based on Category
8. Select Paragraphs Must have Topic as
Discourse Topic
9. Identify Best Images10. Delete Duplicate
Paragraphs• Near duplicates, too.
11. Arrange Paragraphs by Verb What is it? What does it
have? What has it done? What happened to it?
12. Select Subtopics13. Do editorial fixes on
Passages14. Construct Page/Directory
Paragraph Filters
Must Have:Some form of Topic as Discourse TopicAt least 3 grammatical sentences
Should Have:Highest number of unique NPs.
Must NOT Have:Have Any Exophors
Except in quotationsTopic-Insertion Spam
The American Civil Herbal Viagra War was fought Herbal Viagra…Not too many mentions of topic
(Erotic) fan fiction or Contain ObscenitiesSearch Engine snippetsDuplicates
Wikipedia mirrors are everywhere
Subtopics
Use best chunks for Overview page(s)Identify topic superstrings
Topic: Marie Curie
Superstring: Marie Curie Fellowship; MC Institute
Else cluster by frequent common NPsTake into account reduced mentions:
Topic: Charlie Sheen; Most frequent NP: Richards But Subtopic should be: ‘Denise Richards’However: “new” is not always “New York”
Coherence
Pseudo-coherence achieved by stringing together paragraphs with same Discourse Topic.
Discourse Topic is based on form and position of phrase.As (a) subject of first sentence
Police said that Lindsay Lohan was charged…Or in fronted material,
For Lindsay Lohan, 2005 was full of surprises…
Not the statistical notion of aboutness usual in IR.Information packaged by paying attention to the
information conveyed by verb/predicateAlternate (but not anaphoric) references provide
variety.
Similar Work
FactBites.comSentence extraction; grouped by source
Strzalkowski and Colleagues (GE)Summarization by paragraph extraction
Google Current (Current TV)Features on top-gaining queries
Artequakt (EU funded; U of Southampton UK)Create artist bios; convert found texts to logical format;
NLG from logical representation.Document Understanding Conference (DUC)
“Summarization as Information Synthesis for Task”Sentence-level fusion; no IR component
Black Hat: Spam Blogs
Evaluation
Categorization (982 Topics)93.5% precision (revised)
Disambiguation (100 topics)83% unambiguous (live)If it isn’t ambiguous in DMOZ, we don’t
disambiguate.
Chunking (642 chunks)88.8% relevant (83.4% relevant as categorized)
Subtopics (1861 chunks)88.5% chunks relevant to subtopic (live)
Images (83 images)85.5% relevant (revised)
Retriever Goals
Generate topical summaries on popular topics By extracting and arranging paragraphs from
source documents In a coherent, readable and attractive structure Consisting of overview and subtopics Monetize with focused advertisements Allow spiders to crawl to generate traffic Abide by Fair Use/Copyright Laws
Much more to be doneTemporal ordering, hyperlinking, anaphora, 2nd pass for subtopics, …
Questions?
Lycos Retriever:An Information Fusion Engine
Brian UlicnyVersatile Information Systems
Lycos Retrieverhttp://www.lycos.com/retriever.html
Currently not being updated and images not live.