Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 0 times |
LBSC 796/INFM 718R: Week 6
Representing the Meaning of Documents
Jimmy LinCollege of Information StudiesUniversity of Maryland
Monday, March 6, 2006
Muddy Points
Binary trees vs. binary search
Document presentation
Algorithm running times Logarithmic, linear, polynomial, exponential
The Central Problem in IRInformation Seeker Authors
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?
Today’s Class
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
Outline
How do we represent the meaning of text?
What are the problems?
What are the possible solutions?
How well do they work?
Why is IR hard?
IR is hard because natural language is so rich (among other reasons)
What are the issues? Encoding Tokenization Morphological Variation Synonymy Polysemy Paraphrase Ambiguity Anaphora
Possible Solutions
Vary the unit of indexing Strings and segments Tokens and words Phrases and entities Senses and concepts
Manipulate queries and results Term expansion Post-processing of results
Representing Electronic Texts
A character set specifies the unit of composition Characters are the smallest units of text Abstract entities, separate from how they are stored
A font specifies the printed representation What each character will look like on the page Different characters might be depicted identically
An encoding is the electronic representation What each character will look like in a file One character may have several representations
An input method is a keyboard representation
The Character ‘A’
ASCII = American Standard Code for Information Interchange
7 bits used per character Number of representable characters = 128 Some character codes used for non-visible characters
The visible characters:
0 1 0 0 0 0 0 1 = 65 DEC = ‘A’0 1 0 0 0 0 1 0 = 66 DEC = ‘B’…
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_‘abcdefghijklmnopqrstuvwxyz{|}~
The Latin-1 Character Set
ISO 8859-1: 8-bit characters for Western Europe French, Spanish, Catalan, Galician, Basque,
Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English
Printable Characters, 7-bit ASCII
Additional Defined Characters, ISO 8859-1
What about these languages?
天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 - باسم الناطق ريجيف مارك وقال
قبل - شارون إن اإلسرائيلية الخارجيةبزيارة األولى للمرة وسيقوم الدعوة
المقر طويلة لفترة كانت التي تونس،لبنان من خروجها بعد الفلسطينية التحرير لمنظمة الرسمي
1982عام .
Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России.
भा�रत सरका�र ने आर्थि� का सर्वे�क्षण में� विर्वेत्ती�य र्वेर्ष� 2005-06 में� स�त फ़ी�सदी� विर्वेका�स दीर हा�सिसल कारने का� आकालने विकाय� हा! और कार स#धा�र पर ज़ो'र दिदीय� हा!
日米連合で台頭中国に対処…アーミテージ前副長官提言
조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 보도를 부인했다 .
Tokenization
What’s a word? First try: words are separated by spaces
What about clitics?
What about languages without spaces?
Same problem with speech!
I’m not saying that I don’t want John’s input on this.
The cat on the mat. the, cat, on, the, mat
天主教教宗若望保祿二世因感冒再度住進醫院。天主教 教宗 若望保祿二世 因 感冒 再度 住進 醫院。
Where are the spaces?
Word-Level Issues
Morphological variation
= different forms of the same concept Inflectional morphology: same part of speech
Derivational morphology: different parts of speech
Synonymy
= different words, same meaning
Polysemy
= same word, different meanings
{dog, canine, doggy, puppy, etc.} concept of dog
Bank: financial institution or side of a river?Crane: bird or construction equipment?Is: depends on what the meaning of “is” is!
break, broke, broken; sing, sang, sung; etc.
destroy, destruction; invent, invention, reinvention; etc.
Paraphrase
Who killed Abraham Lincoln?
(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He will forever be
known as the man who ended Abraham Lincoln’s life.
When did Wilt Chamberlain score 100 points?
(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks.
(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank McGuire didn’t expect it to last long, saying, “He’ll get 100 points someday.” McGuire’s prediction came true just a few months later in a game against the New York Knicks on March 2.
Language provides different ways of saying the same thing
Ambiguity
What exactly do you mean?
Why don’t we have problems (most of the time)?
I saw the man on the hill with the telescope?Who has the telescope?
Time flies like an arrow.Say what?
Visiting relatives can be annoying.Who’s visiting?
Ambiguity in Action
Different documents with the same keywords may have different meanings…
What do frogs eat?
(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.
keywords: frogs, eat
What is the largest volcano in the Solar System?
(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.
(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.
(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.
keywords: largest, volcano, solar, system
Anaphora
Who killed Abraham Lincoln?
(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He will forever be
known as the man who ended Abraham Lincoln’s life.
When did Wilt Chamberlain score 100 points?
(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks.
(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank McGuire didn’t expect it to last long, saying, “He’ll get 100 points someday.” McGuire’s prediction came true just a few months later in a game against the New York Knicks on March 2.
Language provides different ways of referring to the same entity
More Anaphora
Terminology Anaphor = an expression that refers to another Anaphora = the phenomenon
Other different types of referring expressions:
Anaphora resolution can be hard!
Fujitsu and NEC said they were still investigating, and that knowledge of more such bids could emerge... Other major Japanese computer companies contacted yesterday said they have never made such bids.
The city council denied the demonstrators a permit because…they feared violence.they advocated violence.
The hotel recently went through a $200 million restoration… original artworks include an impressive collection of Greek statues in the lobby.
What can we do?
Here are the some of the problems: Encoding, tokenization Morphological variation, synonymy, polysemy Paraphrase, ambiguity Anaphora
General approaches: Vary the unit of indexing Manipulate queries and results
The Encoding Problem
天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 - باسم الناطق ريجيف مارك وقال
قبل - شارون إن اإلسرائيلية الخارجيةبزيارة األولى للمرة وسيقوم الدعوة
المقر طويلة لفترة كانت التي تونس،لبنان من خروجها بعد الفلسطينية التحرير لمنظمة الرسمي
1982عام .
Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России.
भा�रत सरका�र ने आर्थि� का सर्वे�क्षण में� विर्वेत्ती�य र्वेर्ष� 2005-06 में� स�त फ़ी�सदी� विर्वेका�स दीर हा�सिसल कारने का� आकालने विकाय� हा! और कार स#धा�र पर ज़ो'र दिदीय� हा!
日米連合で台頭中国に対処…アーミテージ前副長官提言
조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 보도를 부인했다 .
East Asian Character Sets
More than 128 characters are needed! Two-byte encoding schemes are used
Several countries have unique character sets GB in People’s Republic of China BIG5 in Taiwan JIS in Japan KS in Korea TCVN in Vietnam
Many characters appear in several languages
Unicode
Goal is to unify the world’s character sets ISO Standard 10646
Limitations: Produces much larger files than Latin-1 Fonts are hard to obtain for many characters Some characters have multiple representations, e.g.,
accents can be part of a character or separate Some characters look identical when printed, but they
come from unrelated languages The sort order may not be appropriate
What do we index?
In information retrieval, we are after the concepts represented in the documents
… but we can only index strings
So what’s the best unit of indexing?
The Tokenization Problem
In many languages, words are not separated by spaces…
Tokenization = separating a string into “words”
Simple greedy approach: Start with a list of every possible term (e.g., from a
dictionary) Look for the longest word in the unsegmented string Take longest matching term as the next word and
repeat
Probabilistic Segmentation
For an input word: c1 c2 c3 … cn
Try all possible partitions:
Choose the highest probability partition E.g., compute P(c1 c2 c3) using a language model
Challenges: search, probability estimation
c1 c2 c3 c4 … cn
c1 c2 c3 c4 … cn
c1 c2 c3 c4 … cn
…
Indexing N-Grams
Consider a Chinese document: c1 c2 c3 … cn
Don’t segment (you could be wrong!)
Instead, treat every character bigram as a term
Break up queries the same way
Works at least as well as trying to segment correctly!
c1 c2 c3 c4 c5 … cn
c1 c2 c2 c3 c3 c4 c4 c5 … cn-1 cn
Morphological Variation
Handling morphology: related concepts have different forms Inflectional morphology: same part of speech
Derivational morphology: different parts of speech
Different morphological processes: Prefixing Suffixing Infixing Reduplication
dogs = dog + PLURAL
broke = break + PAST
destruction = destroy + ion
researcher = research + er
Stemming
Dealing with morphological variation: index stems instead of words Stem: a word equivalence class that preserves the
central concept
How much to stem? organization organize organ? resubmission resubmit/submission submit? reconstructionism?
Stemmers
Porter stemmer is a commonly used stemmer Strips off common affixes Not perfect!
Many other stemming algorithms available
Errors of comission: doe/doing execute/executive ignore/ignorant
Errors of omission: create/creation europe/european cylinder/cylindrical
Incorrectly lumps unrelated terms together
Fails to lump related terms together
Does Stemming Work?
Generally, yes! (in English) Helps more for longer queries Lots of work done in this area
Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15.
Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993.
David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84.
And others…
Stemming in Other Languages
Arabic makes frequent use of infixes
What’s the most effective stemming strategy in Arabic? Open research question…
maktab (office), kitaab (book), kutub (books), kataba (he wrote), naktubu (we write), etc.
the root ktb
Words = wrong indexing unit!
Synonymy
= different words, same meaning
Polysemy
= same word, different meanings
It’d be nice if we could index concepts! Word sense: a coherent cluster in semantic space Indexing word senses achieves the effect of conceptual
indexing
{dog, canine, doggy, puppy, etc.} concept of dog
Bank: financial institution or side of a river?Crane: bird or construction equipment?
Indexing Word Senses
How does indexing word senses solve the synonym/polysemy problem?
Okay, so where do we get the word senses? WordNet: a lexical database for English
Automatically find “clusters” of words that describe the same concepts
Other methods also have been tried…
{dog, canine, doggy, puppy, etc.} concept 112986
I deposited my check in the bank. bank concept 76529I saw the sailboat from the bank. bank concept 53107
http://wordnet.princeton.edu/
Word Sense Disambiguation
Given a word in context, automatically determine the sense (concept) This is the Word Sense Disambiguation (WSD) problem
Context is the key: For each ambiguous word, note the surrounding words
“Learn” a classifier from a collection of examples Use the classifier to determine the senses of words in
the documents
bank {river, sailboat, water, etc.} side of a riverbank {check, money, account, etc.} financial institution
Does it work?
Nope!
Examples of limited success….
Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word Senses for Text Retrieval. Proceedings of SIGIR 1993.
Mark Sanderson. (1994) Word-Sense Disambiguation and Information Retrieval. Proceedings of SIGIR 1994
And others…
Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval Based on Word Senses. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.
Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using WordNet Senses. Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR.
Why Disambiguation Hurts
Bag-of-words techniques already disambiguate Context for each term is established in the query
WSD is hard! Many words are highly polysemous, e.g., interest Granularity of senses is often domain/application
specific
WSD tries to improve precision But incorrect sense assignments would hurt recall Slight gains in precision do not offset large drops in
recall
An Alternate Approach
Indexing word senses “freezes” concepts at index time
What if we expanded query terms at query time instead?
Two approaches Manual thesaurus, e.g., WordNet, UMLS, etc. Automatically-derived thesaurus, e.g., co-occurrence
statistics
dog AND cat ( dog OR canine ) AND ( cat OR feline )
Does it work?
Yes… if done “carefully”
User should be involved in the process Otherwise, poor choice of terms can hurt performance
Handling Anaphora
Anaphora resolution: finding what the anaphor refers to (called the antecedent)
Most common example: pronominal anaphora resolution Simplest method works pretty well: find previous noun
phrase matching in gender and number
John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life.
He = John Wilkes Booth
Expanding Anaphors
When indexing, replace anaphors with their antecedents
Does it work? Somewhat … but can be computationally expensive … helps more if you want to retrieve sub-document
segments
Beyond Word-Level Indexing
Words are the wrong unit to index…
Many multi-word combinations identify entities Persons: George W. Bush, Dr. Jones Organizations: Red Cross, United Way Corporations: Hewlett Packard, Kraft Foods Locations: Easter Island, New York City
Entities often have finer-grained structuresProfessor Stephen W. Hawking
title first name middle initial last name
Cambridge, Massachusetts
city state
Indexing Named Entities
Why would we want to index named entities?
Index named entities as special tokens
And treat special tokens like query terms
Works pretty well for question answering
In reality, at the time of Edison’s 1879 patent, the light bulb
had been in existence for some five decades ….
PERSON DATE
Who patented the light bulb?
When was the light bulb patented?
patent light bulb PERSON
patent light bulb DATE
John Prager, Eric Brown, and Anni Coden. (2000) Question-Answering by Predictive Annotation. Proceedings of SIGIR 2000.
But First…
We have to recognize named entities
Before that, we have to first define a hierarchy Influenced by text genres of interest… mostly news
Decent algorithms based on pattern matching
Best algorithms based on supervised learning Annotate a corpus identifying entities and types “Train” a probabilistic model Apply the model to new text
Indexing Phrases
Two types of phrases Those that make sense, e.g., “school bus”, “hot dog” Those that don’t, e.g., bigrams in Chinese
Treat multi-word tokens as index terms
Three sources of evidence: Dictionary lookup Linguistic analysis Statistical analysis (e.g., co-occurrence)
Known Phrases
Compile a term list that includes phrases Technical terminology can be very helpful
Index any phrase that occurs in the list
Most effective in a limited domain Otherwise hard to capture most useful phrases
Syntactic Phrases
Parsing = automatically assign structure to a sentence
“Walk” the tree and extract phrases Index all noun phrases Index subjects and verbs Index verbs and objects etc.
Sentence
Noun Phrase
The quick brown fox jumped over the lazy black dog
Noun phrase
Det Adj Adj Noun Verb Adj NounAdjDet
Prepositional Phrase
Prep
Syntactic Variations
What does linguistic analysis buy? Coordinations
Substitutions
Permutations
lung and breast cancer lung cancer, breast cancer
inflammatory sinonasal disease inflammatory disease, sinonasal disease
addition of calcium calcium addition
Statistical Analysis
Automatically discover phrases based on co-occurrence probabilities
If terms are not independent, they may form a phrase
Use this method to automatically learn a phrase dictionary
P(“kick the bucket”) = P(“kick”) P(“the”) P(“bucket”) ?
Does Phrasal Indexing Work?
Yes…
But the gains are so small they’re not worth the cost
Primary drawback: too slow!
What about ambiguity?
Different documents with the same keywords may have different meanings…
What do frogs eat?
(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.
keywords: frogs, eat
What is the largest volcano in the Solar System?
(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.
(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.
(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.
keywords: largest, volcano, solar, system
Indexing Relations
Instead of terms, index syntactic relations between entities in the text
Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
< frogs subject-of eat >< insects object-of eat >< animals object-of eat >< adult modifies frogs >< small modifies animals >
Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
< alligators subject-of eat >< kinds object-of animals >< small modifies animals >
From the relations, it is clear who’s eating whom!
Are syntactic relations enough?
Consider this example:
Syntax sometimes isn’t enough… we need semantics (or meaning)!
Semantics, for example, allows us to relate the following two fragments:
John broke the window.The window broke.
< John subject-of break >< window subject-of break>
“John” and “window” are both subjects…But John is the person doing the breaking (or “agent”),and the window is the thing being broken (or “theme”)
The barbarians destroyed the city…The destruction of the city by the barbarians…
event: destroyagent: barbarianstheme: city
Semantic Roles
Semantic roles are invariant with respect to syntactic expression
The idea: Identify semantic roles Index “frame structures” with filled slots Retrieve answers based on semantic-level matching
Mary loaded the truck with hay. Hay was loaded onto the truck by Mary.
event: loadagent: Marymaterial: haydestination: truck
Does it work?
No, not really…
Why not? Syntactic and semantic analysis is difficult: errors offset
whatever gain is gotten As with WSD, these techniques are precision-
enhancers… recall usually takes a dive It’s slow!
Alternative Approach
Sophisticated linguistic analysis is slow! Unnecessary processing can be avoided by query time
analysis
Two-stage retrieval Use standard document retrieval techniques to fetch a
candidate set of documents Use passage retrieval techniques to choose a few
promising passages (e.g., paragraphs) Apply sophisticated linguistic techniques to pinpoint the
answer
Passage retrieval Find “good” passages within documents Key Idea: locate areas where lots of query terms
appear close together
Key Ideas
IR is hard because language is rich and complex (among other reasons)
Two general approaches to the problem Attempt to find the best unit of indexing Try to fix things at query time
It is hard to predict a priori what techniques work Questions must be answered experimentally
Words are really the wrong thing to index But there isn’t really a better alternative…