Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

transcript

Representing the Meaning of Documents

LBSC 796/CMSC 838o

Session 2, February 2, 2004

Philip Resnik

Agenda

• The structure of interactive IR systems

• Character sets

• Terms as units of meaning– Strings and segments– Tokens and words– Phrases and entities– Senses and concepts

• A few words about the course

What do We Mean by “Information?”

• How is it different from “data”?– Information is data in context

• Databases contain data and produce information

• IR systems contain and provide information

• How is it different from “knowledge”?– Knowledge is a basis for making decisions

• Many “knowledge bases” contain decision rules

What Do We Mean by “Retrieval?”

• Find something that you want– The information need may or may not be explicit

• Known item search– Find the class home page

• Answer seeking– Is Lexington or Louisville the capital of Kentucky?

• Directed exploration– Who makes videoconferencing systems?

Source: Global Reach

EnglishEnglish

2000 2005

Global Internet User Population

Chinese

Supporting the Search Process

SourceSelection

Search

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Supporting the Search Process

SourceSelection

Search

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Representing Electronic Texts• A character set specifies semantic units

– Characters are the smallest units of meaning– Abstract entities, separate from their representation

• A font specifies the printed representation– What each character will look like on the page– Different characters might be depicted identically

• An encoding is the electronic representation– What each character will look like in a file– One character may have several representations

• An input method is a keyboard representation

Agenda

• The structure of interactive IR systemsCharacter sets

The character ‘A’

• ASCII encoding: 7 bits used per character0 0 0 0 0 1 0 1 = 65 DEC (decimal)

0 1 0 0 0 0 0 1 = 65 DEC (decimal)

• Number of representable characters:27 = 128 distinct characters including 0 (NUL)

• Some character codes used for non-visible characters, e.g. 7 = control-G = BEL

• Widely used in the U.S. – American Standard

Code for Information Interchange

– ANSI X3.4-1968

| 0 NUL | 32 SPACE | 64 @ | 96 ` || 1 SOH | 33 ! | 65 A | 97 a || 2 STX | 34 " | 66 B | 98 b || 3 ETX | 35 # | 67 C | 99 c || 4 EOT | 36 $ | 68 D | 100 d || 5 ENQ | 37 % | 69 E | 101 e || 6 ACK | 38 & | 70 F | 102 f || 7 BEL | 39 ' | 71 G | 103 g || 8 BS | 40 ( | 72 H | 104 h || 9 HT | 41 ) | 73 I | 105 i || 10 LF | 42 * | 74 J | 106 j || 11 VT | 43 + | 75 K | 107 k || 12 FF | 44 , | 76 L | 108 l || 13 CR | 45 - | 77 M | 109 m || 14 SO | 46 . | 78 N | 110 n || 15 SI | 47 / | 79 O | 111 o || 16 DLE | 48 0 | 80 P | 112 p || 17 DC1 | 49 1 | 81 Q | 113 q || 18 DC2 | 50 2 | 82 R | 114 r || 19 DC3 | 51 3 | 83 S | 115 s || 20 DC4 | 52 4 | 84 T | 116 t || 21 NAK | 53 5 | 85 U | 117 u || 22 SYN | 54 6 | 86 V | 118 v || 23 ETB | 55 7 | 87 W | 119 w || 24 CAN | 56 8 | 88 X | 120 x || 25 EM | 57 9 | 89 Y | 121 y || 26 SUB | 58 : | 90 Z | 122 z || 27 ESC | 59 ; | 91 [ | 123 { || 28 FS | 60 < | 92 \ | 124 | || 29 GS | 61 = | 93 ] | 125 } || 30 RS | 62 > | 94 ^ | 126 ~ || 31 US | 64 ? | 95 _ | 127 DEL |

Geeky Joke for the Day

• Why do computer geeks confuse Halloween and Christmas?

• Because 31 OCT = 25 DEC!

• 031 OCT = 0*82 + 3*81 + 1*80 octal

= 0*102 + 2*101 + 5*100 decimal

The Latin-1 Character Set

• ISO 8859-1 8-bit characters for Western Europe– French, Spanish, Catalan, Galician, Basque, Portuguese, Italian,

Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English

Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

Other ISO-8859 Character Sets

East Asian Character Sets

• More than 256 characters are needed– Two-byte encoding schemes (e.g., EUC) are used

• Several countries have unique character sets– GB in Peoples Republic of China, BIG5 in Taiwan,

JIS in Japan, KS in Korea, TCVN in Vietnam

• Many characters appear in several languages– Research Libraries Group developed EACC

• Unified “CJK” character set for USMARC records

Unicode

• Goal is to unify the world’s character sets– ISO Standard 10646

• Character set and encoding scheme separated– Full “code space” is used by character codes

• Extends Latin-1

– UTF-7 encoding will pass through email• Originally designed for 64 printable ASCII characters

– UTF-8 encoding works with disk file systems

Limitations of Unicode

• Produces much larger files than Latin-1

• Fonts are hard to obtain for many characters

• Some characters have multiple representations– e.g., accents can be part of a character or separate

• Some characters look identical when printed– But they come from unrelated languages

• The sort order may not be appropriate

Agenda

• Character setsTerms as units of meaning

– Strings and segments– Tokens and words– Phrases and entities– Senses and concepts

Strings and Segments

• Retrieval is (often) a search for concepts– But what we index are character strings

• What strings best represent concepts?– In English, words are often a good choice

• But well chosen phrases can be even better

– In German, compounds may need to be split• Otherwise queries using constituent words would fail

– In Chinese, word boundaries are not marked• Thissegmentationproblemissimilartothatofspeech

• This segmentation problem is similar to that of speech

Longest Substring Segmentation

• A greedy segmentation algorithm– Based solely on lexical information

• Start with a list of every possible term– Dictionaries are a handy source for term lists

• For each unsegmented string– Remove the longest single substring in the list

– Repeat until no substrings are found in the list

• Can be extended to explore alternatives

Longest Substring Example

• Possible German compound term: – washington

• List of German words:– ach, hin, hing, sei, ton, was, wasch

• Longest substring segmentation– was-hing-ton

• A language model might see this as bad– Roughly translates to “What tone is attached?”

Probabilistic Segmentation

• For an input word c1 c2 c3 … cn

• Try all possible partitions into w1 w2 w3 …

– c1 c2 c3 … cn

– c1 c2 c3 c3 … cn

– c1 c2 c3 … cn etc.

• Choose the highest probability partition– E.g., compute Pr(w1 w2 w3 ) using a language model

• Challenges: search, probability estimation

Non-Segmentation: N-gram Indexing

• Consider a Chinese document c1 c2 c3 … cn

• Don’t segment (you could be wrong!)

• Instead, treat every character bigram as a term– _c1 c2 , c2 c3 , c3 c4 , … , cn-1 cn

• Break up queries the same way

Tokens and Words• What is a word?

– Kindergarten– Aux armes!– Doug’s running– Realistic review resubmit

• Morphology: – How morphemes combine to make words– Morphemes are units of meaning– Remember antidisestablishmentarianism?– Anti (disestablishmentarian) ism

Morphemes and Roots

• Inflectional morphology– Preserves part of speech

– Destructions = Destruction+PLURAL

– Destroyed = Destroy+PAST

• Derivational morphology– Relates parts of speech

– Destructor = AGENTIVE(destroy)

• Can help IR performance, but expensive• Getting derivational morphology right is hard

– {peninsula,insulate}:insula (Lat. “island”) ???

Stemming• Stem: in IR, a word equivalence class that

preserves the main concept.– Often obtained by affix-stripping (Porter, 1980)– {destroy, destroyed, destruction}: destr

• Inexpensive to compute• Usually helps IR performance• Can make mistakes! (over-/understemming)

– {centennial,century,center}: cent– {acquire,acquiring,acquired}: acquir {acquisition}: acquis

Roots and Stems: beyond English

• Arabic: alselam– Stem: selam– Root: SLM (peace)

• Semantic families: altaliban– Stem: taliban (student)– Root: TLB (question)

• Current research on best level of analyis

Phrases and Entities

• Multi-word combinations identify entities– The president, Dubya, George W. Bush

• Can also identify relationships of interest– Derek Jones, CEO of SadAndBankrupt.com,…– Entity roles, filling slots in templates

Named Entity Identification

• Major categories of named entities– Influenced by text genres of interest… mostly news– Person, organization, location, date, money, …

• Decent algorithms based on finite automata

• Best algorithms based on supervised learning– Annotate a corpus identifying entities and types– Train a probabilistic model– Apply the model to new text

Example: Predictive Annotation for Question Answering

In reality, at the time of Edison’s 1879 patent, the light bulb

had been in existence for some five decades ….

TIMEPERSON

Who patented the light bulb?When was the light bulb patented?

What did Thomas Edison patent?

patent light bulb PERSONpatent light bulb TIME

In what year was the light bulb patented?

General Phrase Identification

• Two types of phrases– Compositional: meaning derived from parts– Noncompositional: idiomatic expressions

• e.g., “kick the bucket” or “buy the farm”

• Three sources of evidence– Dictionary lookup– Parsing– Co-occurrence

Known Phrases

• Same idea as longest substring match– But look for word (not character) sequences

• Compile a term list that includes phrases– Technical terminology can be very helpful

• Index any phrase that occurs in the list

• Most effective in a limited domain– Otherwise hard to capture most useful phrases

Syntactic Phrases

• Automatically construct sentence diagrams– Fairly good parsers are available

• Index the noun phrases– Assumes that queries will focus on objects

Sentence

Noun Phrase

The quick brown fox jumped over the lazy dog’s back

Noun phrase

Det Adj Adj Noun Verb Adj NounAdjDet

Prepositional Phrase

Syntactic Variations

• The “paraphrase problem”– Prof. Douglas Oard studies information access patterns.– Doug studies patterns of user access to different kinds of

information.

• Transformational variants (Jacquemin)– Coordinations

• lung and breast cancer lung cancer

– Substitutions• inflammatory sinonasal disease inflammatory disease

– Permutations• addition of calcium calcium addition

Phrase Discovery: Collocations

• Compute observed occurrence probability– For each single word and each word n-gram

• “buy” 10 times in 1000 words yields 0.01

• “the” 100 times in 1000 words yields 0.10

• “farm” 5 times in 1000 words yields 0.005

• “buy the farm” 4 times in 1000 words yields 0.004

• Compute n-gram probability if truly independent– 0.01*0.10*0.005=0.000005

• Compare with observed probability– Record phrases that occur more often than expected

Phrase Indexing Lessons

• Poorly chosen phrases hurt effectiveness– And some techniques can be slow (e.g., parsing)

• Better to index phrases and words– Want to find constituents of compositional phrases

• Better weighting schemes less benefit– Negligible improvement in some TREC systems

• Very helpful for cross-language retrieval– Noncompositional translation, reduced ambiguity

Cross-Language IR and Phrases• Poser: quite ambiguous (Langenscheidt)

– Place, put (a question, a motion)– Lay down (a principle)– Hang (curtains)– Set (a problem)

• Poser une question: meaning is clear!– Ask a question

• In this case, better to use the phrase

• But is this really about phrases?

Senses and Concepts

• What is a word sense?– Entry in a dictionary or thesaurus– Position or cluster in a semantic space

– What is word sense disambiguation?– Identifying intended sense(s) from context

– Goal for IR– Match on the intended concept, not just the words

Problems With Word Matching

• Word matching suffers from two problems– Synonymy: paper vs. article– Homonymy: bank (river) vs. bank (financial)

• Disambiguation in IR: seek to resolve homonymy– Index word senses rather than words

• Synonymy usually addressed by – Thesaurus-based query expansion– Latent semantic indexing

Word Sense Disambiguation

• Context provides clues to word meaning– “The doctor removed the appendix.”

• For each occurrence, note surrounding words– Typically +/- 5 non-stopwords

• Group similar contexts into clusters– Based on overlaps in the words that they contain

• Separate clusters represent different senses

Disambiguation Example

• Consider four example sentences– The doctor removed the appendix

– The appendix was incomprehensible

– The doctor examined the appendix

– The appendix was removed

• What clusters can you find?• Can you find enough word senses this way?• Might you find too many word senses?

Why Disambiguation Hurts

• Bag-of-words techniques already disambiguate– When more words are present, documents rank higher

– So a context for each term is established in the query

• Formal disambiguation tries to improve precision– But incorrect sense assignments would hurt recall

– Hard to distinguish homonymy from fine-grained polysemy

• Average precision balances recall and precision– But the possible precision gains are small

– And current techniques substantially hurt recall

Where Could Disambiguation Help?

• Categorization of whole documents – Identifying location(s) in a topic hierarchy

• Visualization– People are good at seeing signal amidst noise

• Probabilistic models– Combining different sources of evidence– (Requires n-best rather than 1-best responses)

Summary

• The goal is to index the right meaning units

• Start by finding fundamental features– Characters or shape codes (for OCR) etc.

• Combine them into easily recognized units– Words where possible, character n-grams otherwise– Consider alternatives to splitting or forming phrases

• But stemming is generally a good idea

• Usually best to match those units directly– Disambiguation strategies hurt more than they help

Agenda

• Character sets

A few words about the course

Course Goals

• Appreciate IR system capabilities and limitations

• Understand IR system design & implementation– For a broad range of applications and media

• Evaluate IR system performance

• Identify current IR research problems

Course Design

• Text/readings provide background and detail– At least one recommended reading is required

• Class provides organization and direction– We will not cover every important detail

• Assignments and project provide experience– The TA can help CLIS students with the project

• Final exam helps focus your effort

Grading

• Assignments (15%)– Mastery of concepts and experience using tools– 796: “homework,” 828o: “programming”

• Term project (796: 50%, 828o: 30%)– Options are described on course Web page

• Final exam (796: 35%, 828o: 55%)– Two different in-class exams

Handy Things to Know

• Classes will be videotaped– Available in the CLIS library if you miss class

• Office hours are by appointment– Send an email, or ask after class

• Everything is on the Web– At http://www.glue.umd.edu/~oard/teaching.html

• Doug is most easily reached by email– oard@umd.edu

Some Things to Do This Week

• At least skim the readings before class– Don’t fall behind!

• Look at assignment 1– Due in 2 weeks!

• Explore the Web site– Start thinking about the term project