+ All Categories
Home > Documents > Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Date post: 27-Dec-2015
Category:
Upload: melvin-shaw
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
50
Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik
Transcript
Page 1: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Representing the Meaning of Documents

LBSC 796/CMSC 838o

Session 2, February 2, 2004

Philip Resnik

Page 2: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Agenda

• The structure of interactive IR systems

• Character sets

• Terms as units of meaning– Strings and segments– Tokens and words– Phrases and entities– Senses and concepts

• A few words about the course

Page 3: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

What do We Mean by “Information?”

• How is it different from “data”?– Information is data in context

• Databases contain data and produce information

• IR systems contain and provide information

• How is it different from “knowledge”?– Knowledge is a basis for making decisions

• Many “knowledge bases” contain decision rules

Page 4: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

What Do We Mean by “Retrieval?”

• Find something that you want– The information need may or may not be explicit

• Known item search– Find the class home page

• Answer seeking– Is Lexington or Louisville the capital of Kentucky?

• Directed exploration– Who makes videoconferencing systems?

Page 5: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Source: Global Reach

EnglishEnglish

2000 2005

Global Internet User Population

Chinese

Page 6: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Page 7: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 8: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Representing Electronic Texts• A character set specifies semantic units

– Characters are the smallest units of meaning– Abstract entities, separate from their representation

• A font specifies the printed representation– What each character will look like on the page– Different characters might be depicted identically

• An encoding is the electronic representation– What each character will look like in a file– One character may have several representations

• An input method is a keyboard representation

Page 9: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Agenda

• The structure of interactive IR systemsCharacter sets

• Terms as units of meaning– Strings and segments– Tokens and words– Phrases and entities– Senses and concepts

• A few words about the course

Page 10: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

The character ‘A’

• ASCII encoding: 7 bits used per character0 0 0 0 0 1 0 1 = 65 DEC (decimal)

0 1 0 0 0 0 0 1 = 65 DEC (decimal)

• Number of representable characters:27 = 128 distinct characters including 0 (NUL)

• Some character codes used for non-visible characters, e.g. 7 = control-G = BEL

Page 11: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

ASCII

• Widely used in the U.S. – American Standard

Code for Information Interchange

– ANSI X3.4-1968

| 0 NUL | 32 SPACE | 64 @ | 96 ` || 1 SOH | 33 ! | 65 A | 97 a || 2 STX | 34 " | 66 B | 98 b || 3 ETX | 35 # | 67 C | 99 c || 4 EOT | 36 $ | 68 D | 100 d || 5 ENQ | 37 % | 69 E | 101 e || 6 ACK | 38 & | 70 F | 102 f || 7 BEL | 39 ' | 71 G | 103 g || 8 BS | 40 ( | 72 H | 104 h || 9 HT | 41 ) | 73 I | 105 i || 10 LF | 42 * | 74 J | 106 j || 11 VT | 43 + | 75 K | 107 k || 12 FF | 44 , | 76 L | 108 l || 13 CR | 45 - | 77 M | 109 m || 14 SO | 46 . | 78 N | 110 n || 15 SI | 47 / | 79 O | 111 o || 16 DLE | 48 0 | 80 P | 112 p || 17 DC1 | 49 1 | 81 Q | 113 q || 18 DC2 | 50 2 | 82 R | 114 r || 19 DC3 | 51 3 | 83 S | 115 s || 20 DC4 | 52 4 | 84 T | 116 t || 21 NAK | 53 5 | 85 U | 117 u || 22 SYN | 54 6 | 86 V | 118 v || 23 ETB | 55 7 | 87 W | 119 w || 24 CAN | 56 8 | 88 X | 120 x || 25 EM | 57 9 | 89 Y | 121 y || 26 SUB | 58 : | 90 Z | 122 z || 27 ESC | 59 ; | 91 [ | 123 { || 28 FS | 60 < | 92 \ | 124 | || 29 GS | 61 = | 93 ] | 125 } || 30 RS | 62 > | 94 ^ | 126 ~ || 31 US | 64 ? | 95 _ | 127 DEL |

Page 12: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Geeky Joke for the Day

• Why do computer geeks confuse Halloween and Christmas?

• Because 31 OCT = 25 DEC!

• 031 OCT = 0*82 + 3*81 + 1*80 octal

= 0*102 + 2*101 + 5*100 decimal

Page 13: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

The Latin-1 Character Set

• ISO 8859-1 8-bit characters for Western Europe– French, Spanish, Catalan, Galician, Basque, Portuguese, Italian,

Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English

Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

Page 14: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Other ISO-8859 Character Sets

-2

-3

-4

-5

-7

-6

-9

-8

Page 15: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

East Asian Character Sets

• More than 256 characters are needed– Two-byte encoding schemes (e.g., EUC) are used

• Several countries have unique character sets– GB in Peoples Republic of China, BIG5 in Taiwan,

JIS in Japan, KS in Korea, TCVN in Vietnam

• Many characters appear in several languages– Research Libraries Group developed EACC

• Unified “CJK” character set for USMARC records

Page 16: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Unicode

• Goal is to unify the world’s character sets– ISO Standard 10646

• Character set and encoding scheme separated– Full “code space” is used by character codes

• Extends Latin-1

– UTF-7 encoding will pass through email• Originally designed for 64 printable ASCII characters

– UTF-8 encoding works with disk file systems

Page 17: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Limitations of Unicode

• Produces much larger files than Latin-1

• Fonts are hard to obtain for many characters

• Some characters have multiple representations– e.g., accents can be part of a character or separate

• Some characters look identical when printed– But they come from unrelated languages

• The sort order may not be appropriate

Page 18: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Agenda

• The structure of interactive IR systems

• Character setsTerms as units of meaning

– Strings and segments– Tokens and words– Phrases and entities– Senses and concepts

• A few words about the course

Page 19: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Strings and Segments

• Retrieval is (often) a search for concepts– But what we index are character strings

• What strings best represent concepts?– In English, words are often a good choice

• But well chosen phrases can be even better

– In German, compounds may need to be split• Otherwise queries using constituent words would fail

– In Chinese, word boundaries are not marked• Thissegmentationproblemissimilartothatofspeech

• This segmentation problem is similar to that of speech

Page 20: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Longest Substring Segmentation

• A greedy segmentation algorithm– Based solely on lexical information

• Start with a list of every possible term– Dictionaries are a handy source for term lists

• For each unsegmented string– Remove the longest single substring in the list

– Repeat until no substrings are found in the list

• Can be extended to explore alternatives

Page 21: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Longest Substring Example

• Possible German compound term: – washington

• List of German words:– ach, hin, hing, sei, ton, was, wasch

• Longest substring segmentation– was-hing-ton

• A language model might see this as bad– Roughly translates to “What tone is attached?”

Page 22: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Probabilistic Segmentation

• For an input word c1 c2 c3 … cn

• Try all possible partitions into w1 w2 w3 …

– c1 c2 c3 … cn

– c1 c2 c3 c3 … cn

– c1 c2 c3 … cn etc.

• Choose the highest probability partition– E.g., compute Pr(w1 w2 w3 ) using a language model

• Challenges: search, probability estimation

Page 23: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Non-Segmentation: N-gram Indexing

• Consider a Chinese document c1 c2 c3 … cn

• Don’t segment (you could be wrong!)

• Instead, treat every character bigram as a term– _c1 c2 , c2 c3 , c3 c4 , … , cn-1 cn

• Break up queries the same way

Page 24: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Tokens and Words• What is a word?

– Kindergarten– Aux armes!– Doug’s running– Realistic review resubmit

• Morphology: – How morphemes combine to make words– Morphemes are units of meaning– Remember antidisestablishmentarianism?– Anti (disestablishmentarian) ism

Page 25: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Morphemes and Roots

• Inflectional morphology– Preserves part of speech

– Destructions = Destruction+PLURAL

– Destroyed = Destroy+PAST

• Derivational morphology– Relates parts of speech

– Destructor = AGENTIVE(destroy)

• Can help IR performance, but expensive• Getting derivational morphology right is hard

– {peninsula,insulate}:insula (Lat. “island”) ???

Page 26: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Stemming• Stem: in IR, a word equivalence class that

preserves the main concept.– Often obtained by affix-stripping (Porter, 1980)– {destroy, destroyed, destruction}: destr

• Inexpensive to compute• Usually helps IR performance• Can make mistakes! (over-/understemming)

– {centennial,century,center}: cent– {acquire,acquiring,acquired}: acquir {acquisition}: acquis

Page 27: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Roots and Stems: beyond English

• Arabic: alselam– Stem: selam– Root: SLM (peace)

• Semantic families: altaliban– Stem: taliban (student)– Root: TLB (question)

• Current research on best level of analyis

Page 28: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Phrases and Entities

• Multi-word combinations identify entities– The president, Dubya, George W. Bush

• Can also identify relationships of interest– Derek Jones, CEO of SadAndBankrupt.com,…– Entity roles, filling slots in templates

Page 29: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Named Entity Identification

• Major categories of named entities– Influenced by text genres of interest… mostly news– Person, organization, location, date, money, …

• Decent algorithms based on finite automata

• Best algorithms based on supervised learning– Annotate a corpus identifying entities and types– Train a probabilistic model– Apply the model to new text

Page 30: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Example: Predictive Annotation for Question Answering

In reality, at the time of Edison’s 1879 patent, the light bulb

had been in existence for some five decades ….

TIMEPERSON

Who patented the light bulb?When was the light bulb patented?

What did Thomas Edison patent?

patent light bulb PERSONpatent light bulb TIME

???

In what year was the light bulb patented?

Page 31: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

General Phrase Identification

• Two types of phrases– Compositional: meaning derived from parts– Noncompositional: idiomatic expressions

• e.g., “kick the bucket” or “buy the farm”

• Three sources of evidence– Dictionary lookup– Parsing– Co-occurrence

Page 32: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Known Phrases

• Same idea as longest substring match– But look for word (not character) sequences

• Compile a term list that includes phrases– Technical terminology can be very helpful

• Index any phrase that occurs in the list

• Most effective in a limited domain– Otherwise hard to capture most useful phrases

Page 33: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Syntactic Phrases

• Automatically construct sentence diagrams– Fairly good parsers are available

• Index the noun phrases– Assumes that queries will focus on objects

Sentence

Noun Phrase

The quick brown fox jumped over the lazy dog’s back

Noun phrase

Det Adj Adj Noun Verb Adj NounAdjDet

Prepositional Phrase

Prep

Page 34: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Syntactic Variations

• The “paraphrase problem”– Prof. Douglas Oard studies information access patterns.– Doug studies patterns of user access to different kinds of

information.

• Transformational variants (Jacquemin)– Coordinations

• lung and breast cancer lung cancer

– Substitutions• inflammatory sinonasal disease inflammatory disease

– Permutations• addition of calcium calcium addition

Page 35: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Phrase Discovery: Collocations

• Compute observed occurrence probability– For each single word and each word n-gram

• “buy” 10 times in 1000 words yields 0.01

• “the” 100 times in 1000 words yields 0.10

• “farm” 5 times in 1000 words yields 0.005

• “buy the farm” 4 times in 1000 words yields 0.004

• Compute n-gram probability if truly independent– 0.01*0.10*0.005=0.000005

• Compare with observed probability– Record phrases that occur more often than expected

Page 36: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Phrase Indexing Lessons

• Poorly chosen phrases hurt effectiveness– And some techniques can be slow (e.g., parsing)

• Better to index phrases and words– Want to find constituents of compositional phrases

• Better weighting schemes less benefit– Negligible improvement in some TREC systems

• Very helpful for cross-language retrieval– Noncompositional translation, reduced ambiguity

Page 37: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Cross-Language IR and Phrases• Poser: quite ambiguous (Langenscheidt)

– Place, put (a question, a motion)– Lay down (a principle)– Hang (curtains)– Set (a problem)

• Poser une question: meaning is clear!– Ask a question

• In this case, better to use the phrase

• But is this really about phrases?

Page 38: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Senses and Concepts

• What is a word sense?– Entry in a dictionary or thesaurus– Position or cluster in a semantic space

– What is word sense disambiguation?– Identifying intended sense(s) from context

– Goal for IR– Match on the intended concept, not just the words

Page 39: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Problems With Word Matching

• Word matching suffers from two problems– Synonymy: paper vs. article– Homonymy: bank (river) vs. bank (financial)

• Disambiguation in IR: seek to resolve homonymy– Index word senses rather than words

• Synonymy usually addressed by – Thesaurus-based query expansion– Latent semantic indexing

Page 40: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Word Sense Disambiguation

• Context provides clues to word meaning– “The doctor removed the appendix.”

• For each occurrence, note surrounding words– Typically +/- 5 non-stopwords

• Group similar contexts into clusters– Based on overlaps in the words that they contain

• Separate clusters represent different senses

Page 41: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Disambiguation Example

• Consider four example sentences– The doctor removed the appendix

– The appendix was incomprehensible

– The doctor examined the appendix

– The appendix was removed

• What clusters can you find?• Can you find enough word senses this way?• Might you find too many word senses?

Page 42: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Why Disambiguation Hurts

• Bag-of-words techniques already disambiguate– When more words are present, documents rank higher

– So a context for each term is established in the query

• Formal disambiguation tries to improve precision– But incorrect sense assignments would hurt recall

– Hard to distinguish homonymy from fine-grained polysemy

• Average precision balances recall and precision– But the possible precision gains are small

– And current techniques substantially hurt recall

Page 43: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Where Could Disambiguation Help?

• Categorization of whole documents – Identifying location(s) in a topic hierarchy

• Visualization– People are good at seeing signal amidst noise

• Probabilistic models– Combining different sources of evidence– (Requires n-best rather than 1-best responses)

Page 44: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Summary

• The goal is to index the right meaning units

• Start by finding fundamental features– Characters or shape codes (for OCR) etc.

• Combine them into easily recognized units– Words where possible, character n-grams otherwise– Consider alternatives to splitting or forming phrases

• But stemming is generally a good idea

• Usually best to match those units directly– Disambiguation strategies hurt more than they help

Page 45: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Agenda

• The structure of interactive IR systems

• Character sets

• Terms as units of meaning– Strings and segments– Tokens and words– Phrases and entities– Senses and concepts

A few words about the course

Page 46: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Course Goals

• Appreciate IR system capabilities and limitations

• Understand IR system design & implementation– For a broad range of applications and media

• Evaluate IR system performance

• Identify current IR research problems

Page 47: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Course Design

• Text/readings provide background and detail– At least one recommended reading is required

• Class provides organization and direction– We will not cover every important detail

• Assignments and project provide experience– The TA can help CLIS students with the project

• Final exam helps focus your effort

Page 48: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Grading

• Assignments (15%)– Mastery of concepts and experience using tools– 796: “homework,” 828o: “programming”

• Term project (796: 50%, 828o: 30%)– Options are described on course Web page

• Final exam (796: 35%, 828o: 55%)– Two different in-class exams

Page 49: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Handy Things to Know

• Classes will be videotaped– Available in the CLIS library if you miss class

• Office hours are by appointment– Send an email, or ask after class

• Everything is on the Web– At http://www.glue.umd.edu/~oard/teaching.html

• Doug is most easily reached by email– [email protected]

Page 50: Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.

Some Things to Do This Week

• At least skim the readings before class– Don’t fall behind!

• Look at assignment 1– Due in 2 weeks!

• Explore the Web site– Start thinking about the term project


Recommended