+ All Categories
Home > Documents > Modern Information Retrieval Chapter 7: Text Operations

Modern Information Retrieval Chapter 7: Text Operations

Date post: 31-Dec-2015
Category:
Upload: valterra-watson
View: 159 times
Download: 14 times
Share this document with a friend
Description:
Modern Information Retrieval Chapter 7: Text Operations. Ricardo Baeza-Yates Berthier Ribeiro-Neto. Document Preprocessing. Lexical analysis of the text Elimination of stopwords Stemming Selection of index terms Construction of term categorization structures. Lexical Analysis of the Text. - PowerPoint PPT Presentation
Popular Tags:
40
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto
Transcript
Page 1: Modern Information Retrieval Chapter 7: Text Operations

Modern Information Retrieval

Chapter 7: Text Operations

Ricardo Baeza-YatesBerthier Ribeiro-Neto

Page 2: Modern Information Retrieval Chapter 7: Text Operations

Document Preprocessing

Lexical analysis of the text Elimination of stopwords Stemming Selection of index terms Construction of term categorization structures

Page 3: Modern Information Retrieval Chapter 7: Text Operations

Lexical Analysis of the Text

Word separators space digits hyphens punctuation marks the case of the letters

Page 4: Modern Information Retrieval Chapter 7: Text Operations

Elimination of Stopwords A list of stopwords

words that are too frequent among the documents articles, prepositions, conjunctions, etc.

Can reduce the size of the indexing structure considerably

Problem Search for “to be or not to be”?

Page 5: Modern Information Retrieval Chapter 7: Text Operations

Stemming

Example connect, connected, connecting, connection, connections effectiveness --> effective --> effect picnicking --> picnic king -\-> k

Removing strategies affix removal: intuitive, simple table lookup successor variety n-gram

Page 6: Modern Information Retrieval Chapter 7: Text Operations

Index Terms Selection

Motivation A sentence is usually composed of nouns, pronouns,

articles, verbs, adjectives, adverbs, and connectives. Most of the semantics is carried by the noun words.

Identification of noun groups A noun group is a set of nouns whose syntactic distance

in the text does not exceed a predefined threshold

Page 7: Modern Information Retrieval Chapter 7: Text Operations

Thesauri

Peter Roget, 1988 Examplecowardly adj.Ignobly lacking in courage: cowardly turncoatsSyns: chicken (slang), chicken-hearted, craven, dastardly, fain

t-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

A controlled vocabulary for the indexing and searching

Page 8: Modern Information Retrieval Chapter 7: Text Operations

The Purpose of a Thesaurus

To provide a standard vocabulary for indexing and searching

To assist users with locating terms for proper query formulation

To provide classified hierarchies that allow the broadening and narrowing of the current query request

Page 9: Modern Information Retrieval Chapter 7: Text Operations

Thesaurus Term Relationships

BT: broader NT: narrower RT: non-hierarchical, but related

Page 10: Modern Information Retrieval Chapter 7: Text Operations

Term Selection

Automatic Text Processingby G. Salton, Chap 9, Addison-Wesley, 1989.

Page 11: Modern Information Retrieval Chapter 7: Text Operations

Automatic Indexing

Indexing: assign identifiers (index terms) to text documents.

Identifiers: single-term vs. term phrase controlled vs. uncontrolled vocabularies

instruction manuals, terminological schedules, … objective vs. nonobjective text identifiers

cataloging rules define, e.g., author names, publisher names, dates of publications, …

Page 12: Modern Information Retrieval Chapter 7: Text Operations

Two Issues

Issue 1: indexing exhaustivity exhaustive: assign a large number of terms nonexhaustive

Issue 2: term specificity broad terms (generic)

cannot distinguish relevant from nonrelevant documents narrow terms (specific)

retrieve relatively fewer documents, but most of them are relevant

Page 13: Modern Information Retrieval Chapter 7: Text Operations

Parameters of retrieval effectiveness

Recall

Precision

Goalhigh recall and high precision

P Number of relevant items retrieved

Total number of items retrieved

R Number of relevant items retrieved

Total number of relevant items in collection

Page 14: Modern Information Retrieval Chapter 7: Text Operations

NonrelevantItems

RelevantItems

RetrievedPart

ab

c d

Precisiona

a + bRecall

a

a + d

Page 15: Modern Information Retrieval Chapter 7: Text Operations

A Joint Measure

F-score

is a parameter that encode the importance of recall and procedure.

=1: equal weight <1: precision is more important >1: recall is more important

FP R

P R

( )

2

2

1

Page 16: Modern Information Retrieval Chapter 7: Text Operations

Choices of Recall and Precision

Both recall and precision vary from 0 to 1. Particular choices of indexing and search policies hav

e produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall.

In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.

Page 17: Modern Information Retrieval Chapter 7: Text Operations

Term-Frequency Consideration Function words

for example, "and", "or", "of", "but", … the frequencies of these words are high in all texts

Content words words that actually relate to document content varying frequencies in the different texts of a collect indicate term importance for content

Page 18: Modern Information Retrieval Chapter 7: Text Operations

A Frequency-Based Indexing Method

Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words.

Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di.

Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.

Page 19: Modern Information Retrieval Chapter 7: Text Operations

Inverse Document Frequency

Inverse Document Frequency (IDF) for term Tj

where dfj (document frequency of term Tj) is thenumber of documents in which Tj occurs.

fulfil both the recall and the precision occur frequently in individual documents but rarely in the re

mainder of the collection

idfN

dfj

j

log

Page 20: Modern Information Retrieval Chapter 7: Text Operations

TFxIDF Weight wij of a term Tj in a document di

Eliminating common function words Computing the value of wij for each term Tj in each document Di

Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors

w tfN

dfij ij

j

log

Page 21: Modern Information Retrieval Chapter 7: Text Operations

Term-discrimination Value

Useful index terms Distinguish the documents of a collection from

each other Document Space

Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together

When a high-frequency term without discrimination is assigned, it will increase the document space density

Page 22: Modern Information Retrieval Chapter 7: Text Operations

Original State After Assignment of good discriminator

After Assignment of poor discriminator

A Virtual Document Space

Page 23: Modern Information Retrieval Chapter 7: Text Operations

Good Term Assignment

When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection.

This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.

Page 24: Modern Information Retrieval Chapter 7: Text Operations

Poor Term Assignment

A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar.

This is reflected in an increase in document space density.

Page 25: Modern Information Retrieval Chapter 7: Text Operations

Term Discrimination Value

Definitiondvj = Q - Qj

where Q and Qj are space densities before and after the assignments of term Tj.

dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

QN N

sim D Di kki k

N

i

N

1

1 11( )( , )

Page 26: Modern Information Retrieval Chapter 7: Text Operations

DocumentFrequency

Low frequency

dvj=0Medium frequency

dvj>0

High frequency

dvj<0

N

Variations of Term-Discrimination Valuewith Document Frequency

Page 27: Modern Information Retrieval Chapter 7: Text Operations

TFij x dvj

wij = tfij x dvj

compared with

: decrease steadily with increasing document frequency

dvj: increase from zero to positive as the document frequency of the term increase,

decrease shapely as the document frequency becomes still larger.

w tfN

dfij ij

j

log

N

df j

Page 28: Modern Information Retrieval Chapter 7: Text Operations

Document Centroid Issue: efficiency problem

N(N-1) pairwise similarities Document centroid C = (c1, c2, c3, ..., ct)

where wij is the j-th term in document i. Space density

N

iijj wc

1

N

iiDCsim

NQ

1

),(1

Page 29: Modern Information Retrieval Chapter 7: Text Operations

Probabilistic Term Weighting

GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection

DefinitionGiven a user query q, and the ideal answer set of the relevant documents

From decision theory, the best ranking algorithm for a document D

)Pr(

)Pr(log

)|Pr(

)|Pr(log)(

nonrel

rel

nonrelD

relDDg

Page 30: Modern Information Retrieval Chapter 7: Text Operations

Probabilistic Term Weighting

Pr(rel), Pr(nonrel):document’s a priori probabilities of relevance and nonrelevance

Pr(D|rel), Pr(D|nonrel):occurrence probabilities of document D in the relevant and nonrelevant document sets

Page 31: Modern Information Retrieval Chapter 7: Text Operations

t

ii

t

ii

nonrelxnonrelD

relxrelD

1

1

)|Pr()|Pr(

)|Pr()|Pr(

Assumptions

Terms occur independently in documents

Page 32: Modern Information Retrieval Chapter 7: Text Operations

Derivation Process

)Pr(

)Pr(log

)|Pr(

)|Pr(log)(

nonrel

rel

nonrelD

relDDg

log

Pr( | )

Pr( | )

x rel

x nonrel

ii

t

ii

t1

1

constants

log

Pr( | )

Pr( | )

x rel

x nonreli

ii

t

1

constants

Page 33: Modern Information Retrieval Chapter 7: Text Operations

Given a document D=(d1, d2, …, dt)

Assume di is either 0 (absent) or 1 (present).

Pr( | ) ( )

Pr( | ) ( )

x d rel p p

x d nonrel q q

i i i

d

i

d

i i i

d

i

d

i i

i i

1

1

1

1

Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-piPr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi

g Dx d rel

x d nonreli i

i ii

t

( ) logPr( | )

Pr( | )

1

constants

For a specific document D

Page 34: Modern Information Retrieval Chapter 7: Text Operations

g Dx d rel

x d nonreli i

i ii

t

( ) logPr( | )

Pr( | )

1

constants

log( )

( )

d d

d d

i i

i i

p p

q q

i i

i ii

t1

11

11

constants

log( ) ( )

( ) ( )

d d

d d

i i

i i

p q p

q p qi i i

i i ii

t 1 1

1 11

constants

constantslog1 )1())1((

)1())1((

t

iiii

iii

qpq

pqpi

i

d

d

Page 35: Modern Information Retrieval Chapter 7: Text Operations

trp q

q pj

j j

j j

log( )

( )

1

1

g Dp

qd

p q

q pi

ii

t

ii i

i ii

t

( ) log log( )

( )

1

1

1

11 1constants

Term Relevance Weight

Page 36: Modern Information Retrieval Chapter 7: Text Operations

Issue

How to compute pj and qj ?

pj = rj / Rqj = (dfj-rj)/(N-R)

R: the total number of relevant documents N: the total number of documents

Page 37: Modern Information Retrieval Chapter 7: Text Operations

Estimation of Term-Relevance

The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collection

qj = dfj / N

The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value pj = 0.5 for all j.

Page 38: Modern Information Retrieval Chapter 7: Text Operations

5.0*

)1(*5.0log

)1(

)1(log

N

dfN

df

pq

qptr

j

j

jj

jjj

j

j

df

dfN )(log

When N is sufficiently large, N-dfj N,

j

jj

df

dfNtr

)(log

jdf

Nlog = idfj

Comparison

Page 39: Modern Information Retrieval Chapter 7: Text Operations

Estimation of Term-Relevance

Estimate the number of relevant documents rj in the collection that contain term Tj as a function of the known document frequency tfj of the term Tj. pj = rj / R

qj = (dfj-rj)/(N-R)R: an estimate of the total number of relevant documents in the collection.

Page 40: Modern Information Retrieval Chapter 7: Text Operations

Summary

Inverse document frequency, idfj tfij*idfj (TFxIDF)

Term discrimination value, dvj tfij*dvj

Probabilistic term weighting trj tfij*trj

Global properties of terms in a document collection


Recommended