Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | megan-turner |
View: | 216 times |
Download: | 1 times |
5 June 2006 Polettini Nicola 1
Term Weighting in Information
RetrievalPolettini Nicola
Monday, June 5, 2006
Web Information Retrieval
5 June 2006 Polettini Nicola 2
Contents1. Introduction to Vector Space Model
Vocabulary & Terms Documents & Queries Similarity measures
2. Term weighting Binary weights SMART Retrieval System
3. Salton: “Term Precision Model” Paper analysis
4. New weighting schemas Web documents
5. Conclusions6. References
5 June 2006 Polettini Nicola 3
The Vector Space Model
1.Vocabulary.2.Terms.3.Documents & Queries.4.Vector representation.5.Similarity measures.6.Cosine Similarity.
5 June 2006 Polettini Nicola 4
Vocabulary
• Documents are represented as vectors in term space (all terms = vocabulary).
• Queries represented the same as documents.
• Query and Document weights are based on length and direction of their vector.
• A vector distance measure between the query and documents is used to rank retrieved documents.
5 June 2006 Polettini Nicola 5
Terms
• Documents are represented by binary or weighted vectors of terms.
• Terms are usually stems.• Terms can be also n-grams.
• “Computer Science” = bigram• “World Wide Web” = trigram
5 June 2006 Polettini Nicola 6
Documents & Queries Vectors
• Documents and queries are represented as “bags of words” (BOW).
• Represented as vectors:– A vector is like an array of floating
point.– It has direction and magnitude.– Each vector holds a place for every
term in the collection.– Therefore, most vectors are sparse.
5 June 2006 Polettini Nicola 7
Vector representation
• Documents and Queries are represented as vectors.
• Vocabulary = n terms• Position 1 corresponds to term 1, position
2 to term 2, position n to term n.
absent is terma if 0
...,,
,...,,
,21
21
w
wwwQ
wwwD
qnqq
dddi inii
5 June 2006 Polettini Nicola 8
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
Similarity Measures
5 June 2006 Polettini Nicola 9
Cosine SimilarityThe similarity of two documents is:
• This is called the cosine similarity.• The normalization is done when weighting the
terms. Otherwise normalization and similarity can be combined.
• Cosine similarity sorts documents according to degrees of similarity.
)()(
),(
1
2
1
2
1
n
jd
n
jqj
n
jdqj
i
ij
ij
ww
wwDQsim
),(1
n
jdqji ijwwDQsim
5 June 2006 Polettini Nicola 10
Example: Computing Cosine Similarity
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
2
1 1D
Q2D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
5 June 2006 Polettini Nicola 11
Example: Computing Cosine Similarity (2)
98.0 42.0
64.0
])7.0()2.0[(*])8.0()4.0[(
)7.0*8.0()2.0*4.0(),(
yield? comparison similarity their doesWhat
)7.0,2.0(document Also,
)8.0,4.0(or query vect have Say we
22222
2
DQsim
D
Q
5 June 2006 Polettini Nicola 12
Term Weighting
1.Binary weights2.SMART Retrieval System
Local formulas Global formulas Normalization formulas
3.TFIDF
5 June 2006 Polettini Nicola 13
Binary Weights• Only the presence (1) or absence (0)
of a term is included in the vectordocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1
5 June 2006 Polettini Nicola 14
Binary Weights Formula
0 if 0
0 if 1}{
kd
kd
kd freq
freqfreq
Binary formula gives every word that appears in a document equal relevance.
It can be useful when frequency is not important.
5 June 2006 Polettini Nicola 15
Why use term weighting?
• Binary weights too limiting.(terms are either present or absent).
• Non-binary weights allow to model partial matching .(Partial matching allows retrieval of docs that approximate the query).
• Ranking of retrieved documents - best matching.(Term-weighting improves quality of answer set).
5 June 2006 Polettini Nicola 16
Smart Retrieval System
• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.
• Designed for laboratory experiments in IR –Easy to mix and match different weighting methods.
Paper: Salton, “The Smart Retrieval System – Experiments in Automatic Document Processing”, 1971
5 June 2006 Polettini Nicola 17
Smart Retrieval System (2)
• In SMART weights are decomposed into three factors:
norm
globallocalw kkdkd
5 June 2006 Polettini Nicola 18
Local term-weighting formulas
)1)(ln(}{
)max(2
1}{
2
1)max(
}{
kdkd
kd
kdkd
kd
kd
kd
kd
kd
freqfreq
freqfreq
freq
freq
freqfreq
freq
local
Binary
Frequency
Maxnorm
AugmentedNormalized
AlternateLog
5 June 2006 Polettini Nicola 19
Term frequency• TF (term frequency) - Count of times
term occurs in document. docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1
5 June 2006 Polettini Nicola 20
Term frequency (2)
• The more times a term t occurs in document d the more likely it is that t is relevant to the document.
• Used alone, favors common words, long documents.
• Too much credit to words that appears more frequently.
• Tipically used for query weighting.
5 June 2006 Polettini Nicola 21
Augmented Normalized Term Frequency
)max(}{
kd
kdkd freq
freqKfreqK
• This formula was proposed by Croft.
• Usually K = 0,5.
• K < 0,5 for large documents.
• K = 0,5 for shorter documents.
• Output varies between 0,5 and 1 for terms that appear in the document. It’s a “weak” form of normalization.
5 June 2006 Polettini Nicola 22
Logarithmic Term Frequency
• Logarithms are a way to de-emphasize the effect of frequency.
• Logarithmic formula decreases the effects of large differences in term frequencies.
)1)(ln(}{ kdkd freqfreq
5 June 2006 Polettini Nicola 23
Global term-weighting formulas
k
k
k
k
k
k
Doc
Doc
DocNDocDoc
NDoc
Doc
NDoc
global
1
log
log
log
2
Inverse
Squared
Probabilistic
Frequency
5 June 2006 Polettini Nicola 24
Document Frequency
• DF = document frequency
– Count the frequency considering the whole collection of documents.
– Less frequently a term appears in the whole collection, the more discriminating it is.
5 June 2006 Polettini Nicola 25
Inverse Document Frequency
• Measures rarity of the term in collection.• Inverts the document frequency.• It’s the most used global formula.• Higher if term occurs in less documents:
– Gives full weight to terms that occurr in one document only.
– Gives lowest weight to terms that occurr in all documents.
kDoc
NDoclog
5 June 2006 Polettini Nicola 26
Inverse Document Frequency (2)
• IDF provides high values for rare words and low values for common words.
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
Examples for a collectionof 10000 documents(N = 10000)
5 June 2006 Polettini Nicola 27
Other IDF Schemes
• Squared IDF: used rarely as a variant of IDF.
• Probabilistic IDF: – It assigns weights ranging from - for a
term that appears in every document to log(n-1) for a term that appears in only one document.
– Negative weights for terms appearing in more than half of the documents.
5 June 2006 Polettini Nicola 28
Normalization formulas
jj
n
j j
n
j j
n
j j
w
w
w
w
norm
maxn to1
1
4
1
2
1Sum of weights
Cosine
Fourth
Max
5 June 2006 Polettini Nicola 29
Document Normalization
• Long documents have an unfair advantage:– They use a lot of terms
• So they get more matches than short documents– And they use the same words repeatedly
• So they have much higher term frequencies
• Normalization seeks to remove these effects:– Related somehow to maximum term frequency.– But also sensitive to the number of terms.
• If we don’t normalize short documents may not be recognized as relevant.
5 June 2006 Polettini Nicola 30
Cosine Normalization• It’s the most used and popular.• Normalize the term weights (so longer
documents are not unfairly given more weight).
• If weights are normalized the cosine similarity results:
),( 1
t
kjkikji wwDDsim
5 June 2006 Polettini Nicola 31
Other normalizations• Sum of weights and fourth
normalization are rarely used as cosine normalization variant.
• Max Weight Normalization: It assigns weights between 0 and 1, but it
doesn’t take into account the distribution of terms over documents.
It gives high importance to the most relevant weighted terms within a document (used in CiteSeer).
5 June 2006 Polettini Nicola 32
TFIDF Term-weighting
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
5 June 2006 Polettini Nicola 33
TFIDF Example• It’s the most used term-weighting
4
5
6
3
1
3
1
6
5
3
4
3
7
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
2
1 2 3
2
3
2
4
4
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
tf Wi,j
idf
5 June 2006 Polettini Nicola 34
Normalization example
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
4
5
6
3
1
3
1
6
5
3
4
3
7
1
2
1 2 3
2
3
2
4
4
tf
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
Wi,j
idf
1.70 0.97 2.67 0.87Length
0.29
0.37
0.53
0.13
0.62
0.77
0.57
0.14
0.19
0.79
0.05
0.71
1 2 3
0.69
0.44
0.57
4
W'i,j
5 June 2006 Polettini Nicola 35
Retrieval Example
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
Query: contaminated retrieval
1
query
W'i,j
1
0.29 0.9 0.19 0.57Cosine similarity score
Ranked list:Doc 2Doc 4Doc 1Doc 3
0.29
0.37
0.53
0.13
0.62
0.77
0.57
0.14
0.19
0.79
0.05
0.71
1 2 3
0.69
0.44
0.57
4
W'i,j
5 June 2006 Polettini Nicola 36
Gerard Salton paper: “The term precision
model”1.Weighting Schema proposed.2.Cosine similarity.3.Density formula.4.Discrimination Value formulas.5.Term Precision formulas.6.Conclusions.
5 June 2006 Polettini Nicola 37
Gerard Salton paper: “Weighting Schema
proposed”
1.Use of tf idf formulas.2.Underline the importance of
term weighting.3.Use of cosine similarity.
5 June 2006 Polettini Nicola 38
Gerard Salton paper: “Density formula”
N
jii
N
ijj
jiji DDsNN
DDs1 1
),()1(
1),(
• Density = the average pairwise cosine similarity between distinct document pairs.
• N = total number of documents.
5 June 2006 Polettini Nicola 39
Gerard Salton paper: “Discrimination Value
formulas”
ssDV kk
kikik DVdfw
DV = Discrimination Value.
• It’s the difference between the two average densities where sk is the density for document pairs from which term k has been removed.
• If k is useful DV is positive.
5 June 2006 Polettini Nicola 40
Gerard Salton paper: “Discrimination Value
formulas”• Terms with a high document frequency
increase the total density formula and DV is negative.
• Terms with a low document frequency leave the density unchanged and DV is near zero value.
• Terms with medium document frequency decrease the total density and DV is positive.
5 June 2006 Polettini Nicola 41
Gerard Salton paper: “Term Precision formulas”
sI
s
rR
rw log
• N = total documents.
• R = relevant documents with respect to a query.
• I = (N-R) non relevant documents.
• r = assigned relevant documents.
• s = assigned non relevant documents. (df = r + s)
• w increases in 0<df<R and decreases in R<df<N
• The maximum value of w is reached at df = R.
5 June 2006 Polettini Nicola 42
Gerard Salton paper: “Conclusions”
Precision weights are difficult to compute in practice because required relevance assessments of documents with respect to queries are not normally available in real retrieval situations.
5 June 2006 Polettini Nicola 43
New Weighting Schemas
1.Web problems2.Document Structure3.Hyperlinks4.Different weighting
schemas
5 June 2006 Polettini Nicola 44
New Weighting Schemas (2)
• Weight tokens under particular HTML tags more heavily:– <TITLE> tokens (Google seems to like title matches)
– <H1>,<H2>… tokens– <META> keyword tokens
• Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.
5 June 2006 Polettini Nicola 45
ReferencesGerald Salton and Chris Buckley. Term
weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, Issue 5. 1988.
Gerard Salton and M.J.McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York, 1983.
Gerard Salton, A. Wong, and C.S. Yang. A vector space model for Information Retrieval. Journal of the American Society for Information Science, 18(11):613-620, November 1975.
Gerard Salton. The SMART Retrieval System – Experiments in automatic document processing. Prentice Hall, Englewood Cliffs, N. J., 1971.
5 June 2006 Polettini Nicola 46
References (2) Erica Chisholm and Tamara G. Kolda. New Term Weighting
formulas for the Vector Space Method in Information Retrieval.
Computer Science and Mathematics Division. Oak Ridge
National Laboratory, 1999.
W. B. Croft. Experiments with representation in a document
retrieval system. Information Technology: Research and
Development, 2:1-21, 1983.
Ray Larson, Marc Davis. SIMS 202: Information Organization and
Retrieval. UC Berkeley SIMS, Lecture 18: Vector Representation,
2002.
Kishore Papineni. Why Inverse Document Frequency?. IBM T.J.
Watson Research Center Yorktown Heights, New York, Usa,
2001.
Chris Buckley. The importance of proper weighting methods. In
M. Bates, editor, Human Language Technology. Morgan
Kaufman, 1993.
5 June 2006 Polettini Nicola 47
Questions?