Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | shanna-gray |
View: | 214 times |
Download: | 0 times |
2
Basic Issues in A Retrieval Model
How to represent text
objects
What similarity function should
be used?
How to refine query according to users’
feedbacks?
3
Basic Issues in IR How to represent queries? How to represent documents? How to compute the similarity between
documents and queries? How to utilize the users’ feedbacks to
enhance the retrieval performance?
4
IR: Formal Formulation Vocabulary V={w1, w2, …, wn} of language Query q = q1,…,qm, where qi V Collection C= {d1, …, dk}
Document di = (di1,…,dimi), where dij V
Set of relevant documents R(q) C Generally unknown and user-dependent Query is a “hint” on which doc is in R(q)
Task = compute R’(q), an “approximate R(q)”
5
Computing R(q) Strategy 1: Document selection
Classification function f(d,q) {0,1} Outputs 1 for relevance, 0 for irrelevance
R(q) is determined as a set {dC|f(d,q)=1} System must decide if a doc is relevant or not
(“absolute relevance”) Example: Boolean retrieval
7
Computing R(q) Strategy 2: Document ranking
Similarity function f(d,q) Outputs a similarity between document d and query q
Cut off The minimum similarity for document and query to be
relevant
R(q) is determined as the set {dC|f(d,q)>} System must decide if one doc is more likely to
be relevant than another (“relative relevance”)
8
Document Selection vs. Ranking
++
+ +-- -
- - - -
- - - -
-
- - +- - Doc Ranking
f(d,q)=?
0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -
R’(q)
True R(q)
9
Document Selection vs. Ranking
++
+ +-- -
- - - -
- - - -
-
- - +- -
Doc Rankingf(d,q)=?
0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -
R’(q)
---
1Doc Selection
f(d,q)=?
++
++
--+
-+
--
- --0
R’(q)
True R(q)
10
Ranking is often preferred Similarity function is more general than
classification function The classifier is unlikely to be accurate
Ambiguous information needs, short queries Relevance is a subjective concept
Absolute relevance vs. relative relevance
11
Probability Ranking Principle As stated by Cooper
Ranking documents in probability maximizes the utility of IR systems
“If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”
12
Vector Space Model Any text object can be represented by a term vector
Examples: Documents, queries, sentences, …. A query is viewed as a short document
Similarity is determined by relationship between two vectors e.g., the cosine of the angle between the vectors, or the
distance between vectors The SMART system:
Developed at Cornell University, 1960-1999 Still used widely
13
Vector Space Model: illustration
Java Starbuck Microsoft
D1 1 1 0
D2 0 1 1
D3 1 0 1
D4 1 1 1
Query 1 0.1 1
15
Vector Space Model: Similarity Represent both documents and queries by word histogram
vectors n: the number of unique words A query q = (q1, q2,…, qn)
qi: occurrence of the i-th word in query
A document dk = (dk,1, dk,2,…, dk,n)
dk,i: occurrence of the the i-th word in document
Similarity of a query q to a document dk q
dk
Some Background in Linear Algebra Dot product (scalar product)
Example:
Measure the similarity by dot product
16
nknkk dqdqdq ,2,21,1 ... kdq
26453211
]4,3,1[],5,2,1[
6051241
]0,1,4[],5,2,1[
k
k
k
k
dq
dq
dq
dq
kdq
Some Background in Linear Algebra Length of a vector
Angle between two vectors
17
),( dq
q
dk
2,
22,
21,
222
21 ...,... nkkkn dddqqq kdq
2,
22,
21,
222
21
,2,21,1
......
...
)),(cos(
nkkkn
nknkk
dddqqq
dqdqdq
k
kk dq
dqdq
Some Background in Linear Algebra Example:
Measure similarity by the angle between vectors
18
97.0431521
453211)),(cos(
]4,3,1[],5,2,1[
27.0014521
051241)),(cos(
]0,1,4[],5,2,1[
222222
222222
k
k
k
k
dq
dq
dq
dq
),( dq
q
dk
19
Vector Space Model: Similarity Given
A query q = (q1, q2,…, qn) qi: occurrence of the i-th word in query
A document dk = (dk,1, dk,2,…, dk,n) dk,i: occurrence of the the i-th word in
document
Similarity of a query q to a document dk
q
dk
)),(cos(
...
),(sim
,2,21,1
kkk dqdqdq
nknkk
k
dqdqdq
dq
),( dq
2,
22,
21,
222
21
,2,21,1
......
...
)),(cos(),(sim'
nkkkn
nknkk
k
dddqqq
dqdqdq
dq
k
kk dq
dqdq
Vector Space Model: Similarity
20
q
dk
26453211
]4,3,1[],5,2,1[
40850201
]8,0,0[],5,2,1[
k
k
k
k
dq
dq
dq
dq
Vector Space Model: Similarity
21
97.0431521
453211)),(cos(
]4,3,1[],5,2,1[
913.0800521
850201)),(cos(
]8,0,0[],5,2,1[
222222
222222
k
k
k
k
dq
dq
dq
dq
),( dq
q
dk
22
Term Weighting
wk,i: the importance of the i-th word for document dk
Why weighting ? Some query terms carry more information
TF.IDF weighting TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization: avoid the bias of long documents
1 ,1 2 ,2 ,( , )k k k n k nsim q d q d q d q d
1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w
23
TF Weighting A term is important if it occurs frequently in document Formulas:
f(t,d): term occurrence of word ‘t’ in document d Maximum frequency normalization:
( , )( , ) 0.5 0.5
MaxFreq( )
f t dTf t d
d
Term frequency normalization
24
TF Weighting A term is important if it occurs frequently in document Formulas:
f(t,d): term occurrence of word ‘t’ in document d “Okapi/BM25 TF”:
( , )( , )
( )( , ) 1
_
f t dTf t d
doclen df t d
avg docl
k
be
bn
k
Term frequency normalization
doclen(d): the length of document d
avg_doclen: average document length
k,b: predefined constants
25
TF Normalization Why?
Document length variation “Repeated occurrences” are less informative than the “first
occurrence” Two views of document length
A doc is long because it uses more words A doc is long because it has more contents
Generally penalize long doc, but avoid over-penalizing (pivoted normalization)
26
TF Normalization
Norm. TF
Raw TF
“Pivoted normalization”( , )
( , )( )
( , ) 1_
f t dTf t d
doclen df t d
avg docl
k
be
bn
k
27
IDF Weighting A term is discriminative if it occurs only in a
few documents Formula:
IDF(t) = 1+ log(n/m)n – total number of docsm -- # docs with term t (doc freq)
Can be interpreted as mutual information
28
TF-IDF Weighting TF-IDF weighting :
The importance of a term t to a document d
weight(t,d)=TF(t,d)*IDF(t)
Freq in doc high tf high weight Rare in collection high idf high weight
1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w
29
TF-IDF Weighting TF-IDF weighting :
The importance of a term t to a document d
weight(t,d)=TF(t,d)*IDF(t)
Freq in doc high tf high weight Rare in collection high idf high weight
Both qi and dk,i arebinary values, i.e. presence and absence of a word in query and document.
1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w
30
Problems with Vector Space Model Still limited to word based matching
A document will never be retrieved if it does not contain any query word
How to modify the vector space model ?
36
Choosing Bases for VSM Modify the bases of the vector space
Each basis is a concept: a group of words Every document is a vector in the concept space
A1
A2
c1 c2 c3 c4 c5 m1 m2 m3 m4
A1 1 1 1 1 1 0 0 0 0
A2 0 0 0 0 0 1 1 1 1
37
Choosing Bases for VSM Modify the bases of the vector space
Each basis is a concept: a group of words Every document is a mixture of concepts
A1
A2
c1 c2 c3 c4 c5 m1 m2 m3 m4
A1 1 1 1 1 1 0 0 0 0
A2 0 0 0 0 0 1 1 1 1
38
Choosing Bases for VSM Modify the bases of the vector space
Each basis is a concept: a group of words Every document is a mixture of concepts
How to define/select ‘basic concept’? In VS model, each term is viewed as an
independent concept
41
Linear Algebra Basic: Eigen Analysis Eigenvectors (for a square mm matrix S)
Example
eigenvalue(right) eigenvector
42
Linear Algebra Basic: Eigen Analysis
21
12S
2/1
2/1,1 eigenvalue second the
2/1
2/1,3 eigenvaluefirst the
22
11
v
v
43
Linear Algebra Basic: Eigen Decomposition
1 1 1 12 1 3 02 2 2 21 2 1 1 0 1 1 1
2 2 2 2
S
S = U * * UT
,1,3 2/1
2/1 ,
2/1
2/12121
vv
44
Linear Algebra Basic: Eigen Decomposition
1 1 1 12 1 3 02 2 2 21 2 1 1 0 1 1 1
2 2 2 2
S
S = U * * UT
,1,3 2/1
2/1 ,
2/1
2/12121
vv
45
Linear Algebra Basic: Eigen Decomposition
1 1 1 12 1 3 02 2 2 21 2 1 1 0 1 1 1
2 2 2 2
S
This is generally true for symmetric square matrix
Columns of U are eigenvectors of S
Diagonal elements of are eigenvalues of S
S = U * * UT
46
Singular Value Decomposition
TVUA
mm mn V is nn
For an m n matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:
The columns of U are left singular vectors.
The columns of V are right singular vectors
is a diagonal matrix with singular values
53
Latent Semantic Indexing (LSI)Computation: using single value decomposition (SVD) with the first m largest singular values and singular vectors, where m is the number of concepts
Rep. of Concepts in term space
Concept
Concept
Rep. of concepts in document space
60
54.20
034.3
X X
SVD: Properties
rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent.
SVD produces the best low rank approximation
X’: rank(X’) = 2X: rank(X) = 9