Hugo Zaragoza (Yahoo! Research). CLEF 2008 1
Exploiting Semantics with Structured Queries
Jose Ramón Pérez-Agüera & Hugo Zaragoza
U. Complutense de Madrid Yahoo! Research (Barcelona)
Hugo Zaragoza (Yahoo! Research). CLEF 2008 2
Query expansion makes term independance
a big issue…
we are double counting “meanings” !!!
Hugo Zaragoza (Yahoo! Research). CLEF 2008 3
Term independance assumption gets worse with query expansion… (example 1)
Verde que te quiero verde.
Verde viento. Verdes
ramas.
El barco sobre la mar
y el caballo en la montaña.
Con la sombra en la cintura
ella sueña en su baranda
verde carne, pelo verde,
con
ojos de fría plata.
Bajo la luna gitana, las
cosas
la están mirando y ella
no puede mirarlas.
[…]
verde3 que te quiero
verde2.
verde3 viento. verde1
ramas.
El barco sobre la mar
y el caballo en la montaña.
Con la sombra en la cintura
ella sueña en su baranda
verde5 carne, pelo verde1,
con
ojos de fría plata.
Bajo la luna gitana, las cosas
la están mirando y ella
no puede mirarlas.
[…] q1: verde1 pelo
q2: verde1 verde2 pelo
q2: verde1 verde2 verde3 verde4 verde5 pelo
q: verde pelo [CLEF EFE94, 2001 Spanish topics]
Hugo Zaragoza (Yahoo! Research). CLEF 2008 4
Term independance assumption gets worse with query expansion… (example 2)
[CLEF EFE94, 2001 Spanish topics][Pérez-Agüera , Zaragoza and Araujo, NLDB 2008]
- 46% !!!
Hugo Zaragoza (Yahoo! Research). CLEF 2008 5
• BM25 dependance model:
tf = 1 2 3 4 … 10
tfk
tfw
Term independance assumption gets worse with query expansion… (example 3)
24
4
2
2
1
1
1
1:
kkkkex
Hugo Zaragoza (Yahoo! Research). CLEF 2008 6
Query Expansion (example of state of the art)
• Term Selection:– Divergence From Randomness Expansion Model (DFR) Bo1 Model [8,6]:
• Term Weighting:– Rochio [9]:
tf in top x=1 document
top 40 terms document
P(term)
0.3
• Perf. Prediction:– AvICTF [5] (cheap)
> 9.0
qt tn
n
qlCq 2log
1),(AvICTF
n
nt
Hugo Zaragoza (Yahoo! Research). CLEF 2008 7
Results in CLEF 2008 Robust-WSD Task:
• Standard Query Expansion:
• 3rd team in CLEF Robust out of 8. 1st team well ahead of everyone.– It seems no one improved GMAP so they reported MAP
Hugo Zaragoza (Yahoo! Research). CLEF 2008 8
Query expansion makes term independance
a big issue…
we are double counting “meanings” !!!
Hugo Zaragoza (Yahoo! Research). CLEF 2008 9
“Cheap Barcelona Italian Restaurants”{cheap, barcelona, italian, restaurant }
Expansion:{cheap, barcelona, italian, restaurant, inexpensive, affordable, Sagrada Familia, Ramblas, Gràcia, Barceloneta, pizzeria, trattoria, café }
Strcuture: collect related meanings in clauses{
{cheap, inexpensive, affordable},{Barcelona, Sagrada Familia, Ramblas, Gràcia, Barceloneta, …},{Italian_restaurant, pizzeria, trattoria, café}
}
Query Clauses Idea:
c1
c2
c3
Clause independance, not term independance
Hugo Zaragoza (Yahoo! Research). CLEF 2008 10
Query Clauses Idea
term 1
term 2
term e1
Hugo Zaragoza (Yahoo! Research). CLEF 2008 11
Query Clauses Idea
term1
term e1
term e4
term2
term e2
term e3
c1
c2
c3
(same idea as BM25-F on fields [10])
Hugo Zaragoza (Yahoo! Research). CLEF 2008 12
Query Clauses Model
Bag of words:
Query clauses :(bag of bags of weighted words):
),(*),()( 21 CtWlttfWdscoreqt
dd
},...,,{ 10 qtttq
)},()...,,(),,{( 01100 wtwtwtc c
},...,,{ 10 qcccq
),(*,)()( 21 CcWlwttfWdscoreqc
dqt
td
Matrix notation: let , then redefine each document as
Example:
Hugo Zaragoza (Yahoo! Research). CLEF 2008 13
clause term frequency:
clause collection frequency:
clause document likelihood:
clause collection lihelihood:
In general projection is query-dependent and needs to be done online:
Query Clauses Implementation of W1 and W2
Hugo Zaragoza (Yahoo! Research). CLEF 2008 14
Query Clauses Implementation of W1 and W2
IDF is not straight-forward, there are several possibilities:
Some possibilities:– min, max, avg (leads to inconsistent situations for small weights)– expected clause idf:
)},()...,,(),,{( 01100 wtwtwtc c )(...,),(),( 10 ctidftidftidf
ct
cttd
td
wttf
wttftidfdcicf
)(
)()(),(
Hugo Zaragoza (Yahoo! Research). CLEF 2008 15
How can we construct the clauses?
• Idea: use WordNet to expand each term in the query as a clause.
• Idea: use statistical methods to expand each term in the query.
• Idea: use query expansion to find terms, use statistical methods to group the, into clauses.
• Idea: use query expansion to find terms, use WordNet to group them into clauses. – There exist several semantic similarity measures based on WordNet [11]:
WN(s1,s2) – We construct a clause for every original query term, and we add to it expanded
terms with:WN(s1,s2) < k
– To be conservative, all terms not in an original clause are added together to a new “Other” clause.
Hugo Zaragoza (Yahoo! Research). CLEF 2008 16
• Implementation:
DFR Expansion: 40 new terms extracted for each query.
Query Clauses:
Ranking: BM25 with standard params, on clauses:
WordNet Similarity
DFR)},()...,,(),,{( 01100 wtwtwtc c
},...,,{ 10 qcccq
Results in CLEF 2008 Robust-WSD Task:
icf
tqt
d wttfctf )(
Query Clauses
Hugo Zaragoza (Yahoo! Research). CLEF 2008 17
Results in CLEF 2008 Robust-WSD Task:
4% rel. impr.
(overall results)
• 2nd team in CLEF Robust, 1st team well ahead without use of WSD.
clauses
Hugo Zaragoza (Yahoo! Research). CLEF 2008 18
Biblio
[10] H. Zaragoza, N. Craswell, M. Taylor, S. Saria, and S. Robertson. Microsoft Cambridge at TREC 13: Web and hard tracks. In Text REtrieval Conference (TREC-13), 2004.
[11] Z. Wu and M. Palmer, Verb semantics and lexical selection, 32nd. Annual Meeting of the Association for Computational Linguistics, ACL 1991.