Date post: | 20-Jun-2015 |
Category: |
Technology |
Upload: | guest78476a |
View: | 1,029 times |
Download: | 2 times |
Information Retrieval
Retrieval strategies
• Vector Space Model
• Latent Semantic Indexing
• Probabilistic Retrieval Strategies
• Language Models
• Inference Networks
• Extended Boolean Retrieval
• Neural Networks
• Genetic Algorithms
• Fuzzy Set Retrieval
Vector space model
Text retrieval
Analysis
Tokenization
Stop-words
Stemming
Lemmatization
http://tartarus.org/~martin/PorterStemmer/
Document
Term
Term frequency
Inversion document frequency
Preliminary draft (c)�2007 Cambridge UP
6.2 Term frequency and weighting 91
Word cf dfferrari 10422 17insurance 10440 3997
! Figure 6.3 Collection frequency (cf) and document frequency (df) behave differ-ently.
term dft idft
calpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0
! Figure 6.4 Example of idf values. Here we give the idf’s of terms with variousfrequencies in a corpus of 1,000,000 documents.
a term t. The reason to prefer df to cf is illustrated in Figure 6.3, where a sim-ple example shows that collection frequency (cf) and document frequency(df) can behave rather differently. In particular, the cf values for both ferrari
and insurance are roughly equal, but their df values differ significantly. Thissuggests that the few documents that do contain ferrari mention this term fre-quently, so that its cf is high but the df is not. Intuitively, we want such termsto be treated differently: the few documents that contain ferrari should geta significantly higher boost for a query on ferrari than the many documentscontaining insurance get from a query on insurance.
How is the document frequency df of a term used to scale its weight? De-noting as usual the total number of documents in a corpus by N, we definethe inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT
FREQUENCY
idft = logN
dft.(6.1)
Thus the idf of a rare term is high, whereas the idf of a frequent term islikely to be low. Figure 6.4 gives an example of idf’s in a corpus of 1,000,000documents; in this example logarithms are to the base 10.
Exercise 6.2
Why is the idf of a term always finite?
Exercise 6.3
What is the idf of a term that occurs in every document? Compare this with the useof stop word lists.
Preliminary draft (c)�2007 Cambridge UP
92 6 Scoring and term weighting
6.2.2 Tf-idf weighting
We now combine the above expressions for term frequency and inverse doc-ument frequency, to produce a composite weight for each term in each doc-ument. The tf-idf weighting scheme assigns to term t a weight in documentd given by
tf-idft,d = tft,d ! idft.(6.2)
In other words, tf-idft,d assigns to term t a weight in document d that is
1. highest when t occurs many times within a small number of documents(thus lending high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in manydocuments (thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents.
At this point, we may view each document as a vector with one componentcorresponding to each term, together with a weight for each component thatis given by (6.2). This vector form will prove to be crucial to scoring andranking; we will develop these ideas in Chapter 7. As a first step, we intro-duce the overlap score measure: the score of a document d is the sum, over allquery terms, of the number of times each of the query terms occurs in d. Wecan refine this idea so that we add up not the number of occurrences of eachquery term t in d, but instead the tf-idf weight of each term in d.
Score(q, d) = !t"q
tf-idft,d.(6.3)
Exercise 6.4
Can the tf-idf weight of a term in a document exceed 1?
Exercise 6.5
How does the base of the logarithm in (6.1) affect the score calculation in (6.3)? Howdoes the base of the logarithm affect the relative scores of two documents on a givenquery?
Exercise 6.6
If the logarithm in (6.1) is computed base 2, suggest a simple approximation to the idfof a term.
6.3 Variants in weighting functions
A number of alternative schemes to tf and tf-idf have been considered; wediscuss some of the principal ones here.
Search
Preliminary draft (c)�2007 Cambridge UP
100 7 Vector space retrieval
!
"
!!
!!
!!
!!
!!"
#############$
%%%%%%%%%%%%%%&
!v(q)
!v(d2)
!v(d2)
! Figure 7.1 Cosine similarity illustrated.
John” and “John is quicker than Mary” are identical in such a bag of wordsrepresentation.
How do we quantify the similarity between two documents in this vectorspace? A first attempt might consider the magnitude of the vector differencebetween two document vectors. This measure suffers from a drawback: twodocuments with very similar term distributions can have a significant vectordifference simply because one is much longer than the other. Thus the rel-ative distributions of terms may be identical in the two documents, but theabsolute term frequencies of one may be far larger.
To compensate for the effect of document length, the standard way ofquantifying the similarity between two documents d1 and d2 is to computethe cosine similarityof their vector representations !V(d1) and !V(d2)COSINE SIMILARITY
sim(d1, d2) =!V(d1) · !V(d2)
|!V(d1)||!V(d2)|,(7.1)
where the numerator represents the inner product (also known as the dotproduct) of the vectors !V(d1) and !V(d2), while the denominator is the prod-ucts of their lengths. The effect of the denominator is to normalize the vec-tors !V(d1) and !V(d2) to unit vectors !v(d1) = !V(d1)/|!V(d1)| and !v(d2) =!V(d2)/|!V(d2)|. We can then rewrite (7.1) as
sim(d1, d2) = !v(d1) ·!v(d2).(7.2)
Thus, (7.2) can be viewed as the inner product of the normalized versions ofthe two document vectors. What use is the similarity measure sim(d1, d2)?Given a document d (potentially one of the di in the collection), consider
Q: “gold silver truck”
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
a arrived damaged delivery fire gold in of shipment silver truck
D1 1 0 1 0 1 1 1 1 1 0 0
D2 1 1 0 1 0 0 1 1 0 2 0
D3 1 1 0 0 0 1 1 1 1 0 1
Q 0 0 0 0 0 1 0 0 0 1 1
TF
• a log 3/3 = 0
• arrived log 3/2 = 0.176
• damaged log 3/1 = 0.477
• delivery log 3/1 = 0.477
• fire log 3/1 = 0.477
• in log 3/3 = 0
• of log 3/3 = 0
• silver log 3/1 = 0.477
• shipment log 3/2 = 0.176
• truck log 3/2 = 0.176
• gold log 3/2 = 0.176
Preliminary draft (c)�2007 Cambridge UP
6.2 Term frequency and weighting 91
Word cf dfferrari 10422 17insurance 10440 3997
! Figure 6.3 Collection frequency (cf) and document frequency (df) behave differ-ently.
term dft idft
calpurnia 1 6animal 100 4sunday 1000 3fly 10,000 2under 100,000 1the 1,000,000 0
! Figure 6.4 Example of idf values. Here we give the idf’s of terms with variousfrequencies in a corpus of 1,000,000 documents.
a term t. The reason to prefer df to cf is illustrated in Figure 6.3, where a sim-ple example shows that collection frequency (cf) and document frequency(df) can behave rather differently. In particular, the cf values for both ferrari
and insurance are roughly equal, but their df values differ significantly. Thissuggests that the few documents that do contain ferrari mention this term fre-quently, so that its cf is high but the df is not. Intuitively, we want such termsto be treated differently: the few documents that contain ferrari should geta significantly higher boost for a query on ferrari than the many documentscontaining insurance get from a query on insurance.
How is the document frequency df of a term used to scale its weight? De-noting as usual the total number of documents in a corpus by N, we definethe inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT
FREQUENCY
idft = logN
dft.(6.1)
Thus the idf of a rare term is high, whereas the idf of a frequent term islikely to be low. Figure 6.4 gives an example of idf’s in a corpus of 1,000,000documents; in this example logarithms are to the base 10.
Exercise 6.2
Why is the idf of a term always finite?
Exercise 6.3
What is the idf of a term that occurs in every document? Compare this with the useof stop word lists.
a arrived damaged delivery fire gold in of shipment silver truck
D1 0 0 0.477 0 0.477 0.176 0 0 0.176 0 0
D2 0 0.176 0 0.477 0 0 0 0 0 0.954 0.176
D3 0 0.176 0 0 0 0.176 0 0 0.176 0 0.176
Q 0 0 0 0 0 0.176 0 0 0 0.477 0.176
SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0)(0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0)(0)+(0)(0.176)+(0.477)(0)+(0.176)(0)=(0.176)(0.176) ⋲ 0.031
SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486
SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062
Inverted index
term - 1
term - 2
term - 3
term - 4
term - 5
term - n
(dn,1) (d10,1)
(dn,5) (dn,3)
(d2,11) (d10,1)
(dn,1) (d2,1)
(dn,2) (d4,3)
(d6,1) (d7,3)
Lucene
Analysis
Using the built-in analyzers 119
when you order the filtering process. Consider an analyzer that removes stop words and also injects synonyms into the token stream—it would be more effi-cient to remove the stop words first so that the synonym injection filter would have fewer terms to consider (see section 4.6 for a detailed example).
4.3 Using the built-in analyzers
Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.We’ll leave discussion of the two language-specific analyzers, RussianAnalyzerand GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper, PerFieldAnalyzerWrapper, to section 4.4.
The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-Analyzer are both trivial and we don’t cover them in more detail here. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have non-trivial effects.
4.3.1 StopAnalyzerStopAnalyzer, beyond doing basic word splitting and lowercasing, also removes stop words. Embedded in StopAnalyzer is a list of common English stop words; this list is used unless otherwise specified:
public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it",
Table 4.2 Primary analyzers available in Lucene
Analyzer Steps taken
WhitespaceAnalyzer Splits tokens at whitespace
SimpleAnalyzer Divides text at nonletter characters and lowercases
StopAnalyzer Divides text at nonletter characters, lowercases, and removes stop words
StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mail addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words
Licensed to Simon Wong <[email protected]>
Index
Index
• IndexWriter
• Directory
• Analyzer
• Document
• Field
storeIndex options: store
Value Description:no Don’t store field:yes Store field in its original format.
Use this value if you want to highlightmatches or print match excerpts a la Googlesearch.
:compressed Store field in compressed format.
Ruby Day Kraków: Full Text Search with Ferret
indexIndex options: index
Value Description:no Do not make this field searchable.:yes Make this field searchable and tok-
enize its contents.:untokenized Make this field searchable but do not
tokenize its contents. Use this valuefor fields you wish to sort by.
:omit norms Same as :yes except omit the normsfile. The norms file can be omit-ted if you don’t boost any fields andyou don’t need scoring based on fieldlength.
:untokenized omit norms Same as :untokenized except omit thenorms file.
Ruby Day Kraków: Full Text Search with Ferret
term_vectorIndex options: term vector
Value Description:no Don’t store term-vectors:yes Store term-vectors without storing positions
or o!sets.:with positions Store term-vectors with positions.:with o!sets Store term-vectors with o!sets.:with positions ofssets Store term-vectors with positions and o!-
sets.
Ruby Day Kraków: Full Text Search with Ferret
Search
Search
• IndexSearcher
• Term
• Query
• Hits
Query
Query
• API
• new TermQuery(new Term(“name”,”Tomek”));
• Lucene QueryParser
• queryParser.parse(“name:Tomek");
TermQuery
name:Tomek
BooleanQuery
ramobo OR ninja
+rambo +ninja –name:rocky
PhraseQuery“ninja java” –name:rocky
SloppyPhraseQuery
“red-faced politicians”~3
RangeQueryreleaseDate:[2000 TO 2007]
WildcardQuerysup?r, su*r, super*
FuzzyQuerycolor~
colour, collor, colro
http://en.wikipedia.org/wiki/Levenshtein_distance
color colour - 1
colour coller - 2
Equation 1. Levenstein Distance Score
This means that an exact match will have a score of 1.0, whereas terms with nocorresponding letters will have a score of 0.0. Since FuzzyQuery has a limit to thenumber of matching terms it can use, the lowest scoring matches get discarded ifthe FuzzyQuery becomes full.
Due to the way FuzzyQuery is implemented, it needs to enumerate every singleterm in its field’s index to find all valid similar terms in the dictionary. This cantake a long time if you have a large index. One way to prevent any performanceproblems is to set a minimum prefix length. This is done by settingthe :min_prefix_len parameter when creating the FuzzyQuery. This parameter isset to 0 by default hence the fact that it would need to enumerate every term inindex.
To minimize the expense of finding matching terms, we could set the minimumprefix length of the example query to 3. This would greatly reduce the number ofterms that need to be enumerated, and “color” would still match “colour,” al-though “cloor” would no longer match.# FQL: 'content:color~ ' => no way to set :min_prefix_length in FQLquery = FuzzyQuery.new(:content, "color", :max_terms => 1024, :min_prefix_length => 3)
You can also set a cut-off score for matching terms by setting the :min_similarityparameter. This will not affect how many terms are enumerated but it will affecthow many terms are added to the internal MultiTermQuery which can also helpimprove performance.# FQL: 'content:color~0.8 ' => no way to set :min_prefix_length in FQLquery = FuzzyQuery.new(:content, "color", :max_terms => 1024, :min_similarity => 0.8, :min_prefix_length => 3)
In some cases, you may want to change the default values for :min_prefix_lenand :min_similarity, particularly for use in the Ferret QueryParser. Simply set theclass variables in FuzzyQuery.FuzzyQuery.default_min_similarity = 0.8FuzzyQuery.default_prefix_length = 3
Ferret 42
Boosttitle:Spring^10