transcript
- Slide 1
- Salton2-1 Automatic Indexing Hsin-Hsi Chen
- Slide 2
- Salton2-2 Indexing indexing: assign identifiers to text items.
assign: manual vs. automatic indexing identifiers: objective vs.
nonobjective text identifiers cataloging rules define, e.g., author
names, publisher names, dates of publications, controlled vs.
uncontrolled vocabularies instruction manuals, terminological
schedules, single-term vs. term phrase
- Slide 3
- Salton2-3 Two Issues Issue 1: indexing exhaustivity exhaustive:
assign a large number of terms nonexhaustive Issue 2: term
specificity broad terms (generic) cannot distinguish relevant from
nonrelevant items narrow terms (specific) retrieve relatively fewer
items, but most of them are relevant
- Slide 4
- Salton2-4 Parameters of retrieval effectiveness Recall
Precision Goal high recall and high precision
- Slide 5
- Salton2-5 Nonrelevant Items Relevant Items Retrieved Part a b c
d
- Slide 6
- Salton2-6 A Joint Measure F-score is a parameter that encode
the importance of recall and procedure. =1: equal weight >1:
precision is more important
- Salton2-10 A Frequency-Based Indexing Method Eliminate common
function words from the document texts by consulting a special
dictionary, or stop list, containing a list of high frequency
function words. Compute the term frequency tf ij for all remaining
terms T j in each document D i, specifying the number of
occurrences of T j in D i. Choose a threshold frequency T, and
assign to each document D i all term T j for which tf ij >
T.
- Slide 11
- Salton2-11 Discussions High-frequency terms favor recall high
precision the ability to distinguish individual documents from each
other high-frequency terms good for precision when its term
frequency is not equally high in all documents.
- Slide 12
- Salton2-12 Inverse Document Frequency Inverse Document
Frequency (IDF) for term Tj where dfj (document frequency of term
Tj) is number of documents in which Tj occurs. fulfil both the
recall and the precision occur frequently in individual documents
but rarely in the remainder of the collection
- Slide 13
- Salton2-13 New Term Importance Indicator weight w ij of a term
T j in a document t i Eliminating common function words Computing
the value of w ij for each term T j in each document D i Assigning
to the documents of a collection all terms with sufficiently high
(tf x idf) factors
- Slide 14
- Salton2-14 Term-discrimination Value Useful index terms
distinguish the documents of a collection from each other Document
Space two documents are assigned very similar term sets, when the
corresponding points in document configuration appear close
together when a high-frequency term without discrimination is
assigned, it will increase the document space density
- Slide 15
- Salton2-15 Original State After Assignment of good
discriminator After Assignment of poor discriminator A Virtual
Document Space
- Slide 16
- Salton2-16 Good Term Assignment When a term is assigned to the
documents of a collection, the few items to which the term is
assigned will be distinguished from the rest of the collection.
This should increase the average distance between the items in the
collection and hence produce a document space less dense than
before.
- Slide 17
- Salton2-17 Poor Term Assignment A high frequency term is
assigned that does not discriminate between the items of a
collection. Its assignment will render the document more similar.
This is reflected in an increase in document space density.
- Slide 18
- Salton2-18 Term Discrimination Value definition dv j = Q - Q j
whereQ and Qj are space densities before and after the assignments
of term Tj. dv j >0, T j is a good term; dv j
- Salton2-19 Document Frequency Low frequency dv j =0 Medium
frequency dv j >0 High frequency dv j
- Salton2-41 Term-Phrase Formation Term Phrase a sequence of
related text words carry a more specific meaning than the single
terms e.g., computer science vs. computer; Document Frequency Low
frequency dv j =0 Medium frequency dv j >0 High frequency dv
j
- Salton2-52 Thesaurus-Group Generation Thesaurus transformation
broadens index terms whose scope is too narrow to be useful in
retrieval a thesaurus must assemble groups of related specific
terms under more general, higher-level class indicators Document
Frequency Low frequency dv j =0 Medium frequency dv j >0 High
frequency dv j
- Salton2-57 Word Stemming effectiveness --> effective -->
effect picnicking --> picnic king -\-> k
- Slide 58
- Salton2-58 Some Morphological Rules Restore a silent e after
suffix removal from certain words to produce hope from hoping
rather than hop Delete certain doubled consonants after suffix
removal, so as to generate hop from hopping rather than hopp. Use a
final y for an I in forms such as easier, so as to generate easy
instead of easi.
- Slide 59
- Salton2-59 The Indexing Prescription (2) Identify individual
text words. Use stop list to delete common function words. Use
automatic suffix stripping to produce word stems. Compute
term-discrimination value for all word stems. Use thesaurus class
replacement for all low-frequency terms with discrimination values
near zero. Use phrase-formation process for all high-frequency
terms with negative discrimination values. Compute weighting
factors for complex indexing units. Assign to each document single
term weights, term phrases, and thesaurus classes with
weights.
- Slide 60
- Salton2-60 Query vs. Document Differences Query texts are
short. Fewer terms are assigned to queries. The occurrence of query
terms rarely exceeds 1. Q=(w q1, w q2, , w qt ) where w qj :
inverse document frequency D i =(d i1, d i2, , d it ) where d ij :
term frequency*inverse document frequency
- Slide 61
- Salton2-61 Query vs. Document When non-normalized documents are
used, the longer documents with more assigned terms have a greater
chance of matching particular query terms than do the shorter
document vectors. or
- Slide 62
- Salton2-62 Relevance Feedback Terms present in previously
retrieved documents that have been identified as relevant to the
users query are added to the original formulations. The weights of
the original query terms are altered by replacing the inverse
document frequency portion of the weights with term-relevance
weights obtained by using the occurrence characteristics of the
terms in the previous retrieved relevant and nonrelevant documents
of the collection.
- Slide 63
- Salton2-63 Relevance Feedback Q = (w q1, w q2,..., w qt ) D i =
(d i1, d i2,..., d it ) New query may be the following form Q = {w
q1, w q2,..., w qt }+ {w qt+1, w qt+2,..., w qt+m } The weights of
the newly added terms T t+1 to T t+m may consist of a combined
term- frequency and term-relevance weight. CLARIT NP Extractor
----> {Raw Noun Phrases} ----> Stati">
- Salton2-71 Experiment Design CLARIT commercial retrieval system
{original document set} ----> CLARIT NP Extractor ----> {Raw
Noun Phrases} ----> Statistical NP Parser, Phrase Extractor
----> {Indexing Term Set} ----> CLARIT Retrieval Engine
- Slide 72
- Salton2-72 Different Indexing Units example [[[heavy
construction] industry] group] (WSJ90) single words heavy,
construction, industry, group head modifier pairs heavy
construction, construction industry, industry group full noun
phrases heavy construction industry group
- Slide 73
- Salton2-73 Different Indexing Units (Continued) WD-SET single
word only (no phrases, baseline) WD-HM-SET single word + head
modifier pair WD-NP-SET single word + full NP WD-HM-NP-SET single
word + head modifier + full NP
- Slide 74
- Salton2-74 Result Analysis Collection: Tipster Disk 2 (250MB)
Query: TREC-5 ad hoc topics (251-300) relevance feedback: top 10
documents returned from initial retrieval evaluation total number
of relevant documents retrieved highest level of precision over all
the points of recall average precision
- Slide 75
- Salton2-75 Effects of phrases with feedback and TREC-5
- Slide 76
- Salton2-76 Summary When only one kind of phrase is used to
supplement the single words, each can lead to a great improvement
in precision. When we combine the two kinds of phrases, the effect
is a greater improvement in recall rather than precision. How to
combine and weight different phrases effectively becomes an
important issue.
- Slide 77
- Salton2-77 A Corpus-Based Statistical Approach to Automatic
Book Indexing Jyun-Sheng Chang, Tsung-Yih Tseng, Ying Cheng,
Huey-Chyun Chen, Shun-Der Cheng, Sur-Jin Ker, and John S. Liu
(ANLP92, pp. 147-151)
- Slide 78
- Salton2-78 Generating Indices Word Segmentation Part-of-speech
tagging Finding noun phrases
- Slide 79
- Salton2-79 Example of Problem Description Segmentation,
tagging, noun phrase finding / / / / / / / P/Q/CL/LOC/CTM/NC/NC/ /
/ / / / / / / / / / P/D/Q/CL/NC/LOC/LOC/LOC/V/CTM/NC/ / / / / / /
NP/ADV/V/NC/CTM/NC/ / / / / / / / P/NC/CTM/NC/CTM/NC/NC/
- Slide 80
- Salton2-80 Word Segmentation Given a Chinese sentence, segment
the sentence into words. / / / / / / /
- Slide 81
- Salton2-81 Segmentation as a Constraint Satisfaction Problem
Given a sequence of Chinese characters, C1, C2, , Cn, assign
break/continue to each place Xi between two adjacent characters Ci
and Ci+1. > = = > > = > = > > > = break: >,
continue: =
- Slide 82
- Salton2-82 Detail specification For each sequence of characters
Ci, , Cj which are a Chinese word in the dictionary or a
surname-name, if j=i then put (>, >) in Ki-1,i. If j > I,
the put (>,=) in Ki-1, i, (=, =) in Ki, i+1, , and Kj-1, j.
- Slide 83
- Salton2-83 Example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 assume , , ,
, , , , , , , , are words by dictionary lookup 0 (>,>) 1
(>,=) 2 (=,>), (=,=) 3 (=,>) 4 (>,>) 5 (>,>) 6
(>,>), (>,=) 7 (>,>), (=,>), (>, =) 8
(>,>), (=,>), (>, =) 9 (>,>), (=,>), (>, =)
10 (>,>), (=,>), (>, =) 11 (>,>) 12 (>,>),
(>, =) 13 (=, >)
- Slide 84
- Salton2-84 Differences from English IR Data analysis issue
media: syllable structure in speech data code & character: GB
and BIG conversion, rigid semantic in character word: simple in
word stemming, spelling, hard in word segmentation, proper noun
identification (Dr. L.F. Chien, 1996)
- Slide 85
- Salton2-85 Differences from English IR (Continued) Interface
issues input: eager in speech and OCR input query: need searching
for approximate terms, rigid information in NLQ, hard to find
proper noun in NLQ
- Slide 86
- Salton2-86 Differences from English IR (Continued) Indexing and
searching issues index: hard to use word-level and complete index
like inverted file workable to use character-level and filtering
index like signature (Chien, 1995) searching: need multiple-stage
searching, need best match in term level
- Slide 87
- Salton2-87 Segmentation Problem Segmentation is a serious
problem in processing Chinese sentences (Hsin-Hsi Chen, 1996)
- Slide 88
- Salton2-88 Strategies Dictionary lookup supplemented by other
special strategies the-longest-word-the-first the number of words
...