Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | alexandrina-kennedy |
View: | 217 times |
Download: | 0 times |
3
Elimination of StopwordsWords with high frequency are not good
discriminators. (in fact a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval ). They are frequently referred to as stopwords filtered out.
Examples:Articles: a, an, the,…Prepositions: on , in ,over,…Conjunctions: and ,or.
4
Elimination of Stopwords (derived from Brown corpus): 425 words:
a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became, ...
5
Elimination of StopwordsElimination of stopwords reduces the size
of the indexing structure (a bout 40% compression in the size of the indexing structure ).
Unfortunately, sometimes elimination of stopwords could eliminate words that have a profound impact on the retrieved documents. Ex:’to be or not to be’ be (is only left).
Solution full text index.
6
StemmingA stem is the portion of the word which is
left after the removal of the affixes (prefixes and suffixes).
They are thought to be useful for improving retrieval performance (reduce the variants of the same root to a common concept)
connect connected, connecting, connection, connections.
7
StemmingFrakes distinguish 4 types of stemming
strategies:Affix removalTable lookup: simple, but needs dataSuccessor variety.N-gram.
8
Stemming• Affix removal: intuitive, simple and can be
implemented efficiently • Table lookup: looking for the stem of a word in a
table( simple ,but needs data for the whole language and considerable storage space).
• Successor variety: based on the determination of the morpheme boundaries , uses knowledge from structural Linguistic (complex, expensive maintenance).
• N-gram: based on the identification of digrams and trigrams, and it is more clustering procedure than a stemming one .(no data, but imprecise).
10
Successor VarietyDefinition (successor variety of a
string)the number of different characters that follow it in words in some body of text.
11
Successor Variety (Continued)Idea
The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase.
ExampleTest word: READABLECorpus: ABLE, BEATABLE, FIXABLE, READ,READS
READABLE, READING, RED, ROPE, RIPEPrefix Successor Variety LettersR 3 E, O, IRE 2 A, DREA 1 DREAD 3 A, I, SREADA 1 BREADAB 1 LREADABL 1 EREADABLE 1 blank
12
Affix Removal Stemmersprocedure
Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. E.g., Porter’s algorithm (Eng Lang.) Porter algorithm. Martin Porter. Ready code in the
web. Substitution rules: sses s, s stresses stress.
13
Affix Removal StemmersExample: plural forms
If a word ends in “ies” but not “eies” or “aies”then “ies” --> “y” (surgeries--> surgery).
If a word ends in “es” but not “aes”, “ees”, or “oes” then
“es” --> “e” (houses house).If a word ends in “s”, but not “us” or “ss” then “s” --
> NULL. (doors--> door).
14
Index Terms SelectionA sentence in natural language text is
usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.
Most of the semantics meaning is carried by the noun words .so it is a promising strategy to use the nouns in the text (by eliminating the others like verbs ,etc,..).
15
Index Terms SelectionSince it is common to combine two or three
nouns in a single component (ex. Computer science) it makes sense to cluster nouns which appear nearby in the text into a single indexing component (concept).
Thus instead of simply using nouns as index terms, we adopt noun groups)a set of nouns whose syntactic distance in the text and does not exceed a predefined threshold (for instance,3)(