1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

1

University of Palestine

Topics In CISITBS 3202

Ms. Eman Alajrami

2nd Semester 2008-2009

2

Chapter 7Text Operations

Part2

3

Elimination of StopwordsWords with high frequency are not good

discriminators. (in fact a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval ). They are frequently referred to as stopwords filtered out.

Examples:Articles: a, an, the,…Prepositions: on , in ,over,…Conjunctions: and ,or.

4

Elimination of Stopwords (derived from Brown corpus): 425 words:

a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became, ...

5

Elimination of StopwordsElimination of stopwords reduces the size

of the indexing structure (a bout 40% compression in the size of the indexing structure ).

Unfortunately, sometimes elimination of stopwords could eliminate words that have a profound impact on the retrieved documents. Ex:’to be or not to be’ be (is only left).

Solution full text index.

6

StemmingA stem is the portion of the word which is

left after the removal of the affixes (prefixes and suffixes).

They are thought to be useful for improving retrieval performance (reduce the variants of the same root to a common concept)

connect connected, connecting, connection, connections.

7

StemmingFrakes distinguish 4 types of stemming

strategies:Affix removalTable lookup: simple, but needs dataSuccessor variety.N-gram.

8

Stemming• Affix removal: intuitive, simple and can be

implemented efficiently • Table lookup: looking for the stem of a word in a

table( simple ,but needs data for the whole language and considerable storage space).

• Successor variety: based on the determination of the morpheme boundaries , uses knowledge from structural Linguistic (complex, expensive maintenance).

• N-gram: based on the identification of digrams and trigrams, and it is more clustering procedure than a stemming one .(no data, but imprecise).

9

Stemming

Term Stemengineering engineerengineered engineerengineer engineer

Table lookup

10

Successor VarietyDefinition (successor variety of a

string)the number of different characters that follow it in words in some body of text.

11

Successor Variety (Continued)Idea

The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase.

ExampleTest word: READABLECorpus: ABLE, BEATABLE, FIXABLE, READ,READS

READABLE, READING, RED, ROPE, RIPEPrefix Successor Variety LettersR 3 E, O, IRE 2 A, DREA 1 DREAD 3 A, I, SREADA 1 BREADAB 1 LREADABL 1 EREADABLE 1 blank

12

Affix Removal Stemmersprocedure

Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. E.g., Porter’s algorithm (Eng Lang.) Porter algorithm. Martin Porter. Ready code in the

web. Substitution rules: sses s, s stresses stress.

13

Affix Removal StemmersExample: plural forms

If a word ends in “ies” but not “eies” or “aies”then “ies” --> “y” (surgeries--> surgery).

If a word ends in “es” but not “aes”, “ees”, or “oes” then

“es” --> “e” (houses house).If a word ends in “s”, but not “us” or “ss” then “s” --

> NULL. (doors--> door).

14

Index Terms SelectionA sentence in natural language text is

usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.

Most of the semantics meaning is carried by the noun words .so it is a promising strategy to use the nouns in the text (by eliminating the others like verbs ,etc,..).

15

Index Terms SelectionSince it is common to combine two or three

nouns in a single component (ex. Computer science) it makes sense to cluster nouns which appear nearby in the text into a single indexing component (concept).

Thus instead of simply using nouns as index terms, we adopt noun groups)a set of nouns whose syntactic distance in the text and does not exceed a predefined threshold (for instance,3)(

Date post:	02-Jan-2016
Category:	Documents
Upload:	alexandrina-kennedy
View:	217 times
Download:	0 times

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

Documents