+ All Categories
Home > Documents > OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer...

OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer...

Date post: 22-Apr-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
40
OCUG 7 th A lM ti OCUG 7 th Annual Meeting 25 September 2002
Transcript
Page 1: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

OCUG 7th A l M tiOCUG 7th Annual Meeting25 September 2002

Page 2: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in

TMS S h Obj D iTMS Search Object Design

Donna CarusoSunil G. SinghAmgen, Inc.

gDBMS Consulting

Page 3: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Acknowledgements• Thanks to the OCUG for the opportunity to• Thanks to the OCUG for the opportunity to

present.• Thanks to Andy Alasso Kim Renjdrup &Thanks to Andy Alasso, Kim Renjdrup &

Dominique Farinaux-Dumas for theirtechnical insights into the content of thisgpresentation.

• Special thanks to Amgen, Inc. for theirsponsorship as this functionality wasutilized to develop the Amgen Auto-Encoderf TMSfor TMS.

Page 4: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Goals• Gain an understanding of the toolsGain an understanding of the tools

used in search object design.• Review research on stemmingReview research on stemming

algorithms’ performance ininformation retrieval.

• Amgen’s Case Study for application ofthe tools

• Concluding observations based onresearch & practical application withinp ppthe TMS environment.

Page 5: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Definitions• TMS Search Objects:• TMS Search Objects:

• Procedures containing algorithms forsearching TMS dictionaries

• Integrated with TMS through search objectdefinition

• Executed from TMS API calls• Executed from TMS API calls• Information retrieval in the context of

TMS search objects:j• The ability to retrieve & match verbatim

terms (VTs) to dictionary terms by usingsearch algorithmssearch algorithms.

Page 6: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Definitions (2)• Retrieval tools used in searchRetrieval tools used in search

algorithms:• Stemmer Algorithms:

P t St• Porter Stemmer• Oracle interMedia (Xerox Corporation’s iMT

stemmer)S b tit ti• Substitutions:

• Full words• Partial words

C did t T• Candidate Terms• List of dictionary terms retrieved in the search

algorithm that are suggested dictionary matchesg gg yused in manual classification.

Page 7: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Definitions (3)M h l i l i t ( d• Morphological variants (wordvariations)

Unrecognizable in exact term matching• Unrecognizable in exact term-matchingalgorithms (cramp, cramps, cramping).

• Similar semantic interpretations and can• Similar semantic interpretations and canbe treated as equivalents in informationretrieval (cramps, cramping -> cramp).retrieval (cramps, cramping cramp).

Page 8: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Why Use Stemmers?• Stemmers have been created forStemmers have been created for

information retrieval to reduce termsto their root form for improvedprecognition by term-matchingprocedures.

Unstemmed Word StemBlurry BlurBlurry BlurBlurred BlurBlurring BlurBlurring Blur

Page 9: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Stemmer Scope1. Traditional approach based on suffix1. Traditional approach based on suffix

removal:• Focus on the Porter Stemmer

2. Linguistic methods based on theXerox Stemmer

• Focus on Oracle interMedia usingdefault English lexer (lexicon)• Search & retrieval capability for text• Concept searching• Theme analysis

Page 10: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Porter StemmerTh P t t i l ith i• The Porter stemming algorithm is aprocess for removing morphological

i t & i fl i l divariants & inflexional endings(suffixes) from words in English.

• It is mainly used as part of a termnormalization process duringinformation retrieval.

Page 11: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Xerox StemmerX ’ E li h l i l d t b• Xerox’s English lexical database canlinguistically identify 77,000 basef f 500 000 i t d ithforms of 500,000 variant words withthe following morphological tools:• Inflectional stemmer• Derivational stemmer

Page 12: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Xerox Stemmer (2)I fl ti l St• Inflectional Stemmer:• Identifies changes in word form due to

d b tcase, gender, number, tense, person,mood, voice.• Nouns: children > child• Nouns: children -> child• Verbs: understood -> understand

Adj ti b t > d• Adjectives: best -> good• Pronouns: whom -> who

Page 13: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Xerox Stemmer (3)D i ti l St• Derivational Stemmer:• Reduces variant words to their derived

f i ffi d fi lform using suffix and prefix removal• Must preserve original meaning

Page 14: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Stemmer Analysis• Impacts of Stemming:Impacts of Stemming:

• Only a small improvement to retrievalperformanceperformance

• Although it does not hurt retrievalperformanceperformance

• Traditional approach & linguisticmethods perform equally as wellmethods perform equally as well.

Page 15: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Stemmer Analysis (2)• Down side to suffix removal stemmer:Down side to suffix removal stemmer:

• Lumps “general, generous, generation,generic” into “gener” rootgeneric into gener root.

• Does not find a root for “recognize,recognition”.recognition .

• Creates roots that are not actual wordsmaking it difficult for dictionaryg yinformation retrieval “genetic,genetically, geneticist, genetics” into“genet” root.

Page 16: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Research1 Observations• Some form of Stemming is beneficial; theSome form of Stemming is beneficial; the

average absolute improvement due tostemming ranges from 1-3%.

• Plural removal is very effective with smallqueries.N diff i f f• No difference in average performance ofStemmers.

• Rules based suffix removal is beneficial isRules based suffix removal is beneficial issome cases, but not ideal in all cases.

1 Researchers from Rank Xerox Research Centre, France used the SMART textt i l t d l d t C ll U i it t i th fretrieval system developed at Cornell University to examine the performance

of 5 different stemming algorithms.

Page 17: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Research Observations (2)• Linguistic methods are limited basedLinguistic methods are limited based

on the content of the lexicon; unableto correct stem words which are notto correct stem words which are notcontained in the lexicon.

• Linguistic root words are not always• Linguistic root words are not alwaysoptimal for information retrieval.

“E li h” b d l i i t ff ti• “English” based lexicon is most effectivefor “English” words and their definitions.

Page 18: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

AmgenCase Study• VTO Creation• Coding

W kflWorkflow• Review

WorkflowWorkflow

Page 19: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Business OpportunitiesI th f ll• Improve the process of manuallyclassify verbatim terms to dictionarytterms.

• Improve accuracy & consistency inthe dictionary coding process.

Page 20: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Directives• Utilize existing TMS functionality to• Utilize existing TMS functionality to

define & execute custom algorithms(no additional GUIs/Forms)(no additional GUIs/Forms).

• Utilize complex search procedures tot li t f did t t tcreate a list of candidate terms to

assist, not change, the existingdi ti di d idictionary coding and peer reviewworkflow.

Page 21: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Directives (2)• Optimize the search procedureOptimize the search procedure

performance by executing during TMSbatch validation, not during the dictionary

di l hi icoding process; leverage machine timevs. person time.Utili th i ti TMS Cl if VT• Utilize the existing TMS Classify VTOmissions form to display the list ofcandidate terms in “best match” sortcandidate terms in best match sortorder.

• Utilize the English lexicon, even thoughg , ginterMedia can support many languages.

Page 22: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Define Search Objects

Page 23: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

TMS Search Objects• autoencodeautoencode

• Runs automatically during the TMSprocedure in batch validation.

• candidate• Displays a list of suggested dictionary

t h i Cl if VT O i i P idmatches in Classify VT Omissions. Providesthe ability to filter the search criteria todisplay a subset of the candidate terms.p y

• extsearch• Runs On-the-Fly during the auto-encodery g

search in Extended Search.

Page 24: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

autoencode & candidate• Autoencoded

Terms

• Candidate List

Page 25: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Apply Candidate Filter• Search for aSearch for a

subset of candidate terms in the candidate list th t t ithat contain the word “LEG”LEG .

Page 26: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Candidate Filter Results• The

Candidate filter retrieves aretrieves a subset of candidate terms containing “LEG”LEG .

Page 27: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

extsearch

• Autoencode any type of term on-the-flyA t d h ll l l f th di ti• Autoencoder searches all levels of the dictionary

Page 28: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Autoencoding Algorithm• Breaks up a Multi-word Term into• Breaks up a Multi-word Term into

individual words.• Executes procedures against• Executes procedures against

individual words in the order definedin the reference codelist.in the reference codelist.

• Full Word Substitutions• Remove stop words (“an, nd, st, of” to blank)p ( , , , )• Create substitution synonym list (TYLENOL

to ACETAMINOPHEN)• Remove frequent terms• Remove frequent terms

Page 29: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Autoencoding Algorithm (2)• Partial Word Substitutions• Partial Word Substitutions

• Remove punctuation & symbols (“; *” toblank))

• Remove numeric values (“0 – 9” to blank)

• Porter Stemmer (TOOTH ABSCESSES(to Tooth abscess) or (FALLS to Fall)

Page 30: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Autoencoding Algorithm (3)• Reorders individual words with allReorders individual words with all

possible permutations of a Multi-word Term (with limits)word Term (with limits).

• Searches the dictionary at theclassification and verbatim termclassification and verbatim termlevels for matches and assigns aranking value used to order theranking value used to order thecandidate list.

Page 31: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Autoencoding Algorithm (4)• Executes interMedia Logic andExecutes interMedia Logic and

assigns a ranking value used toorder the candidate list.

• The interMedia Lexicon is English.• interMedia Indexing is used tointerMedia Indexing is used to

perform the ‘CONTAINS’/ ‘ABOUT’searches.

• A default set of stop words is usedin interMedia searches.

Page 32: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Retrieval Tool Metrics - AEs

Note: 3 week sampling of VTs autoencoded. Stemmer &Substitution % are based on selected candidates that areapproved VTAs.

Page 33: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Retrieval Tool Metrics - Meds

Note: 3 week sampling of VTs autoencoded. Stemmer &Substitution % are based on selected candidates that areapproved VTAs.

Page 34: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Amgen’s Observations• The most effective term-matching is aThe most effective term matching is a

combination of substitutions &interMedia.• 68% for AEs• 44% for Meds

• “English” based lexicon is mosteffective for AEs but not as strong forMeds supporting existing research.• 71% for AEs• 45% for Meds

Page 35: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Amgen’s Observations (2)• Porter Stemmer retrieval performs• Porter Stemmer retrieval performs

within the expected range 1-3 %supporting existing research.supporting existing research.• 2% for AEs• 3% for Meds

• A combination of Porter Stemmer &interMedia retrieval does notsignificantly increase term-matching.• 3% for AEs• 1% for Meds

Page 36: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Amgen’s Observations (3)• The benefit to having the source code for• The benefit to having the source code for

the Porter Stemmer is being able tocontrol more predictable resultscontrol more predictable results.

• Since source code is not available for theXerox Stemmer a strict algorithmXerox Stemmer, a strict algorithmdefinition is not available for interMedia.

Page 37: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Effectiveness Metrics

Page 38: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Conclusion

• Efficiency improvements of 39%gained when selecting candidatesg gwithin the first 20 terms in thecandidate list.

• Effective results of 70% are gainedthrough auto matching (equal match)through auto matching (equal match)& manually selecting within the first20 terms in the candidate list20 terms in the candidate list.

Page 39: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

References• M Porter An algorithm for suffix strippingM. Porter. An algorithm for suffix stripping.

Program, 14(3):130-137, 1980. http://www.tartarus.org/~martin/PorterStem

/i d ht lmer/index.html• David A. Hull, Gregory Grefenstette. A

Detailed Analysis of English StemmingDetailed Analysis of English Stemming Algorithms. January 31, 1996.

• Metalink. Oracle 8i interMedia Text 8.1.7Technical Overview. May, 19 2002.

• Oracle 8i interMedia Text ReferenceR l 2 (8 1 6) D b 1999Release 2 (8.1.6) December, 1999.

Page 40: OCUG 7OCUG 7 AlMtiAnnual Meeting 25 September …Uses of Stemmer AlgorithmsUses of Stemmer Algorithms, Substitutions, and interMedia in TMS S h Obj D iTMS Search Object Design Sunil

Q&A

??


Recommended