ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task
Avinash YadavRobins YadavSukomal Pal
Department of Computer Science & EngineeringIndian School of Mines Dhanbad, India
Contents IntroductionAdhoc retrieval task participationMorpheme Extraction Task
participationConclusion
IntroductionStemmerISMstemmerEvaluation
StemmerAttempts to reduce word variants to its stem
or root formExample – education, educating, educativewill all reduce to educat
Approaches for StemmingLanguage based approachStatistical approach
ISMstemmerstatistical stemmerbased on suffix extractionsuffix frequencyalgorithm
Data PreprocessingConvert the corpus into single file
File 1
File 2
File n
…
Single File
Cleaning of data
John asked a girl with an apple of Kashmir, “ do you
have the time”. She said,
“yes”.John asked a girl with an apple of Kashmir do you have the time she said yes
Removing Stop Words
John asked a girl with an apple of Kashmir do you have the time she said yes
John asked girl with apple Kashmir you time she said yes
John asked girl with apple Kashmir you time she said yes
Johnaskedgirlwith appleKashmiryoutimeshesaidyes
Convert file into Single
Column
Data preprocessing (contd….)unique words extractedHindi- 4,90,391English-7,95,144
Find valid suffixesReverse the
words of single column file
aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling
gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna
Sort the reversed
list
gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna
dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba
Find suffix according
to threshold
dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba
degniniot
gni
17%
40%
Threshold usedEnglish: 0.01 - 0.1%
Hindi: 0.1 – 1.0%
Stemming of corpusStem the
reversed words with reversed valid suffixes
dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba
dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba
Reverse stemmed words
to get the original words
dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba
addagreadmittallottabuildagreeamblanglabornadmittallottadmiraactivaaddiacquisiabsorpabsolu
Note: If the length of a word after
stemming is less than ’3’ alphabets, then that word will not be stemmed
agingking
agk
Evaluation of ISMstemmerFor evaluation of ISMstemmer we have
participated in:
1. Monolingual Adhoc retrieval task in English and Hindi Languages
2. Morpheme Extraction Task (MET) of FIRE-2012
Adhoc Retrieval Task(ART) ParticipationMonolingual taskLanguages chosen:
EnglishApproachResults
HindiApproachResults
ART: English Approach:
Indexing:Search Engine used:
Indri(IndriBuildIndex)Retrieval:
Search engine used: Lemur (RetEval)Data Provided:
Corpus from The Telegraph and BD News50 query set
ART: English (contd….)Results:
Run id No. of queries
No. of results
No. of relevant docs.
No. of rel. docs ret.
MAP value
EE.ism.unstemmed
50 50000 3539 2503 0.2264
EE.ism.krovetzstemmer
50 50000 3539 2504 0.2255
EE.ism.ismstemmer
50 50000 3539 2415 0.2096
ART: HindiApproach:
Indexing: Search Engine used: Indri
(IndriBuildIndex)Retrieval:
Search Engine used: Indri (IndriRunQuery)Data Provided:
Corpus from Navbharat Times and Amar Ujala
50 query set
ART: Hindi (contd….)Results:Run id No. of
queriesNo. of results
No. of relevant docs
No. of rel. docs ret.
MAP value
HH.ism.unstemmed.indri
50 50000 2309 222 0.0173
HH.stemmmedcorpus.unstemmedquery
50 50000 2309 98 0.0026
HH.stemmmedcorpus.stemmedquery
50 50000 2309 209 0.0137
Morpheme Extraction Task Participation
Tool submittedResults
MET Tool Submission.ISMstemmer submittedevaluated at IR Labs: DAIICT,
Gujarattested on 6 languages of South
Asian originhas given efficient results with 3
languages
MET Results:1. BENGALI
Institute Language MAP ObtainedBaseline Bengali 0.2740JU Bengali 0.3307DCU Bengali 0.3300IIT-KGP Bengali 0.3225CVPR-Team1 Bengali 0.3159ISM Bengali 0.3103
CVPR-Team2+ Bengali NA
MET Results (contd….)2. GUJARATI
Institute Language MAP ObtainedBaseline Gujarati 0.2677ISM Gujarati 0.2824
3. MARATHIInstitute Language MAP ObtainedBaseline Marathi 0.2320ISM Marathi 0.2797IIT-B Marathi 0.2684
MET Results (contd….)4. ODIA
Institute Language MAP ObtainedBaseline Odia 0.1537IIIT-Bh Odia 0.1537ISM Odia 0.1537
5. HINDIInstitute Language MAP ObtainedBaseline Hindi 0.2821DCU Hindi 0.2963ISM Hindi 0.2793
MET Results (contd….)6. TAMIL
Institute Language MAP ObtainedBaseline Tamil NAAUCEG Tamil NAISM Tamil NA
NA : results are not available, due non-availability of qrels
Reasons for Underperformance with Hindi
overstemmingundesired stemming of proper
nouns
OverstemmingThis refers to words that shouldn’t be grouped
together by stemming, but are.Example –
1. accent, accentual, accentuateStem word – accent
2. accept, acceptant, acceptorStem word – accept
3. access, accessible, accessionStem word – access
due to overstemming it may be possible that these all group into wrong stem - acce
Undesired stemming of proper nounsproper nouns should not be stemmed as
they are not inflected
Example – BeijingIt will get stemmed to Beij
ConclusionART: English: not satisfactory Hindi: poor Reasons: overstemming undesired stemming of proper nouns
MET: performed efficiently with Bengali, Gujarati and
Marathi languages performed up to the mark with Odia underperformed with Hindi
References1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali
Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata.
2. www.isical.ac.in/~fire/ (as on 06.12.2012)3. Christopher D. Manning, Hinrich Schütze: Foundations of
Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9.
4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012)
5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012)
6. www.lemurproject.org (as on 06.12.2012)7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011.
GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)
References (contd…)8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based
stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011).
9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China.
10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81.
11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012)12. How Effective Is Suffixing? Donna Harman. lister Hill
Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209
THANK YOU!!