+ All Categories
Home > Documents > ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

Date post: 24-Feb-2016
Category:
Upload: crescent
View: 31 times
Download: 0 times
Share this document with a friend
Description:
ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task. Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad , India. Contents. Introduction Adhoc retrieval task participation - PowerPoint PPT Presentation
Popular Tags:
30
ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad, India
Transcript
Page 1: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

Avinash YadavRobins YadavSukomal Pal

Department of Computer Science & EngineeringIndian School of Mines Dhanbad, India

Page 2: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Contents IntroductionAdhoc retrieval task participationMorpheme Extraction Task

participationConclusion

Page 3: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

IntroductionStemmerISMstemmerEvaluation

Page 4: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

StemmerAttempts to reduce word variants to its stem

or root formExample – education, educating, educativewill all reduce to educat

Approaches for StemmingLanguage based approachStatistical approach

Page 5: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

ISMstemmerstatistical stemmerbased on suffix extractionsuffix frequencyalgorithm

Page 6: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Data PreprocessingConvert the corpus into single file

File 1

File 2

File n

Single File

Cleaning of data

John asked a girl with an apple of Kashmir, “ do you

have the time”. She said,

“yes”.John asked a girl with an apple of Kashmir do you have the time she said yes

Removing Stop Words

John asked a girl with an apple of Kashmir do you have the time she said yes

John asked girl with apple Kashmir you time she said yes

John asked girl with apple Kashmir you time she said yes

Johnaskedgirlwith appleKashmiryoutimeshesaidyes

Convert file into Single

Column

Page 7: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Data preprocessing (contd….)unique words extractedHindi- 4,90,391English-7,95,144

Page 8: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Find valid suffixesReverse the

words of single column file

aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling

gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna

Sort the reversed

list

gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna

dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba

Find suffix according

to threshold

dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba

degniniot

gni

17%

40%

Page 9: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Threshold usedEnglish: 0.01 - 0.1%

Hindi: 0.1 – 1.0%

Page 10: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Stemming of corpusStem the

reversed words with reversed valid suffixes

dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba

dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba

Reverse stemmed words

to get the original words

dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba

addagreadmittallottabuildagreeamblanglabornadmittallottadmiraactivaaddiacquisiabsorpabsolu

Page 11: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Note: If the length of a word after

stemming is less than ’3’ alphabets, then that word will not be stemmed

agingking

agk

Page 12: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Evaluation of ISMstemmerFor evaluation of ISMstemmer we have

participated in:

1. Monolingual Adhoc retrieval task in English and Hindi Languages

2. Morpheme Extraction Task (MET) of FIRE-2012

Page 13: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Adhoc Retrieval Task(ART) ParticipationMonolingual taskLanguages chosen:

EnglishApproachResults

HindiApproachResults

Page 14: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

ART: English Approach:

Indexing:Search Engine used:

Indri(IndriBuildIndex)Retrieval:

Search engine used: Lemur (RetEval)Data Provided:

Corpus from The Telegraph and BD News50 query set

Page 15: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

ART: English (contd….)Results:

Run id No. of queries

No. of results

No. of relevant docs.

No. of rel. docs ret.

MAP value

EE.ism.unstemmed

50 50000 3539 2503 0.2264

EE.ism.krovetzstemmer

50 50000 3539 2504 0.2255

EE.ism.ismstemmer

50 50000 3539 2415 0.2096

Page 16: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

ART: HindiApproach:

Indexing: Search Engine used: Indri

(IndriBuildIndex)Retrieval:

Search Engine used: Indri (IndriRunQuery)Data Provided:

Corpus from Navbharat Times and Amar Ujala

50 query set

Page 17: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

ART: Hindi (contd….)Results:Run id No. of

queriesNo. of results

No. of relevant docs

No. of rel. docs ret.

MAP value

HH.ism.unstemmed.indri

50 50000 2309 222 0.0173

HH.stemmmedcorpus.unstemmedquery

50 50000 2309 98 0.0026

HH.stemmmedcorpus.stemmedquery

50 50000 2309 209 0.0137

Page 18: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Morpheme Extraction Task Participation

Tool submittedResults

Page 19: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

MET Tool Submission.ISMstemmer submittedevaluated at IR Labs: DAIICT,

Gujarattested on 6 languages of South

Asian originhas given efficient results with 3

languages

Page 20: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

MET Results:1. BENGALI

Institute Language MAP ObtainedBaseline Bengali 0.2740JU Bengali 0.3307DCU Bengali 0.3300IIT-KGP Bengali 0.3225CVPR-Team1 Bengali 0.3159ISM Bengali 0.3103

  CVPR-Team2+  Bengali NA

Page 21: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

MET Results (contd….)2. GUJARATI

Institute Language MAP ObtainedBaseline Gujarati 0.2677ISM Gujarati 0.2824

3. MARATHIInstitute Language MAP ObtainedBaseline Marathi 0.2320ISM Marathi 0.2797IIT-B Marathi 0.2684

Page 22: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

MET Results (contd….)4. ODIA

Institute Language MAP ObtainedBaseline Odia 0.1537IIIT-Bh Odia 0.1537ISM Odia 0.1537

5. HINDIInstitute Language MAP ObtainedBaseline Hindi 0.2821DCU Hindi 0.2963ISM Hindi 0.2793

Page 23: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

MET Results (contd….)6. TAMIL

Institute Language MAP ObtainedBaseline Tamil NAAUCEG Tamil NAISM Tamil NA

NA : results are not available, due non-availability of qrels

Page 24: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Reasons for Underperformance with Hindi

overstemmingundesired stemming of proper

nouns

Page 25: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

OverstemmingThis refers to words that shouldn’t be grouped

together by stemming, but are.Example –

1. accent, accentual, accentuateStem word – accent

2. accept, acceptant, acceptorStem word – accept

3. access, accessible, accessionStem word – access

due to overstemming it may be possible that these all group into wrong stem - acce

Page 26: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

Undesired stemming of proper nounsproper nouns should not be stemmed as

they are not inflected

Example – BeijingIt will get stemmed to Beij

Page 27: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

ConclusionART: English: not satisfactory Hindi: poor Reasons: overstemming undesired stemming of proper nouns

MET: performed efficiently with Bengali, Gujarati and

Marathi languages performed up to the mark with Odia underperformed with Hindi

Page 28: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

References1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali

Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata.

2. www.isical.ac.in/~fire/ (as on 06.12.2012)3. Christopher D. Manning, Hinrich Schütze: Foundations of

Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9.

4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012)

5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012)

6. www.lemurproject.org (as on 06.12.2012)7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011.

GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)

Page 29: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

References (contd…)8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based

stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011).

9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China.

10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81.

11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012)12. How Effective Is Suffixing? Donna Harman. lister Hill

Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209

Page 30: ISM@FIRE 2012:  Adhoc  Retrieval Task & Morpheme Extraction Task

THANK YOU!!


Recommended