+ All Categories
Home > Documents > Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Date post: 24-Dec-2015
Category:
Upload: frederica-morris
View: 219 times
Download: 0 times
Share this document with a friend
Popular Tags:
17
Search in Transliterated Space Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India
Transcript
Page 1: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Search in Transliterated Space

Shared Task Proposal, FIRE 2012

Monojit ChoudhuryMicrosoft Research Lab India

Page 2: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

Song Lyrics

Page 3: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

Reviews and Forums

Page 4: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

Facebook and Twitter

Page 5: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

And lot more

Page 6: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Beyond Indic languages

Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,

Morocco,…) Persian Indian sub-continental languages (IL &

Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)

Page 7: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Aspects of Transliterated Text

Code Mixing

Transliteration

Errors, Contracti

on

Page 8: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

IR Scenario - I

Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni

suhanee Results: Only Roman transliterated

documents

Challenge: Spelling variations tandee hawa ye chandny soohaany

Page 9: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

IR Scenario - II

Cross-script and Multi-script Monolingual IR in transliterated space

Query: thandee hava yeh chandni OR ठं� डी� हवा� ये चाँ��दनी� Results: Both Roman transliterated

or in native script

Challenge: Transliteration

Page 10: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Scenario - III

Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and

Devanagari) and English documents

Page 11: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Shared Task on Retrieval

Mono-scriptMonolingual

IR

Transliterated query in

Roman

Transliterated documents in Roman

Cross-scriptMonolingual

IR

Transliterated query in

Roman

Transliterated documents in native script

Multi-scriptMonolingual

IR

Query in Roman or

native script

Documents in Roman and native scripts

Page 12: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Shared Sub-Tasks

Language identification of transliterated queries, documents, code-mixed text

kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML

Transliteration Forward: കഴി�ക്കാ�ന്‍ kazhikkan Backward: kazhikkan കഴി�ക്കാ�ന്‍

Page 13: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Available Data

20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)

35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics

More data under preparation from FaceBook on mixture of various languages.

Looking for partners to extend!

Page 14: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Available Data

Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics

Looking for partners to extend it to other (Indian) Languages

Other domains?

Page 15: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Thank you! [email protected]

Page 16: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Other resources

Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological

analyzers

Anything else?

Page 17: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Concluding Remarks

We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing

These are just some initial ideas that came up from our experiences

If you are interested please let me know


Recommended