Date post: | 22-Oct-2014 |
Category: |
Technology |
View: | 524 times |
Download: | 2 times |
Digitizing and Retrieving Printed Arabic Documents
Kareem Darwish
Senior Scientist
Qatar Computing Research Institute
Overview
Some Magic
Search results
Scanning
Scanning
http://en.wikipedia.org/wiki/Book_scanning
Scanning
http://en.wikipedia.org/wiki/Book_scanning
Result of Scanning
Courtesy of the Library of Alexandria
http://www.colophon.com
Magic: Optical Character Recognition
Courtesy of the Library of Alexandria
والنيران المراقبة ناحيتى من . الساحلى السهل على
تسلكها التى االقتراب وطرقناحية من عربية قوات أى
طرق دى ر تنح الشرتى : التالية الثالثة أهمها خمسة
ا- وهو ألول ا الطريق ا - هـ بغداد من " 2233ألقصر
إلى. النحراف ا أو ، المفرتىقبل دمشق 3أ3الرطبة
األردن،إلى العرا من الرابى ا محاور
ألردد وا سوريابغداد- " 2 من االثانى الطريق
- - - ا دمشق بالميرا كمال أبوألردد.
األطول - 3 وهو الثالسث الطريق - - - الزور دير الموصل داد بغ من
- دمشق- حملروألردن " 68ا
OCR output (Sakhr)
Arabic OCR is Hard
• Letters change shape depending on position in word, with dots distinguishing them from each other
– تـ ، ـتـ ، ـت– قـ ، ـقـ ، ـق ، ق
• Diacritics are optional
– ق ، ق� ، ق� ، ق� ، ق� ، ق�• Some letter combinations have special shapes
(ligatures):
– ل + ا = ال• Letter elongations (Kashida) are often used
– قبـــــــــــــــــــــــــــل قبل• Letters are connected
Isolated End Middle Start
ت ـت ـتـ تـ
ي ـي ـيـ يـ
ق ـق ـقـ قـ
Arabic OCR is Hard
Diacritics and dots easily confusable. If manuscript is old,they can be confused with speckle on page
Word error rate is typically greater than 20% !
Arabic OCR is Hard
الخليقة تقاكظ سوق الجنة والنار وبها وتام• فهى واألبا إر رالفجار رالكفارإلى المؤمنين
منشأ الخلق واألمر والثواب والعقاب ،وهى رغها رعن له الخليقة خطقت الذى اهدن
والحسابحقرقها السمؤال
Typical OCR output
Arabic Morphology Challenges
• Arabic uses complex derivational morphology:– Root (ex. ktb)– Stem – root in a template (ex. mkAtbp)– Word – stem with optional determiner, preposition,
coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+At wAlmkAtbAt)
– Estimated number of possible words: 60 billion• Morphology dictates diacritics, which change meaning
– Ex. Elm (Eelm, Ealam, Eolem: Knowledge, flag, acknowledge)
• No specific writing standard is prevalent:– Ex. The trailing letters in Ely (Ali) and ElY (on) are
often interchanged
Arabic Morphology
• For regular Arabic search, morphological analysis is typically used:– Full morphological analysis:
• Sebawai, Buckwalter, IBM Lee, AMIRA– Light stemming – remove common prefixes and
suffixes• Al-Stem or Light-10
• For OCR they fail
OCR Error Handling
• Error correction:– Word level techniques:
• Dictionary lookup (Jurafsky & Martin, 2000)
– Character level model uses confusion matrix– Typically font dependent
• Character n-gram model:– Some character sequences are more
common than others– Presence of a rare character sequence
indicates position of error
argmax P ( WordOrg | WordOCR ) = P ( WordOCR | WordOrg ) P ( WordOrg )
Char level model Word level model
OCR Error Handling
• Error correction:– Passage level/context sensitive techniques:
• Using language modeling (bi or trigram LM):
• Clustering words in passage:– assumes salient terms appear more than
once:– Ex. Kennedy; Kemedy; Kennody; etc.
P ( Wordoriginal | WordOCR ) =
P ( WordOCR | WordOrg ) P ( WordOrg )
P(WordOrg | WordOrg-1 )
OCR Error Handling
• Multi-source fusion:– Uses language modeling to fuse the output of
multiple OCR systems
• Query garbling:– Use a character level model to generate multiple
degraded versions of a query• Ex.: cement => cement, cornent, cernont, etc.
– Set degraded versions of a term as synonyms
Arabic OCR Text Retrieval
• Without error handling Use character n-grams (3 & 4-grams)
الخليقة تقاكظ سوق الجنة والنار وبها وتام فهى واألبا إر رالفجار رالكفارإلى المؤمنين
منشأ الخلق واألمر والثواب والعقاب ،وهى رغها رعن له الخليقة خطقت الذى اهدن
رالفجاروالحسابحقرقها السمؤال والفجار
، الف ، رالفجا ، جار
وال ، الف ، فجا ، جار
Presenting Results
• Presenting OCR output to users is not an option• How would a ranked list of images look like
– How would we generate image snippets?– How do we highlight salient terms in these
images?
Presenting Results
• What is the unit of search?– Is it book, chapter, page
Concluding Remarks
• Scanning is a fairly mature technology• Arabic OCR has quite a ways to go• Quality of search is tied to the quality of OCR• Presentation Issues persist