+ All Categories
Home > Documents > [ACM Press the 4th International Workshop - Washington, D.C. (2013.08.24-2013.08.24)] Proceedings of...

[ACM Press the 4th International Workshop - Washington, D.C. (2013.08.24-2013.08.24)] Proceedings of...

Date post: 14-Dec-2016
Category:
Upload: prem
View: 216 times
Download: 1 times
Share this document with a friend
8
Multilingual OCR Research and Applications: An Overview Xujun Peng Raytheon BBN Technologies Cambridge, MA, USA [email protected] Huaigu Cao Raytheon BBN Technologies Cambridge, MA, USA [email protected] Srirangaraj Setlur CUBS University at Buffalo Buffalo, NY, USA [email protected] Venu Govindaraju CUBS University at Buffalo Buffalo, NY, USA [email protected] Prem Natarajan Information Sciences Institute Univ. of Southern California Marina del Rey, CA, USA [email protected] ABSTRACT This paper offers an overview of the current approaches to research in the field of off-line multilingual OCR. Typically, off-line OCR systems are designed for a particular script or language. However, the ideal approach to multilingual OCR would likely be to develop a system that can, with the use of language-specific training data, be re-targeted to process different languages with minimal modifications. This is still an open area of research with plenty of challenges. This is particularly true for multilingual handwriting recognition due to the added complexity of variations in writing styles even within the same scripts. Challenges for multilingual OCR in preprocessing, feature extraction, script identifica- tion and recognition modeling and a brief survey of research in these areas are presented in the paper. Ideas for future research in multilingual OCR are outlined. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; I.7.0 [Computing Methodologies]: Document and Text Processing—General General Terms Application Keywords Multilingual OCR, System modeling, Document processing, Text Recognition 1. INTRODUCTION As a major application of pattern recognition and machine learning, optical character recognition (OCR) is widely used for converting text from scanned documents into digitally Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MOCR ’13, August 24 2013,Washington DC, USA Copyright 2013 ACM 978-1-4503-2114-3/13/08 ...$15.00 http://dx.doi.org/10.1145/2505377.2509977. editable text to facilitate document indexing and manage- ment for search and information retrieval. The origin of OCR can be traced back to early 19th century attempts to develop devices to aid the blind for reading [48]. Decades of research and development have resulted in OCR systems (commercial as well as open-source) that can recognize printed documents and well-constrained handwritten documents in applications such as forms processing and mail automation[53] with acceptable accuracies. However, despite these successes, most OCR engines can only handle a particular language or a small set of languages because researchers tend to use language-specific features and models to simplify problems. For instance, although there are numerous successful implementations of OCRs for Latin-like scripts reported in the literature, the direct use of these OCRs on other scripts such as Asian scripts, results in a steep decline in recognition accuracy due to the features being script/language-specific. More difficulties are intro- duced if multiple scripts/languages appear within the same document image, as shown in Fig. 1, where Fig. 1(a) con- tains both Hindi and English and Fig. 1(b) contains both Arabic and English. The proliferation of affordable cameras and mobile devices such as smart phones and tablets with cameras has led to sig- nificant interest in the location and recognition of scene text as an enabling technology for a variety of mobile applications and has become a hot research area for exploration [40, 43]. Processing of multiple scripts/languages is also a challenge for scene text recognition. In Fig. 2(a), Korean and En- glish appear in the same street view image, and in 2(b), the street sign contains both French and English. Such uncon- strained scene images with complex backgrounds present a very different set of challenges for multilingual OCR than conventional documents for pre-processing, and text loca- tion as well as recognition. A typical multilingual OCR system pipeline is shown in Fig. 3 and consists of the following processing steps: (i) pre- processing including binarization, layout analysis and page segmentation, followed by line finding and optionally word and character segmentation; (ii) feature extraction and (iii) classification or recognition. Documents containing text in multiple scripts/languages will require a script identification module which would also involve relevant preprocessing, fea- ture extraction, and recognition modeling steps. While doc- ument image preprocessing methods to eliminate noise, and
Transcript

Multilingual OCR Research and Applications: An Overview

Xujun PengRaytheon BBN Technologies

Cambridge, MA, [email protected]

Huaigu CaoRaytheon BBN Technologies

Cambridge, MA, [email protected]

Srirangaraj SetlurCUBS

University at BuffaloBuffalo, NY, USA

[email protected]

Venu GovindarajuCUBS

University at BuffaloBuffalo, NY, USA

[email protected]

Prem NatarajanInformation Sciences InstituteUniv. of Southern CaliforniaMarina del Rey, CA, USA

[email protected]

ABSTRACTThis paper offers an overview of the current approaches toresearch in the field of off-line multilingual OCR. Typically,off-line OCR systems are designed for a particular script orlanguage. However, the ideal approach to multilingual OCRwould likely be to develop a system that can, with the useof language-specific training data, be re-targeted to processdifferent languages with minimal modifications. This is stillan open area of research with plenty of challenges. Thisis particularly true for multilingual handwriting recognitiondue to the added complexity of variations in writing styleseven within the same scripts. Challenges for multilingualOCR in preprocessing, feature extraction, script identifica-tion and recognition modeling and a brief survey of researchin these areas are presented in the paper. Ideas for futureresearch in multilingual OCR are outlined.

Categories and Subject DescriptorsH.4 [Information Systems Applications]: Miscellaneous;I.7.0 [Computing Methodologies]: Document and TextProcessing—General

General TermsApplication

KeywordsMultilingual OCR, System modeling, Document processing,Text Recognition

1. INTRODUCTIONAs a major application of pattern recognition and machine

learning, optical character recognition (OCR) is widely usedfor converting text from scanned documents into digitally

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’13, August 24 2013,Washington DC, USACopyright 2013 ACM 978-1-4503-2114-3/13/08 ...$15.00http://dx.doi.org/10.1145/2505377.2509977.

editable text to facilitate document indexing and manage-ment for search and information retrieval. The origin ofOCR can be traced back to early 19th century attempts todevelop devices to aid the blind for reading [48]. Decadesof research and development have resulted in OCR systems(commercial as well as open-source) that can recognize printeddocuments and well-constrained handwritten documents inapplications such as forms processing and mail automation[53]with acceptable accuracies.

However, despite these successes, most OCR engines canonly handle a particular language or a small set of languagesbecause researchers tend to use language-specific featuresand models to simplify problems. For instance, althoughthere are numerous successful implementations of OCRs forLatin-like scripts reported in the literature, the direct use ofthese OCRs on other scripts such as Asian scripts, results ina steep decline in recognition accuracy due to the featuresbeing script/language-specific. More difficulties are intro-duced if multiple scripts/languages appear within the samedocument image, as shown in Fig. 1, where Fig. 1(a) con-tains both Hindi and English and Fig. 1(b) contains bothArabic and English.

The proliferation of affordable cameras and mobile devicessuch as smart phones and tablets with cameras has led to sig-nificant interest in the location and recognition of scene textas an enabling technology for a variety of mobile applicationsand has become a hot research area for exploration [40, 43].Processing of multiple scripts/languages is also a challengefor scene text recognition. In Fig. 2(a), Korean and En-glish appear in the same street view image, and in 2(b), thestreet sign contains both French and English. Such uncon-strained scene images with complex backgrounds present avery different set of challenges for multilingual OCR thanconventional documents for pre-processing, and text loca-tion as well as recognition.

A typical multilingual OCR system pipeline is shown inFig. 3 and consists of the following processing steps: (i) pre-processing including binarization, layout analysis and pagesegmentation, followed by line finding and optionally wordand character segmentation; (ii) feature extraction and (iii)classification or recognition. Documents containing text inmultiple scripts/languages will require a script identificationmodule which would also involve relevant preprocessing, fea-ture extraction, and recognition modeling steps. While doc-ument image preprocessing methods to eliminate noise, and

(a) Mixture of Hindi and English

(b) Mixture of Arabic and English

Figure 1: Examples of mixture documents whichcontain multi-languages within the same page.

extract and normalize text lines for recognition are typicallyscript-independent, it is sometimes necessary to be cognizantof script-specific characteristics such as diacritical marks inArabic and the shirorekha or head-line in Devanagari, whichmight influence the way preprocessing is done. As knowl-edge of the script and language of the text is necessary touse the right resources for multilingual OCR, script identifi-cation is a vital component, especially for mixed documentswhich contains multiple scripts. Recognition is the processof identifying text by classifying each character/word. Iden-tification of discriminative features for recognition is one ofthe most challenging problems for multilingual OCR dueto the differing characteristics of each script. Generally,text recognition approaches for conventional documents canbe divided into segmentation-based and segmentation-freemethods, based on whether text lines are segmented intofiner units prior to recognition. Segmentation-based meth-ods divide a text line into a sequence of smaller units, suchas characters or sub-characters, and compare a combinationof these units against potential lexicon words to find the bestmatch. In contrast, segmentation-free OCR methods implic-itly incorporate character segmentation in producing a glob-

(a) Mixture of Korean and English in street view image

(b) Mixture of French and English in signimage

Figure 2: Examples of scene images with multiplelanguages.

ally optimized character/word sequence as the recognitionresult using methods such as the left-to-right hidden Markovmodel (HMM). Depending on the choice of a segmentation-based or segmentation-free approach, different recognitionmodeling methods can be used.

2. PREPROCESSINGThe input to an OCR system, is typically a color or grayscale

document image containing text. However, often the doc-ument page might contain other elements such as figures,logos, tables, and pictures. As the first step of an OCRsystem, the goal of preprocessing is to identify the textwithin the document, segment them into text lines and gen-erate a noise-free, normalized line or word image for fur-ther processing. While preprocessing methods for binariza-tion, noise removal, page segmentation, line finding etc. aretypically script-independent, script specific characteristicscan necessitate specialized treatment; e.g., the shirorekhain Devanagari can be a useful aid for text line detectionand word segmentation whereas the threshold for smallercomponents that might typically be considered noise andeliminated while processing documents in Latin script couldsignificantly impact recognition in Arabic and Indic docu-ments by eliminating crucial diacritics.

2.1 BinarizationMost OCR systems are designed to work with binarized

images of documents and good binarization is crucial for reli-able performance. There has been a lot of work on binariza-tion algorithms that from initial algorithms such as Otsu’s,assumed a bimodal distribution of pixel intensities within a

Segmentation-based OCR Segmentation

Segmentation-free OCR

Preprocessing

Binarization Layout Analysis Line Finding

Script

Identification

Word/Character

Segmentation

Recognition

Modeling

Feature

Extraction

Input Text Image

Output Hypothesis

Figure 3: Typical components of a multilingual OCR system.

document to more sophisticated local adaptive binarizationapproaches that have become the norm [18]. However, asatisfactory choice of binarization algorithm appears to bedependent on the application domain and experimentationon relevant data sets.

There have been approaches proposed for binarization ofmultilingual documents. By using a ‘small’ representativeset of color pixels from text contour, Kasar et al. [28] de-signed a script-independent adaptive binarization approachwhich outperforms the script-dependent approach on a mul-tiple scripts data set. In [46], Rangoni et al. proposeda script guided binarization approach, where the optimalthresholding is obtained by training a Gaussian mixturemodel for particular language. This framework is claimed tobe script-independent. Binarization of camera-based scenetext images is particularly complex and is still a very activearea of research.

2.2 Page segmentationTo identify regions of text in a document image, docu-

ment layout or page segmentation analysis is required, wheremeaningful components are separated and the correspond-ing positions and attributes (such as text blocks, images,tables, etc.) are identified. This information is then fed intoline finding and recognition systems.

The existing document layout analysis algorithms can bedivided into two main categories: bottom-up and top-downapproaches. Bottom-up methods start with local connectedcomponents and merge neighboring regions which have simi-lar texture structure or features recursively until some crite-rion, such as spatial constraints, are met. These approachesinclude Voronoi-diagram based method [31, 37] and Doc-strum based method [41]. Top-down methods divide thedocument image into smaller regions recursively. X-Y cutof projection profile approach [21] and background analysisapproach [5] can be considered as typical top-down segmen-tation methods. Other approaches attempt to use a combi-nation of bottom-up and top-down methods[8].

By incorporating prior knowledge of script/language intolayout analysis systems, researchers have developed several

page segmentation algorithms that are applicable to a widerange of scripts and layout structures. Early attempts atlanguage-free layout analysis include Ittner’s work, whichuses a method of greedy white covers [23]. In [49], Kumaret al. analyze why general layout analysis systems cannotwork well on a particular script and suggest that script infor-mation should be taken into consideration during the designof layout analysis systems. By combining statistical mod-els and global geometric features, Breuel and others haveproposed a script-independent page segmentation approach[6].

2.3 Text line findingAfter identification of text regions, line finding algorithms

can be applied to extract text lines from document image.The methods applied for line finding usually assume thattext is roughly horizontally oriented. In machine printeddocuments, individual text lines can be extracted fairly eas-ily by using projection profile based methods [34]. Hand-written text lines, which are often curved and skewed, aremore complex and graph-based methods [32] or local group-ing and smearing based methods [50] can provide more reli-able results than projection based approaches.

In order to apply line finding on multilingual OCR sys-tems, Bukhari et al. proposed a script-independent text linesegmentation approach based on active contours [7]. In-stead of using text line height to segment machine-printedtext line, Kim and Oh described an interline spaces based ap-proach for text line finding which is also script-independent[30].Unlike other connected component based methods, Li et al.designed a script-free text line finding method by estimat-ing the probability of each pixel belonging to a text line [33].The experimental results in [33] show this approach outper-forms other methods on Chinese, Korean, Arabic and Hindidata sets. Handwritten line segmentation competition re-ports[19] describe more line finding algorithms which can beextended to multilingual applications. Performance of linesegmentation algorithms can be improved by incorporatingscript specific enhancements such as specialized grouping forArabic diacritics etc.

2.4 Word/Character segmentationBased on different recognition modeling schemes, segmen-

tation based OCR systems further divide text lines intosmall units, e.g. words, characters or sub-characters forclassification. One of the potential disadvantages for wordor character segmentation for multilingual OCR systems isthat these methods rely on glyph characteristics for a partic-ular script. For example, most Latin or Chinese charactersegmentation approaches are based on the assumption ofgap between each character, where vertical projection his-tograms analysis or connected component analysis can beused [54]. But this type of approach can not be adopted onArabic scripts or Hindi scripts easily, where connecting char-acters and allographs are expected [2]. Thus, more effort isneeded to design effective word segmentation algorithms fora segmentation-based multilingual OCR system.

In order to adapt Tesseract for multilingual OCR, Smithet al. adjust the number of segmentation points for dif-ferent languages. For example, Tesseract tends to reducesegmentation points for Chinese and Japanese because char-acters from these languages have similar aspect ratios andare either full-pitched or half-pitched [52]. Now, Markovmodel based OCR systems have become mature and manyresearchers apply this technique for implicit word/charactersegmentation because it provides a suitable framework toprocess text lines when considered as a sequence of sig-nals. For instance, Jiang et al. successfully used a HMMto segment touching Chinese characters [24] and in [4], aHMM based Thai word segmentation approach was pro-posed where word boundaries are not apparent.

Prior to word/character segmentation, text lines are gen-erally de-skewed, slant corrected and rescaled to ensure anormalized appearance [17]. Summarized by Thomas andGernot [45], no standard methodology exists for these threeoperations and the methods thereof are somewhat script-dependent for multilingual OCR.

3. SCRIPT IDENTIFICATIONMost OCR systems assume that the script and language of

the document is known beforehand. Preprocessing can alsobenefit from prior knowledge of script/language (as shownin Fig. 3) and as described in prior sections. Hence, anOCR system’s performance on a document with unknownscript can be improved if script/language can be identified.Although in most application scenarios, one can explicitlyidentify a document’s script by examining the document im-age and utilize appropriate engines, it is infeasible to man-ually identify scripts in a multilingual environment whichdeals with mixed documents that contain multiple scriptswithin the same page as shown in Fig. 1. Thus, automaticscript identification is necessary for many OCR systems toselect an appropriate script-specific preprocessing methodand OCR engine.

In [25], Joshi et al. provided a generalised framework forscript identification, where text is considered as texture anda hierarchical approach was used. In [42], the properties ofdifferent scripts were analyzed and a system to identify textlines of different languages, such as English, Chinese, Arabicand Bangla was proposed. Based on a template clusteringmethod, the system described in [22] can identify thirteenscripts with high accuracy . In order to separate Englishscript from Thai documents, Chanda et al. proposed a SVM

based method that identifies English and Thai words frommixed documents[9]. Extracting extreme points from textlines and using properties such as average angle of the seg-ments and the smoothness of lines, Phan et al. suggested amulti-script identification method that can separate English,Chinese, Tamil from video scene images [44].

4. FEATURE EXTRACTIONIn a segmentation-free HMM-based OCR system [38, 39],

a number of script-independent features such as intensitypercentile, local stroke angle, correlation and total energyare extracted from each sliding window of a text line image.Not only relying on local features, El-Hajj et al. used base-line based features which are also script-independent to ap-ply on a HMM based OCR. Other features, such as Gaborfeatures [10] and GSC (Gradient, Structure, and Concav-ity) features [16] are also widely used for script-independentOCR systems.

In segmentation-based OCR systems, features are usu-ally extracted from each word/character image. In [35], achaincode contour feature for each word where local contourextrema are extracted. To overcome the problem of dam-aged character recognition, [51] describes a scheme wherethe segments of polygonal approximation are used as fea-tures for training. In the recognition phase, a small, fixedsize feature is used and multiple short features are matchedagainst each prototype in the training set [51]. By applyingboth global features (e.g. loops, ascenders, and descenders)for each word and contour transition histogram features foreach segment, El-Yacoubi et al. proposed a word level HMMbased multilingual OCR [14]. In Alaei et al’s work, multiplefeatures (under-sampled bitmap feature, chain code feature,gradient feature, shadow feature and key points feature) areused to recognize Persian/Arabic Handwritten Characters[1].

Besides exploring new features, some researchers use fea-ture transform approaches to improve the performance ofOCR. For example, Chen et al. used a region-dependentnon-linear feature transform (RDT) on handwriting recog-nition [11], which decreases the word error rate (WER) ona HMM-based multilingual handwritten recognition system.Other researchers avoid feature extraction for word/characterimage by using Neural Networks (NN). In [56], Yuan et al.proposed a convolutional neural network OCR to recognizeLatin-script, where the input is raw character image andthe feature extraction is not needed. Similarly, Elagouni etal. applied a NN-based recognizer to tackle the problem ofscene text OCR, which uses multi-scale windows to extractraw character images from scene image [15].

5. RECOGNITION MODELINGMany efforts have been taken to extend traditional OCR

to multilingual applications. As discussed in previous sec-tions, multilingual OCR can be grouped into two categorieswhich are analyzed below.

5.1 Segmentation-based recognitionSegmentation-based OCR systems such as the open-source

Tesseract engine, perform character segmentation explicitlyand refine character segmentation in a recognition-drivenframework. Tesseract uses a two-step classification processfor character recognition. At the first step, a short list of

candidate characters are identified for each unknown char-acter based on a look-up table matching and counting ap-proach. In the second step, more accurate similarity mea-sures between candidates and the unknown character arecalculated. The final word recognition hypothesis is ob-tained by using simple linguistic analysis. Attempts havebeen made to adapt the Tesseract engine to the desiredmulti-lingual OCR framework which is to enable genericmulti-lingual operation such that negligible customizationis required for a new language beyond providing a corpusof text[52]. The accuracy of Tesseract on handwritten textis relative low due to the weakness of classifier design andsmall size of training data.

A segmentation-based OCR method requires much longertime to reconfigure the segmentation algorithm for opti-mized performance on new languages. Instead of relying onpre-trained character models, an alternative approach wouldbe to use unsupervised learning on the fly. In [26], Kae etal. proposed a font-free approach toward multilingual OCRwhere the frequency of similar symbols is calculated for in-dividual document and a document-specific model is trainediteratively. Although the comparison on English and Greekreported in [26] shows that the proposed approach can notbeat existing commercial OCRs, it is still a potential re-search direction which can be used to adapt existing OCRsfor multilingual use.

5.2 Segmentation-free recognitionUnlike segmentation-based OCR, segmentation-free OCR

analyzes individual text lines sequentially. Hidden Markovmodels (HMM) can be integrated into multilingual OCRsystems using a sliding-window approach. The idea of ap-plying HMMs to OCR was inspired by the success of HMMon speech recognition, where a sliding window is used andlocal features are computed from a sequence of overlappingframes. A two-dimensional text line image is converted intoa one-dimensional frame sequence.

In [38, 39], a hidden Markov model (HMM) based frame-work was proposed to recognize machine printed and hand-written text that was incorporated into the Byblos OCRsystem. In the Byblos OCR, given a sequence of observedfeature vectors X, the aim of the OCR is to find a sequenceof characters that maximizes posterior probability P (C|X).By applying Bayes’ rule, P (C|X) can be written as:

C = arg maxC

P (C|X) = arg maxC

P (C)P (X|C)

P (X)

= arg maxC

P (C)P (X|C) (1)

where P (X|C) is the so-called glyph model and P (C) is then-gram language model, which are two separate componentsof the Byblos OCR.

The glyph model P (X|C) uses HMM to represent localfeatures captured from the text line image, where HMM hastwo layers - hidden layer and observed layer. In Fig. 4, weshow a 5-state HMM, with transition probabilities in hiddenlayer and output probability distribution in observed layerwith each of the five states. The probability distributionsare defined over the feature vector X, which is 15 dimensionsin the Byblos OCR. In the training phase, transition prob-abilities between states, the weights of Gaussian mixturemodels and the means and covariance of Gaussian distribu-tions within each state are estimated by using Baum-Welch

algorithm.Unlike segmentation-based OCR, HMM can not only model

a single character but by concatenating character HMMs, itcan be easily extended to word HMMs and text line HMMs.A detailed algorithm for HMM concatenating is given by[45]. In addition, to achieve robust parameter estimation,different tied-mixture models are employed in the ByblosOCR depending on the amount of training data. In theCharacter Tied Mixture (CTM) mode, the states of eachcharacter share one codebook of Gaussians. In State TiedMixture (STD) mode, each state of all HMMs has its owncodebook of Gaussians[39].

The n-gram language model P (C) in the Byblos OCRsystem uses Markov-chain to statistically describe the prob-ability of a sequence of characters, which computes P (C)by multiplying the probabilities of consecutive groups of ncharacters or words.

Formally, C can be expressed as a sequence of charactersC = c1, c2, ..., cT . By applying Bayes’ rule, P (C) is givenby:

P (C) = P (c1)P (c2|c1)...P (cn|c1, ..., cT )

=

T∏t=1

P (ct|c1, ..., ct−1) (2)

Considering the context dependency and Markov prop-erty, P (C) can be rewritten by using a short “history” ofcharacter sequence:

P (C) ≈T∏

t=1

P (Ct|ct − n + 1, ..., ct−1) (3)

During recognition phase, the Byblos OCR system uses atwo-pass beam search [3] to find an optimal state sequencewhere tri-gram language model is used to guide the searchpath. Theoretically, HMM based OCR can be used as mul-tilingual OCR with little effort. In [39], the authors demon-strated the capability of the Byblos OCR system on multi-lingual OCR, where three different scripts: Arabic, Chineseand English were tested.

Currently, Markov models based (glyph model and lan-guage model) framework has become the predominant ap-proach for multilingual OCR systems. Other attempts at us-ing HMM framework for multilingual OCR include the ES-MERALDA OCR engine [55] which uses a framework simi-lar to BBN’s Byblos OCR system and is used for video textrecognition and the CENPARMI OCR engine [14] which isdesigned for handwritten recognition and can also be usedfor hybrid recognition. In [29], a multilingual OCR sys-tem is described that operates independently of the natureof the script and was tested on Arabic and Latin scripts.By changing the feature sequences, Schambach et al. havesuccessfully adapted SIEMENS’s HMM based Latin-scriptOCR to Arabic script [47].

Recently, other advanced classification approaches havebeen explored for multilingual OCRs. For example, in [20],Graves and Schmidhuber proposed a recurrent neural net-works based approach by combining multidimensional re-current neural networks (RNN) and connectionist temporalclassification for multilingual OCR tasks. This frameworkostensibly does not rely on any knowledge of script for pre-processing and can be used for multiple languages withoutmodification. By combining RNN with HMM, Menasri et al.

1 2 3 4 5

0.2 0.4 0.5 0.6 0.3

0.2 0.3 0.1

0.6 0.3 0.4 0.4 0.7 HMM

Output

Probabilities

�Ú � �Û � �Ü � �Ý � �Þ �

Figure 4: An example of a 5-state, left-to-right, HMM with the Bakis topology.

Table 1: Overview of the state-of-the-art multilin-gual OCR systems

SYSTEM Technology Language AppliedBBN Byblos [39] HMM MultilingualSIEMENS [27, 47] HMM MultilingualTUM [20] RNN Latin-script, ArabicA2iA [36] HMM & RNN Latin-script, ArabicUOB-ENST [13] HMM Arabic

suggested a hybrid OCR which achieved high accuracy forOCR of Latin-like scripts[36]. In table 1, we summarize sev-eral the state-of-the-art OCR systems used for multilingualtasks.

6. SUMMARYAfter decades of development, the state-of-the-art in mul-

tilingual OCR has advanced from primitive schemes for ma-chine printed single language/script recognition to the appli-cation of sophisticated techniques for recognition of multi-lingual handwritten scripts. The performance of handwrit-ing recognition systems indicates that acceptable recogni-tion performance can be obtained with standard documentsin languages that are resource-rich, where corpora that canbe used to build efficient language models exist, and do-main knowledge can be effectively used. However, a fullyre-targetable multilingual OCR system which can recognizeany script by just re-training on data in that script is noteasy to achieve. Further research on techniques such as deeplearning[12] can perhaps help tackle this challenging prob-lem. Recognition of multilingual scene text also presents alot of interesting research problems that are very differentfrom conventional documents in all aspects of the recog-nition pipeline, from binarization to recognition, and willlikely require much greater use of context and reasoning toproduce useful results.

7. REFERENCES

[1] A. Alaei, U. Pal, and P. Nagabhushan. A comparativestudy of persian/arabic handwritten characterrecognition. In Proceedings of the 2012 InternationalConference on Frontiers in Handwriting Recognition,pages 123–128, 2012.

[2] Y. Alginahi. A survey on arabic charactersegmentation. International Journal on DocumentAnalysis and Recognition (IJDAR), 16(2):105–126,2013.

[3] S. Austin, R. Schwartz, and P. Placeway. Theforward-backward search algorithm, 1991.

[4] P. Bheganan, R. Nayak, and Y. Xu. Thai wordsegmentation with hidden markov model and decisiontree. In Advances in Knowledge Discovery and DataMining, volume 5476, pages 74–85. 2009.

[5] T. M. Breuel. Two geometric algorithms for layoutanalysis. In Document Analysis Systems V, pages188–199. Springer Berlin Heidelberg, 2002.

[6] T. M. Breuel. High performance document layoutanalysis, 2003.

[7] S. Bukhari, F. Shafait, and T. Breuel.Script-independent handwritten textlinessegmentation using active contours. In DocumentAnalysis and Recognition, 2009. ICDAR ’09. 10thInternational Conference on, pages 446–450, 2009.

[8] H. Cao, R. Prasad, P. Natarajan, and E. Macrostie.Robust page segmentation based on smearing anderror correction unifying top-down and bottom-upapproaches. In Document Analysis and Recognition,2007. ICDAR 2007. Ninth International Conferenceon, volume 1, pages 392–396, 2007.

[9] S. Chanda, O. Terrades, and U. Pal. Svm basedscheme for thai and english script identification. InDocument Analysis and Recognition, NinthInternational Conference on, volume 1, pages 551–555,2007.

[10] J. Chen, H. Cao, R. Prasad, A. Bhardwaj, and

P. Natarajan. Gabor features for offline arabichandwriting recognition. In Proceedings of the 9thIAPR International Workshop on Document AnalysisSystems, pages 53–58, 2010.

[11] J. Chen, B. Zhang, H. Cao, R. Prasad, andP. Natarajan. Applying discriminatively optimizedfeature transform for hmm-based off-line handwritingrecognition. In Frontiers in Handwriting Recognition(ICFHR), International Conference on, pages219–224, 2012.

[12] G. E. Dahl, S. Member, D. Yu, S. Member, L. Deng,and A. Acero. Context-dependent pre-trained deepneural networks for large vocabulary speechrecognition. In IEEE Transactions on Audio, Speech,and Language Processing, 2012.

[13] R. El-Hajj, L. Likforman-Sulem, and C. Mokbel.Arabic handwriting recognition using baselinedependant features and hidden markov modeling. InDocument Analysis and Recognition, 2005.Proceedings. Eighth International Conference on,volume 2, pages 893–897, 2005.

[14] A. El-Yacoubi, M. Gilloux, R. Sabourin, and C. Suen.An hmm-based approach for off-line unconstrainedhandwritten word modeling and recognition. PatternAnalysis and Machine Intelligence, IEEE Transactionson, 21(8):752–760, 1999.

[15] K. Elagouni, C. Garcia, F. Mamalet, and P. Sebillot.Combining multi-scale character recognition andlinguistic knowledge for natural scene text ocr. InProceedings of the 10th IAPR International Workshopon Document Analysis Systems, pages 120–124, 2012.

[16] F. Favata and G. Srikantan. A multiplefeature/resolution approach to handprinted digit andcharacter recognition. International Journal of ImageSystems and Technology, 17(4):304–311, 1998.

[17] H. Fujisawa. Robustness design of industrial strengthrecognition systems. In Digital Document Processing,Advances in Pattern Recognition, pages 185–212.Springer London, 2007.

[18] B. Gatos, I. Pratikakis, and S. Perantonis. Adaptivedegraded document image binarization. PatternRecognition, 39(3):317 – 327, 2006.

[19] B. Gatos, N. Stamatopoulos, and G. Louloudis. Icdar2009 handwriting segmentation contest. In DocumentAnalysis and Recognition, 2009. ICDAR ’09. 10thInternational Conference on, pages 1393–1397, 2009.

[20] A. Graves and J. Schmidhuber. Offline HandwritingRecognition with Multidimensional Recurrent NeuralNetworks. In NIPS, pages 545–552, 2008.

[21] J. Ha, R. Haralick, and I. Phillips. Recursive x-y cutusing bounding boxes of connected components. InDocument Analysis and Recognition, 1995.,Proceedings of the Third International Conference on,volume 2, pages 952–955, 1995.

[22] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns.Automatic script identification from document imagesusing cluster-based templates. Pattern Analysis andMachine Intelligence, IEEE Transactions on,19(2):176–181, 1997.

[23] D. Ittner and H. Baird. Language-free layout analysis.In Document Analysis and Recognition, 1993.,Proceedings of the Second International Conference

on, pages 336–340, 1993.

[24] Z. Jiang, X. Ding, C. Liu, and Y. Wang. A novel shortmerged off-line handwritten chinese character stringsegmentation algorithm using hidden markov model.In Document Analysis and Recognition (ICDAR), 2011International Conference on, pages 668–672, 2011.

[25] G. D. Joshi, S. Garg, and J. Sivaswamy. A generalisedframework for script identification. Int. J. Doc. Anal.Recognit., 10(2):55–68, 2007.

[26] A. Kae and E. Learned-Miller. Learning on the fly:Font-free approaches to difficult ocr problems. InProceedings of the 10th International Conference onDocument Analysis and Recognition, pages 571–575,2009.

[27] A. Kaltenmeier, T. Caesar, J. Gloger, and E. Mandler.Sophisticated topology of hidden markov models forcursive script recognition. In Document Analysis andRecognition, Proceedings of the Second InternationalConference on, pages 139–142, 1993.

[28] T. Kasar and A. Ramakrishnan. Cococlust:Contour-based color clustering for robust binarizationof colored text. In Proceeding of InternationalWorkshop on Camera-based Document Analysis andRecognition, pages 11–17, 2009.

[29] Y. Kessentini, T. Paquet, and A.-M. Ben Hamadou. Amulti-lingual recognition system for arabic and latinhandwriting. In Document Analysis and Recognition,2009. ICDAR ’09. 10th International Conference on,pages 1196–1200, 2009.

[30] M. Kim and I.-S. Oh. Script-free text linesegmentation using interline space model for printeddocument images. In Document Analysis andRecognition (ICDAR), 2011 International Conferenceon, pages 1354–1358, 2011.

[31] K. Kise, A. Sato, and M. Iwata. Segmentation of pageimages using the area voronoi diagram. ComputerVision and Image Understanding, 70(3):370 – 382,1998.

[32] J. Kumar, W. Abd-Almageed, L. Kang, andD. Doermann. Handwritten arabic text linesegmentation using affinity propagation. InProceedings of the 9th IAPR International Workshopon Document Analysis Systems, pages 135–142, 2010.

[33] Y. Li, Y. Zheng, D. Doermann, and S. Jaeger.Script-independent text line segmentation in freestylehandwritten documents. IEEE Transactions onPattern Analysis and Machine Intelligence,30(8):1313–1329, 2008.

[34] Z. Lu, R. Schwartz, and C. Raphael.Script-independent, hmm-based text line finding forocr. In Pattern Recognition, 2000. Proceedings. 15thInternational Conference on, volume 4, pages 551–554,2000.

[35] S. Madhvanath, G. Kim, and V. Govindaraju.Chaincode contour processing for handwritten wordrecognition. IEEE Transactions on Pattern Analysisand Machine Intelligence, 21:928–932, 1999.

[36] F. Menasri, J. Louradour, A. Bianne-Bernard, andC. Kermorvant. The A2iA French handwritingrecognition system at the Rimes-ICDAR2011competition. In Proceedings of SPIE, volume 8297,2012.

[37] Mudit Agrawal and David Doermann. Context-Awareand Content-Based Dynamic Voronoi PageSegmentation. In The Nineth IAPR InternationalWorkshop on Document Analysis Systems, pages73–80, 2010.

[38] P. Natarajan, Z. Lu, R. Schwartz, I. Bazzi, andJ. Makhoul. Multilingual machine printed OCR.International Journal of Pattern Recognition andArtificial Intelligence, 15(01):43–63, 2001.

[39] P. Natarajan, S. Saleem, R. Prasad, E. MacRostie,and K. Subramanian. Multi-lingual offline handwritingrecognition using hidden markov models: Ascript-independent approach. In Arabic and ChineseHandwriting Recognition, pages 231–250. SpringerBerlin Heidelberg, 2008.

[40] L. Neumann and J. Matas. Real-time scene textlocalization and recognition. In Computer Vision andPattern Recognition (CVPR), IEEE Conference on,pages 3538–3545, 2012.

[41] L. O’Gorman. The document spectrum for page layoutanalysis. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 15(11):1162–1173, 1993.

[42] U. Pal and B. Chaudhuri. Automatic identification ofenglish, chinese, arabic, devnagari and bangla scriptline. In Proceedings of the Sixth InternationalConference on Document Analysis and Recognition,pages 790–794, 2001.

[43] X. Peng, H. Cao, R. Prasad, and P. Natarajan. Textextraction from video using conditional random fields.In Document Analysis and Recognition (ICDAR),International Conference on, pages 1029–1033, 2011.

[44] T. Q. Phan, P. Shivakumara, Z. Ding, S. Lu, andC. Tan. Video script identification based on text lines.In Document Analysis and Recognition (ICDAR),International Conference on, pages 1240–1244, 2011.

[45] T. Plotz and G. Fink. Markov models for offlinehandwriting recognition: a survey. InternationalJournal on Document Analysis and Recognition(IJDAR), 12(4):269–298, 2009.

[46] Y. Rangoni, J. van Beusekom, and T. M. Breuel.Language independent thresholding optimizationusing a gaussian mixture modelling of the charactershapes. In Proceedings of the International Workshopon Multilingual OCR, pages 1–9, 2009.

[47] M.-P. Schambach, J. Rottland, and T. Alary. How toconvert a latin handwriting recognition system toarabic. In International conference on Frontiers inHandwriting Recognition, 2008.

[48] H. F. Schantz. Recognition Technologies UsersAssociation, 1982.

[49] K. Sesh Kumar, S. Kumar, and C. Jawahar. Onsegmentation of documents in complex scripts. InDocument Analysis and Recognition, 2007. ICDAR2007. Ninth International Conference on, volume 2,pages 1243–1247, 2007.

[50] Z. Shi, S. Setlur, and V. Govindaraju. A steerabledirectional local profile technique for extraction ofhandwritten arabic text lines. In Document Analysisand Recognition, 2009. ICDAR ’09. 10th InternationalConference on, pages 176–180, 2009.

[51] R. Smith. An overview of the tesseract ocr engine. InProceedings of International Conference on Document

Analysis and Recognition, volume 2, pages 629–633,2007.

[52] R. Smith, D. Antonova, and D.-S. Lee. Adapting thetesseract open source ocr engine for multilingual ocr.In Proceedings of the International Workshop onMultilingual OCR, pages 1–8, 2009.

[53] S. Srihari, V. Govindaraju, and A. Shekhawat.Interpretation of handwritten addresses in usmailstream. In Document Analysis and Recognition,1993., Proceedings of the Second InternationalConference on, pages 291–294, 1993.

[54] X. Wei, S. Ma, and Y. Jin. Segmentation of connectedchinese characters based on genetic algorithm. InDocument Analysis and Recognition, 2005.Proceedings. Eighth International Conference on,pages 645–649, 2005.

[55] M. Wienecke, G. Fink, and G. Sagerer. Towardautomatic video-based whiteboard reading.International Journal of Document Analysis andRecognition (IJDAR), 7(2-3):188–200, 2005.

[56] A. Yuan, G. Bai, L. Jiao, and Y. Liu. Offlinehandwritten english character recognition based onconvolutional neural network. In Proceedings of the10th IAPR International Workshop on DocumentAnalysis Systems, pages 125–129, 2012.


Recommended