+ All Categories
Home > Documents > CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document...

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document...

Date post: 09-Jan-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
13
IJDAR (2012) 15:71–83 DOI 10.1007/s10032-011-0148-6 ORIGINAL PAPER CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image Ram Sarkar · Nibaran Das · Subhadip Basu · Mahantapas Kundu · Mita Nasipuri · Dipak Kumar Basu Received: 3 June 2010 / Revised: 18 August 2010 / Accepted: 24 January 2011 / Published online: 24 February 2011 © Springer-Verlag 2011 Abstract In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of hand- written Bangla text and Bangla text mixed with English words. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with Eng- lish words. This database for off-line-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, i.e., CMATERdb1.1.1, con- Electronic supplementary material The online version of this article (doi:10.1007/s10032-011-0148-6) contains supplementary material, which is available to authorized users. R. Sarkar · N. Das · S. Basu · M. Kundu · M. Nasipuri (B ) Computer Science and Engineering Department, Jadavpur University, Kolkata 700032, India e-mail: [email protected] R. Sarkar e-mail: [email protected] N. Das e-mail: [email protected] S. Basu e-mail: [email protected] M. Kundu e-mail: [email protected] D. K. Basu A.I.C.T.E. Emeritus Fellow, Computer Science and Engineering Department, Jadavpur University, Kolkata 700032, India e-mail: [email protected] taining document pages written in Bangla script only, and CMATERdb1.2.1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line seg- mentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques [Khandelwal et al., PReMI 2009, pp. 369–374] and then corrected any possible error by using our developed tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm. Both the databases along with the ground truth annotations and the ground truth generating tool are available freely at http://code.google.com/p/cmaterdb. Keywords Unconstrained handwritten document image database · Text line extraction · Ground truth preparation · OCR of multi-script document 1 Introduction OCR involves computer recognition of characters from dig- itized images of optically scanned document pages. The characters thus recognized from document pages are coded with American Standard Code for Information Interchange (ASCII) or some other standard code for storing in a file, which can be edited as any other file created with some word processing software or some editor. A scanner with OCR facility allows editing the contents of document pages after scanning them optically. Identification of text lines is the first and most important step in the process of OCR of handwritten/printed document images. If text line identification is not accurate, then none of the words and characters in the constituent text lines can be 123
Transcript

IJDAR (2012) 15:71–83DOI 10.1007/s10032-011-0148-6

ORIGINAL PAPER

CMATERdb1: a database of unconstrained handwritten Banglaand Bangla–English mixed script document image

Ram Sarkar · Nibaran Das · Subhadip Basu ·Mahantapas Kundu · Mita Nasipuri ·Dipak Kumar Basu

Received: 3 June 2010 / Revised: 18 August 2010 / Accepted: 24 January 2011 / Published online: 24 February 2011© Springer-Verlag 2011

Abstract In this paper, we have described the preparationof a benchmark database for research on off-line OpticalCharacter Recognition (OCR) of document images of hand-written Bangla text and Bangla text mixed with Englishwords. This is the first handwritten database in this area,as mentioned above, available as an open source document.As India is a multi-lingual country and has a colonial past,so multi-script document pages are very much common. Thedatabase contains 150 handwritten document pages, amongwhich 100 pages are written purely in Bangla script and restsof the 50 pages are written in Bangla text mixed with Eng-lish words. This database for off-line-handwritten scripts iscollected from different data sources. After collecting thedocument pages, all the documents have been preprocessedand distributed into two groups, i.e., CMATERdb1.1.1, con-

Electronic supplementary material The online version of thisarticle (doi:10.1007/s10032-011-0148-6) contains supplementarymaterial, which is available to authorized users.

R. Sarkar · N. Das · S. Basu · M. Kundu · M. Nasipuri (B)Computer Science and Engineering Department,Jadavpur University, Kolkata 700032, Indiae-mail: [email protected]

R. Sarkare-mail: [email protected]

N. Dase-mail: [email protected]

S. Basue-mail: [email protected]

M. Kundue-mail: [email protected]

D. K. BasuA.I.C.T.E. Emeritus Fellow, Computer Science and EngineeringDepartment, Jadavpur University, Kolkata 700032, Indiae-mail: [email protected]

taining document pages written in Bangla script only, andCMATERdb1.2.1, containing document pages written inBangla text mixed with English words. Finally, we have alsoprovided the useful ground truth images for the line seg-mentation purpose. To generate the ground truth images, wehave first labeled each line in a document page automaticallyby applying one of our previously developed line extractiontechniques [Khandelwal et al., PReMI 2009, pp. 369–374]and then corrected any possible error by using our developedtool GT Gen 1.1. Line extraction accuracies of 90.6 and92.38% are achieved on the two databases, respectively, usingour algorithm. Both the databases along with the ground truthannotations and the ground truth generating tool are availablefreely at http://code.google.com/p/cmaterdb.

Keywords Unconstrained handwritten document imagedatabase · Text line extraction · Ground truth preparation ·OCR of multi-script document

1 Introduction

OCR involves computer recognition of characters from dig-itized images of optically scanned document pages. Thecharacters thus recognized from document pages are codedwith American Standard Code for Information Interchange(ASCII) or some other standard code for storing in a file,which can be edited as any other file created with some wordprocessing software or some editor. A scanner with OCRfacility allows editing the contents of document pages afterscanning them optically.

Identification of text lines is the first and most importantstep in the process of OCR of handwritten/printed documentimages. If text line identification is not accurate, then none ofthe words and characters in the constituent text lines can be

123

72 R. Sarkar et al.

identified correctly. Such errors are unacceptable for large-scale recognition of documents.

The problem of text line identification for handwrittendocuments is more difficult than the printed documents. Forexample, all the text lines in a handwritten document maybe skewed with respect to the horizontal axis, or individuallines may be non-parallel to one another. A text line mayalso be curvy or written close to one another, making the linesegmentation difficult. Adjacent text lines may even touchone another at multiple points. All such cases make the linesegmentation from handwritten digitized documents a chal-lenging research problem.

The next step in an OCR system for handwritten docu-ments is that of word identification. Words in a text line mustbe identified accurately for the constituent characters to makegrammatical sense. Word identification poses its own set ofproblems. In many of the techniques developed for wordextraction so far, words are identified after identification ofthe text lines. So, the performance of the word extractiontechnique is completely dependent on that of the text lineidentification technique. Word identification also becomesvery difficult in case of text lines with varying skewness orslant.

Despite the importance of the line segmentation for thesuccessful development of any OCR system, especially forhandwritten text, an acceptable degree of accuracy has notyet been achieved for unconstrained handwritten document,and hence, it still remains as an open problem for research.

Researches on extraction/segmentation of unconstrainedhandwritten text lines from digitized document pages arelimited in the literature [7,10–17,21]. In one of our ear-lier works [7], we had reviewed contemporary researchcontributions related to extraction of handwritten/printedtext lines from digitized document pages. Current researchcontributions related to extraction of unconstrained hand-written text lines from digitized document pages may beclassified into several categories viz. Hough Transformationbased techniques [10,11,16], Statistical approaches usingMinimal spanning tree, Probability Distribution Function(PDF) etc. [12–14,17] and typical morphological approachesusing run length encoding, water flow technique, analysisof neighborhood components etc [15,7,21]. Some of wordextraction methodologies are described in [10,16,24,25].

1.1 Need for standardization of experimental data

Handwritten OCR is a challenging and open problem in Pat-tern Recognition, which started in early 1960s. Generally,majority of the researchers in this field prepared their owndatabases for their respective experiments, making uniformassessment of any methodology a difficult task. At present,limited public domain databases are available for handwritten

OCR. Most popularly used databases in this field are NIST[1], IAM-DB [2], CENPARMI [3], CEDAR [26] etc. Mostof these databases [1,26] are not freely available. IAM-DBdatabase [2] consists of handwritten English script. CEDARdatabase [26] consists of city names, state names, ZIP codes,and alphanumeric characters. Moreover, some handwrittennumerals’ databases [1] aim at specialized applications, suchas recognition of postal code. In addition to the English dat-abases, there are databases of other languages [4,5]. Koreanand JIS Chinese scripts were used for the databases in [4]and [5], respectively. Databases are also available for Ara-bic script [27] and Japanese script [28]. The ICDAR 2009Handwriting Segmentation Contest [6] also provided a set ofhandwritten document pages written in Latin script for Eng-lish, Greek, French, and German languages. A database isalso available in [9] for some of the handwritten Indic scriptsviz., Bangla, Devanagri, and Oriya. But in [9], only data-base for isolated digit, characters, modifiers, or compoundcharacters is available. Authors of [23] also provided twodatabases for handwritten numerals for two Indian scripts,viz., Devanagri and Bangla.

However, there is no public domain database available forunconstrained handwritten document pages written in anyof the popular Indic scripts. It may be worth mentioning atthis point that for Indic, Arabic, and Chinese scripts, specialtechniques are required to implement handwritten OCR algo-rithms, on digitized document pages of such scripts. Anothertypical attribute of Indic script document is the presence ofLatin script words in many unconstrained documents, mak-ing the OCR process more complex. More than 960 millionpeople use Indic scripts [19] worldwide, and that large pop-ulation itself is a motivating factor in developing benchmarkdatasets for such scripts. Previous researches on Indic scriptrecognition systems were reported on the basis of databasescollected in the laboratory. But, future research in this domainrequires standard databases fulfilling certain criteria depend-ing on the application domain.

Therefore, for the aforementioned reasons, we have beenmotivated to prepare two handwritten document databases.The first one contains only Bangla words, and the othercontaining both Bangla and English words. The second cat-egory of document pages is more popular and difficult tosegment and recognize due to the presence of two contrast-ing types of scripts in it. The following section discusses theorigin of mixed script documents in India and some basiccharacteristics of Bangla script. The line/word extractiontechniques for both these categories of documents may argu-ably perform in similar fashion (despite sharp differences inwriting styles of the two scripts under consideration), butcharacter segmentation from Matra-based (A Matra is a hor-izontal line, touching the upper part of a basic or compoundcharacter) Bangla word images is different from that of Latinword images. Therefore, intelligent classification techniques

123

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English 73

are required to distinguish words of two different scripts ina mixed script document. In one of our earlier works [30],an effort was made in such direction. However, benchmarkdatabases of mixed script document images are required toextend research initiatives in that direction. This has beenone of our primary motivations behind inclusion of the mixedscript document image dataset (along with the Bangla scriptdocument image dataset) in our current work.

1.2 Origin of mixed script documents in India

With nearly 207 million total speakers [18], Bangla is oneof the most spoken languages (ranking 5th or 6th) [19] inthe world. Bangla, official language of Bangladesh, is theprimary language spoken in Bangladesh and is the secondmost spoken language in India. In India, Bangla is mostlyused in West Bengal, Tripura, and Assam. Moreover, Ban-gla script is also used for other two Indian languages, viz.,Assamese and Manipuri.

In any Indic script, including Bangla, region-specificminor variations in the shape of the scripts written by indi-viduals are sometimes observed. Another interesting pointregarding handwritten documents in Indic scripts is that peo-ple often write one or more words in English. The reasonsfor this possibly are

a. India is a multi-lingual country.b. India has colonial past.c. English is very much used in official purpose.d. English is usually taught in schools.e. Most of the books followed in higher studies are either

written in English or in regional languages with Englishwords for universally used terms.

We observed that while writing scientific or technicalinformation (such as subjects like physics, chemistry, math-ematics, computer science, etc.) in an Indic script, the writermight casually enter one or more English words, generatinga mixed script document.

1.3 Characteristics of Bangla script

Characters of Bangla script can be grouped into five cate-gories of characters, viz., vowel, consonant, modified shape,compound character, and punctuation symbol. Out of thesecharacters, vowels and consonants, which constitute Banglaalphabet, are called basic characters. There are 11 vowelsand 39 consonants in Bangla alphabet. There is no conceptof upper and lower case characters in Bangla script. Charac-ters in Bangla script are written from left to right. A vowelfollowing a consonant in a word takes a modified shape inBangla script. Such shapes of all vowels are termed as mod-

ified shapes. It is noteworthy that some modified shapesattached with a consonant have two isolated parts appearingin two opposite sides of the consonant. Some modified shapesmay appear just below the consonant, and some may reach itstop from one of its sides with a curved or partly curved seg-ment. So, characters in Bangla script may not always appearin non-overlapping consecutive positions. Depending on themode of pronunciation, a Bangla consonant followed by oneor two consonants takes a complex shape, which is called acompound character. There are in all 280 compoundcharacters in Bangla script. Apart from the basic characters,the modified shapes, and the compound characters, Banglascript also constitutes 10 digit patterns. An important featureof Bangla characters is Matra or head line. Excepting a few,all basic and compound characters of Bangla script have thisfeature.. The width of a Matra is nearly same as the widthof the character it touches. All the Matras of consecutivecharacters appearing in a Bangla word are joined to form acommon Matra of the characters appearing in the word.

Rest of the paper describes the database nomencla-ture, data collection methodology, data processing tech-niques employed and the detail composition of the database.Description and availability of the database and the groundtruth generating software, developed by us, are also discussedin this paper along with the detailed reports of the bench-mark performances of our line extraction methodology onthe newly developed database.

2 Detailed dataset description

We have named our developed database as CMATERdb1,where CMATER stands for Center for Microprocessor Appli-cation for Training Education and Research, a research lab-oratory at Computer Science and Engineering departmentof Jadavpur University, India, where the current researchactivity took place. db stands for database, and the numericvalue 1 represents handwritten line segmentation databaseprepared for the current work. Currently, we have devel-oped two variations of CMATERdb1, viz., CMATERdb1.1representing a database of handwritten document pages con-taining Bangla words only and CMATERdb1.2 representinga database of handwritten document pages containing bothBangla and English words. The first version of both these dat-abases is released as CMATERdb1.1.1 and CMATERdb1.2.1,respectively. Database is available freely in the CMATERwebsite (www.cmaterju.org) and at http://code.google.com/p/cmaterdb.

2.1 Data collection methodologies

The materials of the handwritten document pages for theproposed databases have been collected from three different

123

74 R. Sarkar et al.

Fig. 1 Sample document images from CMATERdb1. a A handwritten document page from the database CMATERdb1.1.1. b A handwrittendocument page CMATERdb1.2.1

types of sources, viz., class notes of students of differentage-groups, handwritten manuscripts of a popular Banglamonthly magazine “Computer Jagat” [29], and the docu-ment pages written by different persons, on request, underour supervision.

The document pages written under our supervision werecollected from various persons with textual contents col-lected from newspaper articles and Bangla textbooks contain-ing both Bangla and English vocabulary. The writers wereasked to use a black or blue ink pen and write inside the A-4size pages. They were imposed no other restrictions regard-ing the kind of pen they used or the style of writing chosen.Special attention was paid to ensure data collection from writ-ers of different age-groups and educational levels. Moreover,we collected the pages from different places (home, office,school, etc.) in order to include different styles of writing.In total, 25 men and 15 women participated in this data col-lection drive. The main characteristics of our database are asfollows:

• 95% of the writers were native Bengali.• Places of Data Collection: 40% in schools/colleges, 40%

in writers’ homes, and 20% in public places.

• Educational level of the writers: 20% 10th standardschool, 40% high school, and 40% college and univer-sity.

• Writers’ age: 40% between 15 and 25 years, 30% between25 and 35 years, and 30% between 35 and 55 years.

2.2 Data processing techniques

All the document pages were scanned using a flatbedscanner with 300 dpi gray scale image resolution. Eachpage, meant for the databases CMATERdb1.1.1 and CMA-TERdb1.2.1, is stored in bitmap file format with the nam-ing convention B###.bmp and BE###.bmp, respectively.### is a unique integer number given to the file name tomaintain sequence, and B or BE refers to the documenttype, i.e., Bangla or Bangla–English, respectively. Somesample images from both these databases are shown inFig. 1(a-b).

After scanning, the documents were binarized by sim-ple adaptive thresholding technique, where the thresh-old was chosen as the mean of the maximum andminimum gray level values in each document image. All thebinarized images were archived in DAT format, where the

123

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English 75

Fig. 2 Application of preprocessing technique on a sample image taken from CMATERdb1.2.1. a Before preprocessing technique. b Afterpreprocessing technique

foreground and background pixels were represented as‘0’ and ‘1’, respectively. Then, the documents were pre-processed in order to remove all the remaining salt andpepper noises like long lines in the border zone(s). Toremove discontinuity in the pixel level, we have used erosionand dilation [8], two popularly used morphological oper-ators in image processing. Figure 2b shows output of thepreprocessing technique for a sample image, as shown inFig. 2a, taken from the database. All the binarized imagesare finally used to prepare the ground truth annotations forthe database.

2.3 Composition of the database

CMATERdb1.1, the Bangla script handwritten documentdatabase contains 100 pages in its first version. CMA-TERdb1.2 contains 50 handwritten document pages contain-ing both Bangla and English words. Each of the documentpages of the databases are described with the help of attri-butes like page dimensions, i.e., height, width and aspectratio, counts of number of lines and words, and statistical esti-mations of the horizontal and vertical stroke widths. Detailed

descriptions of all the document pages of the two databasesare uploaded as supplementary files in the database website[http://code.google.com/p/cmaterdb]. Descriptions of someof the sample pages and the averages and standard deviationsof all the attributes of all the document pages from the twodatabases are shown in Tables 1 and 2, respectively.

Document attributes related to page dimensions are actu-ally based on the scanned region of the images. In most cases,we have attempted to preserve the original/physical pagedimensions, but in some cases, they may get compromisedbecause of misalignment due to scanning or cropping oftorn out page boundaries (especially relevant in cases ofmanuscripts collected from external sources). Countingof number of lines and words in the document images isdone manually at the CMATER research laboratory. Theseattributes are necessary for designing effective line/wordextraction algorithms. They give an estimate of the aver-age line/word spacing in document images. These count-features are also essential for performance evaluation ofline/word extraction techniques. The stroke width in anybinarized document image is estimated as the run of blackpixels in any given direction (horizontal/vertical). Unlike

123

76 R. Sarkar et al.

Table 1 Detailed description of some document images taken from the database CMATERdb1.1.1

Pagenumber

Height Width Aspectratio

Numberof lines

Numberof words

Averagehorizontalstroke width

Averagevertical strokewidth

Standarddeviation ofhorizontal strokewidth

Standarddeviation ofvertical strokewidth

B001 3199 2575 1.2423 21 144 3.9412 3.9757 1.2722 1.3044

B002 1024 0766 1.3368 26 216 1.9338 2.3708 0.5295 0.9492

B008 3295 2491 1.3228 25 205 3.1842 3.2761 0.7162 0.7430

B015 3162 2664 1.1869 18 114 11.6803 12.0958 6.2258 6.2620

B027 3906 2796 1.3970 21 130 6.7254 6.5931 3.2529 2.8776

B028 3283 1459 2.2502 18 86 7.0089 6.9808 2.5598 2.5169

B038 3361 2497 1.3460 13 54 2.4946 2.7321 1.1327 1.6892

B040 2976 2862 1.03998 18 143 8.1749 8.2079 2.6822 2.6599

B046 2041 2503 0.8154 11 67 9.1240 9.0114 2.5383 2.3183

B063 3417 2474 1.3812 33 246 5.0053 5.1989 1.2356 1.3030

B069 1088 1380 0.7884 12 99 4.2643 4.7042 1.5903 1.4606

B070 1021 0768 1.3294 24 202 2.2025 2.7876 0.6175 1.0189

B087 2984 2354 1.2676 28 267 6.8421 6.9198 1.4678 1.4204

B092 1024 0728 1.4066 24 179 2.2191 3.6843 0.6628 1.7699

B100 1024 0745 1.3745 25 186 2.5057 3.2399 4.8907 5.6640

Avg. 2601.66 1987.53 1.3142 21.63 157.3 5.5600 5.8145 1.7767 1.9337

Std. Dev. 893.07 621.51 0.2377 4.8920 48.3855 2.1880 2.0173 0.9517 0.9442

Table 2 Detailed description of some document images taken from the database CMATERdb1.2.1

Pagenumber

Height Width Aspectratio

Numberof lines

Numberof words

Averagehorizontalstroke width

Averagevertical strokewidth

Standarddeviation ofhorizontal strokewidth

Standarddeviation ofvertical strokewidth

BE001 2573 2493 1.0321 23 199 5.4737 5.4773 0.9941 0.9863

BE003 2298 2556 0.8991 21 183 5.1947 5.0890 1.2968 1.2147

BE006 3052 1852 1.6479 29 236 5.6277 5.7914 1.4511 1.5320

BE007 3465 2549 1.3594 23 188 9.5661 9.6637 2.6812 2.4579

BE008 2338 1476 1.5840 31 251 4.1351 5.5435 3.4899 4.4915

BE015 3844 2656 1.4473 24 121 7.3274 7.5296 3.7894 3.7558

BE018 1728 1411 1.2247 16 96 2.7794 3.6061 0.7700 1.7475

BE020 3287 2308 1.4242 24 160 4.5359 5.1433 1.4930 2.0447

BE022 2142 1505 1.4233 34 277 4.2108 4.3968 0.7328 0.7179

BE025 3936 2797 1.4072 28 191 5.3964 5.2904 2.6035 2.5045

BE034 1547 1558 0.9929 16 130 4.4867 4.7674 1.7473 1.5756

BE035 2110 1565 1.3482 30 270 3.1953 3.7877 0.6802 0.9728

BE036 2207 1530 1.4425 19 119 2.7633 4.0085 1.0466 1.9920

BE040 1902 1272 1.4953 23 140 4.6024 4.6932 2.6635 2.6968

BE050 3761 2637 1.4262 29 237 7.6946 7.7805 2.9026 3.0493

Avg. 3082.56 2218.54 1.3897 24.64 177.58 5.8535 6.1001 2.0175 2.1669

Std. Dev. 714.36 466.51 0.1441 4.1441 44.7186 1.5858 1.4232 0.8420 0.8811

the other features, these two features are computed pro-grammatically and are particularly useful in estimating animportant writers’ characteristic, that is, the connectednessin writing style. These writers’ characteristics play key roles

in line/word segmentation algorithms. Popularly used run-length-based features are specifically sensitive to the strokewidth of any unconstrained handwritten document image.Run-length- based horizontalness and verticalness attributes

123

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English 77

in document/word images are widely used for character seg-mentation [31–34] from document images. To get the averagehorizontal stroke width, we have estimated the mean of all thecontinuous run of black pixels along the rows. The averagevertical stroke width is computed in the similar fashion overthe mean column-wise runs of black pixels. The computa-tion process of these two features is pictorially illustratedin Fig. 3. To show the variability in writing styles of indi-viduals, in each document page, we have also provided thestandard deviations of horizontal and vertical stroke widths(in respective pages) both in the supplementary material

Fig. 3 An illustration of horizontal and vertical stroke widths

[http://code.google.com/p/cmaterdb] and in the Tables 1 and2.

The orientation/skew of text lines is estimated as the hypo-thetical angle the handwritten text lines make with the hori-zontal axis. Presence of multi-oriented text is also consideredas an important characteristic of unconstrained handwrittendocument images. Since the objective of the current work isto accommodate normal writing styles from variety of users,extreme multi-oriented writing is not included in the currentdataset collection. Some of the document pages, taken fromour dataset, with limited multi-oriented characteristics of thetext lines are shown in Fig. 4a, b.

3 Ground truth of our databases

Generation of appropriate ground truth data has always beena challenging and tiresome task for the kind of problem underconsideration. Availability of ground truth information, how-ever, makes any database more useful, enabling proper eval-uation of one’s technique by comparing their output with theground truth of the same. In this work, we have prepared

Fig. 4 Sample document images, with moderate multi-orientations in text lines. a From CMATERdb1.1.1. b From CMATERdb1.2.1

123

78 R. Sarkar et al.

Fig. 5 Sample ground truth images. a From CMATERgt1.1.1. b From CMATERgt1.2.1

ground truth images for all the images of our databases, viz.,CMATERdb1.1.1 and CMATERdb1.2.1 for line segmentationapplication. For each of the two handwritten databases, wehave generated the ground truth information, which has beenarchived as CMATERgt1.1.1 and CMATERgt1.2.1, respec-tively. We have prepared these ground truth images of thedatabases in a semi-automatic way. More specifically, wehave employed our previously developed [22] technique toidentify individual line segments from any document image.The possible error that might have been generated in theautomated line extraction is corrected using a software toolcalled GT Gen version 1.1, which we have developed forthis project. Basically, we have used GT Gen to recolorsome lines or part of the lines, which were erroneouslylabeled by our technique developed in [22]. It may be notedthat all the ground truth images are stored in bitmap (bmp)file format, where the background is labeled in white andindividual text lines are marked in different colors. All thefiles in CMATERgt1.1.1 and CMATERgt1.2.1 are named asGTB###.bmp and GTBE###.bmp, respectively. Figure 5a, b

shows sample ground truth images from the two databases,respectively, prepared for the line extraction application.

3.1 GTGen: the ground truth generating software

GT Gen version 1.1 is a software tool, developed in VisualBasic dot net technology at the CMATER research labora-tory that can label text in any chosen color. GT Gen readsimages of document pages with white background. One canselect any color from a color panel and use that to recolorthe text by selecting the intended region with a mouse. Usingthis technique, we can easily correct errors in our line extrac-tion algorithm [22] to generate ground truth data. We caneven use this tool to label text lines from beginning (withoutassistance from [22]) or even generate ground truths for wordand character segmentation algorithms. This software setupis available freely and can be downloaded from http://code.google.com/p/cmaterdb. A screenshot of the developed GTGen 1.1 software is shown in Fig. 6.

123

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English 79

Fig. 6 A screenshot of the developed GT Gen 1.1 software

4 Benchmark evaluation of the databases by our lineextraction technique

Our previously published works in this area may be found in[7,21,22]. We also participated in ICDAR-2009 HandwritingSegmentation Contest [20]. The line extraction algorithm asdiscussed in [22] is applied on both of the databases preparedfor the current work. The brief description of the algorithmand results achieved on both the databases are described inthe subsequent subsections.

4.1 Line extraction algorithm used

In one of our earlier works [22], an effective techniquefor identifying text lines in digitized handwritten documentimages has been presented. Two terminologies, a componentor a segment, have been used interchangeably to representan 8-connected set of black pixels in any binarized digi-tal document page. A connected component-labeling (CCL)

algorithm [8] is implemented to identify the basic segmentsin the text document as unique objects. During preprocess-ing, the components are categorized in one of the 4 types,viz., Type #1, Type #2, Type #3, and Type #4 according totheir respective dimensional characteristics (related to theirheights and widths). Type #1 components are the small dot-like segments, Type #2 components are the long lines, Type#3 components consist of large segments, which may or maynot be connected, and Type #4 components comprise of therest of the segments.

Type# 1 and Type #2 components are considered as noiseand are therefore ignored. Type #3 components may begenerated as a result of overlap of two or more words belong-ing to adjacent text lines or character(s) that may have lig-ature(s) or elongated portion(s) in upper or lower directionsdue to writing style of individual. Such components are alsoignored during the initial phase of identification of text lines.Our algorithm therefore considers only Type #4 components,during the first phase of the processing.

123

80 R. Sarkar et al.

Fig. 7 Illustration of 4 different types of components in a sample docu-ment page taken from CMATERdb1.2.1. Unbounded black colored com-ponents are Type #1 components. The component bounded with dotted

rectangle is a Type #2 component. Black colored component boundedwith dotted circle is a Type #3 component. Rests of the bounded lightgray shaded components are Type# 4 components

Different dimensional features of these Type #4 compo-nents are used to identify individual text line. The post-processing step includes possible reconsideration of theType #1 components, ignored in the first phase. Some of thesemight have actually been small handwritten parts of texts, andsuch components are allocated to suitable text lines. Finally,there might have been few cases in which some words ofadjacent text lines get merged (Type #3 components). Suchtouching text components are carefully selected and split toform the actual text lines. Other type of Type #3 componentsis included in the text line with which they have maximumoverlap. Illustration of the 4 different types of components isshown on a sample document image in Fig. 7.

4.2 Results of the line extraction algorithmon both the databases

Performance evaluation of the line extraction algorithm isdone by visually observing the identified lines carefully on

the set of handwritten documents, as we have not any toolfor estimating performance evaluation automatically. Formanual estimation of the success rate of text line extractiontechnique, we have considered two types of errors, viz. under-segmented text line and over-segmented text line. If two ormore text lines are identified as a single text line, then itis considered as under-segmentation error and both/all theextracted text lines are treated as wrongly extracted textlines. Similarly, if single text line components are errone-ously allocated to two or more text lines, then this textline is also considered as wrongly extracted text line dueto over-segmentation. The total number of under-segmentedand over-segmented text lines is reflected in the estimationof the success rate (SR) of the text line extraction technique.More specifically,

SR = (T − (U + O))/T, (1)

where,

123

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English 81

Table 3 Detailed description ofthe experimental results of thetext line extraction algorithm onboth the databases

Database Languages/scripts used Number of Pages Average performance estimate (SR)

CMATERdb1.1.1 Bangla 100 90.60%CMATERdb1.2.1 Bangla and English 50 92.38%

Fig. 8 Outputs of our text line extraction algorithm on sample document images. a From CMATERdb1.1.1. b From CMATERdb1.2.1

U=number of under-segmented text linesO=number of over-segmented text linesT=number of actual text lines present in the document page

By applying equation (1), the percentages SRs achievedon the databases CMATERdb1.1.1 and CMATERdb1.2.1are 90.60 and 92.38%, respectively. Table 3 illustrates theperformance of the line extraction technique [22] on boththe databases. Figure 8a, b shows line extraction results ontwo sample document images, taken from the two databasesunder consideration. Figure 9 shows results of different issuesof present line extraction technique. Figure 9a shows out-put image, illustrating cases of under-segmentation and over-segmentation, and Fig. 9b shows output image of successfultouching text line separation.

5 Conclusion

In this paper, we have discussed the steps involved in gener-ating a benchmark database for unconstrained, handwritten

document pages containing both Bangla and Bangla–Englishmixed script words. This database is first of its kind in thisdomain of application, i.e., OCR of handwritten Indic script.Each document contains characters, text, digits, and othersymbols written by different writers. In the current database,the document pages written under our supervision were col-lected from 40 writers of different age-groups, sexes, andeducational levels. As an extension of the current project,more handwritten samples will be collected (under our super-vision) from a broader section of the society to incorporate awider variation in handwriting styles. Also, we plan to collectmore number of unconstrained handwritten document pages,analyze both the supervised and unconstrained documents,prepare the ground truth and make the benchmark resultsready for the next database release. As discussed before,Bangla is a complex Indic script and used by more than207 million people in this world. Unlike English, Banglascript uses more than 300 character shapes, many modifiers,and 10 digit patterns. Therefore, extraction of text lines fromsuch documents is a challenging task. Despite many research

123

82 R. Sarkar et al.

Fig. 9 Outputs of our text line extraction algorithm. a Case of under-segmentation (two text lines shown with same color) and case of over-seg-mentation (single text line shown with two colors). b Successful separation of touching text lines (marked with circle)

efforts on this problem domain, availability of standard data-set is limited for Bangla script. The current CMATERdb1database is the first effort to develop one such repository notonly for unconstrained handwritten document pages contain-ing Bangla script but also for mixed script document pagescontaining both Bangla and English words. In future releasesof our database, we may include newer scripts like Devan-agri and collect document pages containing both Devanagriand Latin script words. Extreme variations in writing styleshaving slants in words, and multi-oriented text lines mayalso be incorporated as a future work. Such databases wouldenable to test versatility of any line/word/character segmen-tation algorithms to their limits. It may also be noted thatall the current document pages are digitized using a flatbedscanner with a given resolution. Camera captured documentimage analysis is also gaining popularity and may open up anew research dimension. We have already reserved our CMA-TERdb4 release for such images that till now includes onlycamera-captured business card images [http://code.google.

com/p/cmaterdb]. Such images often suffer from uneven illu-mination, perspective distortion, improper focus, shadow,etc. Extending CMATERdb4 release for camera capturedhandwritten/printed document images is another future scopeof this work.

We have also generated the ground truth images for boththe databases for evaluation of line extraction algorithmand also made the ground truth generating software GTGen 1.1 available freely in public domain. Benchmark textline extraction accuracies on these handwritten pages are alsoreported in the current work. In future, our aim is to increasethe size of the database and to generate unconstrained hand-written page databases for other Indic scripts. We are alsoworking at present to generate word-level and character-leveldatabases for handwritten Bangla word images. Improve-ment of the ground truth generation software by includ-ing the line extraction routines and performance evaluationmetrics are also in our future plans of research in thisdomain.

123

CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English 83

In a nutshell, we have attempted to provide a benchmarkevaluation database for researchers interested in a challeng-ing problem domain, related to OCR of unconstrained hand-written document pages containing Bangla and Bangla mixedwith English words.

Acknowledgments A lot of people helped us to make the databasecompletion successfully. Authors are grateful to everyone who con-tributed with data to make this project successful. Authors also thankthe editor of Computer Jagat magazine to provide their manuscript forthe completion of the database. The work reported here has been par-tially funded by DST, Govt. of India, PURSE (Promotion of UniversityResearch and Scientific Excellence) Programme. Research of S. Basu ispartially supported by BOYSCAST Fellowship (SR/BY/E-15/09) fromDST, Government of India.

References

1. Wilkinson, R.A., Geist, J., Janet, S., Grother, P., Burges, C.J.C.,Creecy, R., Hammond, B., Hull, J., Larsen, N.J., Vogl, T.P., Wilson,C.L.: The first census optical character recognition systems confer-ence. Technical Report NISTIR 4912, The U.S Bureau of Censusand the National Institute of Standards and Technology, Gaithers-burg (July 1992)

2. Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: Proceedings of fifth internationalconference on document analysis and recognition, pp. 705–708.Bangalore (1999)

3. Suen, C.Y., Nadal, C., Legault, R., Mai, T.A., Lam, L.: Computerrecognition of unconstrained handwritten numerals. Proc. IEEE80(7):1162–1180 (1992)

4. Kim, D.H., Hwang, Y.S., Park, S.T., Kim, E.J., Paek, S.H.,Bang, S.Y.: Handwritten Korean character image database PE92.In: Proceedings of the Second international Conference on Docu-ment Analysis and Recognition, pp. 470–473, (1993)

5. Saito, T., Yamada, H., Yamamoto, K.: On the database ELT9 ofHandprinted characters in JIS Chinese characters and Its Analysis(in Japanese). IECEJ Trans. Vol. J. 68-D(4):757–764 (1985)

6. http://users.iit.demokritos.gr/~bgat/HandSegmCont2009/,“ICDAR2009 Handwriting Segmentation Contest”

7. Basu, S., Chaudhuri, C., Kundu, M., Nasipuri, M., Basu, D.K.: Textline extraction from multi-skewed handwritten documents. PatternRecognit. 40(6), 1825–1839 (2007)

8. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. 1stedn. Prentice-Hall, Indian (1992)

9. http://www.isical.ac.in/~ujjwal/download/database.html10. Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Line and

word segmentation of handwritten documents. In: Proceedings ofInternational Conference in Frontiers in Handwritten Recognition(ICFHR-08), pp. 247–252. August 19-21, Canada (2008)

11. Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text linedetection in handwritten documents. Pattern Recognit. 41(12),3758–3772 (2008)

12. Yin, F., Liu, C.: Handwritten text line segmentation by cluster-ing with distance metric learning. In: Proceedings of InternationalConference in Frontiers in Handwritten Recognition (ICFHR-08),pp. 229–234, August 91–21, Canada (2008)

13. Du, X., Pan, W., Bui, T.D.: Text line segmentation in handwrit-ten documents using Mumford-Shah model. In: Proceedings ofInternational Conference in Frontiers in Handwritten Recognition(ICFHR-08), pp. 253–258. August 91–21, Canada (2008)

14. Li, Y., Zheng, Y., Doermann, D.: Script-independent text linesegmentation in freestyle handwritten documents. IEEE Trans.PAMI 30(8), 1313–1329 (2008)

15. Roy, P.P., Pal, U., Llados, J.: Morphology based handwrittenLine segmentation using foreground and background information.In: Proceedings of International Conference in Frontiers in Hand-written Recognition (ICFHR-08), pp. 241–246, August 19–21,Canada (2008)

16. Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text lineand word segmentation of handwritten documents. Pattern Rec-ognit. 42(12), 3169–3183 (2009)

17. Yin, F., Liu, C.: Handwritten Chinese text line segmentationby clustering with distance metric learning. Pattern Recog-nit. 42(12), 3146–3157 (2009)

18. Statistical Summaries, Ethnologue, 2005. Retrieved 2007-03-0319. Languages spoken by more than 10 million people. Encarta Ency-

clopedia. Retrieved 2007-03-03, (2007)20. Basilios Gatos, Nikolaos Stamatopoulos, Georgios Louloudis.

ICDAR 2009 Handwriting Segmentation Contest. ICDAR 2009,pp. 1393–1397

21. Sarkar, R., Basu, S., Das, N., Mollah, A.F., Kundu, M., Nasipuri,M.: Line extraction from unconstrained handwritten documentpages using piece-wise water-flow technique. In: Proceedings (CD)of 4th Indian International Conference on Artificial Intelligence(IICAI), pp. 1861–1872. Tumkur, India, 16–18 Dec (2009)

22. Khandelwal, A., Choudhury, P., Sarkar, R., Basu, S., Nasipuri, M.,Das, N.: Text line segmentation for unconstrained handwritten doc-ument images using neighborhood connected component analysis.In: Proceedings of International Conference on PreMI, pp. 369–374, Dec (2009)

23. Bhattacharya, U., Chaudhuri, B.B.: Handwritten numeral databasesof Indian scripts and multistage recognition of mixed numer-als. IEEE Trans. Pattern Anal. Mach. Intell. 3(3), 444–457 (2009)

24. Luthy, F., Varga, T., Bunke, H.: Using hidden Markov models asa tool for handwritten text line segmentation. In: Proceedings ofNinth International Conference on Document Analysis and Rec-ognition, pp. 8–12. Curitiba, Brazil (2007)

25. Huang, C., Srihari, S.: Word segmentation of off-line handwrit-ten documents. In: Proceedings of the Document Recognition andRetrieval (DRR) XV, IST/SPIE Annual Symposium, San Jose, CA,USA, January (2008)

26. Hull, J.J.: A database for handwritten text recognitionresearch. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554(1994)

27. Al-Ohali, Y., Cheriet, M., Suen, C.: Databases for recognition ofhandwritten Arabic cheques. Pattern Recognit. 36, 111–121 (2003)

28. Noumi, T., Matsui, T., Yamashita, I., Wakahara, T., Tsutsumida, T.:Tegaki Suji Database ‘IPTP CD-ROM1’ no ichi bunseki(in Japanese). Autumn Meeting of IEICE, Vol. D-309, Sept. (1994)

29. http://www.computerjagat.org/, “Computer Jagat”30. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.:

Word level script identification from Bangla and Devanagri hand-written texts mixed with Roman script. J. Comput. vol. 2(2):103–108, Feb, ISSN 2151-9617, (2010)

31. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.:A two-stage approach for segmentation of handwritten Banglaword images. In: Proceedings of the 11th International Conferenceon Frontiers in Handwriting Recognition (ICFHR), pp. 403–408.Montreal, Canada, (2008)

32. Ikeda, H., Ogawa, Y., Koga, M., Nishimura, H., Sako, H., Fujisawa,H.: A Recognition Method for Touching Japanese HandwrittenCharacters. In: Proceedings of ICDAR, pp. 641-644. Bangalore,India (1999)

33. Yi-Kai, C., Jhing-Fa, W.: Segmentation of single or multiple-touch-ing handwritten numeral string using background and foregroundanalysis. IEEE Trans. PAMI 22(11), 1304–1317 (2000)

34. Lu, Y.: Machine printed character segmentation-an overview.Pattern Recognit. 28(7), 67–80 (1995)

123


Recommended