7/8/2009
1
Arabic
Character Recognition
Professor Mohammed Zeki Khedher
Jordan University
19th May 2003
Contents
• Types of Documents
• Signature verification
• Language Classification
• On Line and Off line OCR
• Latin Character Recognition
• Printed and Handwritten
• Preprocessing:
– Line segmentation
– Word segmentation
– Thinning
• Segmentation
• Feature Extraction
• Neural Networks
7/8/2009
2
Arabic Character Recognition using
Approximate Stroke Sequence
• Arabic Optical Character Recognition
• Previous Work in Arabic Character
Recognition
• Main characteristics of Arabic Writing
• Approximate Stroke Sequence String Matching
• Conclusions
Types of Documents
7/8/2009
3
A page containing text, image and a table
Schema for Document Image Analysis
LEVEL OF PROCESSING
(low to high
DOCUMENT TYPE
MOSTLY-TEXT MOSTLY-GRAPHICS
Pixels Preprocessing
Representation
Noise reduction
Binarization
Skew detection
Zoning
Character segmentation
Script, language & font recognition
Character scaling
Preprocessing
Representation
Noise reduction
Binarization
Thinning
Vectorization
Primitives Glyph recognition
Connected components
Strokes
Characters, diacritics, punctuation
Words
Primitive recognition
Stright-line & curve segments
Junctions and nodes
Loops
Characters
7/8/2009
4
Structures Text recognition
Word segmentation
Text line reconstruction
Table analysis
Morphological context
Lexical context
Syntax, semantics
Structures recognition
Text field
Legends
Label attribution
Dimensions
Graphics symbols
Aerial and texture features
Beautification (constraints)
Documents Page layout analysis
Text versus non-text
Physical components analysis
Logical components analysis
Functional components (content
tags)
Compression
Interpretation
Components recognition
Connectivity analysis
CAD/GIS layer separation
Database attribute extraction
Compression
Information retrieval
Document classification retrieval
Search
Security, authentication, privacy
Database, CAD, GIS interface
Validation
Search
Update
A check
codeline
Amount and account fields signature
postcode
7/8/2009
5
Signature verification
FEATURE
EXTRACTION
DISTANCE
MEASURE
Line
Signature
from
check
Feature
vector
Feature vector
FEATURE
EXTRACTION
Reference
signatureREFERENCE
DATA BASE
distance
Language Classification
7/8/2009
6
Classical OCR Systems
Format
Analysis
Character
Segmentation
Feature
Extraction
Classification
Document image
Character group image
Character image
Character properties
Character ID
Base Line Extraction
7/8/2009
7
Handwritten sentence recognition
Word Segmentation Algorithm
ComputeInitial Grouping
Prob.
Update linkingprobabilities
GroupAdjacent Glyphs
MakeAdjustment improving the
Joint prob. most
Compute joint prob. When each glyph
pair splited & merged
Glyph sequence within line
Glyph PairsWith linking prob.
Word Partitions
Labeled Words
7/8/2009
8
Two Samples From The set of 1000 images
An assignment strategy used in a postal delivery
system
7/8/2009
9
Reference lines separating three zones
upper zone
middle zone
lower zone
Total recognition scheme
Classifier A
Classifier B
Classifier C
Classifier D
Combination
Combination
Combination
Discrimination Discrimination
Discrimination
Nonlinear normalization
Feature extraction
Canonical variates Common differencePrincipal components
Difference Principal components
7/8/2009
10
segmentation techniques
Training
Phase
Testing
phase
Segmentation problems in machine printed text
7/8/2009
11
Sample A’s
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
Some pre-processing operations
(a) (b) (c)
• (a) The original image;• (b) Image after thinning;• (c) Image after dilation and scale normalization.
7/8/2009
12
Steps of the decomposition process
• (a) the original bitmap• (b) the thinned image• (c) the corrected polygonal• (d) the decomposition into circular arcs.
(a) (b) (c) (d)
Algorithm of Hole Detection
STEP1: Assume C(i) is the number of strokes which we
crossed by the horizontal scanning line Y = L Scan from top
to bottom, until i1 such that C(i1) = 1 and C(i1+1)>=1.
STEP2: Continue scanning until i2 such that C(i2) >= 2 and
C(i2+1) = 1. To prevent broken stroke, we continue to such
to set if (C(i2+2) = 1, C(i2+k) = 1, k is a small integer about
2.
STEP3: Given i1 and i2, begin to confirm the hold, Assume
the internal which region at Y=i1+1 is [11,12], Let
B(i1+1]=[2-1], Let D(1) be the length of the internal black
region as Y=i1. Similarly, we can get B(i2), D(i2+1), B(i1+1),
D(i2+1).
7/8/2009
13
Examples of Numeral Features
• (a) The principal (PA) and secondary (SA) axes;• (b) Number of black pixel blocks in each row and column;• (c) Position of holes.
(a) (b) (c)
Algorithm of Contour Concavity Detection
7/8/2009
14
Recovery of Drawing Order from Handwriting
Images
Partitioning Handwritten
Numeral Strings
Original string
Partitioned string
(a) (b)
(c) (d)
7/8/2009
16
Stroke Sequence Strings for “A”’s
Computing distance table
2 1 2 1 0 7 6 7 6
0 2 4 6 8 10 12 14 16 18
1 2 1 2 4 6 8 10 12 14 16
1 4 3 1 3 4 6 8 10 12 14
1 6 5 3 2 3 5 7 9 11 13
1 8 7 5 4 2 4 6 8 10 12
1 10 9 7 6 4 3 5 7 9 11
5 12 11 9 8 6 5 5 6 8 10
5 14 13 11 10 8 7 7 6 8 9
6 16 15 13 12 10 9 8 7 7 8
5 18 17 15 14 12 11 10 9 9 8
5 20 19 17 16 14 13 12 11 11 10
7/8/2009
17
Segmentation Methods
analytic holistic
Recognitionbased
Megabased
dissectionPost
process
HiddenMarkovModel
Non-Markov
windowing Featurebased
Dynamicprogram
Markov
hybrid
Segmentation Strategies
Classical approach: character like properties
cutting into meaningful components
(dissections)
Recognition-based-segmentation: matching classes
into alphabets
Holistic methods: recognition of whole words
7/8/2009
18
Recursive segmentation
Inputpattern
WindowedInput
MatchingPrototype 1
ResidueMatching
Prototype 2
Location of the Blocks on the letter
7/8/2009
19
Feature Extraction
Multiple segmentation hypotheses
Curve
Cups
Angular point
Curvature maxima & Multiple-pointSimple loop
Anticlockwise orientation
Segmentation hypotheses into physical primitives
Curvature maxima & Multiple-pointSimple loop
Anticlockwise orientation
7/8/2009
20
Graphemes
Relationship between sub components and the
background
(a) Isolated case
(b) Partially enclosed case
(c) Totally enclosed case
7/8/2009
21
Skewing of Text
Learning 2D Shape Models
• Two training sets of left ventricle and cistern shapes
• from different patients were automatically divided into clusters
7/8/2009
22
The shape learning method
Recognition of Mathematical Symbols
∞a = c02 + ∑ cn2 / 2
n = 0
7/8/2009
23
Block diagram of the Neural Classifier
147
2309
0.56
BLOCK 1
BLOCK 1
Neural Network based Feature Extraction and
Classification
MLP classifier
Extracted Coupled Feature Space
Input image
7/8/2009
24
Recognition rates of letters for increasing
training sets
Hidden Markov Model for Text Line
7/8/2009
25
Hidden Markov Model
• (a) Training sample
• (b) Sequence of features
• (c) Hidden Markov model
• (d) Different segmentation features
Arabic Character Recognition
• On-line systems
• Off-line systems
• Arabic OCR
• Necessity of segmentation even for printed text
• Treatment of the sub-words rather than words
7/8/2009
26
Main Characteristics of Arabic Writing
• Right to left
• Always cursive
• Change of character shape according to its location in the word
• Four different shapes
• 28 basic characters: 15 with dots,13 without
• No fixed character width & No fixed size
Characters recognized by dots only
• Letters: ب ت ث ي ن
• Middle form ــQـ ـRـ ـSـ ـTـ ـU
7/8/2009
27
Characters with Hamza
• 4 characters which may take the secondary character “Hamzah ء”.
• Alif أ إ
• Waw ؤ
• Yaa ئ
• Kaf ك
Aِrabic Characters Different Forms
7/8/2009
28
Aِrabic Characters Different Forms
Words and sub-words
• رa`ل A word with 3 sub-words
• ر A sub-word with 1 character
• `a A sub-word with 2 character
• ل A sub-word with 1 character
7/8/2009
29
Test Example
• Size: about 1.4MB
• 262,647 words
• 1,126,420 characters
• 4.3 characters per word
• 574,383 sub-words
• 2.2 sub-word per word
Sub-words Shapes Statistics
7/8/2009
30
Proposal for a New Procedure for
Recognition of Arabic Characters
• Sub-words of 1 character (stand-alone form)
recognise directly without any segmentation
• Sub-words of 2 characters.
The first one is in the initial form
The second one in the final form
Segmentation in two parts only.
• Sub-words of more than two characters.
The first one is in the initial form
The last one in the final form,
The rest are in the middle form
Examples
• One character: ن ق ع
• Two characters sub-words: ل`e fg`آ ig`j
• Three characters sub-words: ikj وlm ikj`ل
• Four characters sub-words: opqr opqrو
• Five characters sub-words:fkqTjا opqTm
7/8/2009
31
Proposed Procedure for Arabic Character
Recognition
Approximate Stroke Sequence String
Matching
• Individual distance di,j between the i’th stroke in
letter a1 and the j’th stroke in letter a2where
di,j= |a1(i) – a2(j)| if |a1(i) –a2(j)| ≤ 4
7/8/2009
32
The 8-direction stroke convention
String matching between the unknown character
and character ح
7/8/2009
33
String matching between the unknown character
and character ع
a: Character ح b: character ع c: unknown character to be
matched with a and b
7/8/2009
38
Remarks about Arabic Language
Average Arabic word contains about 4.3 characters Average of 2.2 sub-words per word
Basic block should be sub-word rather than the word.
The size of the sub-word from 1 to 8 characters Sub-words with a single character: stand-alone form.
….continued
Sub-words with two characters: a single shotsegmentation has to be made dividing thesub-word into two characters. The first oneis in the initial form and the second one inthe final form. Sub-words of lengths longerthan 2 characters need to be segmentedinto three characters or more. The first is ofinitial form, the last of final form and therest of middle form.
7/8/2009
39
Conclusions
• On-line OCR is easier• Off-line OCR for printed text is available for Latin
characters• OCR for handwriting is still in research era• Special applications of recognition of handwritten
text is available e.g. checks and postal delivery• OCR for oriental languages is in research era• Research in Natural Language Processing aids the
OCR development
Conclusions-continued
Design of Arabic OCR system when taking these facts into account would be much simpler.
Classification of sub-words according to the number of the characters they contain, still ought to be addressed.
Approximate stroke sequence string matching. Promising results are shown.
Further refinement of the algorithm used need to be carried out for better rate of recognition.
Neural Network use in segmentation is promising