1 INTRODUCTION OF OPTICAL CHARACTER
RECOGNITION PATTERNS
Introduction
Statement of the Problem
Objective of the Study
Advantages of the Study
Scope of the Study
Limitation of the Study
Literature Review of the Study
Glossary of the Terms
Planning of the Remaining Chapters
References
Chapter 1
1
CHAPTER 1
INTRODUCTION OF OPTICAL CHARACTER
RECOGNITION PATTERNS
1.1 Introduction
In corporations, institutes and offices overwhelming volume of paper-based
data challenges their ability to manage documents and records. Most of the
fields are become computerized because Computers can work faster and more
efficiently than human beings. Paperwork is reduced day by day and computer
system took place of it.
But what about already stored papers? All the early paper work is now become
paperless work. To convert any handwritten, printed or historical documents
(e.g. Book) into text document, either we have to type the whole document or
scan the document. If document size is very large, it becomes cumbersome
and time consuming to type the whole document and there may be the
possibilities of typing errors. Another option is scanning of document.
Whenever document size is very large instead of typing, scanning of document
is preferable. Forms can be scanned through scanner. Scanning is merely an
image capture of the original document, so it can’t be edited or searched
through in any way and it also occupies more space because it is stored as an
image. To edit or search from the document, document must be converted to
doc (text) format.
Optical Character Recognition (OCR) is a process of converting scanned
document into text document so it can be easily edited if needed and becomes
searchable. OCR is the mechanical or electronic translation of images of
handwritten or printed text into machine-editable text. Recognition engine of
the OCR system interpret the scanned images and turn images of handwritten
or printed characters into ASCII data (Machine readable characters). It
occupies very less space than the image of the document. The document which
Chapter 1
2
is converted to text document using OCR occupies 2.35KB while the same
document converted to an image occupies 5.38MB. So whenever document
size is very large instead of scanning, OCR is preferable.
1.2 Statement of the Problem
Title of the present study is: “A study of optical character patterns
identified by the different OCR algorithms and generation of a model for
the elimination of deficiency in identifying the patterns of optical
character”.
The prime theme of the study is to study optical character recognition
algorithms and tools, study character patterns by analyzing structural and
statistical features of characters and find deficiency in identifying the optical
character patterns and develop a model to eliminate it. Present study focuses
on offline isolated handwritten English capital characters and digits. English
language consists of different 26 capital alphabets A to Z, small alphabets a to
z and 10 digits 0 to 9.
1.3 Objectives of the Study
The objective of this research is to study character recognition patterns,
character recognition algorithms and tools and eliminate deficiency in
identifying character patterns.
To fulfill this objective some sub objectives were formed which are as
following.
1.3.1 Study different optical character recognition algorithms.
1.3.2 Study online and offline tools of the character recognition.
1.3.3 Identify character recognition patterns by Study and analysis of
English capital characters for structural and statistical features.
1.3.4 Study and analysis of digits 0 to 9 for structural and statistical features.
Chapter 1
3
1.3.5 Collect handwritten data samples and recognize each character and
digit using different character recognition tools.
1.3.6 Identify the common deficiency in most of the character recognition
software/tools by calculating the recognition rate of each character and
digit and find out the characters and digits whose recognition rate is
very less.
1.3.7 Designing and development of the model to eliminate the common
deficiency identified.
1.3.8 Develop the algorithm to implement the above model.
1.3.9 Testing and Performance evaluation by analyzing results of model.
1.4 Advantages of the study
Advantages of the study can be identified as following:
1.4.1 Scanned document stored as an image file which doesn’t allow editing
and searching. The proposed model will convert a document into
digital files so that document becomes searchable and editable.
1.4.2 It recognizes handwritten characters available on bank cheques,
government records, passport, invoice, bank statement and receipt etc.
1.4.3 Data entry errors can be reduced.
1.4.4 Paper work reduces so it becomes easy to access and manage.
1.4.5 It occupies less space as compared to an image file so it saves
computer space.
1.4.6 The model benefiting users who are unfamiliar with optical character
recognition or who might become confused with multiple applications.
1.4.7 It is possible to combine this model with speech recognition software
and it helps to generate speech output (e.g. Kurzweil 1000 and Open
Book) and it becomes very useful for blind people.
Chapter 1
4
1.5 Scope of the study
The study is concerned with character recognition, for conversion of the
scanned image file into editable and searchable text file. The study is
distributed among the following different software components, which are
developed using Java and OpenCV.
Proposed study focuses on analyzing and recognizing isolated offline English
capital characters and digits. For experimental study, researcher has taken all
capital characters and digits but for the proposed model characters G, N, O, Y
and digit 4 are selected because its recognition rate is very less in selected
character recognition tools.
1.6 Limitation of the study
1.6.1 The model supports only English language.
1.6.2 It identifies only English capital alphabets and digit, it can’t identify
small letters.
1.6.3 Accuracy of model is directly dependent on the quality of input
document.
1.6.4 The model is able to recognize only 4 English capital characters.
1.6.5 The model is able to recognize only 1 digit.
1.6.6 The model can identify only individual character. It can’t identify a
word.
1.6.7 It does not contain any word dictionary so spell check is not possible.
1.6.8 If input document is scanned below 300 dpi then it may affect the
quality and accuracy of OCR results.
Chapter 1
5
1.7 Literature Review of the study
1.7.1 A Glance over Related Literature
Character recognition is an art of detecting segmenting and identifying
characters from image. More precisely Character recognition is process of
detecting and recognizing characters from input image and converts it into
ASCII or other equivalent machine editable form [1], [2], [3].
OCR system is most suitable for the applications like data entry for business
documents (e.g. check, passport etc.), automatic number plate recognition,
multi choice examination; almost all kind of form processing system.
� History
In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by
Paul W. Handel who obtained a US patent on OCR in USA in 1933 (U.S.
Patent 1,915,993). In 1935 Tauschek was also granted a US patent on his
method (U.S. Patent 2,026,329). Tauschek's machine was a mechanical device
that used templates and a photo detector [4].
In 1949 RCA engineers worked on the first primitive computer-type OCR to
help blind people for the US Veterans Administration, but instead of
converting the printed characters to machine language, their device converted
it to machine language and then spoke the letters. It proved far too expensive
and was not pursued after testing [4].
In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security
Agency in the United States, addressed the problem of converting printed
messages into machine language for computer processing and built a machine
to do this, reported in the Washington Daily News on 27 April 1951 and in the
New York Times on 26 December 1953 after his U.S. Patent 2,663,758 was
issued. Shepard then founded Intelligent Machines Research Corporation
Chapter 1
6
(IMR), which went on to deliver the world's first several OCR systems used in
commercial operation [4].
In 1955, the first commercial system was installed at the Reader's Digest. The
second system was sold to the Standard Oil Company for reading credit card
imprints for billing purposes. Other systems sold by IMR during the late 1950s
included a bill stub reader to the Ohio Bell Telephone Company and a page
scanner to the United States Air Force for reading and transmitting by teletype
typewritten messages. IBM and others were later licensed on Shepard's OCR
patents [4].
In about 1965, Reader's Digest and RCA collaborated to build an OCR
Document reader designed to digitize the serial numbers on Reader's Digest
coupons returned from advertisements. The fonts used on the documents were
printed by an RCA Drum printer using the OCR-A font. The reader was
connected directly to an RCA 301 computer (one of the first solid state
computers). This reader was followed by a specialized document reader
installed at TWA where the reader processed Airline Ticket stock. The readers
processed documents at a rate of 1,500 documents per minute, and checked
each document, rejecting those it was not able to process correctly. The
product became part of the RCA product line as a reader designed to process
"Turn around Documents" such as those utility and insurance bills returned
with payments [4].
The United States Postal Service has been using OCR machines to sort mail
since 1965 based on technology devised primarily by the prolific inventor
Jacob Rabinow. The first use of OCR in Europe was by the British General
Post Office (GPO). In 1965 it began planning an entire banking system, the
National Giro, using OCR technology, a process that revolutionized bill
payment systems in the UK. Canada Post has been using OCR systems since
1971. OCR systems read the name and address of the addressee at the first
mechanized sorting center, and print a routing bar code on the envelope based
Chapter 1
7
on the postal code. To avoid confusion with the human-readable address field
which can be located anywhere on the letter, special ink (orange in visible
light) is used that is clearly visible under ultraviolet light. Envelopes may then
be processed with equipment based on simple barcode readers [4].
In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc.
and led development of the first omni-font optical character recognition
system — a computer program capable of recognizing text printed in any
normal font. He decided that the best application of this technology would be
to create a reading machine for the blind, which would allow blind people to
have a computer read text to them out loud. This device required the invention
of two enabling technologies — the CCD flatbed scanner and the text-to-
speech synthesizer. On January 13, 1976 the successful finished product was
unveiled during a widely-reported news conference headed by Kurzweil and
the leaders of the National Federation of the Blind [4].
In 1978 Kurzweil Computer Products began selling a commercial version of
the optical character recognition computer program. LexisNexis was one of
the first customers, and bought the program to upload paper legal and news
documents onto its nascent online databases. Two years later, Kurzweil sold
his company to Xerox, which had an interest in further commercializing
paper-to-computer text conversion. Kurzweil Computer Products became a
subsidiary of Xerox known as Scan soft, now Nuance Communications [4].
1992-1996 Commissioned by the U.S. Department of Energy (DOE),
Information Science Research Institute (ISRI) conducted the most
authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in
the mid-90s. Information Science Research Institute (ISRI) is a research and
development unit of University of Nevada, Las Vegas. ISRI was established in
1990 with funding from the U.S. Department of Energy. Its mission is to foster
the improvement of automated technologies for understanding machine
printed documents [4].
Chapter 1
8
1.7.2 Process of OCR
Any character recognition system performs following steps, i.e. Image
acquisition, Pre-processing, Segmentation, Feature extraction and
Classification.
1.7.2.1 Image Acquisition
Images of OCR system might be acquired by scanning document or by
capturing photograph of document.
1.7.2.2 Pre-processing
Pre-processing step involves RGB to grayscale image conversion,
thresholding, noise removal etc. In RGB to grayscale image conversion, RGB
image is converted to gray image. Thresholding converts gray image to black
and white image. Proper filter like median filter, Gaussian filter etc may be
applied to remove noise from document [5]. Input document may be resized if
Figure 1.1 Optical Character Recognition Process
Chapter 1
9
needed. Proper image pre-processing has a big impact on the quality of the
optical character recognition process.
1.7.2.3 Segmentation
Image segmentation is the process of dividing an image into multiple parts.
This is typically used to identify objects or other relevant information in
digital images [6]. Generally document is processed in hierarchical way. At
first level lines are segmented using row histogram. From each row, words are
extracted using column histogram and finally characters are extracted from
words [5].
1.7.2.4 Feature Extraction
The heart of any character recognition system is the formation of feature
vector to be used in the recognition stage. Feature extraction can be
considered as finding a set of parameters (features) that define the shape of the
underlying character as precisely and uniquely as possible. The term feature
selection refers to algorithms that select the best subset of the input feature set.
Methods that create new features based on transformations, or combination of
original features are called feature extraction algorithms [7].
1.7.2.5 Classification
The classification stage is the decision making part of a recognition system
and it uses the features extracted in the previous stage [8]. Classifiers compare
the input feature with stored pattern and find out the best matching class for
input. In simple terms, it is this part of the OCR which finally recognizes
individual characters and outputs them in machine editable form [7].
1.7.3 Information and Communication Technology (ICT) for data
collection
Collecting data can be a time-consuming, labor intensive process. So
businesses are constantly looking for ways in which data capture and
analysis can be automated. However, manual data collection is still
Chapter 1
10
common for many business processes. This revision note summarizes the
main kinds of data collection you need to be aware of.
1.7.3.1 Manual Input Methods
I. Keyboard
Keyboard is a familiar input device. It is used to input data into personal
computer applications such as databases and spreadsheets.
II. Touch-sensitive screens
Developed to allow computer monitors to be used as an input device.
Selections are made by users touching areas of a screen. Sensors built into the
screen surround, detect what has been touched. These screens are
increasingly used to help external customers input transactional data -
e.g. Buying transport tickets, paying for car parking or requesting
information.
1.7.3.2 Automated Input Methods
I. Magnetic ink character recognition (MICR)
MICR involves the recognition by a machine of specially-formatted characters
printed in magnetic ink. This is an expensive method to set up and use - but it
is accurate and fast. A good example is the use of magnetic ink characters
on the bottom of each cheque in a cheque book.
II. Optical mark reading (OMR)
Optical Mark Reading (OMR) uses paper based forms which users simply
mark (using a dash) to answer a question. OMR needs no special equipment to
mark a form other than a pen/pencil. Data can be processed very quickly and
with very low error rates. An OMR scanner then processes the forms directly
into the required database. An example you are probably familiar with is the
National Lottery entry forms, or answer sheets for those dreaded multiple
choice exam papers.
Chapter 1
11
III. Intelligent Character Recognition (ICR)
Intelligent Character Recognition (ICR) again uses paper based forms which
responders can enter hand printed text such as names, dates etc. as well as
dash marks with no special equipment needed other than a pen/pencil. An
ICR scanner then processes the forms, which are then verified and stored the
required database.
IV. Bar coding and EPOS
A very important kind of data collection method - in widespread use is a Bar
Code. Bar codes are made up of rectangular bars and spaces in varying widths.
Read optically, these enable computer software to identify products and items
automatically. Numbers or letters are represented by the width and position of
each code's bars and spaces, forming a unique 'tag'. Bar codes are printed on
individual labels, packaging or documents. When the coded item is handled,
the bar code is scanned and the information gained is fed into a computer.
Codes are also often used to track and count items.
V. Electronic Point of Sale (EPOS) technology
It is familiar in super markets and many retail operations. Not only saving
time at checkout, EPOS cuts management costs by providing an
automatic record of what is selling and stock requirements. Customers
receive an accurate record of prices and items purchased. Producers use bar
coding for quick and accurate stock control, linking easily to customers.
Distributors use bar codes as a crucial part of handling goods. Larger
businesses and those with high security requirements can use bar codes
for personnel identification and access records for sensitive areas.
VI. EFTPOS (Electronic Funds Transfer at Point Of Sale)
You will find EFTPOS terminals at the till in certain shops. An
EFTPOS terminal electronically prints out details of a plastic card transaction.
The computer in the terminal gets authorization for the payment amount (to
Chapter 1
12
make sure it's within the credit limit) and checks the card against a list of lost
and stolen cards.
VII. Magnetic stripe cards
A card (plastic or paper) with a magnetic strip of recording material on
which the magnetic tracks of an identification card are recorded.
Magnetic stripe cards are in widespread use as a way of controlling access
(e.g. swipe cards for doors, ticket barriers) and confirming identity (e.g. use in
bank and cash cards).
VIII. Smart cards
A smart card (sometimes also called a "chip card") is a plastic card with an
embedded microchip. It is widely expected that smart cards will eventually
replace magnetic stripe cards in many applications. The smart chip provides
significantly more memory than the magnetic stripe. The chip is also capable
of processing information. The added memory and processing capabilities
are what enable a smart card to offer more services and increased
security. Some smart cards can also run multiple applications on one card, this
reducing the number of cards required by any one person.
One of the key functions of the smart card is its ability to act as a stored value
card, such as Mondex and Visa cash. This enables the card to be used as
electronic cash. Smart cards can also allow secure information storage,
making them ideal as ID cards and security keys.
IX. Optical character recognition (OCR) and scanners
OCR is the recognition of printed or written characters by software that
processes information obtained by a scanner. Each page of text is
converted to a digital using a scanner and OCR is then applied to this
image to produce a text file. This involves complex image processing
algorithms and rarely achieves 100% accuracy so manual proof reading is
recommended.
Chapter 1
13
1.7.4 Review of related literature
The researcher has studied all the above mentioned work, and came to some
conclusion as summarized below:
Recognition of printed document is easily done by OCR. But handwritten
character recognition is still the subject of active research. Handwritten
character recognition is difficult task because handwriting differs from writer
to writer. Every individual person has an own style of writing.
1.8 Glossary of Terms
IMR : Intelligent Machines Research
ICR : Intelligent Character Recognition
ICT : Information and Communication Technology
MICR : Magnetic Ink Character Recognition
OMR : Optical Mark Reading
OCR : Optical Character Recognition
EPOS : Electronic Point of Sale
EFTPOS : Electronic Funds Transfer at Point Of Sale
1.9 Planning of the Remaining Chapters
The researcher has distributed the entire work into five different chapters. The
chapter summary for the remaining chapter i.e. from chapter 2 to chapter 5 is
as follow:
Chapter 2 Study of Optical Character Recognition Algorithms and
Tools
This chapter presents study of different optical character recognition
algorithms and offline and online character recognition tools. Researcher has
collected handwritten data samples of capital characters and digits. Result
analysis of each character and digit is carried out and calculated the
Chapter 1
14
recognition rate of each character and digit. Researcher has found out the
characters and digits whose recognition rate is very less in studied tool.
Chapter 3 Study and Analysis of English Characters and Digits for
Structural and Statistical Feature Extraction
This chapter presents study and analysis of structural and statistical features of
English capital characters and digits. Features like Zoning, Number of
endpoints, End point existence in zone, Zone with zero foreground pixels,
Number of vertical lines and Number of horizontal lines are studied and
analyzed for each capital character and digit. Decision tree method is used for
classification. Classification approach as decision table and decision tree is
presented for characters and digits.
Chapter 4 Designing and development of offline handwritten isolated
character recognition model
In this chapter researcher has designed and developed English handwritten
character recognition model for characters G, N, O and Y and digit 4.
Handwritten data samples are collected for those characters and digits.
Digitization of the handwritten data samples, character segmentation,
preprocessing, feature extraction and classification steps are performed to
recognize the characters and digits. Component development and functionality
of the proposed model is described in detail.
Chapter 5 Results, Conclusion and Future Scope for extension of
research work
In this chapter results and analysis of the result as outcome generated by the
proposed model is described. Writer wise result analysis for the characters G,
N, O, Y and digit 4 in the proposed model and performance analysis of
selected tools with the proposed model is described. Further this chapter
presents conclusion of the proposed research work along with direction for
future scope in the present research area.
Chapter 1
15
Reference:
[1] Kai Ding, Zhibin Liu, Lianwen Jin, Xinghua Zhu, “A Comparative
study of GABOR feature and gradient feature for handwritten 17hinese
character recognition”, International Conference on Wavelet Analysis
and Pattern Recognition, pp. 1182-1186, Beijing, China, 2-4 Nov.
2007
[2] Pranob K Charles, V.Harish, M.Swathi, CH. Deepthi, "A Review on
the Various Techniques used for Optical Character
Recognition",International Journal of Engineering Research and
Applications, Vol. 2, Issue 1, pp. 659-662, Jan-Feb 2012
[3] Om Prakash Sharma, M. K. Ghose, Krishna Bikram Shah, "An
Improved Zone Based Hybrid Feature Extraction Model for
Handwritten Alphabets Recognition Using Euler Number",
International Journal of Soft Computing and Engineering, Vol.2, Issue
2, pp. 504-58, May 2012
[4] http://en.wikipedia.org/wiki/Optical_character_recognition
[5] https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&c
d=1&cad=rja&sqi=2&ved=0CCoQFjAA&url=http%3A%2F%2Ffiles.
figshare.com%2F983548%2F2204.pdf&ei=tmP7Us-9HYf-
rAeu6IHYBA&usg=AFQjCNEekQKpByH4GO-LgK-
13ExJ6k8qQA&bvm=bv.61190604,d.bmk
[6] http://www.mathworks.in/discovery/image-segmentation.html
[7] http://shodhganga.inflibnet.ac.in:8080/jspui/bitstream/10603/4166/10/1
0_chapter%202.pdf
[8] http://arxiv.org/ftp/arxiv/papers/1103/1103.0365.pdf
[9] http://www.dataid.com/aboutocr.htm
[10] http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
[11] http://www.codeproject.com/KB/dotnet/simple_ocr.aspx