Automated Receipt Image Identiﬁcation, Cropping, and Parsing › courses › archive › ... ·...

Automated Receipt Image Identification, Cropping, and Parsing

Alex YuePrinceton [email protected]

Abstract

The identification and extraction of unstructured datahas always been one of the most difficult challenges of com-puter vision. This form of data exists all around us, fromflashy advertisements to basic nutritional information onfood products, and provides a large amount of informationthat exists nowhere else. Parsing this sort of data is verychallenging; however, recent advancements in computer vi-sion technology help make this feasible.

In this paper, I introduce a novel data pipeline for theidentification, cropping, and extraction of unstructured datawithin receipt images, a ubiquitous, commonplace item thatmany consumers receive, but one that is difficult to be trans-formed into raw data. Receipts contain a dense amountof data that is useful for future analysis, but there existsno widely available solution for transforming receipts intostructured data. Existing solutions do exist; however, theyare either very costly or inaccurate.

The pipeline that is described in this paper outper-forms existing solutions by a large margin, and offers theability to automatically pull out semantic data such astransaction date and price from an image of a receipt.It achieves this success by using Line Segment Detec-tion (LSD), Holistically-Nested Edge Detection (HED), andHough Transform to crop the image. Then, Optical Char-acter Recognition (OCR) is applied to detect chunks of textand process the cropped image. Finally, natural languageprocessing and statistical techniques are used to extractuseful sets of information from the recognition result.

1. IntroductionThe task of unstructured data processing has received

growing interest over the past few years as computer vi-sion and machine learning algorithms have significantly im-proved. In particular, this paper is interested in the pro-cessing of unstructured receipt image data and convertingit into a simple-to-use, analyzable structured data format.This problem is extremely interesting due to the fact that itis open-ended and the search space is unending: a receipt

image can be provided in an unstructured format that varieswidely across geographies and industries, and data such asdate and price must be extracted from all sorts of receipttypes. An airline electronic ticketing receipt is much differ-ent from a grocery store receipt, with both containing simi-lar sets of data in drastically different locations.

This paper aims to help resolve some of the fundamen-tal difficulties and inconsistencies associated with parsingthis sort of unstructured data. We use a combination of re-ceipt image datasets provided by ExpressExpense [2] andcustom receipt data collected over the past few months bya small group of people to train and evaluate the effective-ness of this system. Our aim is to be able to process andparse receipt data from most standard use cases, which in-volves steps such as correcting the input image orientation,cropping the receipt to remove the background, running op-tical character recognition (OCR) to pull text from the im-age, and using heuristics to determine relevant data fromthe OCR result. A good, satisfactory system should be ableto handle the above characteristics and further edge casesby pulling the transaction date and price from image inputs.The paper will demonstrate how this result was achievedthrough which processes and provide both a quantitativeand qualitative evaluation with respect to the success of thissystem.

2. Related WorkThe challenge of automated receipt image data extrac-

tion has mostly been taken up with respect to two regards:academia and industry. Researchers and academics havemostly focused on developing techniques to improve therecognition and extraction of text from unstructured datawhereas industry has focused on creating commercial sys-tems to reduce manual labor costs associated with inputtingreceipt image data for analysis or reporting. However, nei-ther produce a optimal system due to degradations in ei-ther accuracy or cost. Existing commercial solutions in-clude implementations from companies such as Expensify,Metabrite, and Receipts by Wave. These implementationsare all slow, costly, and still rely on manual labor to processedge cases.

1

2.1. Academic Research

A large amount of research has been focused around op-tical character recognition and text extraction from unstruc-tured data. Less has focused specifically on the subtaskof data extraction from receipts. One commonly used ex-traction mechanism for text detection is to use a Convo-lutional Neural Network (CNN) [9] [7] [4]. This class ofOCR utilizes a Region Proposal Network (RPN) to proposeregions of interest where text may exist as well as a CNNto determine the likelihood of text appearing at that loca-tion. These systems all provide end-to-end for text iden-tification—from localizing and recognizing text in imagesto retrieving data within such text. In the specific case of”Multi-Oriented Text Detection with Fully ConvolutionalNetworks” by Zhang et al., their proposed system consis-tently achieved state-of-the-art performance on numeroustext detection benchmarks [9].

Some research work has also been done on direct receipttext sentiment extraction as well. There exists research byGjoreski et al. regarding ”Optical Character RecognitionApplied on Receipts Printed in Macedonian Language” [3].In this work, the team proposed utilizing a k-nearest neigh-bors classifier to classify individual chunks of images oftext that was extracted from receipt images. This approachachieved a 87% accuracy, which is decently performant forunsupervised data extraction.

2.2. Commercial Implementation

There are numerous companies that implement basic re-ceipt parsing methods in order to collect structured datafrom user-uploaded receipts. Most notably, the leader in thefield at receipt data extraction is Expensify, an integrated,enterprise expense report solution. Expensify markets theirreceipt scanning solution as SmartScan, which provides aservice that will ”lift the Merchant, Date and Amount andcreate an Expense on your behalf” for a given image [1].This system is not perfect, and relies heavily on human in-tervention for their numerous failure cases. More specifi-cally, Expensify outsources some amount of their data entrycollection process out to freelancers on Amazon Mechani-cal Turk.

Other solutions freely available on the market are offeredby companies such as Wave, Shoeboxed, and Metabrite.These systems all provide rudimentary solutions that arenot able to process many different types of user-uploadedreceipts. In the case of Wave, it is only able to extract datafrom simple, uncrumpled receipts on a solid background.Other solutions such as Metabrite are expensive, slow, andunreliable.

Figure 1. Demonstration of automated receipt cropping and sam-ple OCR recognition.

3. MethodologyThis research project is comprised of numerous meth-

ods and systems that work together in order to automatethe extraction of data from receipt images. The primarysteps involved in this systems project include: identificationof receipt foreground and cropping, computation of affinetransformation to unwarp the image, extraction of textualinformation using OCR, and, finally, the parsing of data togenerate structured, semantic meaning.

The approach and methodology to each individual com-ponent for this project is described in the following subsec-tions.

3.1. Receipt Localization & Cropping

Receipt image localization involves determining theforeground versus the background for any given image. Forthis problem, this system assumes the following acceptableconstraints:

2

Figure 2. Sample Receipt Image

• The receipt is rectangular or of a rectangular shape.

• The exists a distinguishable edge along the edge of thereceipt that provides clear contrast with respect to thebackground.

Given the above constraints, processed images will resultin convex quadrilaterals when projected onto a 2D space.Thus, our goal is to identify the four points that representthis quadrilateral and to then compute an affine transforma-tion to project it onto a 2D space. One of the goals of thispart of the system is to make it as robust as possible whenidentifying receipt images from backgrounds. The inputdata consisted of a wide variety of receipt sizes on an as-sortment of challenging, non-standard backgrounds. Thesebackgrounds included objects such as park benches, multi-grained tables, and transparent glass surfaces.

Determining the edges of an image is an incredibly com-plex task when there exists a complex background. Thissystem uses multiple, independent edge detection imple-mentations that are combined using a statistical, weightedvoting average in order to generate a final mask to de-termine the edge boundaries of the receipt image. Thethree different implementations used include a Line Seg-ment Detector (LSD), Probabilistic Hough Transform pre-

Figure 3. Line Segment Detector Result for Receipt

processed through a Canny Edge Detector, and a deep learn-ing aided Holistically-Nested Edge Detection (HED) modelthat leverages a fully convolutional neural network [8].The combination of using these three processing techniquesmakes the cropping tool much more likely to identify thecorrect receipt area, and significantly boosts the accuracyof the overall system. This is due to the fact that each sys-tem has its own strengths and weaknesses with regards toedge detection; however, as the receipt image boundary isthe most significant edge, it should appear in all three de-tection results.

The input image is first downscaled to a fixed size andtransformed from the RGB color space to grayscale. Thisis done in order to improve algorithm performance as wellas to increase the runtime speed. These down-sampled im-ages are then parallelized into each of the following edgedetectors described below.

3.1.1 Line Segment Detector (LSD)

The Line Segment Detector is a linear-time algorithm de-veloped by Gioi, Jakubowicz, Morel, and Randall [6]. Es-sentially, it is a line extraction algorithm that uses Gaus-sian pyramids that are down-sampled and blurred in order toidentify line support regions. These line support regions are

3

then filtered using thresholding and region approximationsin order to extract line segments independent of necessaryparameter tuning.

The advantages of this algorithm is that it automaticallycontrols for its own number of false detections through athresholding method while simultaneously being able toprovide sub-pixel accurate results [6]. Thus, LSD is ableto identify all major, sharp edges within an image, regard-less of the length or relevancy of the line. This provides agood baseline sampling of all available lines within the im-age, which are filtered by the weighted mask votes providedin each subsequent line process detector.

3.1.2 Probabilistic Hough Transform

The Hough Transform is a technique used to determine fea-tures such as lines within an image. A binary image in-put, usually of edges from detectors such as the Canny EdgeDetector, is used for this probabilistic method where possi-ble lines matching the binary input are generated through astochastic voting process and the randomly generated linesthat overlap sufficiently enough with the binary input aresubsequently chosen.

This method of image detection is used in this systembecause it is able to only detect the most significant edgesfor each given image. The grayscale input image is first con-volved using a 5x5 Gaussian kernel in order to remove noisefrom the image. This processed image is then run through aCanny Edge Detector with hysteresis thresholding and non-maximum suppression. This edge detector generates a bi-nary map of edge locations which is then run through theProbabilistic Hough Transform to produce a map of pre-dicted line segments. We then constrain the result space byfiltering out lines that are shorter than a predefined mini-mum and are left with only leading edges.

3.1.3 Holistically-Nested Edge Detection

Holistically-Nested Edge Detection is a novel edge detec-tion algorithm by Xie that uses a deep learning model in or-der to resolve ambiguity over primary edges within a givenimage. It performs edge prediction through the use of a fullyconvolutional neural network and learns the significance ofcertain edges through training [8]. The model used for thisproject implements the architecture developed by Xie anduses the Berkeley Segmentation Dataset and Benchmark asthe initial training data [5].

The advantages of using the HED machine learningmodel primarily stems from the fact that the machine learn-ing model was trained to discriminate against flaggingedges within the background of a given image. This methodhelped especially during time when input receipt imageswere especially noisy within the background and containednumerous straight lines. However, this algorithm requires a

large amount of computational power in order to work andprovides only marginal benefit in images where the receiptbackground was not extremely noisy.

3.2. Affine Transformation & Unwarping

Each line segment mask generated from the three in-dependent edge detectors is combined into a single maskto determine the boundaries of the receipt image. This isdone through a weighted probability distribution wherebythe values in each mask are transformed to a value between0 and 1 and then the mask is multiplied by a pre-computedweight specific to each edge detector. These weightedmasks are then added together and their values quantizedinto a new binary mask where final lines are detected us-ing a Probabilistic Hough Transform. These final lines gen-erated by this transform represent the highest-confidence-predicted edges for the given input images, and is based ona combination of data provided by the three individual edgedetectors that vote for each given line segment.

The lines are then segmented into vertical and horizon-tal lines using a k-means clustering algorithm that clustersbased on the ρ value of polar coordinates. Intersections be-tween vertical lines and horizontal lines are then determinedafter the clustering is complete. Nearby intersecting pointsare then merged through averaging by using a kd-tree forefficient lookup. These points represent statistically likelypoints for corner points of the receipt image.

After likely corner points are generated through thesestatistical methods, the system assigns a box score to eachcombination of four points representing the edges of a pro-posed receipt image boundary within the final corner pointlist. This box score assigned to each combination of pointsis the likelihood of those points representing the actual re-ceipt boundary edges. This score is computed by statis-tically weighting a combination of factors including thesharpness of each corner with respect to 90 degrees per cor-ner, the distance the center of gravity of the points is to thecenter of the image, and the ratio of the area of the rectan-gular box to the total area of the image.

The combination of four points with the lowest box scoreis chosen as the edges of the receipt. These edges are thenused to calculate an affine transformation matrix that will beused to crop the receipt image from its original form into anaxis-aligned and in-plane image to be used for the opticalcharacter recognition system. This matrix is then applied tothe original input image and then returned for use in latersteps in this pipeline to extract text from the image.

3.3. Text Extraction Through OCR

The optical character recognition engine used for thisimplementation project is Tesseract OCR [tesseract].Tesseract is a neural network enhanced optical characterrecognition engine that is maintained by Google, and is one

4

Figure 4. Basic Receipt Parsing with a Complex Background

Figure 5. Receipt Parsing with a Noisy Background

of the leading open-source recognition systems available onthe market.

Before images in the pipeline are inputted into Tesseractfor recognition, a number of pre-processing and image en-hancement steps are performed in order to boost the accu-racy of the result outputted from the OCR engine. Thesesteps include bilateral image filtering in order to removenoise from the image and image thresholding in order toquantize the image into a binarized version. Further en-hancements performed on the input image include using al-

gorithms such as the median blur algorithm.

The processed image is then passed into Tesseract, whichthen returns data including the text recognized, the bound-ing boxes of each recognized text grouping, and a confi-dence score with regards to how accurate the system pre-dicts the recognition is. Each individual box is then filteredto remove clearly incorrect recognitions and recognitionswhere the system is not confident in.

The final bounding boxes for text contain individualgroupings of text that the recognition system probabilis-

5

Figure 6. Receipt Parsing with an Affine Transformation

tically estimates should be grouped. This receipt parsingpipeline then runs a new set of steps in order to regroup se-mantically and geographically similar groups of text intolarger bounding boxes. Such methods are completed byprobabilistically selecting pairs of text dependent on the dis-tance of two text bounding boxes. Given two text boundingboxes, boxes are merged if they contain lexically similarcontent such as address information or numerical informa-tion. For example, if two adjacent bounding boxes bothcontain numerical integers, the system will automaticallymerge both bounding boxes together and insert a decimalpoint given the statistically likely chance that the OCR sys-tem failed to recognize the period delineating the two. Oncethe final, merged bounding boxes are computed, they arepassed onto the next stage of the pipeline to extract bothdate and price data from each receipt.

3.4. Semantic Meaning & Understanding

Understanding the semantic meaning of text is a verydifficult problem, especially when there exists partial textrecognition data from the OCR engine. To solve this chal-lenge, this system uses a combination of natural languageprocessing techniques such as tokenization, regular expres-sion matching, and spatial search algorithms in order toidentify data located on receipt images. I take advantage ofcommonalities found on many receipts such as shared loca-tion information for where common data such as total priceand merchant name are typically located on receipt images.This semantic understanding system also takes advantage ofkeywords on the image that typically imply categories suchas ”price”, ”total”, or ”amount” for the total price.

The two main categories that this system was built toparse were receipt dates and total prices. To parse a givenOCR output, words are first tokenized using natural lan-guage processing libraries and then keywords for each givencategory identified. Once each keyword is identified, I ran aspatial search for all given text inputs nearest to the text in-put containing the keyword for insightful information. Forexample, for parsing pricing data, keywords such as ”price”,”total”, and ”amount” are first identified from the OCRoutput. Then, for each given positive keyword match, anearest-neighbor search is conducted to look for text bound-ing boxes containing pricing information. I then select thekeyword-price pair for the bounding boxes that are the fur-thest down on the page, which usually represents the totalprice for a given receipt image.

Once the price and date information are found, this in-formation is returned as output and a plot containing theindividual images generated during the pipeline combinedwith the parsed information is displayed to the user.

4. Documentation4.1. Modules

The implementation for this project is primarily brokenup into three distinct modules, all in charge of a unique sub-set of features:

• launcher - Responsible for integrating all modulesinto a single runtime and returning final output and ren-der.

• detector - Contains all the primary logic for each

6

Figure 7. Predicted Bounding Boxes for Receipt

of the edge detectors, and code to integrate and filterout the final detected edge segment results.

• transform - Responsible for taking in an edge maskand input image and returning an axis-aligned and in-plane cropped image.

• recognition - Responsible for running TesseractOCR and parsing price and date data from the bound-ing box output from Tesseract.

4.2. Dependencies

This project could not have been finished without thework of numerous other modules that this system depends

upon. Packages and scientific computing libraries used inthis project include:

• OpenCV

• NumPy

• Caffe

• Tesseract

• matplotlib

• scikit-learn

5. ResultsIncluded above and continuing into the next page are

sample result outputs generated from this receipt recogni-tion pipeline. Further results are also included that demon-strate the rotation invariant and other capabilities of the sys-tem.

6. EvaluationThis project is evaluated in two separate stages. First,

there was a qualitative evaluation to test the performanceof the receipt recognition, cropping, and parsing. Next, aquantitative evaluation was performed by running this sys-tem through a set of custom generated test receipt images tocheck for the accuracy of the system.

6.1. Qualitative Evaluation

From tests and run-throughs using this system pipeline, Ifelt that, subjectively, this system was relatively easy to useand fast at parsing information from receipt images, withmost results being returned in sub-second intervals. How-ever, the system does slow down when there are a lot offalse positive lines segments that are identified during theline detection stage as the box score calculation algorithmneeds to iterate through all possible combinations in orderto determine the best combination of points available thatrepresents the receipt image. This could be improved in thefuture through the finer tuning of the edge detection maskweights for each individual detector.

I evaluated the quality of the boundary detection for asample of receipt images. The algorithm that was developedand implemented for detecting the receipt image boundariesperformed relatively well as it was able to determine the im-age boundaries in the majority of the cases. This occurredin light of difficult test images where parts of the receiptwere obscured or taken on challenging backgrounds suchas multi-grained tables that would cause some line detec-tors to attempt to recognize the individual wood grains asline segments.

Through the use of multiple independent edge detectorsthat vote on the confidence of each proposed line segment,

7

this system is much more robust to edge cases when com-pared to other document edge detectors.

6.2. Quantitative Evaluation

Data for use against a baseline system is difficult to comeby as it does not exist. Thus, I have assembled a custom testdataset of 50 unique receipt images for use in evaluatingthe accuracy of this system. For comparison, these imageswere also provided to another receipt parsing system calledReceipts by Wave.

Ultimately, I found that this system implementation out-performed the competing implementation. Furthermore,this system is much faster, with most results generatedwithin 1 second compared to minutes for Receipts by Wave.

• This system was able to correctly identify prices from36/50 of the receipt images.

• Receipts by Wave was able to correctly identify pricesfrom 22/50 of the receipt images.

7. Discussion

The system developed over the course of this finalproject performed relatively well across a wide variety oftest images. Given the subset of images tested, this pipelineperformed optimally when the receipt images were crispand separated by a solid, distinguishable background. Thesystem would be able to identify data such as price with ex-tremely high accuracy rates given these circumstances. Insituations where the background was noisy, such as a mar-ble countertop or on top of a table advertisement, the bound-ary detection statistical voting algorithm was able to iden-tify the borders in most cases; however, when the boundaryidentification failed, the pipeline would not be able to func-tion at all. One possible solution to allow the pipeline to stillrun if the boundary detection and cropping failed would beto align the receipt text to the horizontal axis using a classi-fication algorithm and then run optical character recognitionon the image to attempt to extract text from the entire image.During testing, this approach did not work nearly as well asthe cropped version due to the additional noise added to theimage from the background.

One other great strength of the system developed wasthat it was extremely fast in most circumstances. In tim-ing tests of standard run-throughs of sample receipt images,the system would often provide a result within the second.Furthermore, the edge detection algorithm runs in near-real-time, allowing for further developments such as a visualoverlay during the image capture process to act as an aidfor the user.

8. Further ImprovementsThis receipt image parser provides a significant baseline

platform for pulling pricing data off receipts. However, it isnot to say that there cannot be further improvements on thissystem. One area of improvement that I will focus on im-plementing is training and inserting an end-to-end regionalproposal network to identify areas of text and their bound-ing boxes within each receipt image. One of the shortfalls ofTesseract is that it is unable to handle font variations withinan single image very well. There exist some receipt imageswithin the training and testing datasets that have variationsin font sizes, such as larger fonts for the logos and boldedfont for the total price. Tesseract does not handle these typeof receipt images very well and often fails to identify ar-eas of text that have a font face or size different from themajority font used in each image. Adding a region pro-posal network that identifies text blocks would aid this byallowing this Tesseract to identify text individually per eachbounding box rather than for the entire image at once.

Furthermore, another improvement that I would like toadd to this project is to implement better methods to iden-tify and correct errors that occur in the optical characterrecognition pipeline. Existing optical character recognitionsystems often will recognize characters within some wordsincorrectly. For example, the word ”total” will sometimesbe recognized at ”tota1”, where the letter ”l” is substitutedby the number ”1”. Training a LSTM that can take inputtext and autocorrect errors such as these would boost theaccuracy in receipt identification and parsing significantly.

References[1] K. Barrett. Smartscan 101.[2] L. ExpressExpense. Expressexpense’s massive receipt

database, 2016.[3] M. Gjoreski, G. Zajkovski, A. Bogatinov, G. Madjarov,

D. Gjorgjevikj, and H. Gjoreski. Optical character recogni-tion applied on receipts printed in macedonian language. 042014.

[4] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman.Reading text in the wild with convolutional neural networks.International Journal of Computer Vision, 116(1):1–20, Jan2016.

[5] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of hu-man segmented natural images and its application to evaluat-ing segmentation algorithms and measuring ecological statis-tics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages416–423, July 2001.

[6] G. Randall, J. Jakubowicz, R. G. von Gioi, and J. Morel. Lsd:A fast line segment detector with a false detection control.IEEE Transactions on Pattern Analysis & Machine Intelli-gence, 32:722–732, 12 2008.

[7] B. Shi, X. Bai, and S. J. Belongie. Detecting oriented text innatural images by linking segments. CoRR, abs/1703.06520,2017.

8

[8] S. ”Xie and Z. Tu. Holistically-nested edge detection. In Pro-ceedings of IEEE International Conference on Computer Vi-sion, 2015.

[9] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai.Multi-oriented text detection with fully convolutional net-works. CoRR, abs/1604.04018, 2016.

9. AcknowledgementsMany thanks to Prof. Russakovsky who was the main

advisor for this final project.

10. Honor CodeThis paper represents my own work in accordance with

University regulations.

/s/ Alex Yue

9

Date post:	29-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Automated Receipt Image Identiﬁcation, Cropping, and Parsing › courses › archive › ... ·...

Documents