DcNlCi1X Cj29 . P(X, X2, * * XNICil, Cj2,. , CkN, Dal, Db2,"', DCN ...

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970

Use of Contextual Constraints in Recognitionof Contour-Traced Handprinted Characters

ROBERT W. DONALDSON AND GODFRIED T. TOUSSAINT

Abstract-A contour-tracing technique described in an earlierpaper [9] was used along with monogram, bigram, and trigram con-textual constraints to recognize handprinted characters. Error prob-abilities decreased by factors ranging from two to four over thosewhich resulted when contextual constraints were ignored. A simpletechnique for searching over only the most probable bigrams andtrigrams was used to significantly reduce the computation withoutreducing the recognition accuracy.

Index Terms-Block decoding, character recognition, contextualconstraints, contour analysis, handprinted characters, limited-dataexperiments, suboptimum classification.

I. INTRODUCTIONIn a general pattern recognition problem, patterns are

presented, one after another, to a system which extractsfeatures which then are used, along with a priori patternprobabilities to recognize the unknown patterns. Let Cidenote pattern class i(i= 1, 2, , M) where subscript jdenotes the jth position in the sequence (j = 1, 2, , N).Let Xj denote the corresponding feature vector. LetP(X1, X2, XNICil, Cj2,5 *, CkN) denote the probabilityof the vector sequence X1, X2, *, XN conditioned on thesequence Ci1, Cj2, , CkN whose a priori probability isP(Ci1, Cj*2,* , CkN). The probability of correctly identifyingan unknown pattern sequence is maximized by maximizingany monotonic function of

R = P(X1, X2, , XN Cil, Cj2 , CkN)* P(Cjl, Cj25 CkN) (la)

with respect to i, j, k. In what follows, the length Dr offeature vector Xr is not necessarily constant, even for anindividual pattern class Cj; in fact these differences inlength are used as an aid in recognition. For this reason,(la) is expanded to account for these differences in length Das follows.

R =P(Xl, X2,5-*, XN|Cil, Cj2,'', CkN, Dal, Db2,9 DCN)

*P(Dal, Db2, * DcNlCi1X Cj29 .

9 QkN)P(Ci 1, Cj2~.. CkN) (l1b)

where Dej denotes that vector Xi corresponding to Cij hasDe components.We now make the following two assumptions which sig-

nificantly reduce the amount of computation and dataneeded to calculate R for any i, j,--- k.

P(X, X2, * * XNICil, Cj2,. , CkN, Dal, Db2,"', DCN)= P(XliCil, Dal)P(X2lCj2, Db2) .. P(XNICkNDCN) (2a)

and

Manuscript received March 31, 1970; revised June 15, 1970. This workwas supported by the National Research Council of Canada under GrantNRC A-3308 and by the Defence Research Board of Canada under GrantDRB 2801-30.

The authors are with the Department of Electrical Engineering, Uni-versity of British Columbia, Vancouver, Canada.

P(Dal, Db2,* ** DCN1Ci15 Cj2i '' CkN)- P(Da1ICil)P(Db2ICi2) ... P(DcNICkN)- (2b)

Although (2) is not necessarily exact, it enormously reducesthe training and storage capacity required for the quantitieson the left-hand side of (2a) and (2b). Even so, it is still neces-sary to store P(Ci1, Cj2 *... CkN) for allMN pattern sequences,and this required storage capacity increases exponentiallywith N. What is needed, in addition to (2), are techniqueswhich make effective use of contextual constraints withoutstoring excessive amounts of data or using excessivecomputation time. When the contextual constraints are interms of

P(X1, X2,-- XN|Cil, Cj2,. S CkN)

and

P(Cill Cj29 .., CkN),

there are two basic approaches to the problem.

1) Treat a long sequence ofN patterns as N/L (1 < L < N)separate sequences, which are then classified inde-pendently of each other using (1) or a suitable ap-proximation. This is block decoding. The block lengthis L.

2) Successively examine all N-L+ 1 sequences of ad-jacent patterns of length 1 <L <N. Following exam-ination of each sequence, classify the pattern to beexcluded from all future sequences. Proper use of thisprocedure, called sequential decoding, avoids ex-ponential growth with L in computation time.

Both approaches have been studied extensively for decodingcoded communication signals [11, [2] and have found someapplication in pattern recognition [3]-[8]. We note herethat when the patterns are characters from a natural lan-guage, higher level linguistic constraints would also beused along with the statistical constraints referred to above.

This paper presents results of experiments conducted inorder to determine the effect of using contextual con-straints to recognize contour-traced handprinted characters.Block decoding was used with L=1 (monogram con-straints), L= 2 (bigram constraints), and L= 3 (trigramconstraints). Equiprobable characters were also used, andthis case is denoted by L= 0. The data and the feature ex-traction algorithm are described in Section II. The classifi-cation algorithm is described in Section III. The results,presented in Section IV, show that error probability de-creases by factors between two and four over those whichresult when statistical contextual constraints are ignored.It is also shown that a technique for searching over only themost probable character sequences significantly reduces theamount of computation without further degradation ofperformance. The results and their implications are dis-cussed in Section V.

II. DATA AND FEATURE EXTRACTIONThe data consisted of 14 upper-case Roman alphabets,

two from each of seven persons, as described earlier [9].Each character was spatially quantized into a 50 x 50 array

1096

SHORT NOTES

CODE WORD CODE WORD

10111

COORD WORD

OO,t1,11,OO,10

1001010010

COORD WORD

000,111,111,111, 10Y000,D,1,00D,OO1,000

(a) (b)Fig. 1. CODE and cooRD word for C. (a) Four-part area division.

(b) Six-part area division.

and punched on IBM cards. Subjects were required to re-frain from making broken characters. Ten different grad-uate students were able to recognize the quantized char-acters with 99.7 percent accuracy.The feature vector for any character was obtained by

simulating a discrete scanner on an IBM 7044 computer;the scanner operates as follows. The scanning spot moves,point by point, from the bottom to the top of the leftmostcolumn, and successively repeats this procedure on thecolumn immediately to the right of the column previouslyscanned, until the first black point is found. Upon locatingthis point, the scanner enters the CONTOUR mode, in whichthe scanning spot moves right after encountering a whitepoint and left after encountering a black point. TheCONTOUR mode terminates when the scanning spot com-pletes its trace around the outside of a character and re-turns to its starting point.

After being scanned, the character is divided into eitherfour or six equal-sized rectangles whose size depends onthe height and width of the letter (see Fig. 1). A y thresholdequal to one-half each rectangle's height and an x thresholdequal to one-half each rectangle's width is defined. When-ever the x coordinate of the scanning spot reaches a localextremum and moves in the opposite direction to a pointone threshold away from the resulting extremum, the result-ing point is designated as either an xmax or xmin. After anXmax(Xmin) has occurred, no additional xmax's (xmin's) arerecorded until after an Xmin(xm.) has occurred. Analogouscomments apply to the y coordinate of the scanning spot.The starting point of the coNTouR mode is regarded as anXmin. The CODE word for a character consists of a 1 followedby binary digits whose order coincides with the order inwhich extrema occur during contour tracing; I denotesmax's and min's in x, while 0 denotes max's and min's in y.The rectangles are designated by binary numbers, and theordering ofthese numbers in accordance with the rectanglesin which extrema fall in event sequence constitutes theCOORD word. The feature vector consists of the CODE word

followed by the COORD word. This feature extraction schemeis ourmodification [9] of one originally devised by Clemensand Mason [10]-[12].

III. CLASSIFICATION ALGORITHMThe Dr component of any vector Xy=(xy1, xr2,...* XrD,)

are assumed statistically independent, with the result that

P(XrjCirj Dar) = P(Xrl, Xr2, . XrDarlCir, Dar)Dar

171 P(XrjlCir, Dar). (3)j=l

From (1), (2), and (3) it now follows that the optimum classi-fication of a character sequence of length L results ifi, j,*, k are chosen to maximize'

Dal

Tij ... k - ln P(xjuICij, Da1) + ln P(DaliCil)u= 1

Db2

+ E ln P(X2ulCj2, Db2) + ln P(Db2ICj2)] +

+L= l+ E In P(XLujCkL, DCL) + In P(DcLICkL)

_u=l __

(4)

To search over all ML possible sequences causes the com-putation time to increase exponentially with L. ForM= 26,L = 2, and L =3, there are 676 and 17 576 different se-quences, respectively. To reduce the amount of searchingrequired, the following procedure was used for L= 2(bigrams) and L= 3 (trigrams). First, the feature vector fromeach character in a sequence of L unknown characters wasused to calculate

(5)Da

Q = E In P(xulCi, Da) + In P(D,|ci)_ u=l ,

for all 26 pattern classes. The pattern classes were thenranked in order corresponding to the value of Q; thus thepattern class for which Q was largest was ranked first, thepattern class for which Q was second largest was rankedsecond, and so on. Equation (4) was then maximized overall dL sequences containing the pattern classes having arank between 1 and d, inclusive.

In the experiments described in the next section, five-digitestimates of P(Ci) for individual characters were obtainedfrom Pierce [13]. The 304 legal bigrams and the 2510legal trigram probabilities were obtained from relative-frequency data in Pratt [14]. All other bigrams and trigramswere considered illegal. Thus, when d was so small that alldL sequences of characters were illegal, the characters inthe sequence were identified individually, without usingcontext, by maximizing Q in (5).

Probabilities P(x.lCi, Da) and P(Da,ICi) were learned bydetermining the relative number of times a component x, orlength Da occurred, given the joint event (Ci, Da) or the

' This algorithm is an extension of algorithm T used earlier [9].

1097

+ In P(Cil, Cj2, '' '

I CkL)-

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970

event Ci. When an experimentally determined probabilityP=O, we set P=1/(z+2) [15], where z is the number oftimes (Ci, Da) or Ci occurred during training.

IV. EXPERIMENTS AND RESULTS

Four experiments were conducted for each of the fourvalues ofL. In the first two experiments, the same 14 alpha-bets were used for learning P(x.1 Ci, Da) and P(Da C), andfor testing. In the first experiment, the four-part area divi-sion was used (see Fig. 1). The six-part division was used inthe second experiment. In the last two experiments, whichalso differed from each other solely on the basis of areadivision, ten alphabets from five individuals were used fortraining, while the four alphabets from the remaining twopersons were used to form monograms, bigrams, and tri-grams for testing. In these last two experiments, the resultswere averaged over seven trials; in the ith trial, data frompersons i and i+ 1 (i= 1, 2, * * , 6) were used for testing,while in the seventh trial data from persons 7 and 1 wereused.2 For L= 1, 2, or 3, 2(2L) sample L-grams were usedfor testing, while four samples of each character were usedfor L=0. When the same data were used for both trainingand testing, 7(2L) sample L-grams were used for testing foreach trial when L= 1, 2, or 3, and 14 samples of each char-acter were used forL= 0. No L-gram test samples consistedof individual characters printed by different persons.The results of the above experiments yielded maximum

likelihood estimates of P(ej Ci1, Ci2, * , CkL), which denotesthe probability oferror £ given the sequence Cil, Ci25 * , CkL.The probability of error averaged over all sequences wasthen obtained from

26 26 26

P()= E E ** P(ECil, c2, , CkL)i=1 j=1 k=1

*P(Cil, Cj21 CkJ) (6)

Fig. 2 shows P(E) versus L for the four experiments. TheL=0 and L= 1 results were obtained by searching over allpattern classes whose samples yielded, during training, afeature vector length equal to that ofthe unknown character.TheL= 2 andL= 3 results were obtained in the same way,

except that d= 4. This value for d was obtained by calculat-ing P(E) versus d for the four-part area division (4-PAD)scheme with L= 2 and L= 3 when the training and test datawere disjoint (TS case). For L= 2 d assumed all integervalues from 1 to 16 inclusive, since 16 was the maximumneeded with our data. For L= 3, d went from 1 to 5. Theresults in Fig. 3 show a minimum P(e) at d= 4. This value ofd=4 was subsequently used to obtain all the results forL=2 and L=3 in Fig. 2.

V. DISCUSSION

From Fig. 2 it follows that P(e) decreases as L increases,but at a decreasing rate. The relative improvement resultingfrom use of context seems to increase as recognition ac-curacy for L=0 increases. For the TR-6, TR-4, TS-6, and

2 This procedure yields a better estimate of performance on small datasets than does the often used holdout method [16].

a-

a-0

O-ELcicQ:

otcx

30.

20.

10,

OJ

TS-4TS-6

TR-4TR-6

0 Ist 2nd 3rd

ORDER OF CONTEXTUAL INFORMATION

Fig. 2. Error probability versus order of contextual constraints. TS de-notes that training and test data were disjoint. TR signifies that trainingand test data were identical.

21.

% 20.

E 19.

Qc 18a-r~17.

L16 TRIGRAMS

X-U2 3 4 5 6 8 10 16

DEPTH OF SEARCH

Fig. 3. Error probability versus depth of search d.

TS-4 cases, P(c) for L= 3 is, respectively, 0.23, 0.28, 0.52,and 0.49 times its value at L=0. Such behavior is reason-able, since one would expect correction of errors using con-text to be easiest when the surrounding text-is correct.The 6-PAD is better than the 4-PAD scheme for all L: 3,

as noted earlier [9] for L= 0 and L= 1. The TS results arenot as good as the corresponding TR results, a differencealso observed by others [16]-[19].

In Fig. 3 P(e) decreases as dincreases for 1 < d<4 becausean increase in d allows more legal character sequences tobe examined. The fact that P(E) actually increases as dincreases beyond d= 4 has to be due to the fact that an ap-proximation to R, rather than R as given by (1), is beingmaximized; the approximation results from (2), from (3),and from the fact that P(xlCi, Da) and P(DaIC.) are notknown exactly in the TS case.

Consider now two tradeoffs. First is the tradeoff betweenL and d. Not only is the performance for L=2 and d=4better than for L= 3 and d= 3, but the amount of computa-tion needed in the former case (24= eight searches per char-acter) is less than that for the latter case (33= nine searchesper character). Second, the 6-PAD L= 2 TS case yields betterperformance than does the 4-PAD L= 3 TS case. For theTR experiments, the 6=PAD L=1 and L= 2 cases yieldbetter performance than does the 4-PAD L= 3 case. For the4-PAD and 6-PAD cases, the average length of the featurevectors was 17 and 25, respectively, and the number of dif-ferent (Ci, D.) values was 67 and 66. It follows that with d= 4the 6-PAD scheme for L=1 and L= 2 requires less com-putation time and storage than the 4-PAD L= 3 case; theextra storage needed for P(x.lCi, Da) for the 6-PAD vectors is

1098

SHORT NOTES

more than offset by the saving which results in not havingto store trigram probabilities. It is important to balance theamount of information from character features with thatobtained using contextual constraints, as Raviv [6] hassuggested.From [9] it follows that the results in Fig. 2 would im-

prove considerably if a few additional tests were used todifferentiate between commonly confused characters,and/or if all character samples were from one individual.Higher level semantic constraints, whose use is feasible in awell structured computer language such as FORTRAN, wouldfurther reduce the error rate [5]. The results in Fig. 2 wouldalso improve if values L> 3 were used. Unfortunately, thenumber of different sequences for which ij.. . k in (4) wouldhave to be computed increases exponentially with L for fixedd. This exponential growth can be avoided by using se-quential decoding [1], [6].We note that requiring subjects to make unbroken char-

acters did not greatly inconvenience our subjects, and thatmethods for handling broken characters are described else-where [9]. Finally, we note that our results support those ofBakis et al. [20] in that curve-following features extract sig-nificant information from handprinting, and those of Raviv[6] and Duda and Hart [5] in that use of context signif-icantly improves recognition accuracy.

REFERENCES

[1] J. M. Wozencraft and I. M. Jacobs, Principles of CommunicationEngineering. New York: Wiley, 1965, ch. 6.

[2] R. G. Gallager, Information Theory and Reliable Communication.New York: Wiley, 1965, ch. 6.

[3] B. Gold, "Machine recognition of hand-sent Morse code," IRETrans. Inform. Theory, vol. IT-5, pp. 17-24, March 1959.

[4] W. W. Bledsoe and J. Browning, "Pattern recognition and readingby machine," 1959 Proc. Eastern Joint Computer Conf., vol. 16, pp.225-232; also in L. Uhr, Pattern Recognition. New York: Wiley,1966, pp. 301-316.

[5] R. 0. Duda and P. E. Hart, "Experiments in the recognition ofhand-printed text: Part II-context analysis," 1968 Fall Joint ComputerConf., AFIPS Proc., vol. 33, pt. 2. Washington, D. C.: Thompson,1968, pp. 1139-1149.

[6] J. Raviv, "Decision making in Markov chains applied to the problemof pattern recognition," IEEE Trans. Inform. Theory, vol. IT-13,pp. 536-551, October 1967.

[7] R. Alter, "Use ofcontextual constraints in automatic speech recogni-tion," IEEE Trans. Audio, vol. AU-16, pp. 6-11, March 1968.

[8] K. Abend, "Compound decision procedures for pattern recognition,"1966 Proc. NEC, vol. 22, pp. 777-780.

[9] G. T. Toussaint and R. W. Donaldson, "Algorithms for recognizingcontour-traced handprinted characters," IEEE Trans. Computers(Short Notes), vol. C-19, pp. 541-546, June 1970.

[10] J. K. Clemens, "Optical character recognition for reading machineapplications," Ph.D. dissertation, Dept. of Elec. Engrg., Massachu-setts Institute of Technology, Cambridge, Mass., August 1965.

[11] S. J. Mason and J. K. Clemens, "Character recognition in an experi-mental reading machine for the blind," in Recognizing Patterns,P. A. Kolers and M. Eden, Eds. Cambridge, Mass.: M.I.T. 1968,pp. 156-167.

[12] S. J. Mason, F. F. Lee, and D. E. Troxel, "Reading machine for theblind," M.I.T. Electronics Research Lab., Cambridge, Mass., Quart.Progress Rept. 89, pp. 245-248, April 1968.

[13] J. R. Pierce, Symbols, Signals and Noise. New York: Harper andRow, 1961, p. 283.

[14] F. Pratt, Secret and Urgent. New York: Blue Ribbon, 1942.[15] I. J. Good, The Estimation of Probabilities, Res. Monograph 30.

Recognition, L. N. Kanal, Ed. Washington, D. C.: Thompson,1968, pp. 109-140.

[18] W. H. Highleyman, "The design and analysis of pattern recognitionexperiments," Bell Sys. Tech. J., vol. 41, pp. 723-744, March 1962.

[19] G. F. Hughes, "On the mean accuracy of statistical pattern recog-nizers," IEEE Trans. Inform. Theory, vol. IT-14, pp. 55-63, January1968.

[20] R. Bakis, N. M. Herbst, and G. Nagy, "An experimental study ofmachine recognition of hand-printed numerals," IEEE Trans. Sys.Sci. Cybern. vol. SSC-4, pp. 119-132, July 1968.

Computer Experience on Partitioned List Algorithms

E. MORREALE AND M. MENNUCCI

Abstract-The main characteristics of some programs imple-menting a number of different versions of partitioned list algorithmsare described, and the results of a systematic plan of experimentsperformed on these programs are reported. These programs concernthe determination of all the prime implicants, a prime implicant cover-ing, or an irredundant normal form of a Boolean function. The experi-ments performed on these programs concern mainly the computertime required, the number of prime implicants obtained, and theirdistribution in families. The results obtained from these tests demon-strate that relatively large Boolean functions, involving even somethousands of canonical clauses, can be very easily processed bypresent-day electronic computers.

Index Terms-Boolean function, computer programs, computingtimes, covering, experimental results, incompletely specified, irre-dundant normal forms, minimization, partitioned list, prime im-plicants.

I. INTRODUCTION

Recently a new class of algorithms-partitioned listalgorithms has been proposed [1] for determining, forany given Boolean function, either the set of all the primeimplicants or a prime implicant covering of the function,from which an irredundant normal form can be easily ob-tained. Some theoretical results [2] concerning the compu-tational complexity of partitioned list algorithms.for primeimplicant determination indicate that partitioned list algo-rithms compare favorably with both Quine's [3] and Mc-Cluskey's [4] nonpartitioned list algorithms. However,through these theoretical evaluations, only a rough estimatecan be made of the average computing time actually re-quired for finding all the prime implicants ofa Boolean func-tion. Furthermore, in considering the actual performancesof partitioned list algorithms, it is interesting both to com-pare different versions of algorithms having the same pur-pose, i.e., the determination of all the prime implicants,and also to compare similarly structured algorithms havingdifferent purposes, i.e., the determination of all the primeimplicants, only a prime implicant covering, or an irre-dundant normal form.

In order to evaluate both the relative and the absoluteperformances of some different versions of partitioned listalgorithms, a number of programs has been realized for theIBM 7090 and a systematic plan of tests has been conductedon them. With the aim of obtaining indications on the ab-solute performances of partitioned list algorithms on exist-

Cambridge, Mass.: M.I.T. Press, 1965.[16] L. Kanal and B. Chandrasekaran, "On the dimensionality and sample

size in statistical pattern classification," 1968 Proc. Natl. Electronics Manuscript received June 7, 1968; revised December 11, 1969.Conf., vol. 24, pp. 2-7. The authors are with the Istituto Elaborazione dell'Informazione,

[17] J. H. Munson, "The recognition of hand-printed text," Pattern Pisa, Italy.

1099

Date post:	02-Jan-2017
Category:	Documents
Upload:	lamnhi
View:	215 times
Download:	1 times

DcNlCi1X Cj29 . P(X, X2, * * XNICil, Cj2,. , CkN, Dal, Db2,"', DCN ...

Documents