+ All Categories
Home > Documents > handwritten mathematical content in COPY - Welcome to the GMU ECE...

handwritten mathematical content in COPY - Welcome to the GMU ECE...

Date post: 17-Apr-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
16
Integrated Computer-Aided Engineering 21 (2014) 219–234 219 DOI 10.3233/ICA-140460 IOS Press Audio-video based character recognition for handwritten mathematical content in classroom videos Smita Vemulapalli a and Monson Hayes a,b,a Center for Signal and Image Processing, Georgia Institute of Technology, Atlanta, GA, USA b Advanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul, Korea Abstract. Recognizing handwritten equations is a challenging problem, and even more so when they are written in a classroom environment. However, since videos of the handwritten text and the accompanying audio refer to the same content, a combi- nation of video and audio based recognition has the potential to significantly improve the recognition accuracy. In this paper, using a combination of video and audio based recognizers, we focus on improving the character recognition accuracy for hand- written mathematical content in videos using audio and propose an end-to-end recognition system. The system includes com- ponents for video preprocessing, selecting the characters that may benefit from audio-video based combination, establishing a correspondence between handwritten and the spoken content, and finally combining the recognition results from the audio and video based recognizers. The current implementation of the system makes use of a modified open source text recognizer and a commercially available phonetic word spotter. For evaluation purposes, we use videos recorded in a classroom-like environment and our experiments demonstrate the significant improvements in character recognition accuracy that can be achieved using our techniques. Keywords: Video preprocessing, handwriting recognition, speech recognition, classifier combination 1. Introduction Recent years have witnessed a rapid increase in the number of e-learning and advanced learning ini- tiatives that either use classroom videos as the pri- mary medium of instruction or make them available online for reference by the students. As the volume of recorded video content increases, it is clear that in or- der to efficiently navigate through the videos there is a need for techniques that can help extract, identify and summarize the video content. In this context, and given the fact that the whiteboard continues to be the preferred and effective medium for teaching complex mathematical and scientific concepts [4,60], this pa- Corresponding author: Monson Hayes, School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul, Korea. Tel.: +1 404 462-8257; E-mail: [email protected]. per focuses on how to use both the audio and video to achieve a higher recognition accuracy compared to when only one recognizer is used alone. It is clear that, in any recognition task, one should use all available in- formation to aid in the recognition task. Since instruc- tors who write their lectures on the whiteboard will typically speak what is being written, it is possible to use the audio to improve the recognition accuracy of a character recognizer. The question is how best to com- bine the outputs of a character and an audio recognizer, and how to use the audio to assist in character recogni- tion. There are many difficult issues that need to be ad- dressed, such as when should audio be used to assist in the recognition? For example, if a handwritten charac- ter is correctly recognized by the character recognizer, then any further processing using the audio may result in an error. If the character recognizer is having diffi- culty in recognizing a character, and there is an ambi- guity as to what the correct character is, then how many ISSN 1069-2509/14/$27.50 c 2014 – IOS Press and the author(s). All rights reserved AUTHOR COPY
Transcript
Page 1: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

Integrated Computer-Aided Engineering 21 (2014) 219–234 219DOI 10.3233/ICA-140460IOS Press

Audio-video based character recognition forhandwritten mathematical content inclassroom videos

Smita Vemulapallia and Monson Hayesa,b,∗aCenter for Signal and Image Processing, Georgia Institute of Technology, Atlanta, GA, USAbAdvanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul, Korea

Abstract. Recognizing handwritten equations is a challenging problem, and even more so when they are written in a classroomenvironment. However, since videos of the handwritten text and the accompanying audio refer to the same content, a combi-nation of video and audio based recognition has the potential to significantly improve the recognition accuracy. In this paper,using a combination of video and audio based recognizers, we focus on improving the character recognition accuracy for hand-written mathematical content in videos using audio and propose an end-to-end recognition system. The system includes com-ponents for video preprocessing, selecting the characters that may benefit from audio-video based combination, establishing acorrespondence between handwritten and the spoken content, and finally combining the recognition results from the audio andvideo based recognizers. The current implementation of the system makes use of a modified open source text recognizer and acommercially available phonetic word spotter. For evaluation purposes, we use videos recorded in a classroom-like environmentand our experiments demonstrate the significant improvements in character recognition accuracy that can be achieved using ourtechniques.

Keywords: Video preprocessing, handwriting recognition, speech recognition, classifier combination

1. Introduction

Recent years have witnessed a rapid increase inthe number of e-learning and advanced learning ini-tiatives that either use classroom videos as the pri-mary medium of instruction or make them availableonline for reference by the students. As the volume ofrecorded video content increases, it is clear that in or-der to efficiently navigate through the videos there isa need for techniques that can help extract, identifyand summarize the video content. In this context, andgiven the fact that the whiteboard continues to be thepreferred and effective medium for teaching complexmathematical and scientific concepts [4,60], this pa-

∗Corresponding author: Monson Hayes, School of AdvancedImaging Science, Multimedia and Film, Chung-Ang University,Seoul, Korea. Tel.: +1 404 462-8257; E-mail: [email protected].

per focuses on how to use both the audio and videoto achieve a higher recognition accuracy compared towhen only one recognizer is used alone. It is clear that,in any recognition task, one should use all available in-formation to aid in the recognition task. Since instruc-tors who write their lectures on the whiteboard willtypically speak what is being written, it is possible touse the audio to improve the recognition accuracy of acharacter recognizer. The question is how best to com-bine the outputs of a character and an audio recognizer,and how to use the audio to assist in character recogni-tion. There are many difficult issues that need to be ad-dressed, such as when should audio be used to assist inthe recognition? For example, if a handwritten charac-ter is correctly recognized by the character recognizer,then any further processing using the audio may resultin an error. If the character recognizer is having diffi-culty in recognizing a character, and there is an ambi-guity as to what the correct character is, then how many

ISSN 1069-2509/14/$27.50 c© 2014 – IOS Press and the author(s). All rights reserved

AUTH

OR

COPY

Page 2: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

220 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

options should be considered when using the audio toassist in the recognition? When the audio is searchedto see if a specific character has been verbalized, overwhat window should the search be made, and whatsearch terms should be used? The character “2” may bespoken in a variety of ways, depending upon the con-text in which it is written. It could be spoken, for exam-ple, as “two” or “twice” or “square”. If two or more ut-terances for a word are found, which one correspondsto the character that was written? When placing thisaudio utterance in the context of others, it is importantto select the correct one. And how should the contextin which a character is spoken be used? In other words,how does one incorporate any knowledge or informa-tion about what characters may appear within a neigh-borhood of an ambiguous character to help resolve theambiguity? In this paper, we propose a variety of waysto address these and other issues in combining videoand audio to recognize handwritten equations.

There is a significant body of research devoted to therecognition of handwritten equations [1,3,6,32,52,63],and to the extraction and recognition of textual contentfrom video [16,17,29,45,61]. While the research pre-sented in this paper is closely related and dependent onadvances made in these fields, the focus here is on howto use audio to enhance the performance of a charac-ter recognizer. Specifically, this paper presents a recog-nition system for audio-video based character recog-nition for handwritten mathematical content in class-room videos.

2. Related work

The research presented in this paper lies at the in-tersection of four distinct and well-studied specializa-tions: video processing, handwritten text recognition,speech recognition and classifier combination. In thefollowing sections the relevant literature for each ofthese specializations is reviewed along with some re-cent work on audio-video content recognition.

2.1. Video preprocessing for text extraction

A method for detecting and tracking text in digitalvideo is proposed in [29], which implements a scale-space feature extractor that feeds an artificial neuralprocessor to detect text blocks. Methods have also beenproposed to detect key frames in video for easy re-trieval, as well as methods to acquire data about themovement of the tip of the pen or pen strokes on a

whiteboard [45,50]. For example, a system that auto-matically produces a set of key frames representing allthe written content on the whiteboard before each era-sure is described in [16]. Some of the video content forwhich it is important to be able to extract textual con-tent from video include recorded lectures for indexingand retrieval in e-learning initiatives [13], video-tapedpresentations for summarization [24] and commercialand personal videos for searching and cataloging. Anadvanced technique for extracting text from faded his-toric documents that is arranged as a complex pattern,with parallels to mathematical content and out of focushandwritten content, is presented in [44].

2.2. Handwritten mathematical text recognition

There is a vast collection of literature related to therecognition of handwritten text, and a comprehensivesurvey of this research is presented in [38]. Handwrit-ing recognition may be done from a variety of inputsources such as paper documents [11,28,43,47], pen-based inputs on an electronic screen [37], video [51]and other specialized input devices [50]. Mathematicalcontent recognition [6,8] presents some challenges thatare quite different from those of recognizing text. Thisis due to the fact that mathematical characters and sym-bols have different sizes and often they are arrangedspatially in a complex two-dimensional structure. Formathematical character recognition, researchers haveaddressed issues that arise from factors such as thelarge number of similar symbols (u versusμ or v versusv) that must be recognized, and the lack of a lexiconthat may be used for final validation [32].

Research has shown that the recognition accuracyof mathematical expressions can be improved by com-bining two or more stages [46], and Prusa et al. haveshown how to use a two-dimensional grammar toachieve better mathematical formula recognition [40].Similarly, a hidden Markov model based method thatavoids segmentation during pre-processing by makinguse of simultaneous segmentation and recognition ca-pabilities is presented in [25,26]. Finally, Awal et al.describe an interesting approach for simultaneous op-timization of segmentation, recognition and structureanalysis that is constrained by a mathematical expres-sion grammar [5].

2.3. Speech recognition

Speech recognition is the process of converting anacoustic signal captured by a microphone or a sim-

AUTH

OR

COPY

Page 3: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content 221

ilar device into a sequence of words. Over the lastfew decades, research in speech recognition has madesignificant advances that have led to the developmentof a number of commercial speech recognizers in-cluding Dragon Naturally Speaking [10], MicrosoftSpeech [34] and IBM ViaVoice [58]. There have alsobeen some important research contributions from theacademic community with the HTK [19] project fromthe University of Cambridge, and the Sphinx [49]project from CMU. Nexidia’s word spotting tool [35]provides a fast an efficient approach to search forwords within an audio stream.

2.4. Classifier combination

The field of classifier combination has been con-stantly evolving to address the challenges posed bynew application domains. A comprehensive survey ofclassifier combination techniques is presented in [53],which partitions the classifier combination methodsalong several distinct dimensions, some of which in-clude the way in which the outputs of the classifiers arecombined, whether the number of classifiers is fixed orif the combination methods use classifiers from a largepool of classifiers, etc. There is also a large body ofresearch that focuses on generic methods for classifiercombination. Lucey et al. [30], for example, have pro-posed a theoretical framework for independent clas-sifier combination. While some classifier combinationmethods use another classifier to combine the output ofmultiple classifiers, others make use of rules and func-tions to combine the outputs. Some of the combina-tion techniques proposed in our research are adapta-tions of well known methods such as weighted combi-nation, Borda count [54], and other decision combina-tion techniques [18]. In the context of handwriting andspeech recognition, classifier combination techniqueshave also been used to improve the recognition accu-racy of handwriting recognizers [14,41,48,59] as wellas speech recognizers [12].

2.5. Audio-video based content recognition

Audio and video signals carry complementary in-formation and often an error in recognizing a spokencharacter will not be accompanied by an error in rec-ognizing the written character. Therefore, a combina-tion of both information sources can potentially lead toa significant improvement in the recognition accuracycompared to that which is obtained when either one isused alone. Yu et al. [23,62] propose a classifier combi-

nation framework for grammar-guided sentence recog-nition, and present results for spoken email commandrecognition, where an acoustic classifier and a visualclassifier (for recognizing lip movement, and tongueand teeth visibility) are combined. The Speech Pensystem with an advanced digital whiteboard recognizesspeech and handwriting in the background and pro-vides the instructor with a list of possible next wordsthat allow the instructor to skip manual writing [27].

A comprehensive collection of research in the fieldof audio-visual speech recognition is presented in [39].In the context of classroom videos that utilize slidesand digital ink, Anderson et al. present an empiricalbasis for addressing the problem of the automatic gen-eration of full text transcripts for lectures [2]. Their ap-proach relies on matching spoken content with slidecontent, and recognizing the meaning of the contentwritten by the instructor using digital ink. An investi-gation of a number of strategies for combining HMMclassifiers for improving audio-visual recognition ispresented in [31], and based on empirical, theoreticaland heuristic evidence, a recommendation is made forusing a hybrid of the sum and product rules.

Hunsinger et al. have proposed a multimodal mathe-matical formula editor that combines speech and hand-writing recognition. In [21], they describe the speechunderstanding module of the system, and in [20] theypresent a multimodal probabilistic grammar that in-corporates the syntactic-semantic attributes of spokenand handwritten mathematical formulas. A system forspeech and handwriting data fusion based isolatedmathematical symbol recognition is presented in [33].Although neither the speech nor the handwritten dataoriginates from video, the techniques used to combinethe output of the character and speech recognizers aresimilar to the techniques presented in this paper. How-ever, issues such as ambiguity detection and A/V syn-chronization are not considered in the aforementionedresearch. Another relevant research effort, closely tiedto [33], relates to the creation of a data set with hand-written and spoken mathematical content [41]. Unfor-tunately, this data set consists of static image segmentscontaining handwritten content, with the correspond-ing audio stored in separate files. The absence of avideo sequence (for audio and video time-stamping)makes this data set unsuitable for our experiments.

3. System overview

The overall system for recognizing handwrittenmathematical text and equations is shown in Fig. 1.

AUTH

OR

COPY

Page 4: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

222 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

Fig. 1. Components of the audio-video character recognition system.

The first stage is the video text recognizer that includesa video preprocessor and a character recognizer. Thevideo preprocessor includes all of the processing thatis required to extract the text that is to be recognized,segment the text into characters, generate timestampsfor each character in the video, and tag the location ofthe characters in the video frame. For each segmentedcharacter, the character recognizer then generates a listof one or more possible characters from a dictionary ofpossible characters. The character recognizer also gen-erates a score for each character in the list that repre-sents the recognizer’s belief in the correctness of thecharacter name. Since the characters in this list arebased only on the video, they will be referred to asvideo options.

Following the video text recognizer is the ambigu-ity detector that determines whether or not the videooption with the highest score is likely to be the correctcharacter. If not, then the character to be recognizedis classified as ambiguous, and two or more video op-tions are selected for further processing to determinewhich is the correct one. This is done in the characterdisambiguation stage.

The character disambiguation stage consists ofthree components. The first is an audio text recognizerthat first assigns one or more audio search terms foreach video option, and then searches the audio withinsome window for the occurrence of these terms. Theoutput of the audio text recognizer is a set of audiooptions, which are occurrences in the audio stream ofthe video option’s character name. Each audio optionconsists of the audio search term, an audio timestamp,and an audio match score. The second component isthe audio-video synchronizer that processes the set ofaudio options and assigns at most one audio option toeach video option. The output of this stage is one ormore audio/video pairs. The final step is audio-video

combination. Here, the recognition scores of each au-dio/video pair are analyzed to produce a final recog-nized character.

In the following section, we begin by looking at thetask of video text recognition, i.e., character recogni-tion that uses only the video.

4. Video text recognition

When capturing the video for video text recognition,a few assumptions are made about the recording pro-cess. Although these assumptions are not very restric-tive, they simplify many of the preprocessing tasks.First, it is assumed that the entire whiteboard is withinthe field of view of the camera, and that the whiteboard(the region of interest) is easily detected. It is also as-sumed that the beginning of every recording sessionhas at least one calibration frame, which is a videoframe with a clean whiteboard without the instructor.In the recording of a lecture, it is assumed that theboard is erased completely before the instructor beginsa new board, and that the instructor briefly steps awayfrom the board after a complete erasure so that the en-tire region of interest is unobstructed.

4.1. Video preprocessing

The first step in the video text recognizer is thevideo preprocessor that performs a number of impor-tant tasks [55]. The first is to identify the region of in-terest (the whiteboard), and the second is to detect theframes of interest in the video, which are unobstructedviews of the whiteboard just prior to an erasure. Thus,the frame of interest contains all of the characters thatare to be detected and recognized. The process of de-tecting the frame of interest primarily relies on count-

AUTH

OR

COPY

Page 5: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content 223

(a) (b)

Fig. 2. Video Text Recognition. (a) Text that has been extracted and segmented into characters from a video frame and (b) the vector c(s) thatconsists of the image of the character s, and the time and location that the character appeared on the whiteboard.

ing the number of connected components contained invideo frames over the duration of the video, and select-ing the frame with the maximum number of connectedcomponents.

The next step is character segmentation. Assum-ing that a given frame of interest is free of any oc-clusions or shadows from the instructor and that indi-vidual characters appear as one or more distinct con-nected components, a connected component analysisalgorithm is used to extract the characters [7]. A post-processing step allows for the handling of charactersin the dataset that do not appear as a single connectedcomponent, such as ‘i’ and ‘=’. The final step is to pro-duce a video timestamp for each segmented character,which is the time at which the character is written onthe whiteboard. An example is given in Fig. 2, whichshows a set of segmented characters. Associated witheach character is a vector

c(s) = [s, t(s), l(s)]

that contains the image of the character, s, the time atwhich the character was written on the board, t(s), andthe location of the character on the board, l(s). Sincethe location is used only in the structure analysis of anequation [55,56], it will not be used here.

4.2. Character recognition

Once a character has been extracted, it is forwardedto the character recognizer. Although there are manyrecognizers that could be used, we chose the GNU Op-tical Character Recognition program or GOCR [15]. Itis important to point out that although GOCR is notthe best character recognizer for handwritten text, thefocus of this paper is not to build a state-of-the-art au-dio/video character recognizer, but rather to investigate

ways in which audio and video may be combined toimprove the accuracy of any given character recogni-tion system. If the recognizer were perfect, then therewould be no need to use audio to assist in the recogni-tion of characters. However, since no handwritten char-acter recognizer is perfect, then any recognizer that in-troduces errors or uncertainties in the recognition ofcharacters may be used to study different approachesfor audio-assisted character recognition. It should alsobe pointed out that, in many cases, it may not be pos-sible to use a state-of-the-art recognizer if one is inter-ested in real-time character recognition within a sim-ple platform. Although the techniques and approachespresented in this paper are not tied to any specific rec-ognizer, some of the parameters along with the finalrecognition rates will be different and depend on whatcharacter (and audio) recognizer is used.

Two modifications to GOCR were made. The firstwas to have GOCR return a set of candidate charactersrather than a single recognized character or no matchat all. The second was to return a score that is based onthe number and the relative significance of the recogni-tion rules that are satisfied. Thus, when the image of acharacter is passed to the character recognizer, a set ofpossible characters is generated along with a score thatindicates how likely the given character is the correctone. Since these candidate characters are based only onthe video, they will be referred to as video options, asopposed to audio options that are based on the audioas discussed later in Section 6.1. Thus, as illustrated inFig. 3, the output of the video text recognizer is a set ofL video options for each character s, where each videooption, vj(s), is an ordered pair

vj(s) = [vcj(s), vpj (s)]

where vcj(s) is a character from the dictionary C andvpj (s) is the recognition score for that character.

AUTH

OR

COPY

Page 6: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

224 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

Fig. 3. A character s with L video options.

The current system recognizes the alphabetic char-acters (upper case and lower case), numbers and basicarithmetic operators. Expanding the dictionary to in-clude other characters, such as Greek letters and morecomplicated mathematical symbols is straightforward.

5. Ambiguity detection and option selection

After the character recognizer generates a set ofvideo options for a character, the next step (ambiguitydetection) is to decide whether or not the option withthe highest score is likely to be correct. If it is likelyto be correct, then it is output as the final recognizedcharacter. However, if it is determined that there is asufficiently high probability that this option may be in-correct, then two or more options that satisfy some op-tion selection criteria are sent to the audio recognizerto assist in the recognition.

5.1. Character recognition score

The recognition scores produced by the video textrecognizer are often not the best metric to use to de-termine whether or not a recognized character shouldbe classified as ambiguous. Therefore, these scores aremapped to a new set of scores that are better matchedto the task of tagging the ambiguous characters. Somecommonly used score normalization techniques arediscussed in [22], but the approach that is used hereis to replace the score with an estimate of the condi-tional probability that the character is correctly classi-fied, given the video match score for that option. Morespecifically, let G be a function that returns the groundtruth for a given character s,

G(s) = c

The conditional probability is then given by

Prob{vci (s) = G(s)|vpi (s)}

=Prob{vci (s) = G(s), vpi (s)}

Prob{vpi (s)}

Estimating these conditional probabilities is done us-ing a training set along with the ground truth for eachcharacter in this set. Thus, the estimate of this con-ditional probability that will be used as the characterrecognition score, v̄pi (s), is

v̄pi (s) =N(vci (s) = G(s), vpi (s))

N(vpi (s))

where the term in the numerator is the number of timesvci (s) is correctly classified when its score is vpi (s),and the term in the denominator is the number of timesvci (s) has a score of vpi (s). In some cases, such as whenthere is a limited training set, it may be necessary todivide the range of scores into intervals and estimatethe conditional probability given that vpi (s) falls withinsome range of values.

5.2. Character classification

Having an appropriate set of scores for each videooption for a given character, it is now necessary to de-termine whether or not a character should be classifiedas ambiguous. Those that are ambiguous will be sent tothe audio recognizer to assist in the recognition. It mayat first seem best to send every character to the audiorecognizer for verification or correction, but in thosecases where the video option with the highest score iscorrect, the audio recognizer may find an utterance (orno utterance at all) in the audio thereby making an-other character more likely and, as a result, introduc-ing an error in the final recognition result. Similarly,if an incorrectly recognized character is not forwardedto the audio text recognizer, then there is no possibil-ity for the error to be corrected. Therefore, it is impor-tant to determine which characters have a sufficientlyhigh probability of being incorrectly recognized, andtagging these as ambiguous and sending only these tothe audio recognizer.

A character is classified as non-ambiguous if itsrecognition score exceeds some threshold. To performthis classification, two types of thresholds were consid-ered: simple thresholds and character-specific thresh-olds. In the following, S will be used to denote the setof all characters that are to be recognized, and D(s)will be used to represent the set of all characters in Sthat are classified as ambiguous. It will be assumed,for simplicity, that the video options for each character,vj(s), have been ordered according to their recogni-tion score with the first option having the largest score.

AUTH

OR

COPY

Page 7: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content 225

Table 1Classification of characters in the training set for a given character-specific threshold T

Top video option TagNt(T ) Correct Non-ambiguousNf (T ) Incorrect Non-ambiguousAt(T ) Incorrect AmbiguousAf (T ) Correct Ambiguous

5.2.1. Simple thresholdsThe first threshold criterion considered for ambigu-

ity detection is one that classifies a character as am-biguous if the option with the largest score is less thansome absolute threshold, TA,

D(S) = {s ∈ S | v̄p1(s) < TA}

The second is to evaluate the ratio of the second largestscore, v̄p2(s), to the largest score, v̄p1(s), and if this ra-tio exceeds some threshold, TR, then the character isclassified as ambiguous,

D(S) = {s ∈ S | v̄p2(s)/v̄p1(s) > TR}

The rationale here is that if the top video option is un-equivocally correct, then it should have a score that issignificantly larger than the second best video option.

5.2.2. Character-specific thresholdsSince some characters are more difficult to recog-

nize than others, and since a character recognizer willgenerally have different recognition rates for differentcharacters, having the same threshold for all charac-ters is generally not the best approach to use. There-fore, another approach is to use a different threshold foreach character in the dictionary. To set these character-specific thresholds (that will depend on the specificcharacter recognizer that is used), a training set foreach character in the dictionary is created. Let S(c) de-note the training set for the character c. Each characterin S(c) is sent through the character recognizer, andthis set is then partitioned into four sets as illustratedin Table 1. This partition, which depends on a thresh-old T , is generated as follows. Let N(T ) denote theset of all characters that would be classified as non-ambiguous using a threshold of T . In other words, thetop recognition score for each of these characters islarger than T . As a result, these characters will not besent to the audio recognizer for further processing, andthe video option having the largest recognition scorewill be the final output. This set is then partitioned intotwo sets, Nt(T ) and Nf (T ). The characters in the firstset are those for which the video option with the high-

est score is the correct character, c, and therefore willbe correctly recognized. The characters in the secondset, on the other hand, are those for which the videooption with the highest score is incorrect and will beincorrectly recognized.

All of the characters not in N(T ) are in a set de-noted by A(T ), and these are the characters that wouldbe classified as ambiguous using the threshold T andwould be sent to the audio recognizer for additionalprocessing. This set is partitioned into two sets, At(T )and Af (T ). The characters in the first set are those thatare correctly classified as ambiguous because the cor-rect character is not the one with the highest recogni-tion score. Therefore, further processing may result inthe correct recognition of these characters. For thosecharacters in the second set, the video option with thehighest recognition score is the correct one. However,since the score does not exceed the threshold T , theyare sent to the audio recognizer for further processingand may, eventually, be recognized incorrectly.

For a given threshold, T , the recognition rate for thecharacter c over the training set S(c) is equal to

α(T, c) =|Nt(T )|+ αL

A(|At(T ) ∪Af (T )|)|S(c)|

where |A| is the number of elements in the set A, andαLA is the recognition accuracy for the ambiguous char-

acters when L video options are sent to the audio rec-ognizer. This value is estimated from the training set.The character-specific threshold T (c) is then defined tobe the value of T that maximizes the recognition score,

T (c) = argmaxT

(α(T, c))

It is important to note that these thresholds depend onthe specific character recognizer that is used.

5.3. Option selection

After a character has been classified as ambiguous,it is necessary to determine which set of video optionsshould be forwarded to the audio recognizer to helpresolve the ambiguity. If the character recognizer pro-duces N video options for a character, then the naïveapproach would be to send all N options to the au-dio recognizer since this would increase the probabil-ity that the correct character would be among the op-tions that are forwarded. However, when the correctcharacter is the one with the highest score, then eachadditional option that is forwarded would increase the

AUTH

OR

COPY

Page 8: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

226 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

chances that an error will be made in the final output.On the other hand, if the number of options that arepassed is too low, then the chances are higher that thecorrect character would not be included within this set.Thus, the goal of option selection is to choose a set ofvideo options in such a way that the probability of hav-ing the correct character within the list is maximizedwhile, at the same time, minimizing the number of op-tions that are in the list.

Three different option selection strategies were con-sidered. If K options are to be forwarded to the audiorecognizer, then the first option is simply to select theK video options that have the largest recognition score.Again assuming that the video options have been or-dered according to their recognition score, with v1(s)having the largest score, the set of video options is

O(s) = {v1(s),v2(s), . . . ,vK(s)}The second strategy is to select all video options thathave a recognition score that exceeds a threshold T .Thus, if V (s) is the set of all video options, then

O(s) = {vi(s) ∈ V (s) | v̄pi (s) > T }In this case, the number of video options is variable.The third approach is to select all video options thathave a recognition score that exceeds some fraction,TO of the highest score,

O(s) = {vi(s) ∈ V (s) | v̄pi (s)/v̄p1(s) > TO}Here again, the number of options is not fixed. In theevent that no video options satisfy the threshold condi-tion, the top one or two options would be selected.

6. Audio-video synchronization

Once the video options for the ambiguous charactershave been identified, the audio recognition system isused to determine which of these options, if any, arefound in the audio within some interval around the timethat the character is written on the board. Since the goalis to search for specific phonemes or spoken words,Nexidia’s word spotter was used since it is fast, workswell for non-standard grammars, and does not requireany training [35]. For the case in which two or moreaudio options are found for a given video option, it isthen necessary to perform audio-video synchronizationto match the appropriate audio option with the givenvideo option. First, we discuss what is meant by anaudio option.

6.1. Audio options

When a handwritten character s that is written attime t(s) is classified as ambiguous, an audio searchwindow is defined that extends from time t(s) − tc totime t(s)+ tc.1 Then, for each video option vj(s), oneor more audio search terms are defined for this charac-ter vcj(s). These audio search terms are the phonemesor words that might be spoken when the character sis written on the board. For example, if vcj (s) = “2”,then the audio search terms might be “two”, “squared”,“twice”, and “double”. The audio is then searched overthe given window for each audio search terms, andfor each one that is found, an audio option is created.These audio options are vectors that contain the audiosearch term, acj,k(s), the time that it occurs in the audio,atj,k(s), and an audio match score, apj,k(s). Thus, thekth audio option for the jth video option vj(s) wouldhave the form

aj,k(s) = [acj,k(s), atj,k(s), a

pj,k(s)]

An example is shown in Fig. 4 where the character shas two video options. For the first option, v1(s), onlyone audio search term is defined, which is “four”, andover the audio search window two audio options arefound. However, for the second video option, v2(s),which also has only one audio search term, only oneaudio option is found. Any audio option that has an au-dio match score that is below a certain threshold maybe ignored or discarded. For example, if the thresholdis set to 0.75, then audio option a2,1(s) would be re-moved, leaving no audio matches for the second videooption.

6.2. Audio-video synchronization

When two or more audio options remain for a givenvideo option, it is necessary to perform audio/videosynchronization to pair the video option with an ap-propriate audio option. It may seem that a good choicewould be to pick the audio option that has the highestrecognition score,

k0 = argmaxk

apj,k(s)

but this is not necessarily the best choice for the follow-ing reasons. First, it ignores the times at which the au-

1The window could be asymmetric, but this adds another param-eter, and was found not to be particularly useful.

AUTH

OR

COPY

Page 9: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content 227

Fig. 4. Audio options, aj,k(s), for two video options, v1(s) and v2(s) for the character s.

dio options occur in the audio. As a result, an audio op-tion may be selected that is much further away in timefrom when the character is written on the board com-pared to another option with a slightly lower recog-nition score that corresponds to the correct utterance.This could happen, for example, when there are re-peated characters, such as in the equation

x+ 99y = z

The first “nine” that is spoken may have a higher recog-nition score than the second one and, therefore, maybe the one that is selected as the audio option forboth characters. Therefore, some alternative methodsfor synchronization have been considered, and are dis-cussed below [55,57].

6.2.1. Time-difference synchronizationAnother simple approach to A/V synchronization is

to select the audio option aj,k(s) that occurs at time,atj,k(s), that is closest to the time, t(s), that the videooption vj(s), is written on the board,

k0 = argmink

|t(s)− atj,k(s)|

For the example given in Fig. 4, v1(s) has two audiooptions. Since the time of the second audio option iscloser to t(s) than that of the first audio option, thena1,2(s) would be the one that is assigned to v1(s). Al-though this approach is simple, it does not take into ac-count the audio options that are found within a neigh-borhood of a given option for characters that come be-fore or after it. Therefore, some approaches that arebased on the context in which the audio option occursare presented below.

6.2.2. Neighbor-based methodsTo see how context might be used for A/V synchro-

nization, suppose that vj(s) is a video option for the

character s and let aj,k(s) be one of its audio options.If atj,k(s) is the time at which this audio option oc-curs, define an A/V synchronization window that startstb seconds before and ends ta seconds after this time,i.e.,

[atj,k(s)− tb, atj,k(s) + ta]

With na and nb two positive integers, consider the topvideo options (those with the largest recognition score)for the nb characters that occur before and the na char-acters that occur after the character s. The number ofthe nb video options that have an audio option withinthe A/V window before atj,k(s) plus the number of thena video options that have an audio option within thegiven window after atj,k(s) is assigned to the variableN(aj,k(s)). Clearly, this variable may have any valuebetween zero and na+nb. The audio option that is thensynchronized with the character s is the one that hasthe largest value of N(aj,k(s)), i.e., aj,k0 (s) where

k0 = argmaxk

N(aj,k(s))

If two or more audio options have the same maximumvalue for N(aj,k(s)), then the one that is the closest intime to vj(s) is selected.

As an illustrative example, suppose that the follow-ing equation is written on the board,

194 + t2 = x (1)

and that the character s = “4” has been classified asambiguous. In addition, suppose that the first video op-tion for this character is vc1(s) = “4” and that ac1,1(s) =“four” is an audio search term. Over the audio searchwindow that is placed symmetrically around t(s) as il-lustrated in Fig. 5, note that two occurrences of “four”are found. So the question is “which one correspondsto the video option v1(s)?” Suppose that na = nb = 1

AUTH

OR

COPY

Page 10: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

228 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

Fig. 5. Illustration of the audio search window for the audio search term “four” and the A/V synchronization windows for two audio options.

and that “9” and “+” are the top video options forthe character before and after the character s, respec-tively, i.e., the video options with the highest recogni-tion scores. Within the A/V synchronization windowfor a1,1(s), there are no audio options for “nine” be-fore time at1,1(s) and no audio options for “plus” af-ter time at1,1(s). Therefore, N(a1,1(s)) = 0. Perform-ing the same search over the A/V synchronization win-dow for the second audio option at time at1,2(s), wesee that N(a1,1(s)) = 2 since an audio option is foundfor “nine” within the window before time at1,2(s) andone is found for “plus” within the window after timeat1,2(s). Therefore, a1,2(s) is the audio option thatwould be synchronized (paired) with the character s =“4”.

6.2.3. Selective neighbor-based methodsIn the neighbor-based method for A/V synchroniza-

tion described above, the top video options of theneighboring characters are assumed to be correct, andthe audio search terms for these options are the onesthat are used when searching for neighbors and indetermining the value of N(aj,k(s)). However, sincecharacter recognizers are not perfect, the top video op-tions may be incorrect thereby resulting in poor syn-chronization. An alternative is to select a subset of theneighboring video options that include those that havethe highest probability of being correct. One way to dothis is presented below through an illustrative example.

Suppose that Eq. (1) is written on the board, and thatan audio option for the character s = “4” is to be found.Shown in Fig. 6 is a video option v1(s) for this char-acter along with one of its audio options, a1,k(s). Alsoshown are the top video options for the nb = 2 charac-ters before and the na = 2 characters after the charac-ter s, and the audio options for these characters that arefound within the given synchronization window. Us-

ing the neighbor-based method with na = nb = 1,N(a1,1(s)) would be equal to one instead of two be-cause the video option before the character s is incor-rectly recognized as “g” and no utterance of “gee” isfound within the given window. However, suppose thatna = nb = 2, and that out of the two characters be-fore s, the one with the highest recognition score is se-lected and the same is done for the two characters afters. In this case, N(a1,1(s)) would be equal to two sincethe incorrectly recognized character would not be used.So, with this approach, in addition to the number ofcharacters before and after s that are considered, twoadditional parameters are defined, lb and la, that spec-ify how many of the nb and na characters, respectfully,will be selected.

Instead of selecting which neighbors to use basedon video scores, the selection may be based on the au-dio. More specifically, with audio-based neighbor se-lection, the video options that are selected are thosewhose audio search terms have the largest number ofphonemes. These are the ones that have a higher prob-ability of correctly finding an audio option within theA/V synchronization window, if one exists. As with thevideo-based neighbor selection, two additional param-eters are needed, lb and la, that specify how many ofthe nb and na characters will be selected.

6.2.4. Rank sum based synchronizationFive different ways to perform audio-video syn-

chronization were presented in the previous sections:score-based, time-difference, neighbor-based, and se-lective neighbor-based methods using either video oraudio. For each synchronization method, the audio op-tions may be rank-ordered, and a rank number as-signed to each. For example, if audio option aj,k(s)is the lth best option using the first synchronizationmethod (recognition score) then the rank, R1(aj,k(s)),

AUTH

OR

COPY

Page 11: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content 229

Fig. 6. The audio options that are found within an A/V synchronization window around a1,k(s) for the top video options of the two charactersbefore and the two characters after the character “4”.

for aj,k(s) using this method would be equal to l. Find-ing the ranks for this option using the other synchro-nization methods gives a set of five rank numbers thatmay be summed to give a rank-sum score. Finding therank-sum score for each audio option, the one with thelowest score is then selected as the one to be synchro-nized with vj(s), i.e., aj,k0 (s) where

k0 = argmink

5∑

i=1

Ri(aj,k(s))

In the case of a tie, any one of a number of possibletie-breaking strategies may be used, such as selectingthe audio option that is the closest in time to when thecharacter s was written on the board.

7. Audio-video combination

The output of the audio-video synchronizer is a setof ambiguous characters that have one or more videooptions,vj(s), for each character s with each video op-tion having only one audio option aj,k0 (s). For exam-ple, shown in Fig. 7 are two video options for the char-acter s. Three audio options are found for v1(s) usingthe audio search term “four” and only one audio optionis found for v2(s) using the audio search term “nine”.The output of the A/V synchronizer for the first videooption is denoted by a1,k0(s). Since there is only oneaudio option for v2(s), no synchronization (pairing) isrequired. So, with two audio/video pairs,

[v1(s), a1,k0(s)], [v2(s), a2,1(s)]

the last step is to determine which pair is the correctone, thereby leading to the final recognition result, ei-ther “4” or “9”. This decision is made based on therecognition scores from each recognizer, the audio/video metrics found during synchronization, and per-

haps on some other sets of parameters. Similar com-bination approaches have been used to improve therecognition accuracy of handwriting recognizers [59],speech recognizers [12] and a combination of thetwo [21]. We considered several rank-level and measu-rement-level combination techniques [53] such as ranksum and weighted sum rule using classifier-specificweights and character-specific weights as describedbelow.

7.1. Rank based techniques

The challenge often encountered when combiningthe outputs of two or more classifiers is that the recog-nition scores are not generally normalized with respectto each other. In other words a score of 8.0 out of 10for one classifier may not mean the same as a scoreof 8.0 out of 10 for another. In such situations, a com-monly used approach is to use the rank sum (Bordacount). With this approach, the various options (or fea-tures) for each classifier are assigned a rank [53]. Theranks are then summed for each option and the recog-nition result having the lowest rank sum is selected asthe output. In the case of a tie, a tie-breaking strategy isused. One of the advantages of a rank based techniqueis that there is no need to assign weights to the audioand video recognizers or to normalize the recognitionscores.

An example of the rank sum method is shown inFig. 8 where the number two is to be recognized. Therecognition scores for the video and audio recognizersalong with their ranks are shown in the table. Note thatif the sum of the recognition scores were used to selectthe final character, then the number seven would havebeen the final result whereas the rank sum results ina correct recognition. The relatively high audio recog-nition score for the audio option corresponding to thecharacter “a” is due to the presence of the phoneme forcharacter “a” that is found in the number eight that oc-

AUTH

OR

COPY

Page 12: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

230 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

Fig. 7. A character s with two video options, with the first one having three audio options. A/V synchronization selects the best audio option forv1(s), and the final step is to decide which audio-video pair is the correct one, leading to the final recognition result.

Fig. 8. The final recognition step in selection the best audio-video pair for a character. Shown in this example are the results of recognizing thecharacter “2” based on the sum of the audio and recognition scores and on the rank sum score.

curs just before the number two. There is also a highaudio recognition score for the audio option “7” be-cause the number seven actually occurs in the audiojust after the number two.

7.2. Weighted sum of recognition scores

As suggested in the example in Fig. 8, another wayto combine the audio and video recognition scoresis simply to form the sum [53]. However, given thatthe video and the audio recognizers may perform dif-ferently in their ability to recognize characters, thisshould be accounted for in the sum. The simplest ap-proach would be to use the same recognizer-specificweights, wv and wa, for all characters,

z(s) = wvvpj (s) + waa

pj (s)

If, for example, the audio recognizer was determinedto be much more accurate than the character recog-

nizer, in general, then more weight should be placedon the audio recognition score when selecting the finaloutput. However, since the accuracy of the audio andvideo text recognizers may be different for each char-acter, then using different weights for each characterhas the potential to further improve the recognition ac-curacy. Consider, for example, the number seven. Since“seven” has two phonemes, the audio recognizer willhave an easier time recognizing this character com-pared to single phoneme characters such as “b” and“d”. The video recognizer, on the other hand, will havemore difficulty recognizing the number “1” (comparedto the letter “l”) than it will have recognizing the num-ber “3”.

One way to assign a weight wV (vcj(s)) to the jth

video option for the character s is to use the video textrecognizer’s accuracy-related metrics, such as preci-sion and sensitivity for the character label vcj(s). Thevalue of the audio weight wA(v

cj (s)) may either be

computed in a similar fashion with the weights being

AUTH

OR

COPY

Page 13: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content 231

Table 2Summary of recognition results

Rescoring Ambiguity detection Options AVS AVC Rate− − One − − 53.7� − One − − 62.2� − All � � 61.4� − TO = 0.80 � � 64.0� TR = 0.85 All � � 64.5� TA = 0.98 Four � � 67.6

normalized so that they sum to one, or the weight maybe set to wA(v

cj(s)) = 1− wV (v

cj(s)).

8. Experiments

This section summarizes some of the key resultsfrom an extensive set of experiments that were doneto test the performance of the various approaches toaudio-video character recognition that have been pre-sented here [55]. An attempt was made to isolate thecontribution of each step in improving the recogni-tion accuracy of the overall system under a variety ofconditions, but this is an extremely difficult task be-cause of the interactions of all of the components andthe sheer number of possible combinations of methodsintroduced. For example, determining what approachis best for option selection will depend on what ap-proaches are used in all of the tasks that follow. Al-though not presented here, a discussion of the diffi-culties encountered when there is heavy occlusion bythe instructor, poor time-stamping of the video options,and large skews in time between the audio and videomay be found in [55]. Here, we summarize the over-all performance of the system using what appears to bethe best system configuration for the audio and videorecognizers that were used. What is significant aboutthe results is not the specific numbers that were ob-tained for the recognition rates, because they could beimproved with any improvement in either recognizer.What is important is the increase that is afforded byincorporating audio into the recognition process, andhow the audio is used to achieve this increase.

8.1. Setup: Data set and implementation

The recording equipment used to capture video con-sisted of a commercially available off-the-shelf videocamera (Sanyo VPC-HD1A 720p High-Definition Dig-ital Media Camera) and a wired microphone. Thevideos were recorded in a classroom-like setting withmathematical content being written on the whiteboard

and being spoken by the instructor. The camera wasconfigured to capture video with a resolution of 1280×720 pixels at 30 frames per second. The main data setis organized into two sets, one for training and one forevaluation. The data set has 9,484 characters from twoinstructors, 4,414 of which are in the training set and5,070 are in the test set. Sample data sets are availableonline [9] along with instructions on how to obtain thecomplete data set.

8.2. Baseline system

To evaluate the effectiveness of the audio-videocharacter recognizer, two baseline systems were usedfor comparison. The first is the character recognizerused alone, with no assistance from the audio. Withthis system, the recognition accuracy was 53.7% us-ing the scores from GOCR, and it was 62.2% if thesescores were replaced with conditional probabilities asdescribed in Section 5.1. These results are given in thefirst two rows of Table 2.

The second baseline system is one in which allcharacters are classified as ambiguous (no ambiguitydetection or option selection), with rank sum basedsynchronization and audio-video combination usingrecognizer-specific weights wv = 0.8 and wa = 0.2.The character recognition accuracy for this system asshown in the third row of Table 2 was 61.4%. Thelower recognition rate here demonstrates the impor-tance of ambiguity detection and option selection.

8.3. Results

After extensive testing, it was found conclusivelythat instead of using the raw recognition scores fromthe character recognizer, better overall recognitionrates are obtained when they are replaced with esti-mates of the conditional probability that the characteris correctly classified given the raw recognition scoreas discussed in Section 5.1. It was observed that thisrescoring not only resulted in a reordering of the videooptions so that more characters ended up with the cor-

AUTH

OR

COPY

Page 14: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

232 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

rect video option with the highest score, but it also re-sulted in an increase in the number of characters thathad the correct video option within the top L scores forany value of L. Therefore, in all of the following re-sults, this rescoring was performed. It is believed thatthis rescoring should be used for any character recog-nizer that is used.

In the absence of an ambiguity detector, determin-ing which options are sent to the audio-video recog-nizer for additional processing may be done in twoways. The first is to select a fixed number of optionsfor each character, and the other is to send only thoseoptions that exceed a threshold (or only one option ifno options exceed the threshold). Of these two meth-ods, the best was to use an option selection thresholdwith TO = 0.80, which resulted in a recognition rateof 64.0% as shown in the fourth row of Table 2.

When an ambiguity detector is used, the methodsproposed to classify a character as ambiguous includesimple and character-specific thresholds. Which typeof threshold to use depends on how the ambiguouscharacters are processed. If there is no option selectioncriterion (all characters that are classified as ambigu-ous are sent to the audio recognizer), then a relativethreshold value of TR = 0.95 was the best approach,which resulted in a recognition rate of 64.5% as shownin the fifth row of Table 2. This is approximately thesame as that obtained when it is assumed that all char-acters are ambiguous, and the video options that ex-ceed an option selection threshold of TO = 0.80 aresent to the audio-video synchronizer (fourth row of Ta-ble 2).

When both an ambiguity detector and an option se-lection criterion is used, the best recognition rates wereachieved using an absolute threshold value of TA =0.98 for ambiguity detection, and sending four (a fixednumber) options to the audio recognizer for each am-biguous character. As shown in Table 2, in this case therecognition rate was 67.6%. What is interesting to noteis that for the data set that was used, 54% of the char-acters were classified as non-ambiguous and for thesecharacters the recognition rate was 85.1%. For the 46%that were classified as ambiguous, 23% of those thatwould have been incorrectly recognized based on theVTR were corrected by the audio. On the other hand,1.6% of the characters that would have been recog-nized correctly were changed to an incorrect characterby the audio recognizer.

Each video option that is selected for audio-videocombination that has two or more audio options, syn-chronization (selection of one of the audio options)

must be performed. Audio-video synchronization is adifficult task, and how well it can be done is determinedby many factors including the accuracy of the videotime-stamping (estimation of the time that the charac-ter is written on the board), the audio-video alignment(how close in time a character is written to when it isverbalized), and the accuracy of the video and audiorecognizers. A summary of the experimental results isas follows. In the case of good video time-stamping(the time difference between the true and estimatedtimes is less than two seconds) and good audio-videoalignment (a character is verbalized within four sec-onds of its being written), each of the synchronizationmethods performed well, with the feature rank sumbased approach performing slightly better. However, inthe case of either poor video time stamping or pooraudio-video alignment, the synchronization results be-come worse, with the time difference based techniquesperforming considerably worse than the others, withthe A/V neighbor based and rank sum based techniquesperforming equally well.

The final step is audio-video combination. In eachexperiment, ambiguity detection and option selectionwith a relative threshold of 0.9 was used. With thisthreshold, 35.7% of the characters are classified asnon-ambiguous and 64.3% as ambiguous. The VTRaccuracy for the non-ambiguous characters is 79.8%and 39.2% for the ambiguous characters. Since A/Vcombination processes only the ambiguous characters,the recognition accuracy for these characters should in-crease. With rank sum based A/V synchronization, theA/V combination results are as follows. With 35.7%of the characters in the test data set classified as am-biguous, without audio the recognition rate for thesecharacters is 39.2%. If the rank sum method is used foraudio-video combination, this rate increases to 50.2%,and if recognizer weights of 0.8 and 0.2 are used forthe video and audio recognizers, respectively, then thisrate increases to 56.9%.

9. Conclusions and future work

This paper presented a number of ways to combinethe output of a character recognizer with an audio rec-ognizer to improve the overall recognition accuracy ofmathematical equations. The components of the sys-tem included a video text recognizer, ambiguity de-tection, an audio recognizer, audio-video synchroniza-tion, and audio-video combination (fusion) of the out-puts of the audio and video recognizers. Experiments

AUTH

OR

COPY

Page 15: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content 233

conducted over a large data set, consisting of videosrecorded in a classroom like environment, demonstratethat significant improvements in character recognitionaccuracy can be achieved by combining audio withvideo. While the system made use of specific text andspeech recognizers (word spotter), the proposed tech-niques may be used with other recognizers as well.We are currently in the process of investigating the useof mathematical grammar and the spoken content toimprove the structure recognition accuracy associatedwith handwritten mathematical content. Other avenuesfor future work include exploring the use of a math-ematical grammar for the audio-video based charac-ter recognition stage and using time-stamping informa-tion along with character recognition results to spot se-quences of words instead of single words.

Acknowledgments

This work was supported, in part, by research fundsfrom Chung-Ang University, Seoul, Korea.

References

[1] W. Aly, S. Uchida and M. Suzuki, A large-scale analysis ofmathematical expressions for an accurate understanding oftheir structure, Int. Workshop on Document Analysis Sys-tems, 2008, 549–556.

[2] R. Anderson, C. Hoyer, C. Prince, J. Su, F. Videon and S.Wolfman, Speech, ink, and slides: The interaction of contentchannels, Int. Conf. on Multimedia, 2004, 796–803.

[3] R.H. Anderson, Two-dimensional mathematical notations. in:Syntactic Pattern Recognition Applications, 1977, pp. 147–177.

[4] L. Anthony, J. Yang and K.R. Koedinger, Evaluation of multi-modal input for entering mathematical equations on the com-puter. CHI ’05 extended abstracts on Human factors in com-puting systems, 2005, pp. 1184–1187.

[5] A.-M. Awal, H. Mouchere and C. Viard-Gaudin, Towardshandwritten mathematical expression recognition, Int. Conf.on Document Analysis and Recog, 2009, pp. 1046–1050.

[6] K.-F. Chan and D.-Y. Yeung, Mathematical expression recog-nition: A survey, International Journal on Document Analysisand Recognition 3 (2000), 3–15.

[7] F. Chang, C.-J. Chen and C.-J. Lu, A linear-time component-labeling algorithm using contour tracing technique, Comput.Vis. Image Understanding 93(2) (2004), 206–220.

[8] P.A. Chou, Recognition of equations using a two-dimensionalstochastic context-free grammar, SPIE Visual Comm. and Im-age Processing IV, 1989, 852–863.

[9] Classroom Video Data Set. http://users.ece.gatech.edu/∼smita/dataset/, 2012 (accessed August 20, 2012).

[10] Nuance – Dragon NaturallySpeaking Speech RecognitionSoftware, http://www.nuance.com/naturallyspeaking/, 2012(accessed August 20, 2012).

[11] R.J. Fateman, T. Tokuyasu, B.P. Berman and N. Mitchell, Op-tical character recognition and parsing of typeset mathemat-ics, Journal of Visual Communication and Image Representa-tion 7 (1996), 2–15.

[12] J. Fiscus, A post-processing system to yield reduced word er-ror rates: Recognizer output voting error reduction (ROVER).In IEEE Workshop on Automatic Speech Recognition andUnderstanding, 1997, 347–354.

[13] G. Friedland, W. Hurst and L. Knipping, Educational multi-media. Multimedia, IEEE (2008), 54–56.

[14] P.D. Gader, M.A. Mohamed and J.M. Keller, Fusion of hand-written word classifiers, Pattern Recogn. Letters (1996), 577–584.

[15] GOCR. http://jocr.sourceforge.net/, 2012 (accessed August20, 2012).

[16] L.-W. He, Z. Liu and Z. Zhang, Why take notes? use thewhiteboard capture system, IEEE Int. Conf. on Acoustics,Speech, and Signal Processing, 2003, 776–779.

[17] L.-W. He and Z. Zhang, Real-time whiteboard capture andprocessing using a video camera for remote collaboration,IEEE Transactions on Multimedia, 2007, 198–206.

[18] T.K. Ho, J.J. Hull and S.N. Srihari, Decision combination inmultiple classifier systems, IEEE Trans Pattern Anal MachIntell (1994), 66–75.

[19] HTK. http://htk.eng.cam.ac.uk/, 2012 (accessed August 20,2012).

[20] J. Hunsinger and M. Lang, A single-stage top-down proba-bilistic approach towards understanding spoken and handwrit-ten mathematical formulas, INTERSPEECH, 2000, pp. 386–389.

[21] J. Hunsinger and M. Lang. A speech understanding modulefor a multimodal mathematical formula editor, IEEE Inter.Conf. on Acoust., Speech, and Sig. Proc., 2000, 2413 –2416.

[22] A. Jain, K. Nandakumar and A. Ross, Score normalization inmultimodal biometric systems, Pattern Recognition, 2005.

[23] X. Jiang, K. Yu and H. Bunke, Classifier combinationfor grammar-guided sentence recognition, First InternationalWorkshop on Multiple Classifier Systems, 2000, 383–392.

[24] S.X. Ju, M.J. Black, S. Minneman and D. Kimber, Summa-rization of video-taped presentations: Automatic analysis ofmotion and gesture, IEEE Trans. on Circuits and Systems forVideo Technology, 1998, 686–696.

[25] A. Kosmala and G. Rigoll, On-line handwritten formularecognition using statistical methods, In Int. Conf. on PatternRecognition, 1998, 1306–1308.

[26] A. Kosmala, G. Rigoll, S. Lavirotte and L. Pottier, On-linehandwritten formula recognition using hidden markov modelsand context dependent graph grammars, Int. Conf. on Docu-ment Analysis and Recognition, 1999, 107–110.

[27] K. Kurihara, M. Goto, J. Ogata and T. Igarashi, Speech pen:predictive handwriting based on ambient multimodal recogni-tion, SIGCHI conference on human factors in computing sys-tems, 2006, 851–860.

[28] H.-J. Lee and M.-C. Lee, Understanding mathematical ex-pressions using procedure-oriented transformation, PatternRecognition (1994), 447–457.

[29] H. Li, D. Doermann and O. Kia, Automatic text detection andtracking in digital video, IEEE Transactions on Image Pro-cessing, 2000, 147–156.

[30] S. Lucey, V. Chandran and S. Sridharan, A theoretical frame-work for independent classifier combination, Int. Conf. onPattern Recognition, 2002.

[31] S. Lucey, T. Chen, S. Sridharan and V. Ch, Integrationstrategies for audio-visual speech processing: Applied to

AUTH

OR

COPY

Page 16: handwritten mathematical content in COPY - Welcome to the GMU ECE …hayes/papers/IntegratedEngineering.pdf · 2014-08-11 · nition for handwritten mathematical content in class-room

234 S. Vemulapalli and M. Hayes / Audio-video based character recognition for handwritten mathematical content

text-dependent speaker recognition, IEEE Trans. Multimedia(2005), 495–506.

[32] C. Malon, S. Uchida and M. Suzuki, Mathematical symbolrecognition with support vector machines. Pattern Recogni-tion Letters, 2008, 1326–1332.

[33] S. Medjkoune, H. Mouchère, S. Petitrenaud and C. Viard-Gaudin. Handwritten and audio information fusion for mathe-matical symbol recognition, Int. Conf. on Document Analysisand Recognition, 2011, 379–383.

[34] SpeechServer – Microsoft Corp. http://www.microsoft.com/SPEECH/, 2012 (accessed Aug. 20, 2012).

[35] Nexidia, http://www.nexidia.com/, 2012 (accessed Aug. 20,2012).

[36] OpenCV, http://opencv.willowgarage.com/wiki/, 2012 (acce-ssed August 20, 2012).

[37] J.A. Pittman, Handwriting recognition: Tablet pc text input.Computer, 2007, 49–54.

[38] R. Plamondon and S.N. Srihari, On-line and off-line hand-writing recognition: A comprehensive survey, IEEE Trans.Pattern Anal. and Machine. Intelligence, 2000, 63–84.

[39] G. Potamianos, C. Neti, J. Luettin and I. Matthews, Audio-visual automatic speech recognition: An overview. In Issuesin Visual and Audio-Visual Speech Processing, G. Bailly, E.Vatikiotis-Bateson, and P. Perrier (Eds.), 2004, MIT Press.

[40] D. Prusa and V. Hlavac, Mathematical formulae recognitionusing 2d grammars, Int. Conf. on Document Analysis andRecognition, 2007, 849–853.

[41] S. Quiniou et al., Hamex – A handwritten and audio dataset ofmathematical expressions, Int. Conf. on Document Analysisand Recog., 2001, 452–456.

[42] A.F.R. Rahman, H. Alam and M.C. Fairhurst, Multiple classi-fier combination for character recognition: Revisiting the ma-jority voting system and its variations, Int. Workshop on Doc-ument Analysis Systems, 2002, 167–178.

[43] S. Rossetto, F. Varej ao and T.W. Rauber, An expert systemapplication for improving results in a handwritten form recog-nition system, Int. Conf. on Industrial and Engineering Ap-plications of Artificial Intelligence and Expert Systems, 2002,383–392.

[44] A. Sánchez, C.A.B. Mello, P.D. Suárez and A. Lopes,Automatic line and word segmentation applied to denselyline-skewed historical handwritten document images, IntegrComput-Aided Eng (2011), 125–142.

[45] E. Saund, Bringing the marks on a whiteboard to electroniclife. In Int. Workshop on Cooperative Buildings – IntegratingInformation, Organizations, and Architecture, 1999, 69–78.

[46] Y. Shi, H. Li and F.K. Soong, A unified framework for symbolsegmentation and recognition of handwritten mathematicalexpressions. In Int. Conf. on Document Analysis and Recog-nition, 2007, 854–858.

[47] M. Shridhar, G.F. Houle and F. Kimura, Recognition strate-gies for general handwritten text documents, Integr. Comput.-

Aided Eng., 2009, 299–314.[48] K. Sirlantzis, S. Hoque and M.C. Fairhurst, Trainable multiple

classifier schemes for handwritten character recognition, Int.Workshop on Multiple Classifier Systems, 2002, 169–178.

[49] CMU Sphinx, http://cmusphinx.sourceforge.net/, 2012 (ac-cessed August 20, 2012).

[50] Q. Stafford-Fraser and P. Robinson, Brightboard: A video-augmented environment, SIGCHI Conference on Human fac-tors in Comp. Syst: Common Ground, 1996, 134–141.

[51] L. Tang, Semantic content analysis and user interfaces for in-structional video indexing, PhD thesis, Columbia University,2006.

[52] K. Toyozumi, N. Yamada, T. Kitasaka, K. Mori, Y. Suenaga,K. Mase and T. Takahashi, A study of symbol segmentationmethod for handwritten mathematical formula recognition us-ing mathematical structure information. In Int. Conf. on Pat-tern Recognition, 2004, 630–633.

[53] S. Tulyakov, S. Jaeger, V. Govindaraju and D. Doermann, Re-view of classifier combination methods. In Studies in Compu-tational Intelligence: Machine Learning in Document Analy-sis and Recognition, 2008, 361–386.

[54] M. Van Erp, L. Vuurpijl and L. Schomaker, An overview andcomparison of voting methods for pattern recognition. In Int.Workshop on Frontiers in Handwr. Recog., 2002, 195.

[55] S. Vemulapalli, Audio-Video Based Handwritten Mathemat-ical Content Recognition. PhD thesis, Georgia Institute ofTechnology, 2012.

[56] S. Vemulapalli and M. Hayes, Grammar-Assisted Audio-Video Equation Recognition, Proc. 18th International Confer-ence on Digital Signal Processing, 2013.

[57] S. Vemulapalli and M. Hayes, Synchronization and Combina-tion Techniques for Audio-video Based Handwritten Mathe-matical Content recognition in Classroom Videos, Proc. 11thInter. Conf. on Intelligent Syst. Design and App., 2011.

[58] Embedded Viavoice. en.wikipedia.org/wiki/IBM_Via Voice,2012 (accessed August 20, 2013).

[59] W. Wang, A. Brakensiek and G. Rigoll, Combination of mul-tiple classifiers for handwritten word recognition, Int. Work-shop on Frontiers in Handwriting Recog., 2002, p. 117.

[60] M. Wienecke, G.A. Fink and G. Sagerer, Toward automaticvideo-based whiteboard reading. Int. Journal on DocumentAnalysis and Recognition, 2005, 188–200.

[61] H. Yang, C. Oehlke and C. Meinel, An automated analysisand indexing framework for lecture video portal. In Advancesin Web-Based Learning – ICWL 2012, 285–294.

[62] K. Yu, X. Jiang and H. Bunke, Combining acoustic and vi-sual classifiers for the recognition of spoken sentences. In Int.Conf. on Patt. Recog., 2000, 491–494.

[63] R. Zanibbi, D. Blostein and J.R. Cordy, Recognizing math-ematical expressions using tree transformation, IEEE TransPattern Anal Mach Intell (2002), 1455–1467.AU

THO

R CO

PY


Recommended