Building a Corpus from Handwritten Picture Postcards ...this task. The evaluation showed that our...

Building a Corpus from Handwritten Picture Postcards:Transcription, Annotation and Part-of-Speech Tagging

Kyoko Sugisaki, Nicolas Wiedmer, Heiko HausendorfGerman department, University of Zurich

Schonberggasse 9, 8001 Zurich, Switzerland{sugisaki, nicolas.wiedmer, heiko.hausendorf}@ds.uzh.ch

AbstractIn this paper, we present a corpus of over 11,000 holiday picture postcards written in German and Swiss German. The postcards have beencollected for the purpose of text-linguistic investigations on the genre and its standardisation and variation over time. We discuss theprocesses and challenges of digitalisation, manual transcription, and manual annotation. In addition, we developed our own automatic textsegmentation system and a part-of-speech tagger, since our texts often contain orthographic deviations, domain-specific structures such asfragments, subject-less sentences, interjections, discourse particles, and domain-specific formulaic communicative routines in salutationand greeting. In particular, we demonstrate that the CRF-based POS tagger could be boosted to a domain-specific text by adding a smallamount of in-domain data. We showed that entropy-based training data sampling was competitive with random sampling in performingthis task. The evaluation showed that our POS tagger achieved a F1 score of 0.93 (precision 0.94, recall 0.93), which outperformed astate-of-the-art POS tagger.

Keywords: postcard corpus, POS tagging, German

1. IntroductionIn this paper, we report the construction of the languageresource Ansichtskartenkorpus ([anko]), ‘picture postcardcorpus’, containing over 11,000 holiday postcards writtenin Standard German and Swiss German. They were manu-ally transcribed and annotated with structural and discourse-related information, and then automatically annotated withtext segmentation, lemma and part-of-speech (POS) infor-mation.We will first characterise the texts contained in the resource(Section 2), and then describe their manual transcriptionand annotation before outlining the development of a NLPtoolkit for text segmentation and POS annotation (Section3).

2. Data SourceThe holiday postcards were collected at the University ofZurich from 2009 to present day for the purpose of text-linguistic investigations on the genre and its standardisationand variation over time. The postcards included in our cor-pus were sent by post from people on holiday, mainly fromSwitzerland but also from Italy, Germany and other Euro-pean countries to their family, friends, colleagues and neigh-bours living in the German-speaking area of Switzerland.About 95% of the cards (11,760 cards) were written mainlyin Standard German. The remaining part of the corpus iscomprised of postcards written mainly in Swiss German.Although the postcards were dated from 1898 to 2016, themajority were written in the 1980s (22%) and 1990s (19%).On average, a post card contains 50 words, while individualpost cards vary from one to 350 words.

3. Corpus ConstructionIn this section, we describe the process of digitalisation,transcription, and annotation carried out manually and auto-matically to build the corpus of the collected postcards.

3.1. Overall Pipeline: From Digitalisation toXML with Linguistic Annotation

Because the collected holiday postcards were in paper for-mat, we first scanned the front and back of each card.We then considered using an optical character recognition(OCR) system to extract the texts from the scanned images.However, the postcards were handwritten in German, andOCR systems do not work well for handwritten texts in lan-guages other than English. Therefore, we decided on manualtranscription for which we developed a web-based tool. Theuser interface is illustrated in Figure 3.2. Each scannedcard was integrated into the tool. The tool displayed thefront and back images of each card on the left side and thetranscription and annotation forms on the right side. Thus,the transcribers could directly transcribe handwriting, markparagraphs, note textual discourse structures (e.g. greetings)and enter metainformation (e.g. dates). The data were thensaved in a MySQL database, which we then converted to anXML representation. We then incorporated our automaticannotations in the XML: 1) text segmentation, 2) lemma and3) POS tags.

3.2. Transcription and Manual AnnotationThe picture postcards written in Standard German were tran-scribed and annotated by four transcribers in a typing officein Germany. The Swiss German postcards were transcribedand annotated by a student whose native language is SwissGerman. To ensure the quality of the transcription and themanual annotation, during the process of transcription andannotation, three students checked samples, corrected themmanually and gave feedback to the typing office.Our corpus consisted of the main texts as primary data andtextual properties as metadata. A picture postcard consistsof two sides – the front side and the back side. The front sideof a modern postcard typically includes images of touristattractions and landscapes, including the name of the loca-tion, whereas the back side consists of an address field on the

255

Figure 1: Web-based manual transcription/annotation toolright and a message field on the left. During the transcriptionprocess, the message field was transcribed and regarded asprimary data. The address field (e.g. name, postal code, lo-cation and country of the receiver) was considered metadata,including latent information, such as the genders of bothreceiver and the sender, as well as the presence of sketchesdrawn by the latter.In addition, our transcribers annotated textual discourse-related information during the transcription process. Themessage field of a holiday postcard is generally structured asfollows: 1) a preface (date, sometimes location, temperatureor weather); 2) a salutation (e.g. Dear Heidi); 3) the mainmessage; 4) greeting including closing (e.g. Cheers); 5) thesignature of the sender. During the transcription, the preface,salutation, greeting and signature were marked directly onthe text. Each beginning and end of these discourse zoneswere marked with unique markdowns. The markdownsconsisted of character sequences that hardly appeared inthe main text. The salutation was marked as star star bar**|, and the closing was marked as |**. For example,the salutation Dear Heidi in the main text was annotatedas **|Dear Heidi|**. Hence, minimal annotation wasrequired, and the mapping to XML opening and closing tagwas straightforward.We considered that sensitive data in the corpus should be ex-plicitly coded. The picture postcards often contained privateinformation, such as the name and address of the receiver,the telephone number or even the bank account number ofthe sender. Therefore, the transcribers did not include suchsensitive information but coded as [Vertraulich] (i.e.‘confidential’) in the message field. In particular, familynames are coded as [NN] (i.e. the short form of Nachnameor ‘family name’). The sensitive data in the address fieldwere marked as such to ensure that they will not be releasedin the corpus.1

3.3. Automatic Text SegmentationThe primary texts were segmented into paragraphs, sen-tences and words. The segmented texts were then structuredin a XML representation.Generally in German, punctuation segments a text into sen-tences, and spaces are used to segment a sentence into words.However, this rule of thumb was not always applicable to

1A sample of our corpus will be available at http://ansichtskartenprojekt.de

(A) Word/lemma featuresA1 Word form: real word formsA2 Normalized word form: all lower case and without uA3 Character type of unit: word form is categorised into the following

classes: (1) all special characters (2) all numbers (3) capitalized (4)all alphabets without capitalization (5) mixed of all possible charac-ter without capitalization

A4-7 Suffix: the last 4, 3, 2, 1 character of words, respectively.A8 Lemma: generated by TreeTagger

(B) POSB1 POS: generated by TreeTaggerB2 POS: generated by Stanford POS tagger

(C) Semantic cluster featuresC1-2 Brown clustering: Brown clustering is used in 4 digits (D1) and all

digits (D2)C3 Word2VecC4 Fasttext

Table 1: Features for CRF-based POS tagging

the sentence segmentation of the postcards, particularly withregard to the following cases: 1) punctuation was a part of atoken with preceding characters; and 2) punctuation was ab-sent. Case 1 refers to abbreviations (e.g. z.B. instead of zumBeispiel or ‘for example’) and brand or proper names (e.g.Sat.1), which is also common in Standard German orthog-raphy. Case 2 refers to freestanding lines, which typicallyended with a wide blank space or extra line spacing, andwhich often omitted punctuation, such as titles, subtitles, ad-dresses, dates, greetings, salutations and signatures (OfficialGerman Orthography, 2006). Dates, greetings, salutationsand signatures belong to the core text zones of postcards. Inaddition, freestanding lines were often extended to the endof the paragraph in the texts of the postcards. Furthermore,the following use of punctuations is also common in post-cards, which differs from Standard German orthography: (a)repeated punctuation (e.g.,!!!,???,......) in order to empha-sise words, phrases and sentences; (b) the use of emotionalpictograms that are typically composed of punctuation (e.g.:),;-)). Based on these peculiarities, we developed a statisti-cal sequential sentence segmentation system that differenti-ates punctuations into Case (1) and the sentence boundary,and deliberately handles Case (2) (Sugisaki, 2017).With regard to tokenisation, the texts of the postcardsshowed a frequent use of contractions, which is also com-mon in internet-based and computer-mediated communica-tion (Bartz et al., 2013). In the contractions, the verb wasoften combined with the pronoun es, ‘it’, and delimitated byan apostrophe (e.g. gibt’s instead of gibt es, ‘gives it’). Theapostrophe was sometimes omitted (e.g. gibts). Nonetheless,not only verbs are concatenated with the pronoun, but also in‘wh question’ words (wenn’s/wo’s instead of wenn es/wo es,‘when/where it’) and prepositions (auf’s instead of aufs orauf das, ‘on the’). Based on this observation, we developeda simple rule-based tokeniser in which ’s was separated fromthe remaining part of the token if it was not a noun. If it wasa noun, the ’s was considered a genitive marker and part ofthe token. We used TreeTagger (Schmid, 1995) to obtain thePOS information. However, in the case of contractions with-out apostrophes, TreeTagger does not provide an accuratePOS tag. Contractions without apostrophes do not belongto standard orthographies, which might cause this difficulty.We observed that frequently used verbs, such as give, be andhave often occurred with the reduced pronoun s without anapostrophe. Therefore, we created a list of these verbs andsome wh question words in order to separate s from them.

256

http://ansichtskartenprojekt.de

http://ansichtskartenprojekt.de

3.4. Part-of-speech TaggingThe segmented tokens were further annotated with POStags that were integrated into the XML representation. Wedeveloped a POS tagger for the postcards. The texts com-prised a mixture of Standard German and Swiss German.In addition, the targeted texts were in written form, butconceptually, they were in near-oral language (Koch andOesterreicher, 2008; Durscheid, 2016). An off-the-shelfPOS tagger is typically trained on a corpus of newspaperswritten in Standard German. A newspaper article belongs tothe category of a prototypical written language in both formand concept. Furthermore, it contains fewer orthographicaldeviations. Therefore, we experimented with features andtraining data to determine the best method for optimisingthe accuracy of the tagger applied to the postcard text in thisstudy.

3.4.1. Experimental SettingIn the experiments, we used the tagging method of condi-tional random fields (CRF) (Lafferty et al., 2001). CRF is asupervised machine learning method for sequences. For theexperiments, we created the following three data sets:

1. TuBa-D/Z v. 10, Tubinger Baumbank desDeutschen/Zeitungskorpus (Telljohann et al., 2012),which is a German newspaper corpus (1.787.801tokens, henceforth TuBa). In our first experiments,approximately 80% of the TuBa (803.040 tokens(henceforth, TuBa80) were used as training data,and 20% of the TuBa tokens were used as test data(252.784 tokens, henceforth TuBa20). In the secondexperiment, we used all the TuBa (TuBa100) tokens astraining data. In addition, we used a cross-validationdata set (2.239 tokens) in all experiments.

2. NOAH’s Corpus of Swiss German Dialects (hence-forth, NOAH) (Hollenstein and Aepli, 2014) is a SwissGerman corpus (94.306 tokens) that contains a vari-ety of texts (blogs, reports, Wikipedia, etc.). In ourexperiments, we used the corpus as training data.

3. From the Ansichtskartenkorpus, or ‘picture postcardcorpus’(henceforth, ANKO), we first manually anno-tated 200 postcards to derive the test data. The test datawere sampled randomly from the corpus and dividedinto two sets: 100 cards for the experiment (5.048 to-kens, henceforth ANKO-TEST) and 100 cards for theevaluation (5.341 tokens, henceforth ANKO-EVAL). Inaddition, we manually annotated 1,500 sentences forthe experiments. The sentences were used as trainingdata, and they were sampled in three ways: 1) 300sentences were selected randomly (henceforth, ANKO-R); 2) 1,200 sentences were selected based on word4-gram-based entropy scores according to four mea-surements. We describe the entropy sampling methodin Section 3.4.3. In our experiments, we used the Stan-dard German sub-corpus of ANKO.

The set of linguistic features used in our experiments isprovided in Table 1. The features were divided into (A)word and lemma, (B) POS features generated by the POS

tagger TreeTagger (Schmid, 1995) and the Stanford POSTagger (Toutanova et al., 2003); and (C) semantic clustersgenerated by unsupervised machine learning methods, thatis, Brown clustering2 (Brown et al., 1992), 3 (Mikolov et al.,2013) and fasttext4 (Bojanowski et al., 2016).In the following subsections, we describe the experimentsusing the set of linguistic features and the data sets.

3.4.2. FeaturesWe trained CRF models on the training set of TuBa andtested them on the test set of TuBa and ANKO. We trainedfour different types of features (A to C in Table 1) separatelyand all features in context window 0 (i.e., current tokens).The results are shown in Table 2. As expected, taggingaccuracy (F1 score) was lower if the training data and testdata were derived from different domains. Regardless of thetest data, the best features were the word and lemma fea-tures (A). The morphosyntactic analysis using the existingPOS taggers showed a lower performance, and the semanticfeatures (B) did not achieve high accuracy. However, thecombination of these three types of features outperformedthe word/lemma features. We extended the feature sets of(A), (B) and (C) from context window 0 (current tokens)to 5 left and right context windows. The results are shownin Table 1. The main finding was that the window side didnot affect the accuracy as much as expected. However, thewider context window size slightly improved the accuracyof the test set of TuBa. Therefore, we conducted furtherexperiments using the combination of the feature sets (A),(B) and (C) in context windows 0 to 5.

3.4.3. Training DataIn this section, we investigate the following challenges: 1)how to boost the tagging accuracy in texts with mixed lan-guages and 2) whose domain and morphosyntactic distribu-tion were different from newspapers.To handle the first challenge, we added the Swiss Germantraining data, NOAH. The results are shown in Table 3. Theaddition of the Swiss German training data produced resultsthat were similar to those of the model that was trained onlyon TuBa100, but it did not improve the tagger.To address the second challenge, we added small amounts offive types of training data from ANKO to the TuBa100 andNOAH training data. The first in-domain training data wererandomly selected from ANKO. The remaining data setswere selected using a cross entropy score. Cross entropy isa variant of perplexity that is used to compare different prob-ability models. The score is measured as follows (Jurafsky

2For Brown clustering, we used the implementation of P. Liang.To create 100 clusters, we trained the model on TuBa100, NOAH,ANKO (normalized word form). The first 4 digits and all digits areused as features.

3For word2vec, we used gensim with parameters skip-gram,500 dimensions, context window 5. For K-means clustering, weused the scikit-learn to create 30 clusters.

4We used the fasttext with parameters, CBOW, 200 dimensions,context window 5, 5 word ngrams. For K-means clustering, weused the scikit-learn to build 20 clusters.

257

Context window 0 0-1 0-3 0-5Feature Feature A Feature B Feature C Feature A-CTuBa-Test .968 (.968,.968) .960 (.960,.961) .893 (.893,.894) .974 (.974,.974) .977 (.977,.977) .978 (.978,.978) .978 (.978,.978)ANKO-Test .883 (.886,.881) .848 (.850,.846) .795 (.796,.794) .897 (.900,.895) .895 (.897,.893) .892 (.895,.890) .895 (.897,.893)

Table 2: Experiments with features in context window 0, 0-1, 0-3, 0-5: Training data =TuBa80:F1 score (precision, recall)

(2009, pp. 117)):

H(w1 ... wn) = −1

NlogP (w1 ... wn) (1)

The goal of the in-domain training data selection was theautomatic selection of a small number of in-domain sen-tences that might improve the tagging accuracy. Ideally, thein-domain sentences to be selected were not observed in thetraining in TuBa and NOAH but were typical in ANKO. Weconsidered two methods: 1) ranking-based entropy scoring(henceforth, method [A]) and 2) difference-based entropyscoring (henceforth, method [B]). Ranking-based entropyscoring is a measurement of how informative in-domainsentences are based on a language model trained on out-of-domain data. The entropy scores were ranked in orderfrom high to low. In this method, in-domain sentences withhigh entropy scores were assumed distinct from the out-of-domain data and thus more informative. This method iscompatible with Axelrod and Gao (2011) in which perplexitywas used instead of cross entropy. We inspected the top 300sentences. They included salutations, greetings, signaturesand dates. These discourse types are typical in postcardsbut are rarely included in a newspaper corpus. In contrast,difference-based entropy is a measurement of differencesin entropy scores based on a language model trained onboth out-of-domain and in-domain sentences. In-domainsentences were considered informative if the difference inscore was large. This method is based on Moore and Lewis(2010). We inspected the top 300 sentences. These sen-tences were similar to those selected by the ranking-basedentropy scores, and they were a mixture of typical discoursestructures.However, the selected sentences did not include in-domaininterpersonal and fragmental sentence patterns typicallyused in private communication. Thus, we did not find anysentences whose subject was in the first or second person,such as Danke fur Deine Karte. (‘Thank you for your card’)or fragments such as sind glucklich hier oben gelandet (‘hap-pily landed up here above’). Here, we found that the vari-ance in higher entropy scores was high in TuBa (mean: 10,variance: 914) and low in ANKO (mean: 1, variance: 3),which indicated that the difference-based entropy scoreswere mainly guided by the TuBa scores. Therefore, thesetwo methods selected similar sentences.To detect typical main sentences in ANKO, we intro-duced two methods: in-domain ranking-based entropy score(henceforth, method [C]) and a difference-ranking-based en-tropy score (henceforth, method [D]). Method (C) was usedto select the sentences with lowest entropy scores based ona language model trained on the in-domain data. In method(D), we simply ranked the entropy scores trained on TuBaand on ANKO, and we ordered the difference in rankingfrom high to low.

Training data TestTuBa100 .899 (.902,.897)TuBa100 + NOAH .898 (.901,.896)TuBa100 + NOAH + 100 ANKO-A .910 (.913,.908)TuBa100 + NOAH + 100 ANKO-B .908 (.911,.906)TuBa100 + NOAH + 100 ANKO-C .922 (.924,.920)TuBa100 + NOAH + 100 ANKO-D .926 (.928,.924)TuBa100 + NOAH + 100 ANKO-R .931 (.934,.929)TuBa100 + NOAH + 300 ANKO-R/A/B/C/D .941 (.943,.939)

Table 3: Experiments with training data with features (A),(B), and (C), and test on the ANKO-TEST: F1 score (preci-sion, recall)

We experiment on these domain-data selection methods (A)-(D) with random selection (R) as our baseline. For that,we manually annotated 300 sentences for the training set(R) and (A)-(D). The results are shown in Table 3. The300 sentences selected by the (D) method outperformedthe other three entropy-based sampling methods, which in-dicated that ranked-difference-based entropy scoring is aviable sampling method, particularly if differences in thevariance of the entropy scores between out-of-domain andin-domain data are large. However, the selected sentencesdid not outperform the in-domain data that were selectedat random. Finally, we tested the models trained on TuBa,NOAH and 1,200 training sentences in the postcards, whichachieved the best F1 score of 0.94.

3.4.4. EvaluationTo evaluate the developed POS tagger, we created a test setthat was derived from the postcard corpus (ANKO-EVAL).We re-trained the CRF model with the features A, B, and Cand the training data, TuBa100, NOAH, ANKO (i.e. ANKO-Test, all ANKO in-domain training sentences R/A/B/C/D).For the comparison, we used TreeTagger. The evaluationrevealed that our POS tagger achieved a F1 score of 0.93(precision 0.94, recall 0.93), which outperformed TreeTag-ger’s F1 score of 0.86 (precision 0.86, recall 0.86).

4. ConclusionIn this paper, we described the process of digitalising, tran-scribing and annotating of over 11,000 handwritten post-cards. In particular, we demonstrated that the POS taggercould be boosted to a domain-specific text by adding a smallamount of in-domain data. We showed that entropy-basedtraining data sampling was competitive with random sam-pling in performing this task. In future work, we will testour POS tagger on text that is written in Swiss German.

5. AcknowledgmentsThis work has been funded under SNSF grant 160238. Wethank all the project members, Joachim Scharloth, NoahBubenhofer, Selena Calleri, Maaike Kellenberger, DavidKoch, Marcel Naef, Dewi Josephine Obert, Jan Langenhorst,Michaela Schnick.

258

6. Bibliographical ReferencesAxelrod, A., He, X., and Gao, J. (2011). Domain adaptation

via pseudo in-domain data selection. In Proceedings ofthe Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 355–362.

Bartz, T., Beißwenger, M., and Storrer, A. (2013). Opti-mierung des Stuttgart-Tubingen-Tagset fur die linguis-tische Annotation von Korpora zur internetbasiertenKommunikation: Phanomene, Herausforderungen, Er-weiterungsvorschlage. JLCL, 28:157–198.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.(2016). Enriching word vectors with subword informa-tion. arXiv preprint arXiv:1607.04606.

Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D.,and Lai, J. C. (1992). Class-based n-gram models ofnatural language. Computational Linguistics, 18(4):467–479.

Durscheid, C. (2016). Einfuhrung in die Schriftlinguistik.Vandenhoeck & Ruprecht, Gottingen, 5 edition.

Hollenstein, N. and Aepli, N. (2014). Compilation of aSwiss German dialect corpus and its application to PoStagging. COLING 2014, page 85.

Jurafsky, D. and Martin, J. H. (2009). Speech and LanguageProcessing: An Introduction to Natural Language Pro-cessing, Computational Linguistics, and Speech Recog-nition. Pearson Education International, Upper SaddleRiver, New Jersey, 2nd edition.

Koch, P. and Oesterreicher, W. (2008). Mundlichkeit undSchriftlichkeit von Texten. In Nina Janich, editor, Textlin-guistik: 15 Einfuhrungen. G. Narr.

Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Con-ditional random fields: Probabilistic models for segment-ing and labeling sequence data. In Proceedings of the18th International Conference on Machine Learning 2001(ICML), pages 282–289.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Ef-ficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781.

Moore, R. C. and Lewis, W. (2010). Intelligent selectionof language model training data. In Proceedings of theACL 2010 Conference Short Papers (ACLShort), pages220–224.

Schmid, H. (1995). Improvements in part-of-speech taggingwith an application to German. In Proceedings of the ACLSIGDAT-Workshop., Dublin, Ireland.

Sugisaki, K. (2017). Word and sentence segmentation ingerman: Overcoming idiosyncrasies in the use of punc-tuation in private communication. In Proceedings of theInternational Conference of the German Society for Com-putational Linguistics and Language Technology (GSCL).

Telljohann, H., Hinrichs, E. W., Sandra, K., Heike, Z., andKathrin, B. (2012). Stylebook for the Tubingen treebankof written German (TuBa-D/Z). Technical report, Univer-sitat Tubingen.

Toutanova, K., Klein, D., Manning, C. D., and Singer, Y.(2003). Feature-rich part-of-speech tagging with a cyclicdependency network. In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association

for Computational Linguistics on Human Language Tech-nology (NAACL ’03), pages 173–180.

259

Date post:	30-Mar-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Building a Corpus from Handwritten Picture Postcards ...this task. The evaluation showed that our...

Documents