From old texts to modern spellings

transcript

BackgroundData and Methods

ResultsConclusions

From old texts to modern spellings:an experiment in automatic normalisation

Iris Hendrickx and Rita Marquilhas

Centro de Linguıstica da Universidade de Lisboa, Lisbon, Portugal

January 5, 2012

Jan 5, 2012, Heidelberg From old texts to modern spellings

ResultsConclusions

Background & Motivation

Aim: Automatic spelling variation reduction in historical corpus

Goal was to reduce the problem of spelling variations in thePortuguese CARDS-FLY corpus of personal letters written in the16th to the 20th century.

The original spelling in the corpus is kept for language changeresearch. However the spelling variation is distracting:

For lexical or grammatical research,

For searching with a query web interface,

The corpus is valuable as country’s cultural heritage thatreflects the everyday lives of ordinary people. Editionsintended for the lay public should be in clean text.

ResultsConclusions

Overview

Our approach

We adapt the VARD2 normalisation tool for Portuguese andevaluate its performance for our data.

We study its performance for 4 different time periods.

We investigate having specialised tools trained separately foreach time period, or one tool trained using the full training setfor the whole period.

Secondly, we investigate the effect of text normalisation onthe application of NLP tools such as a POS-tagger.

ResultsConclusions

The corpus

The current version of the CARDS-FLY corpus1 contains 1802letters. The letters are manually transcribed into an electronicXML-TEI file format including rich and detailed historical andsociological meta-data.

Origin of the personal letters

1500-1800: from religious legal proceedings, as evidence usedby the Inquisition,

19th C: legal evidence, in criminal cases heard by thePortuguese Royal Appeal Court,

20th C : soldiers who fought in World War I or in thePortuguese Colonial War, political prisoners and emigrants.

1CARDS-FLY corpus: http://alfclul.clul.ul.pt/cards-fly/Jan 5, 2012, Heidelberg From old texts to modern spellings

ResultsConclusions

The corpusVARD2Experimental setup

Letter from 1592 adressed to merchandiser Joao Nunes

Figure: show an example picture

ResultsConclusions

Manual transcription of the letter

Figure: Full description at:

http://alfclul.clul.ul.pt/cards-fly/index.php?page=infoLetter&carta=CARDS4006.xml

ResultsConclusions

Manual transcription of the letter in XML

Figure: Full description at:

http://alfclul.clul.ul.pt/cards-fly/index.php?page=infoLetter&carta=CARDS4006.xml

ResultsConclusions

Division in 4 time periods

Classical and Modern Pt categories (1500-1800 and 1801-now)

The earlier period was subdivided because of European andBrazilian Portuguese language changes around 1700. The Modernperiod was subdivided because of the 1911 spelling reform(noticeable by 1930). The current version of the corpus contains1802 letters and here we use a subset of 200 letters.

Time Period Total Sub set

1500 -1700 186 201701 -1800 505 561801- 1930 771 861931 - 1974 340 38

total 1802 200

ResultsConclusions

VARD2 normalisation tool

VARD2 is a statistical normalisation tool

developed for historical English and combines several resources todetect and replace spelling variants with normalised forms.VARD2 uses:

a modern lexicon

a spelling variants dictionary list that matches variants againsttheir modern counterparts

a list of letter replacement rules

a phonetic matching algorithm

an edit distance algorithm to determine the most likelycandidate

a training set with encoded normalisations (optional)

ResultsConclusions

Resources for historical Portuguese

The Tycho Brahe Parsed Corpus of Historical Portuguese(TBCHP) 2 consists of 52 texts from different text genresfrom the Middle Ages to the Late Modern era. Some textsmaintain the original spelling variations while in others thespelling was standardised.

The Historical Dictionary of Brazilian Portuguese (HDBP) isconstructed on the basis of a historical Portuguese corpus of1733 texts and 5 million tokens.(Giusti et al, 2007)

A BP spelling variants dictionary3 based on HDBP with acorpus-based tool to automatically generate and test rewriterules that cluster spelling variants together.

2TBCHP: http://www.tycho.iel.unicamp.br/∼tycho/corpus/en3 http://www.nilc.icmc.usp.br/nilc/projects/hpc/

ResultsConclusions

VARD2 for Portuguese

We re-use existing resources.

Als modern lexicon: The Multifunctional ComputationalLexicon of Contemporary Portuguese4 contains 26K lemmasand 140K tokens

We created letter replacement rules based on the rule setdescribed in detail by Giusti et al, (2007)

BP spelling variants dictionary was converted to VARD2format

4available for download at: http://www.clul.ul. pt/en/resources/88-project- multifunctional- computational -lexicon- of-contemporary-portuguese-r

ResultsConclusions

BP spelling variants dictionary converted to VARD2 format

The original dictionary clusters spelling variants around onecommon word form, the so-called head word of the cluster. Weneeded a list of one-to-one mappings between variants and theirmodernized counterparts to integrate into the VARD2 tool.Mapping variants to head words does not always work. We check ifthe head word in the modern lexicon, if not, we chose the mostfrequent spelling variant that is in lexicon as modern counterpart.

tambem (12211)tambem (9002)tambem (3160)tanbem (47)ttambem (1)ttanbem (1)

ResultsConclusions

Data set

For these experiments

random subset of 200 letters from the CARDS-FLY corpus,

Tokenised, and names are converted to string ‘NAME’,

Normalisation and POS manually verified by a linguist.

This data set was split into 100 letters for training the VARDtool, and 100 for the evaluation set.

Evaluation scores are computed with recall, precision andF-score.

ResultsConclusions

Statistics

Table: Statistics for the evaluation set of 100 letters, divided into the four time periods. # Tok/file shows the

average number of tokens per letter, ‘#Norm/file’ the average number of manual spelling corrections per letter and

‘% Norm/tok’ is the percentage of all tokens that is normalised.

Period Files Tok #Tok/file #Norm/file %Norm/tok1500-1700 10 2262 226.2 56.9 25.21701-1800 28 13913 496.9 120.8 24.31801-1930 43 14343 333.6 60.7 18.11931- 1974 19 6817 358.8 16.1 4.2

ResultsConclusions

Training VARD

By training VARD the weights between the different modules areset and new words are added to variants dictionary.VARD2 has two parameters:

P1 balance between recall and precision, set to 1

P2 the replacement threshold which decides whether apotential variant should be replaced with theequivalent modern candidate, experimentallydetermined.

We ran a series of experiments with different thresholds. Wedivided the training set in 80 letters for training and 20 as adevelopment set. We tested the following settings: 1, 5, 10, 20,..90 for this threshold

ResultsConclusions

Training VARD

Threshold R P F-score1 49.3 97.2 65.55 47.5 98.3 64.010 47.5 98.3 64.020 47.5 98.3 64.030 47.5 98.3 64.040 47.5 98.3 64.050 47.0 98.2 63.560 46.6 98.2 63.270 45.2 98.2 61.980 39.5 98.1 56.490 15.4 97.9 26.6

ResultsConclusions

Results of normalisationResults of POS-tagging

Train on 100, Test on 100

Table: Precision, recall and F-score for VARD2 trained on 100 trainingletters.

time period Acc R P F-scoretotal 91.92 61.10 97.21 75.031500 -1700 91.75 68.72 98.74 81.041701 -1800 90.77 65.95 97.72 78.751801- 1930 91.06 56.00 96.81 70.961930 - 1974 96.46 35.23 87.50 50.24

The VARD2 tool has a much higher precision than recall

Better results would have been expected for the period 1801- 1930 (much data). However, this was a timewhen the lower-classes were becoming semi–literate who produced many creative misspellings that areextremely difficult to predict.

In the latest time period high accuracy and a remarkably low recall and F-score. Accuracy also takes intoaccount true negatives.

ResultsConclusions

Table: Precision, recall and F-score for specialised VARD2 tools, trainedand tested separately for the four time periods.

Period Acc R P F-score1500-1700 92.52 71.88 98.55 83.131701-1800 91.81 70.24 97.49 81.651801-1930 90.61 53.45 97.08 68.941931-1974 96.32 31.88 87.96 46.80

Recall is mainly affected by the change in training set, showing an increase for the oldest data. For the twomore modern data sets precision slightly increases at the cost of the recall.

ResultsConclusions

Error analysis

Most frequent error: spelling of ‘um’ with ‘h-’. VARD2 doesnot recognise this since hum is listed in the modern lexicon.

For all periods: diacritics problems.

The older letters have many archaisms (e.g inda, cousa ) thatare erroneously part of the VARD2 modern lexicon list.

The older letters also have many abbreviations (e.g. v., va.,etcra. ) which are difficult to recognise automatically.

Confusion between different spellings: For 1500-1700, s/c/ssfor the sound [s]; for 1701-1800, the use of z/s for the sound[z], whilst 1801-1930 the phonetic spelling of ‘i’ for ‘e’frequently occurs.

ResultsConclusions

Evaluation in Use

Quantify the effect of automatic normalisation on an NLP toolsuch as a POS-tagger

We trained a POS-tagger on 19 texts from TBCHP (280POS-labels) and tested on 3 versions of the test data:

the original unnormalised text,

automatically standardised text using VARD2(trained on 100 letters)

the gold standard of manual annotation.

ResultsConclusions

Results POS-tagger

Table: POS-tagger accuracy for the evaluation set of 100 letters, basedon the original non-normalised text, text automatically normalised byVARD2, and the gold standard created by manual annotation.

Total Unknown Known# tokens 37,335 5,869 31,466Original 76.86 42.34 87.06VARD2 83.41 47.57 90.10Gold 86.58 49.11 91.94

ResultsConclusions

In sum

Standardising the spelling of historical Portuguese

We adapted VARD2 to Portuguese re-using severalPortuguese resources.

We evaluated the tool for 4 different time periods.

We also investigated whether it was more useful to havespecialised normalisation tools for each time period, orwhether the tool benefits more from one large training setcovering the whole time period.

We did an extrinsic evaluation by looking at the effect ofnormalization on a POS-taggers’ performance.

ResultsConclusions

Conclusions

In all periods, the letter writers can be seen to struggle with twoproblems:

how to master etymological spellings without knowing Latin,Greek, or Old Portuguese

how to master phonographic spellings if they never obeypurely phonetic facts (morphological and lexical informationinfluence the spelling too.)

ResultsConclusions

Conclusions

Observations

VARD2 performs best on the older letters and worst on themost modern ones.

For the Classical period the advantage of a specialised tooloutweighs the smaller amount of data.

For the Modern period a tool trained using a larger, diversedata set works better.

Automatic normalisation helps improve the performance ofthe POS-tagger on historical data.

ResultsConclusions

Future work

Improving the Portuguese VARD2 by fine-tuning the modules:

improve the rewrite rules by generating them automatically onthe basis of the corpus,

manually checking the variants lexicon,

filter out archaisms from the modern lexicon,

evaluate the effect of the phonetic matching element,

abbreviations are marked in the original XML and can betreated separately.

We would also like to study the performance of the POS-tagger forthe different time periods.

From old texts to modern spellings

Education