Post on 18-Dec-2014
description
transcript
BackgroundData and Methods
ResultsConclusions
From old texts to modern spellings:an experiment in automatic normalisation
Iris Hendrickx and Rita Marquilhas
Centro de Linguıstica da Universidade de Lisboa, Lisbon, Portugal
January 5, 2012
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Background & Motivation
Aim: Automatic spelling variation reduction in historical corpus
Goal was to reduce the problem of spelling variations in thePortuguese CARDS-FLY corpus of personal letters written in the16th to the 20th century.
The original spelling in the corpus is kept for language changeresearch. However the spelling variation is distracting:
For lexical or grammatical research,
For searching with a query web interface,
The corpus is valuable as country’s cultural heritage thatreflects the everyday lives of ordinary people. Editionsintended for the lay public should be in clean text.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Overview
Our approach
We adapt the VARD2 normalisation tool for Portuguese andevaluate its performance for our data.
We study its performance for 4 different time periods.
We investigate having specialised tools trained separately foreach time period, or one tool trained using the full training setfor the whole period.
Secondly, we investigate the effect of text normalisation onthe application of NLP tools such as a POS-tagger.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpus
The current version of the CARDS-FLY corpus1 contains 1802letters. The letters are manually transcribed into an electronicXML-TEI file format including rich and detailed historical andsociological meta-data.
Origin of the personal letters
1500-1800: from religious legal proceedings, as evidence usedby the Inquisition,
19th C: legal evidence, in criminal cases heard by thePortuguese Royal Appeal Court,
20th C : soldiers who fought in World War I or in thePortuguese Colonial War, political prisoners and emigrants.
1CARDS-FLY corpus: http://alfclul.clul.ul.pt/cards-fly/Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Letter from 1592 adressed to merchandiser Joao Nunes
Figure: show an example picture
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Manual transcription of the letter
Figure: Full description at:
http://alfclul.clul.ul.pt/cards-fly/index.php?page=infoLetter&carta=CARDS4006.xml
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Manual transcription of the letter in XML
Figure: Full description at:
http://alfclul.clul.ul.pt/cards-fly/index.php?page=infoLetter&carta=CARDS4006.xml
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Division in 4 time periods
Classical and Modern Pt categories (1500-1800 and 1801-now)
The earlier period was subdivided because of European andBrazilian Portuguese language changes around 1700. The Modernperiod was subdivided because of the 1911 spelling reform(noticeable by 1930). The current version of the corpus contains1802 letters and here we use a subset of 200 letters.
Time Period Total Sub set
1500 -1700 186 201701 -1800 505 561801- 1930 771 861931 - 1974 340 38
total 1802 200
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
VARD2 normalisation tool
VARD2 is a statistical normalisation tool
developed for historical English and combines several resources todetect and replace spelling variants with normalised forms.VARD2 uses:
a modern lexicon
a spelling variants dictionary list that matches variants againsttheir modern counterparts
a list of letter replacement rules
a phonetic matching algorithm
an edit distance algorithm to determine the most likelycandidate
a training set with encoded normalisations (optional)
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Resources for historical Portuguese
The Tycho Brahe Parsed Corpus of Historical Portuguese(TBCHP) 2 consists of 52 texts from different text genresfrom the Middle Ages to the Late Modern era. Some textsmaintain the original spelling variations while in others thespelling was standardised.
The Historical Dictionary of Brazilian Portuguese (HDBP) isconstructed on the basis of a historical Portuguese corpus of1733 texts and 5 million tokens.(Giusti et al, 2007)
A BP spelling variants dictionary3 based on HDBP with acorpus-based tool to automatically generate and test rewriterules that cluster spelling variants together.
2TBCHP: http://www.tycho.iel.unicamp.br/∼tycho/corpus/en3 http://www.nilc.icmc.usp.br/nilc/projects/hpc/
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
VARD2 for Portuguese
We re-use existing resources.
Als modern lexicon: The Multifunctional ComputationalLexicon of Contemporary Portuguese4 contains 26K lemmasand 140K tokens
We created letter replacement rules based on the rule setdescribed in detail by Giusti et al, (2007)
BP spelling variants dictionary was converted to VARD2format
4available for download at: http://www.clul.ul. pt/en/resources/88-project- multifunctional- computational -lexicon- of-contemporary-portuguese-r
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
BP spelling variants dictionary converted to VARD2 format
The original dictionary clusters spelling variants around onecommon word form, the so-called head word of the cluster. Weneeded a list of one-to-one mappings between variants and theirmodernized counterparts to integrate into the VARD2 tool.Mapping variants to head words does not always work. We check ifthe head word in the modern lexicon, if not, we chose the mostfrequent spelling variant that is in lexicon as modern counterpart.
tambem (12211)tambem (9002)tambem (3160)tanbem (47)ttambem (1)ttanbem (1)
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Data set
For these experiments
random subset of 200 letters from the CARDS-FLY corpus,
Tokenised, and names are converted to string ‘NAME’,
Normalisation and POS manually verified by a linguist.
This data set was split into 100 letters for training the VARDtool, and 100 for the evaluation set.
Evaluation scores are computed with recall, precision andF-score.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Statistics
Table: Statistics for the evaluation set of 100 letters, divided into the four time periods. # Tok/file shows the
average number of tokens per letter, ‘#Norm/file’ the average number of manual spelling corrections per letter and
‘% Norm/tok’ is the percentage of all tokens that is normalised.
Period Files Tok #Tok/file #Norm/file %Norm/tok1500-1700 10 2262 226.2 56.9 25.21701-1800 28 13913 496.9 120.8 24.31801-1930 43 14343 333.6 60.7 18.11931- 1974 19 6817 358.8 16.1 4.2
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Training VARD
By training VARD the weights between the different modules areset and new words are added to variants dictionary.VARD2 has two parameters:
P1 balance between recall and precision, set to 1
P2 the replacement threshold which decides whether apotential variant should be replaced with theequivalent modern candidate, experimentallydetermined.
We ran a series of experiments with different thresholds. Wedivided the training set in 80 letters for training and 20 as adevelopment set. We tested the following settings: 1, 5, 10, 20,..90 for this threshold
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
The corpusVARD2Experimental setup
Training VARD
Threshold R P F-score1 49.3 97.2 65.55 47.5 98.3 64.010 47.5 98.3 64.020 47.5 98.3 64.030 47.5 98.3 64.040 47.5 98.3 64.050 47.0 98.2 63.560 46.6 98.2 63.270 45.2 98.2 61.980 39.5 98.1 56.490 15.4 97.9 26.6
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Results of normalisationResults of POS-tagging
Train on 100, Test on 100
Table: Precision, recall and F-score for VARD2 trained on 100 trainingletters.
time period Acc R P F-scoretotal 91.92 61.10 97.21 75.031500 -1700 91.75 68.72 98.74 81.041701 -1800 90.77 65.95 97.72 78.751801- 1930 91.06 56.00 96.81 70.961930 - 1974 96.46 35.23 87.50 50.24
The VARD2 tool has a much higher precision than recall
Better results would have been expected for the period 1801- 1930 (much data). However, this was a timewhen the lower-classes were becoming semi–literate who produced many creative misspellings that areextremely difficult to predict.
In the latest time period high accuracy and a remarkably low recall and F-score. Accuracy also takes intoaccount true negatives.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Results of normalisationResults of POS-tagging
Table: Precision, recall and F-score for specialised VARD2 tools, trainedand tested separately for the four time periods.
Period Acc R P F-score1500-1700 92.52 71.88 98.55 83.131701-1800 91.81 70.24 97.49 81.651801-1930 90.61 53.45 97.08 68.941931-1974 96.32 31.88 87.96 46.80
Recall is mainly affected by the change in training set, showing an increase for the oldest data. For the twomore modern data sets precision slightly increases at the cost of the recall.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Results of normalisationResults of POS-tagging
Error analysis
Most frequent error: spelling of ‘um’ with ‘h-’. VARD2 doesnot recognise this since hum is listed in the modern lexicon.
For all periods: diacritics problems.
The older letters have many archaisms (e.g inda, cousa ) thatare erroneously part of the VARD2 modern lexicon list.
The older letters also have many abbreviations (e.g. v., va.,etcra. ) which are difficult to recognise automatically.
Confusion between different spellings: For 1500-1700, s/c/ssfor the sound [s]; for 1701-1800, the use of z/s for the sound[z], whilst 1801-1930 the phonetic spelling of ‘i’ for ‘e’frequently occurs.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Results of normalisationResults of POS-tagging
Evaluation in Use
Quantify the effect of automatic normalisation on an NLP toolsuch as a POS-tagger
We trained a POS-tagger on 19 texts from TBCHP (280POS-labels) and tested on 3 versions of the test data:
the original unnormalised text,
automatically standardised text using VARD2(trained on 100 letters)
the gold standard of manual annotation.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Results of normalisationResults of POS-tagging
Results POS-tagger
Table: POS-tagger accuracy for the evaluation set of 100 letters, basedon the original non-normalised text, text automatically normalised byVARD2, and the gold standard created by manual annotation.
Total Unknown Known# tokens 37,335 5,869 31,466Original 76.86 42.34 87.06VARD2 83.41 47.57 90.10Gold 86.58 49.11 91.94
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
In sum
Standardising the spelling of historical Portuguese
We adapted VARD2 to Portuguese re-using severalPortuguese resources.
We evaluated the tool for 4 different time periods.
We also investigated whether it was more useful to havespecialised normalisation tools for each time period, orwhether the tool benefits more from one large training setcovering the whole time period.
We did an extrinsic evaluation by looking at the effect ofnormalization on a POS-taggers’ performance.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Conclusions
In all periods, the letter writers can be seen to struggle with twoproblems:
how to master etymological spellings without knowing Latin,Greek, or Old Portuguese
how to master phonographic spellings if they never obeypurely phonetic facts (morphological and lexical informationinfluence the spelling too.)
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Conclusions
Observations
VARD2 performs best on the older letters and worst on themost modern ones.
For the Classical period the advantage of a specialised tooloutweighs the smaller amount of data.
For the Modern period a tool trained using a larger, diversedata set works better.
Automatic normalisation helps improve the performance ofthe POS-tagger on historical data.
Jan 5, 2012, Heidelberg From old texts to modern spellings
BackgroundData and Methods
ResultsConclusions
Future work
Improving the Portuguese VARD2 by fine-tuning the modules:
improve the rewrite rules by generating them automatically onthe basis of the corpus,
manually checking the variants lexicon,
filter out archaisms from the modern lexicon,
evaluate the effect of the phonetic matching element,
abbreviations are marked in the original XML and can betreated separately.
We would also like to study the performance of the POS-tagger forthe different time periods.
Jan 5, 2012, Heidelberg From old texts to modern spellings