Post on 21-Dec-2015
transcript
Leif Grönqvist Colloquia Linguistica 1
Colloquia Linguistica
Part II: The development of Automated Syntactic Taggers
Leif Grönqvist
Göteborg University
Leif Grönqvist Colloquia Linguistica 2
Overview
• Some basic thing about corpora (quick)– What is a corpus– What can we do with it
• Part-of-speech tagging (slower)– What is the problem– Some common approaches
• A rule based tagger• A statistical tagger
• Corpus tools– Different tools– Demonstration of Multitool
Leif Grönqvist Colloquia Linguistica 3
What is a corpus for a computational linguist?
• Various properties are important but the word ‘corpus’ is just Latin for ‘body’
• These properties should be considered:– Representativeness– Size– Form (annotation standard)– Standard reference
Leif Grönqvist Colloquia Linguistica 4
Representativeness
• A corpus used for analyzing spoken Swedish should ideally contain all utterances of Swedish ever spoken
• But this is impossible, so there are at least two strategies depending on purpose:– Try to collect various dialogue types of sizes
proportional to the “complete corpus”– Collect enough big portions of each type to make sure
to find all wanted phenomena• Regardless of which strategy you use it is
important to select the samples from each type carefully, preferably using random
Leif Grönqvist Colloquia Linguistica 5
Corpus size: how big should it be?
• Depends on purpose!
• Some strategies:– Monitor corpus: as big as possible
• Bank of English > 500 million tokens• Used for lexicography
– Finite size, big enough for current task• POS-tagging, ~100 tags: 1 million tokens• Language model for automatic speech recognition:
100 million tokens
Leif Grönqvist Colloquia Linguistica 6
Machine readable form
• Corpora have been used in linguistics for more than 100 years.
• Now: a corpus => machine readable
• The annotations should be made in a way to make extraction of wanted features as simple as possible
Leif Grönqvist Colloquia Linguistica 7
Standard reference (quick)
• Typical content of a research article: “We used the corpus XX, took 90% for training, and 10% for testing with our new algorithm. We then got 97.2% correctness, which is a significant improvement from the old tagger at the 99% level”– Exactly the corpus XX must be available for
other research groups
Leif Grönqvist Colloquia Linguistica 8
What to do with a corpus
• Check our linguistic intuition
• Annotate interesting features manually
• Use it for training of taggers and parsers– Annotate new data automatically
• But, be careful! A corpus is not the complete language
Leif Grönqvist Colloquia Linguistica 9
Text encoding
• Various encoding schemes around– Text based
• Human and machine readable• Could be difficult to check for validity
– Word processor based• Only human readable• Rarely used in computational linguistics
– XML/SGML based• Machine readable• May be transformed to human readable form using XSLT• Formalisms and tools for free, well more or less free• Limitations of XML may be annoying sometimes
Leif Grönqvist Colloquia Linguistica 10
Some important properties (skip)
Important properties according to Geoffrey Leech• Possibility to extract original corpus• Possibility to separate annotations• Based on well defined guidelines• Make clear how the annotations were done• Make clear that there may be errors in the
corpus• Widely agreed theory-neutral annotation scheme• No annotation scheme is the a priori standard
scheme
Leif Grönqvist Colloquia Linguistica 11
Some annotation standardsTEI (Text Encoding Initiative)• Huge standard for all types of texts and corpora developed by the
TEI Consortium since 1987• SGML based in the beginning but now XML(X)CES (XML Corpus Encoding Standard)• Highly inspired by the TEI• Not as complicated but only in beta versionISLE (International Standards for Language Engineering)• Developed by three working groups (lexicon, multimodality and
evaluation)CDIF (Corpus Document Interchange Format)• Used by the British National Corpus• A lot in common with the TEI
Leif Grönqvist Colloquia Linguistica 12
Some typical results directly extracted from a corpus
• Concordances (KWIC)
• Frequency lists
• N-gram statistics
• Probabilities
Leif Grönqvist Colloquia Linguistica 13
Concordances
rer, matematiker och dataloger i Göteborgsregionen, bandavskrifter och dataloggar, skriver Feldt.|Si, bandavskrifter och dataloggar.|Men den nya Palme Ahlberg, forskare i datalogi på Chalmers.|Av PER- Ahlberg, forskare i datalogi på Chalmers.|SIDAN 4und blir professor i datalogi vid Umeå universitetund blir professor i datalogi vid Umeå universiteta fyra olika kurser: datalogi, pedagogik, teknik oybjer och Jan Smith, datalogi.|Sektionen för maskiatorer eller pluggar datalogi.|Så på fritiden leker det gäller trådlös datalogistik, nu kommer det ö
Leif Grönqvist Colloquia Linguistica 14
Frequency lists74556 de48104 ja39947 e34342 å25694 så25639 att22378 va19134 som18679 vi18084 inte17611 på17214
man16870 i16846 då
77810 det36843 är35471 och32404 ja30439 att28628 jag26059 så19205 som18681 inte18469 har18421 vi17719 på17377 man17343 då
90304 .56075 ,40438 och33978 i26358 att25634 det21830 en21333 som19743 på15754 är14333 med13837 för13683 av13547 jag
Leif Grönqvist Colloquia Linguistica 15
N-gram statistics3395 det är2913 för att2451 det var1560 att det1351 är det1278 i en1174 att han1003 i den966 som en920 men det889 på en884 att jag882 är en882 med en
42 i stället för att36 för några år sedan35 men det är inte34 en stor del av33 på samma sätt som32 det var som om31 att det är en30 är en av de30 men det var inte28 vad är det för28 det är svårt att27 det är som om27 att det inte var26 för ett år sedan
Leif Grönqvist Colloquia Linguistica 16
Part-of-speech tagging
• We want to assign the right part-of-speech (just as an example) to each word in a corpus
• Input is a tokenized corpus• The tagset is determined in advance• The word types in the corpus have various
properties in the training data– Some are unambiguous– Some are ambiguous (typically 2-7 POS each)– Some are unknown (not there)
Leif Grönqvist Colloquia Linguistica 17
An example
Tagset: noun, verb, pron, art, infmrk, prepIn: $A: you have to book a chair on deckOut: pron verb infmrk verb art noun prep noun
• But, “book” and “chair” may be either verb or noun - the tagger has to disambiguate!
• Several approaches to do this, all based on patterns and regularities in the language
Leif Grönqvist Colloquia Linguistica 18
Terms used in tagging
• Tagging: put the right label (i.e. word class) on each token
• Tagset: all possible labels (word classes)
• Tokenizing: divide the corpus into tokens (words, sentence boundaries)
• Training: find the rules or probabilities needed by the tagger
Leif Grönqvist Colloquia Linguistica 19
Various approaches
• Rule based tagging– Constraint based tagging (SweTwol, EngTwol
by Lingsoft)– Transformation-based tagging (Eric Brill)
• Stochastic tagging (HMM)– Calculate the most probable tag sequence – Using maximum likelihood estimation– Or some bootstrap based training
Leif Grönqvist Colloquia Linguistica 20
Constrain based tagging
• Basic idea:– Assign all possible tags to each words– Remove tags according to a set of rules of the type:
“if word+1 is an adj, adv or quantifier and the following is a sentence boundary and word-1 is not a verb like ‘consider’ then eliminate non-adv else eliminate adv.”
– Continue until no rule is applicable, but never remove the last tag on a word
• Typically more than 1000 hand written rules, but may also be machine learned
Leif Grönqvist Colloquia Linguistica 21
The example: Constraint grammar
• Tagset: nn, vb, pron, art, infmrk, prep
• First: look up all possible classes for each word
• Rules will then remove unwanted tags
In Step 1
you pron
have verb
to infmrk
book noun, verb
a art
chair noun, verb
on prep
deck noun
Leif Grönqvist Colloquia Linguistica 22
Transformation-based tagging
• Basic idea:– Set the most probable tag for each word as a start value– Change tags according to rules of the type: “if a word is tagged
as a verb and the word before is an article, then change the tag to noun”. Perform rules in a specific order!
• Training is done using a tagged corpus:1. Write a set of rule templates of the type: “if word-1 or word+1 is
an X then change the tag for word to Y”2. Among the set of possible rules, find the one with the highest
score3. Continue from 2 until a lowest score threshold is passed4. Keep the ordered set of rules
• Rules will make errors that are corrected by later rules
Leif Grönqvist Colloquia Linguistica 23
The example: Transformation based learning
• Tagset: nn, vb, pron, art, infmrk, prep
• First: look up the most common tag for each word
• Rules will then change to the right tags
Word Step 1
you pron
have verb
to infmrk
book noun
a art
chair noun
on prep
deck noun
Leif Grönqvist Colloquia Linguistica 24
An HMM tagger: uses statistics (brief)
• The problem may be formulated as:
• Which may be reformulated as:
• But the denominator is constant and may be removed and we get:
Leif Grönqvist Colloquia Linguistica 25
HMM tagger, cont. (brief)
The Markov assumption (for n=3) and the chain rule gives us:
What we need now is:
Leif Grönqvist Colloquia Linguistica 26
The example: HMMWord Seq.1 Seq.2 Seq.3 Seq.4
you pron pron pron pron
have verb verb verb verb
to infmrk infmrk infmrk infmrk
book noun noun verb verb
a art art art art
chair noun verb noun verb
on prep prep prep prep
deck noun noun noun noun
Select the sequence with the highest probability!
Leif Grönqvist Colloquia Linguistica 27
Training of an HMM tagger
• The best way is the Maximum Likelihood Estimation. But it requires a hand tagged corpus
• A fancy name for a simple principle: expect the new data to be as the training data. Count the thing there:– P(c) = freq(c) / Ntok– P(w,c) = freq(w,c) / Ntok– P(w|c) = P(w,c) / P(c)
Leif Grönqvist Colloquia Linguistica 28
Evaluation (skip)
• The result is compared with: the so called “Gold Standard” (manually coded)– Typically accuracy reach 96-97% – This may be compared with the result for a
baseline tagger, for example a tagger not using context at all
– Similarity between two gold standards may verified with the kappa measure
• Important to note that 100% is impossible even for human annotators
Leif Grönqvist Colloquia Linguistica 29
Problems (quick)
• Words and sequences are missing in the training data. This is cured using smoothing:– Additive: add one occurrence to each event
frequency– Good-Turing estimation: try to calculate the
number of unseen events to get a better estimation of their probabilities
– Back-off and Linear interpolation– Morphology may help (-arity, -s)
Leif Grönqvist Colloquia Linguistica 30
The Viterbi algorithm (quick)
• To calculate the probabilities for all possible sequences of tags would take too long time
• The Viterbi algorithm helps us to find the most probable path in linear time to the length of the text and quadratic time to the number of states, using dynamic programming
Leif Grönqvist Colloquia Linguistica 31
Example of corpus tools at the linguistics department in Göteborg
• The Corpus Browser– A tool for searching (for words and expressions) and
browsing in our transcriptions
• TraSA– A tool that count things like number of words,
utterances, overlaps, vocabulary richness, etc
• Multitool– A tool for browsing and coding a transcription, with
audio and video available at the same time– Demonstration?
Leif Grönqvist Colloquia Linguistica 32
Thank you!
• Thank you for listening!
• Well, do we have any time left for questions?