+ All Categories
Home > Documents > Tagging with Combined Language Models and Large Tagsets

Tagging with Combined Language Models and Large Tagsets

Date post: 08-Dec-2023
Category:
Upload: racai
View: 0 times
Download: 0 times
Share this document with a friend
14
Tagging with Combined Language Models and Large Tagsets Dan Tufiş Romanian Academy [email protected] Abstract The paper discusses experiments, results, applications and further developments in tagging a highly inflectional language, based on multiple register diversified language models. The texts are accurately disambiguated in terms of a large tagset (611 tags) in two linear-time processing steps (tiered processing). The underlying tagger simultaneously uses multiple register language models and choosing the final annotation is achieved by a combined classifiers decision- making procedure. 1. Introduction In characterizing different NLP systems one usually finds the symbolic/numeric dichotomy as referring to the way the linguistic phenomena are modeled. However, this distinction tends to fade since more often than not the symbolic and numeric (probabilistic) approaches are put at work together in presumably doing the job better and faster. Yet, there is one persistent distinction that appears to differentiate among what we call introspective vs. evidence-based approaches to natural language processing: the way the underlying language model is built-up. The introspective orientation is human-centered and it is based on the linguist’s abilities and perceptiveness as far as the coverage of the language model is concerned. An introspective language model (ILM) is inherently symbolic and in most cases is rule-based. With many ILMs it is the case that they would concentrate on hard but rather rare/infrequent linguistic constructs, aiming at completeness and paying little attention to the computational efficiency. Whether completeness in computational language modeling is a feasible goal or not represents a debatable issue and there are not few those considering that language is a too poorly understood phenomenon to be able to model it adequately. Adequately usually means being able to account in a psychologically meaningful way for all the idiosyncrasies of language learning, usage and understanding. At the other end of the spectrum stays the evidence-based language modeling. This computer-centered approach relies on the reasonable assumptions that language is not a chaotic system and that provided a machine is fed with enough and relevant linguistic data, it would be possible to automatically learn (most/all of) the regularities underlying the natural language communication. The fuzzy terms enough and relevant are in most cases empirically defined with respect to a specific processing task for which capturing most of the regularities in language use is considered to be sufficient for building a useful language model. Now, what a useful language model means is again a matter of dispute. For the sake of this paper we would assume a generally accepted definition within the corpus linguistics community: covering as many linguistic phenomena as possible in arbitrary texts and allowing for an as fast as possible processing. The speed, accuracy and the required computational resources certainly depend on the problems to be solved as well as the granularity these problems are looked at and encoded in the language model. The linguistic problem we address here is the morpho-syntactic disambiguation (MS-tagging) of natural language arbitrary texts. This is a well-defined task, intensively studied, with many practical applications. The problem can be simply stated as follows: given a natural language text, tokenized into meaningful pieces of language units, each of them annotated with all possible morpho-syntactic interpretations, choose the single interpretation for each token that is correct in the given context (assuming that only one interpretation per token in context is allowable). Lexical ambiguity resolution is a key task in natural language processing [Baayen & Sproat, (1996)]. It can be regarded as a classification problem: an ambiguous lexical item is one that in different contexts can be classified differently and given a specified context the disambiguator/classifier decides on the appropriate class. The features that are relevant to the classification task are encoded into the tags. It is part of the corpus linguistics lore that in order to get high accuracy level in statistical POS disambiguation, one needs small tagsets and reasonable large training data. The effect of tagset size on tagger performance has been discussed in [Elworthy, (1995)]; what a reasonable training corpus means, typically varies from 100,000 words up to more than one million. Although some taggers are advertised to be able to learn a language model from raw (unannotated) texts, they require a post validation of the output and a bootstrapping procedure that would eventually get the tagger (in possibly several iterations) to an acceptable error rate. The larger the
Transcript

Tagging with Combined Language Models and Large Tagsets Dan Tufiş

Romanian Academy [email protected]

Abstract The paper discusses experiments, results, applications and further developments in tagging a highly inflectional language, based on multiple register diversified language models. The texts are accurately disambiguated in terms of a large tagset (611 tags) in two linear-time processing steps (tiered processing). The underlying tagger simultaneously uses multiple register language models and choosing the final annotation is achieved by a combined classifiers decision-making procedure.

1. Introduction In characterizing different NLP systems one usually finds the symbolic/numeric dichotomy as referring to the way

the linguistic phenomena are modeled. However, this distinction tends to fade since more often than not the symbolic and numeric (probabilistic) approaches are put at work together in presumably doing the job better and faster. Yet, there is one persistent distinction that appears to differentiate among what we call introspective vs. evidence-based approaches to natural language processing: the way the underlying language model is built-up.

The introspective orientation is human-centered and it is based on the linguist’s abilities and perceptiveness as far as the coverage of the language model is concerned. An introspective language model (ILM) is inherently symbolic and in most cases is rule-based. With many ILMs it is the case that they would concentrate on hard but rather rare/infrequent linguistic constructs, aiming at completeness and paying little attention to the computational efficiency. Whether completeness in computational language modeling is a feasible goal or not represents a debatable issue and there are not few those considering that language is a too poorly understood phenomenon to be able to model it adequately. Adequately usually means being able to account in a psychologically meaningful way for all the idiosyncrasies of language learning, usage and understanding.

At the other end of the spectrum stays the evidence-based language modeling. This computer-centered approach relies on the reasonable assumptions that language is not a chaotic system and that provided a machine is fed with enough and relevant linguistic data, it would be possible to automatically learn (most/all of) the regularities underlying the natural language communication. The fuzzy terms enough and relevant are in most cases empirically defined with respect to a specific processing task for which capturing most of the regularities in language use is considered to be sufficient for building a useful language model. Now, what a useful language model means is again a matter of dispute. For the sake of this paper we would assume a generally accepted definition within the corpus linguistics community: covering as many linguistic phenomena as possible in arbitrary texts and allowing for an as fast as possible processing. The speed, accuracy and the required computational resources certainly depend on the problems to be solved as well as the granularity these problems are looked at and encoded in the language model.

The linguistic problem we address here is the morpho-syntactic disambiguation (MS-tagging) of natural language arbitrary texts. This is a well-defined task, intensively studied, with many practical applications. The problem can be simply stated as follows: given a natural language text, tokenized into meaningful pieces of language units, each of them annotated with all possible morpho-syntactic interpretations, choose the single interpretation for each token that is correct in the given context (assuming that only one interpretation per token in context is allowable). Lexical ambiguity resolution is a key task in natural language processing [Baayen & Sproat, (1996)]. It can be regarded as a classification problem: an ambiguous lexical item is one that in different contexts can be classified differently and given a specified context the disambiguator/classifier decides on the appropriate class. The features that are relevant to the classification task are encoded into the tags. It is part of the corpus linguistics lore that in order to get high accuracy level in statistical POS disambiguation, one needs small tagsets and reasonable large training data. The effect of tagset size on tagger performance has been discussed in [Elworthy, (1995)]; what a reasonable training corpus means, typically varies from 100,000 words up to more than one million. Although some taggers are advertised to be able to learn a language model from raw (unannotated) texts, they require a post validation of the output and a bootstrapping procedure that would eventually get the tagger (in possibly several iterations) to an acceptable error rate. The larger the

corina
Text Box
Proceedings of the TELRI International Seminar on ''Text Corpora and Multilingual Lexicography'', Bratislava, Hungary, November 1999. Institute of Linguistics, Slovak Academy of Sciences

tagset, the larger the necessary training corpora [Berger & all (1996)]. Provided that enough training data is available, this should not be too problematic as long as response time is not seriously affected.

It is generally believed that the state-of-the-art in MS-tagging still leaves room for significant improvements as far as correctness is concerned. The granularity feature we mentioned before determines the difficulty and the complexity of the MS-tagging task. If the underlying language model distinguishes only a few categories of linguistic units and each of them has a small number of attributes, than the cardinality of the necessary tagset would be small. On the contrary, if the language model distinguishes among an increased number of classes of linguistic units and they are described in terms of a larger set of attributes, the necessary tagset will be inherently higher than in the previous case. Having to choose from a larger set of possibilities it seams to be quite intuitive that the MS-tagging becomes harder as the granularity of the language model becomes finer. Harder here means slower, usually less accurate and requiring more computational resources.

In evidence-based (data-driven) approaches to whatever computational problem there are three key factors that affect the effectiveness of a specific solution: the learning algorithm, the quality of the training data and the proper problem solver (the one that applies the knowledge that the learner extracted from training data to the new unseen data).

The learning algorithm will construct a language model usable by the MS-tagger, the accuracy of which would be in direct relation with the quality of the training data. The quality of the training data is judged in terms of both accuracy (correctness of the tokens classification) and coverage (that is provide enough evidence for items classification so that reliable generalizations could be drawn).

Most of the work in part-of-speech or morpho-syntactic tagging relied on the assumption that some high quality training data is available, and concentrated on the modalities to improve the performance of the learners and of the taggers. Among the most successful approaches where those that enhanced the statistical learners with different smoothening techniques [Baayern, Sproat (1996), Brown & all (1993), Chiang & all (1995), Dunning (1993), Merialdo (1995), Dermatas, Kokkinakis (1995), etc], combined the statistical with the rule-based methods [Tapanainen & Vuotilainen (1993, 1994), Brill,(1995), Ezeiza & all (1998), etc.], used tiered tagging [Tufiş, 1998, Tufiş, Mason (1998)] or used combined classifiers [Dietterich (1997), van Halteren, Zavrel, Daelemans (1998), Brill, Wu (1998)].

However, building a high quality training-corpus is a huge enterprise because it is assumed to be hand-made and therefore extremely slow and expensive. A usual claim for justifying poor performances or incomplete evaluation for MS-taggers is the lack of enough training data. We will show that the process of building-up high quality training-corpora can be significantly speed-up, with a corresponding decrease in costs.

2. Combined classifiers method Out of methods used to improve the accuracy of the taggers, relevant for the proposal we will make in this paper is

the combined classifiers method (van Halteren, & all (1998), Brill & Wu (1998)]. The combine classifier approach in MS-tagging naturally emerged from work done on the taggers evaluation [Chanod &Tapanainen (1994), Teufel & all (1996), Samuelsson & Vuotilainen (1997), Rajman & all (1998), Padró & Márquez (1998), etc.]. The combined classifier approach for MS-tagging is intuitively described below. Having k different MS-tagging systems (learner+tagger) and a training corpus, build k language models, one for each system. Then, given a new text T, run each trained tagging system (LMi+taggeri) on it and get k disambiguated versions of T namely T1, T2, … Ti, …Tk. Put it otherwise, each token in T would be assigned k interpretations (not necessarily distinct). Given that each tagging system has its own view on the processed text (encoded in its associated language model), it is very unlikely that the k versions of T would be identical. However, as compared to the truth (a human judged annotation), the probability for an arbitrary token from T to have been assigned the correct interpretation in at least one of the k versions of T is very high (close to 99%). Let us call the hypothetical guesser of this correct tag an oracle (as [Brill & Wu (1998)] call it). Implementing an oracle, i.e. automatically deciding which of the k interpretations is the correct one is a very difficult problem. However, the oracle concept, as defined above, is very useful since its accuracy gives an estimation of the upper bound of correctness that can be achieved by a given taggers combination.

The experiment described in [van Halteren & all.(1998)] is based on the tagged LOB corpus [Johansson (1986)] and uses four different taggers: a trigram HMM tagger [Steetskamp (1995)], a memory-based tagger [Daelemans &all. (1996)], Brill’s Transformation rule-based tagger [Brill, 1994] and a Maximum Entropy-based tagger [Ratnaparkhi (1996)]. There are proposed several decision making procedures (4: simple (individual) voting strategies with different weighting policies -majority, total precision, tag precision, precision-recall- plus a combined pair-wise voting strategy). With pair-wise voting strategy, the combined classifier system outscored all the individual tagging systems

(97.92%). However, given that for the experiment described [van Halteren & all (1998)] the oracle’s precision (99.22%) was not achieved, it shows that there is still room for further investigations on the decision-making procedure.

An almost identical point of view and very similar results are reported in [Brill & Wu (1998)]. Their experiment is based on the Penn Treebank Wall Street Journal corpus [Marcus & all. (1993)] and uses a HMM trigram tagger, Brill’s Transformation rule-based tagger [Brill, 1994] and a Maximum Entropy-based tagger [Ratnaparkhi (1996)]. Here, the evaluated accuracy of the oracle is 98.59%, and using the “pick-up tagger” combination method, an overall accuracy of 97.2% was obtained1

So, in our experiment, we used only one tagger T (this can be any tagger) but trained it on different kind of text registers, constructing different language models (LM1, LM2…). Obviously, the different language models have the same representation, though different content. A new text (unseen, and of unknown register) is independently tagged with the same tagger but using different language models. It is quite intuitive to see that such an approach, involving N language models built with the same tool but from different training texts, applied on the same text could be easily made computationally more efficient (faster) than the baseline obtained by multiplying the response time of the tagger with N. From here, results that running a tagger with N language model is faster than running N different taggers with its own language model. However the basic difference between our approach and that described Brill & Wu (1998) and van Halteren & all (1998) relies not in the better performance but in the very nature of the optimisation criterion. While Brill & Wu (1998) and van Halteren & all (1998) rely on the algorithmic differences among the different

.

What is important about these experiments is that they clearly show that the different taggers, based on language models constructed from the same training data, make complementary errors, so that it really makes sense to look on combination methods. An intuitive evaluation of the error complementary made by two taggers A and B can be obtained by the following simple measure [Brill & Wu (1998)]:

COMP(A,B)=(1- Ncommon / NA) * 100 where Ncommon represents the number of cases where both taggers are wrong and

NA stands for the number of cases when the tagger A is wrong

The COMP measure provides the percentage of the cases in which tagger B is right when A made a wrong classification. If the two taggers would make the same mistakes, or if errors made by tagger B would be a superset of those made by A, than COMP(A,B) would be 0, but as shown by the two experiments described before, this is not the case neither when one considers two taggers performing equally well nor when one considers a sophisticated tagger (such a maximum-entropy tagger) and a very simple one (such as a unigram tagger).

Note that COMP is not commutative, i.e. COMP(A,B)≠COMP(B,A). In the experiment described in Brill & Wu (1998), for instance COMP(Unigram-tagger, Maximum-Entropy-tagger)=69.4% while COMP(Maximum-Entropy-tagger, Unigram-tagger)=34,9%.

While the idee of combined classifiers is a very positive one, the way it is used in the above described experiments is limited. The difference in performance among the different classifiers is justified by a mixture of the technological device (the proper tagger) and to a less extent to the very linguistic nature of the training data. Training the different taggers on the same training corpus and tagging the same (unseen) text is supposed to outline mainly the performance of different approaches to the tagging problem. The linguistic relevance of the training text is not easily measurable.

In spite of the general claim that proper training of learning taggers is impeded by the lack of ready available large training corpora, it is surprising how little effort has been made towards automating the tedious and very expensive hand-annotation procedures underlying the construction or extension of a training corpus. The utility of a training corpus relies not only in its correctness, but also in its size. Having some small texts from different language registers annotated/validated by hand could be enough in order to construct a highly accurate large training corpus with much less effort than usually is assumed. This is the very issue we will address in the rest of this paper. Although all the experiments were done on Romanian, the proposed procedure is language independent.

What we propose here is apparently a similar approach to the one above in combining the classifiers, but in fact it is rather different and in our view, much more useful.

1 The differences in the accuracy figures for the two experiments could be explained by using a different number of taggers (4 versus 3, with two taggers being commonly used), but also due to the potential different levels of noise in the two training corpora (see [Padró& Márquez (1998)] for a discussion of this topic).

taggers, our approach relies on the different linguistic properties of various register texts. While in the "multiple taggers" approach is very hard to see what is the influence of the type of texts, in our "multiple registers" approach text register identification is a byproduct of the methodology. As our experiments have shown, when a new text was in the register of a specific register language model, that language model ensured without exception the highest accuracy in tagging. Therefore, it is reasonable to assume that when tagging a new text by a "multiple register" approach, if the final result is closer to the individual version generated by using the language model LM, then probably the new text belongs to the LM register, or it is closer to this register. Having a general hint on what type of text is the one currently processed, then stronger identification criteria could be used to confirm the hypothesis.

3. Combining views from different linguistic registers

The experiment we engaged in can be briefly described in the following way: given three reasonably large hand-annotated corpora (presumably error-free) C1, C2 and C3, from three different language registries R1, R2 and R3, we wanted to generate by bootstrapping, based on C1, C2 and C3, other annotated corpora C4, C5, …Cn., from other language registries R4, R5, …Rn. This was motivated partly because of the scientific interest in such an enterprise, but mainly due to the human resources limitations. The construction was supposed to be as cheap, fast and reliable as possible.

What we did was to use a specific tagger, train it on each corpus, thus getting three language models LM1, LM2 and LM3. Then, we tagged corpus C4, using alternatively each language model. We obtained this way a tagged corpus C4’ where each item had been assigned three tags, one from each LM. In most cases, the three tags were identical, so according to some previous experiments (described in the next section) we credited such cases with the truth value. For the remaining tokens, it was a human expert to make the final judgement. As we will show in the following, this very simple combination resulted in a highly accurate disambiguated corpus, with the human expert looking at less than 10% of the total number of tokens.

With XEmacs (the text editor we used) the human-decision making was facilitated by an automatic positioning of the cursor on the next token in dispute. In parallel with the human decision-making, each language model was given credits or penalties with respect to each tag disambiguated by the human expert: if the human decision (say tagi) coincided with the proposal made by the language model LMj, the language model LMj was given a bonus for the tagi. Otherwise, considering that LMj proposed tagk instead of tagi, the language model in case was given a penalty encoded as a pair tagk/tagi (meaning that LMj wrongly proposed tagk instead of the correct one tagj). This way, at the end of the manual checking of the tagged corpus C4, each language model was associated with a credibility profile CPi containing statistical information on how well it did on each tag, and in cases where it was wrong how often it confused the right tag with different other tags in the same ambiguity class. With the credibility profiles and C4 as a test corpus, we developed different decision procedures for further constructing C5, C6…Cn without human intervention.

In the next sections we will briefly describe the tagger we used and the language resources (dictionary and corpora). Afterwards we will present the methodology and the experiment in more details and provide the evaluation results.

3.1 The Probabilistic Tagger

In our experiments we used a tiered tagging system [Tufiş (1998)], based on Oliver Mason’s QTAG probabilistic tagger as described in [Tufiş & Mason (1998)]2

2 A new stand-alone Java version is freely available at http://www-clg.bham.ac.uk/oliver/java/qtag.

. We should mention that the tiered tagging approach is not dependent on a specific tagger, nor the LM combination procedure described in the next section. However, given that QTAG is freely available, at least as accurate as other free trigram taggers, very simple to use and plug into other applications and extremely fast (both in training and in disambiguating), we found QTAG a very good option.

In general terms, tiered tagging is a two steps procedure that firstly disambiguate an input text based on a hidden reduced tagset (we call it C-tagset) and then maps the tags from the C-tagset into finer grained tags from a larger tagset (we call it MSD-tagset). For Romanian we used 89 tags for the C-tagset and 621 tags for the MSD-tagset (including the punctuation tags).

The basic algorithm of the hidden level tagging (achieved by QTAG) is fairly straight-forward: at first, the tagger looks up the dictionary for all possible tags that the current token can have (for the unknown words a guesser is invoked, see below) together with their respective lexical probabilities (i.e. the probability distribution of the possible tags for the word form). This is then combined with the contextual probability for each tag to occur in a sequence preceded by the two previous tags. The tag with the highest combined score is selected. Two further processing steps also take into account the scores of the tag as the second and first element of the triplet as the following two tokens are evaluated. QTAG works by combining two sources of information: a dictionary of words with their possible tags and the corresponding frequencies and a matrix of tag sequences also with associated frequencies. These resources can easily be generated from a pre-tagged corpus.

The tagging works on a window of three tokens, which is filled up with two dummy words at the beginning and the end of the text. Tokens are read and added to the window which is shifted by one position to the right each time. The token that ‘falls’ out of the window is assigned a final tag. The tagging procedure is as follows:

1. read the next token

2. look it up in the dictionary

3. if not found, guess possible tags

4. for each possible tag

a. calculate Pw = P(tag|token) the probability of the token to have the specified tag

b. calculate Pc = P(tag|t1,t2), the probability of the tag to follow the tags t1 and t2.

c. calculate Pw,c = Pw *Pc, the joint probability of the individual tag assignment together with the contextual probability.

5. repeat the computation for the other two tags in the window, but using different values for the contextual probability: the probabilities of the tag being surrounded and followed by the two other tags respectively.

For each recalculation (three for each token) the resulting probabilities are combined to give the overall probability of the tag being assigned to the token. As these values become very small very quickly, they are represented as logarithms to the base 10. For output, the tags are sorted according to their probability, and the difference in proba-bilities between the tags gives some measure of the confidence with which the tag ought to be correct (see below for an example of this).

The unknown words are dealt with by a guessing module which search the word for its inflectional ending(s)3. The unknown word is assumed to belong to a main open class (noun, adjective, verb)4

2. Language Resources

. Each ending (including the 0-ending) is associated with an ambiguity class consisting of appropriate tags for open class words (the 0-ending includes also tags for abbreviations, residuals, and interjections). By a retrograde analysis (right to left) of the unknown word, the guesser identifies all possible endings. The ambiguity classes corresponding to all the possible endings are merged, with higher probability assigned to the interpretations provided by longer endings. Depending on the way the guesser is invoked, the unknown word is assigned either this fully merged ambiguity class (the default) or the ambiguity class corresponding to merging the ambiguity classes of the first two longest matched endings. In [Tufis & Mason (1998)] we showed that the recall of the guesser, in an experiment run on almost half a million different words was almost 100% (99.96%). The precision was reasonable (66,3%) so the ambiguity class returned by the guesser for an unknown word contains on average at least one noisy tag.

3 This guesser is language (Romanian) specific, but QTAG comes with a language independent guesser which simply computes guessing scores based on the last three letters of each word existing in the lexicon.

4 Some other words which are not supposed to be found in the dictionary (proper names, abbreviations, numbers, dates, etc) are taken care by the tokenizer [Tufiş & all, 1997)] which would assign a proper unambiguous tag. Therefore, they never meet the guesser.

The table in Figure 1 provides information on the data content of the main Romanian dictionary that was used for the corpus analysis. The MSDs (Morpho-Syntactic Descriptions) represent a set of codes as developed in the MULTEXT-EAST project5

Entries

. A full account on the dictionary encoding strategies and data content, from a multilingual perspective (the 7 languages of the project), can be found in [Tufiş & all (1997)].

Word-forms Lemmas MSDs AMB-MSD AMB-POS

418737 347252 33552 611 869 89

Figure 1: Romanian dictionary overview

AMB-MSD represents the number of ambiguity classes (Weischedel & all (1993); Abney, (1997)] or genotypes (Tzoukermann & Radev). For a given wordform several MSDs might be applicable (accounting this way for homographs). The set of all the MSDs applicable to a given word defines the MSD-ambiguity class for that word. When only part of speech is used for the clustering and MSD-ambiguity class is turned into a POS-ambiguity class. The Romanian lexicon contains 869 MSD-ambiguity classes and 89 POS-ambiguity classes (for more details on the Romanian dictionary and corpora encoding and several relevant statistics see (Tufis & all, 1997)).

We constructed three training-corpora for different registers (fiction, philosophy and journalism) based on Orwell’s “1984”, Plato’s “The Republic” and several issues from the “România Liberă” and "Adevărul" (the newspapers with the largest distribution in Romania). These three corpora, cover all the MSDs and more than 92% of the MSD-ambiguity-classes defined in the lexicon A brief overview of these texts is given below:

Corpus Occurrences Items Lemmas MSDs AMB-MSD

1984 101449 14040 7008 396 524

Republic 114718 10350 4697 369 490

News 98194 9673 5944 403 416

Figure 2: Romanian training corpora overview

For testing purposes, we hand-tagged about 60.000 more words from different texts in the three registers. “1994” is a follow-up story for Orwell’s famous novel, written by a Romanian author (Păun, 1993), “Aristotle” is a monograph on Aristotle’s work (Barnes, 1993) and “MoreNews” represents a collection of articles from other newspapers than those included into the training corpora. An overview of the test texts is shown in Figure 3.

Corpus Occurrences Items

1994 20078

Aristotle 20116

MoreNews 20033

5 For the final reports see http://nl.ijs.si/ME

Figure 3: Romanian test corpora overview

3. Training, Biased Evaluation and Multiple LM Tagging The first phase of the experiment was to evaluate each LM on pieces of texts from the same texts as used for

training (but unseen before). The rationale for this kind of evaluation, which we call biased evaluation (and which is nevertheless the current practice in tagging experiments) was (among others) the estimation of the maximum performance of each LM, considering that the highest accuracy could be obtained on data similar to that used in the training phase.

So, we built the first LM based on 90% of “1984“, the second based on 90% of “The Republic” and the third based on 90% of “News”. The dictionary contained all the words in the three corpora. The three LMs were used to test the unseen 10% of the corresponding corpora with the results shown in Figure 4.

data /LM # words # errors average ambiguity accuracy

10%1984 /90%1984 11791 189 1.55 98.39%

10%Rep/90% Rep 13696 256 1.63 98.13%

10%News /90%News 9938 167 1.54 98.32%

Figure 4: Biased evaluation of the tagging results

The second phase of the experiment was to reconstruct the three LMs on the full texts (100%) and use for the test-tagging the three hand-tagged additional texts (see Figure 3).

Each test-text was tagged with each available LM and the error-complementarity hypothesis was confirmed: tagging a text with two different LM-registers, the errors made in the two resulted texts are complementary, that is the errors in one version are not a subset of those in the second version. The numbers shown in Figure 5, represent the values for the function COMP(TLM1, TLM2) computed for the three test texts (Fiction, Philosophy and Newspapers) considering all meaningful LM combinations. For instance, when tagging the FICTION text with the “Rep” LM and “News” LM, the complementarity value is: COMPFICTION(TRep, TNews) = 56.85%.

FICTION (1994) PHILOSOPHY (Aristotle) NEWSPAPERS (MoreNews)

LM 1984 Rep News LM 1984 Rep News LM 1984 Rep News

1984 43.15 47.63 1984 42.29 59.54 1984 47.38 63.24

Rep 52.21 56.85 Rep 44.59 62.25 Rep 43.65 66.23

News 54.44 55.37 News 54.16 55.46 News 42.11 50.41

Figure 5: Error complementarity between different pairs of LM as applied to different texts

The full paper will comment on some interesting properties of the COMP values shown in Figure 5.

4. Language Models Combinations

LM Credibility Profile

As a side-effect of the biased evaluation of the models we constructed a data-structure called language model credibility profile, which contains information on the overall accuracy (OA), and for each tag in the training corpus its correct assignment probability (CAP) as well as the tag confusion set. The confusion set, as used here, is associated to a tag tagk and represents an association list where each pair tagi:pi denotes the probability pi that tag tagk will be confused with the tag tagi. These probabilities are estimated by the simple counts given below:

Correct assignment probability estimate for tagk = # of correct assignments of tagk / # of tagk Probability estimate for confusing tagk with tagi = # of wrongly assigned tagi instead of tagk / # of tagk

In general these estimations may suffer from data sparseness, so in order to avoid unreliability some smoothening techniques should be considered. However, in this experiment we didn’t use them, but considered an ideal setting where we tagged the whole register corpus with its own LM. This way we ensured for each tag an average occurrence count of more than 1500 and for most errors of the type “tagi instead of tagk” an average count of 198.72. There are 32 pairs of tags that are confused quite frequently (unfortunately with undiscovered regularity). Each of these errors appeared more than 90 times per each corpus. We call these pairs major confusion pairs. In a confusion list all the non-major confusion pairs are considered equally probable.

Using the idealized setting described before, we slightly over-estimated the correct assignment probabilities, while the confusion probabilities were under-estimated.

The LM credibility profile can be formally defined as follows:

PROFILE(LMi)= {OA <X1 (X1:P1 Xm:Pm....Xk:Pk)><X2 (X2:P2 Xq:Pq....Xi:Pi)>....<Xn (Xn:Pn Xs:Ps....Xj:Pj)>}

Each pair <Xa (Xa:Pa Xb:Pb...Xz:Pz)> in a LMi profile refers to a specific tag (Xa) and provides its correct assignment probability (Pa) and also the probabilities to be confused (Pb,… Pz) with another tag (Xb,…, Xz). If a tag Xi does not appear in the list associated with Xa we assume that the probability of mistagging one token with Xi when it should be tagged with Xa is 0.

Let us define CONFUSION(LMi, Xa, (X1, X2, ...Xk)) = SUM(Pj) where Pj is the probability of confusing Xa with Xj. If Xj does not appear in the Xa‘s confusion list then Pj will be 0. This function computes the probability that a token which should be tagged with Xa will be mistagged by LMi with one of X1, X2, ... or Xk. Furthermore, we define the function CONFIDENCE(LMi, Xa, (X1, X2, ...Xk)) as the difference between the probability that LMi will correctly assign a given token the tag Xa and the probability that LMi will confuse the right tag with one of X1, X2, ... or Xk.

Let it be k LMs used to tag the same text. Each token will receive from each LM a specific tag, therefore the decision making procedure has to choose one of the k tags (not necessarily distinct) as the most probable correct one. We will call this decision-making procedure a tag-judge.

Certainly, the simplest tag-judge is the majority vote: the tag that has been proposed by most of the LMs will be the selected one. We will call it as MAJORITY.

Another possible tag-judge, which will be called COMPETENCE, will calculate a pondered sum for each proposed tag and the tag that will get the highest ranking will be the selected one. The weight of the tag Xi proposed by LMi will be the overall accuracy (OA) of LMi.

The third tag-judge, called CONFIDENCE, will chose the tag proposed by the LM with the highest CONFIDENCE score versus the tags proposed by the other LMs.

The last tag-judge we considered, called COMPETENCE&CONFIDENCE is a mixture between the second and the third tag-judgement. The selected tag will be the one proposed by the LM for which the product OA*CONFIDENCE(LM, Xa, (X1, X2, ...Xk)) will be the highest.

For the tag-judges defined above, the experimental results are shown in the table in Figure 6.

The figures in bold face clearly show that combined LM tagging provides much better results than a single LM-based tagging even when the training text is the same.

The full paper will make comments on these figures, and will bring evidence on the language independence of the methodology we discussed here. Computational performances will be also analyzed and a comparison with other approaches in MS-tagging of highly inflectional languages will be provided.

Language model /Text 1984 Republic News 1984 Republic News 1984 Republic News

fiction (20076 words) philosophy (20119 words) newspapers (20038 words) Individual accuracy (%) 98.10 97.74 97.82 97.83 98.09 97.74 97.13 97.32 98.17

accuracy (%) with a LM built from the merged training corpor (1984+Republic+News)

98.31

98.36

97.89

Correct agreement (%) 96.47 96.56 95.99

Incorrect agreement (%) 0.59 0.53 0.67

Disagreement, right tag included(%) 2.91 2.88 3.27

Disagreement, right tag missing(%) 0.03 0.03 0.07

ORACLE accuracy (theoretic) (%) 99.38 99.44 99.26

MAJORITY accuracy (%) 98.36 98.32 98.21

COMPETENCE accuracy (%) 98.48 98.37 98.25

CONFIDENCE accuracy (%) 98.89?? 98.65?? 98.32??

COMP&CONF accuracy (%) 98.89??/ 98.67?? 98.36??

Figure 6: Evaluation results

5. Conclusions We have shown that tiered tagging (tagging a text by a reduced hidden tagset layer and followed by recovering of the information in the initial tagset) allows for successful applying of statistical methods in processing highly inflectional languages (in our case Romanian), which require large tagsets. We also have showed that, given the error complementarity, combining register-diversified LM is rational and doing so, one can significantly increase the accuracy of the tagging process with an additional bonus: hypothesising the linguistic register of the input text.

3.2 Language Resources

Two corpora used in the experiments and evaluations reported here were developed within the MULTEXT-EAST and TELRI projects respectively. They represent only the Romanian subcorpora of two multilingual parallel corpora.

The first one is based on Orwell’s “1984” and contains officially 7 language versions (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene and English as a hub) which are word-level CES-encoded [Ide & Veronis (1992)], fully disambiguated and fully aligned with respect to the English version. An overall presentation of this multilingual parallel corpus is given in [Dimitrova & all (1998)]. The MULTEXT-EAST corpus has been recently contributed with more language versions (Russian, Serbo-Croat and Lithuanian).

The second corpus out of which we extracted the Romanian sub-component is the TELRI multilingual corpus which contains 22 translations (all Central and Eastern European Languages, plus English, French, German, and Chinese) of the Plato’s “The Republic”. All versions are TEI-Lite encoded and pairwise aligned. A general presentation of this parallel corpus can be found in [Erjavec & all (1998)].

Besides Romanian versions of Orwell's 1984 and Plato's The Republic, for the work reported here, it was prepared a newspaper corpus based on several issues from the “România Liberă” (the newspaper with the largest distribution in Romania). Each corpus was CES encoded, hand-tagged and carefully validated. A brief overview of these texts is given below:

Corpus Occurrences Words Lemmas MSDs AMB-MSD AMB-POS

1984 101449 14040 7008 396 524 85

The Republic 114718 10350 4697 369 490 79

News 39183 9673 5944 403 416 80

Covered 211466 29876 14405 602 738 89

Figure 2: Romanian corpora overview

The Occurrences column contains the number of word-forms occurrences in each corpus without eliminating the duplicates. The Wordforms column shows the number of distinct word-forms in the three corpora. The MSDs column shows the number of distinct MSD used in each of the three corpora. The AMB_MSD and AMB_POS give, respectively, the numbers of MSD-ambiguity classes and POS-ambiguity classes found in the corpora. Covered represents the reunion of the information provided by the three corpora. As can be seen from the table in Figure 2, the three corpora cover a little more than 8% of the lexical stock, almost all the MSDs are covered (missing only 9), a great deal of the MSD-ambiguity classes (about 85%) and all the POS-ambiguity classes. This is to say that most of the words raising ambiguity problems appeared at least in one of the two texts. One continuous preoccupation for our group is to minimally add new texts so that to ensure complete coverage in terms of ambiguity classes. We will continue to add texts to our hand annotated 3-register corpora (fiction, philosophy and journalism) texts from the journalistic style either up to the fully coverage of the MSD-ambiguity classes or until the News corpus reached a comparable dimension with the other two. If in the latter case, full coverage of the MSD-ambiguity classes has not been reached, a new register will be added to the corpora collection. Surprisingly, the third corpus (News), although much smaller than the others supplemented very well the distributional properties of the two books in the text collection (see [Tufis & all (1997) for the same kind of information computed only on the basis of “1984” and “The Republic).

The existent SGML markup (TEI conformant) was stripped off and the texts have been tokenized. Please note that a token is not necessarily a word: one orthographic word may be split into several tokens (the Romanian “dă-mi-l” (give it to me) is split into 3 tokens) or several orthographic words may be combined into one token (the Romanian words "de la" (from) are combined into one token "de_la"). Each lexical unit in the tokenised texts was automatically annotated with all its applicable MSDs and then hand disambiguated, thus obtaining the MSD-tagged corpora.

3.3 The C-tagset, Training Corpora, Training Process and the Biased Evaluation of the Language Models

The tagset used for the hidden tagging phase of the tiered tagging (see before) is called C-tagset and contains 79 tags for different morpho-syntactic categories, plus 10 tags for punctuation. This tagset has been derived by a trial-error procedure from the 611 morpho-syntactic description codes (MSDs) defined for encoding the Romanian lexicon, so that the information left out be easily recoverable by a quasi-deterministic procedure ([Tufiş & Mason (1998), [Tufiş (1998)]. The training corpora (CTAG-corpora) were obtained from the MSD-tagged corpora by substituting the MSDs with their corresponding corpus tags.

The first phase of the experiment was to evaluate each language model on texts from the same register. The rationale for this kind of evaluation, which we call biased evaluation (and which is nevertheless the current practice in tagging experiments) was the assessment of the maximum performance of each LM, considering that the highest accuracy could be obtained on data similar to that used in the training phase. These performance figures will be used later in the definition of the combined classifier.

So, we built the first language model based on 90% of “1984“, the second based on 90% of “The Republic” and the third based on 90% of “News”. The three language models were used to test the unseen 10% of the corresponding corpora with the results shown in Figure 3.

data /LM no. of words no. of errors average ambiguity

accuracy

10%1984 /90% 1984 LM 11791 189 1.55 98.39%

10%Republic/90% Rep LM 13696 393 1.63 97.13%

10% News/ 90%News LM 5683 118 1.54 97.92%

Figure 3: Biased evaluation of the tagging results

The accuracy was computed in the usual manner, as the number of correctly assigned tags per total number of tags. The average ambiguity, an approximate measure for the complexity of the disambiguation task, was computed again in the traditional simple way, as the average number of tags per word, i.e. the sum of all possible tags for all words, divided by the number of words. A more informative measure would be to disregard punctuation as this is almost always assigned a unique tag. An even better measure would consider only the ambiguous words, i.e dropping any item (word or punctuation) that is uniquely labeled by the look-up (or guessing) procedure. For instance, the 3 ambiguity scores for the texts used in the experiment reported here are shown in the table below:

Average ambiguity SM NPM AM

1984 1.55 1.60 2.49

Republic 1.63 1.72 2.37

News 1.54 1.58 2.48

Figure 4: Different measures of text ambiguity

SM (Simple measure) = number of tags/number of tokens

NPM (Non-Punctuation Measure) = number of non-punctuation tags/ number of non-punctuation tokens AM (Ambiguity Measure) = number of tags assigned to ambiguous tokens/ number of ambiguous tokens.

5. References Abney, S.(1997): Part-of-Speech Tagging and Partial Parsing. In Young, S., Bloothooft, G. (eds.) Corpus Based Methods in Language and Speech Processing Text, Speech and Language Technology Series, Kluwer Academic Publishers,(pp. 118-136)

Baayern, Harald, Sproat, Richard (1996). “Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms.” Computational Linguistics, 22(2), 155-166.

Berger, A., L., Della Pietra, S., A., Della Pietra, V., J. (1996): A Maximum Entropy Approach to Natural Language Processing in Computational Linguistics, vol. 22, no. 1 (pp. 39-72), March 1996

Brill, Eric, and Wu, Jun (1998). “Classifier Combination for Improved Lexical Disambiguation” In Proceedings of COLING-ACL’98 Montreal, Canada, 191-195

Brill, Eric (1995). “Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging.” Computational Linguistics, 21(4), 543-565

Brill, Eric (1994). “Some Advances in Transformation-Based Part-of-Speech Tagging.” In Proceedings of AAAI’94, ???, ???-???

Brown, F., Peter, Della Pietra, A., Stephen, Della Pietra, J., Vincent and Mercer, L. Robert(1993). “The Mathematics of Statistical Machine Translation: Parameter Estimation.” Computational Linguistics, 19(2), 263-312.

Carleta, Jean (1996). “Assesing Agreement on Classification Tasks: The Kappa Statistics.” Computational Linguistics, 22(2), 249-255

Chanod, Jean-Pierre, Tapanainen, Pasi (1994). “Statistical and Constrained-based Taggers for French.” Technical Report MLTT-016, RXRC Grenoble, 1994, 38pp

Chelba, Ciprian, Jelinek Frederic (1998). “Exploiting Syntactic Structure for Language Modeling” In Proceedings of COLING-ACL’98, Montreal, Canada, 225-231

Chiang, Tung-Hui, Lin, Yi-Chung and Su, Keh-Yih (1995). “Robust Learning, Smoothening and Parameter Tying on Syntactic Ambiguity Resolution.” Computational Linguistics, 21(3), 321-350.

Church, Kenneth (1989). “A stochastic parts program and noun phrase for unrestricted text.” In Proceedings of IEEE 1989 International Conference on Acoustics, Speech and Signal Processing. Glasgow, 695-698

Cutting, D., Kupiek, J., Pedersen, J. and Sibun, P. (1992). “A Practical Part-of-Speech Tagger.” In Proceedings of Third Conference on Applied Language Processing, Trento, Italy, 133-140

Deligne, Sabine, Sagisaka, Yoshinori (1998). “Learning a Syntagmatic and Paradigmatic Structure from Language Data with Bi-Multigram Model.” In Proceedings of COLING-ACL’98, Montreal, Canada, 300-306

Dermatas, Evanghelos, Kokkinakis, George (1995). “Automatic Stochastic Tagging of Natural Language Texts.” Computational Linguistics, 21(2), 321-350.

Dietterich, Tomas (1997). “Machine Learning Research: Four Current Directions”, In AI Magazine, Winter 1997, 97-136

Dimitrova, Ludmila, Erjavec, Tomaž, Ide, Nancy, Kaalep, J. Heiki, Petkevič, Vladimir, and Tufiş, Dan (1998). “Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages” In Proceedings of COLING-ACL’98, Montreal, Canada, 315-319

Dunning, Ted (1993). “Accurate Methods for the Statistics of Surprise and Coincidence.” Computational Linguistics, 19(1), 61-74.

Elworthy, D. (1995): Tagset Design and Inflected Languages, In Proceedings of the ACL SIGDAT Workshop, Dublin, (also available as cmp-lg archive 9504002)

Erjavec, Tomaž, Lawson, Anne, Romary Laurent (1998) “East Meets West: Multilingual Resources in a European Context” In Proceedings of First International Conference on Language Resources and Evaluation, Granada, Spain, 981-986

Ezeiza, N., Alegria, I. Arriola, J., M., Urizar, R. Aduriz, I. (1998). “Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages.” In Proceedings of COLING-ACL’98, Montreal, Canada, 380-384

Hajič, Jan, Hladká, Barbora (1998). “Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset.” In Proceedings of COLING-ACL’98, Montreal, Canada, 483-490.

van Halteren, Hans, Zavrel, Jakub, and Daelemans, Walter(1998). “Improving Data Driven Wordclass Tagging by System Combination” In Proceedings of COLING-ACL’98, Montreal, Canada, 491-497

Ide, Nancy, Veronis, Jean (1993). “Background and Context for the Development of a Corpus Encoding Standard, EAGLES Working Paper, 30 p. Available at <http://www.cs.vassar.edu/CES/CES3.ps.gz>

Johansson, S. (1986). “The Tagged LOB Corpus: User’s Manual.” Norwegian Computin Centre for Humanities, Bergen, Norway, 149 pp.

McMahon, G., John, Smith, J. Francis (1996). “Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies.” In Computational Linguistics, 22(2), 217-248.

Marcus, M., Santorini, B., Marcinkiewicz, M. (1993). “Building a Large Annotated Corpus of English: The Penn Treebank.” Computational Linguistics, 19(2), 313-330

Merialdo, Bernard (1994). “Tagging English Text with a Probabilistic Tagger.” In Computational Linguistics, 20(2), 155-172.

Padró, Luís, Márquez, Luís (1998). “On the Evaluation and Comparison of Taggers: the Effect of Noise in Testing Corpora.” In Proceedings of COLING-ACL’98. Montreal, Canada, 997-1002

Adda, Gilles, Mariani, Joseph, Lecompte, Josette, Paroubek, Patrick, Rajman, Martin (1998) “The GRACE French Part-of-Speech Tagging Evaluation Task”.In Proceedings of First International Conference on Language Resources and Evaluation, Granada, Spain, 433-441

Rathaparkhi, Adwait (1996). “A Maximum Entropy Part of Speech Tagger.” In Proceedings of EMNLP’96, Philadelphia, Pennsylvania.

Samuelsson, C., Vuotilainen, A. (1997). “Comparing a Linguistic and a Stochastic Tagger.” In Proceedings of Joint EACL/ACL Conference, Madrid, Spain ??-??

Steetskamp, R. (1995). “An implementation of a Probabilistic Tagger.” TOSCA Research Group, University of Nijmegen, The Netherlands, 48 pp.

Tapainainen, Pasi, Voutilainen, Atro (1994). “Tagging Accurately – Don’t guess if you know”. In Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart, Germany, 47-52

Tapainainen, Pasi, Voutilainen, Atro (1993). “Ambiguity Resolution in a Reductionistic Parser”, In Proceedings of EACL’93, Utrecht, Netherlands, ??-??

Teufel, Simone, Schiller, Anne, Heid, Ulrich (1996) “Task on Tagset and Tagger Interaction. EAGLES Validation (WP-4) Final Report”, 53pp.

Tufiş, Dan, Barbu, Ana-Maria, Pătraşcu

Tufiş, Dan, Mason Oliver (1998). “Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger” In Proceedings of First International Conference on Language Resources and Evaluation, Granada, Spain, 589-596

, Vasile, Rotariu, Georgiana, Popescu Camelia (1997). “Corpora and Corpus-Based Morpho-Lexical Processing” in Dan Tufiş, Poul Andersen (eds.) Recent Advances in Romanian Language Technology, Editura Academiei, 35-56 (also available at http://www.racai.ro/books)

Tufiş, Dan, Ide, Nancy, Erjavec Tomaž (1998). “Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages” ” In Proceedings of First International Conference on Language Resources and Evaluation, Granada, Spain, 233-240

Tufiş, Dan (1998). “Tiered Tagging” In Journal of Information Science and Technology, 1(2), 103-128

Tufiş, Dan, Chiţu Adrian (1998). “Automatic Diacritics Insertion in Arbitrary Romanian Texts” In Journal of Information Science and Technology, forthcoming

Tür Gökhan, Oflazer, Kemal (1998). “Tagging English by Path Voting Constraints.” In Proceedings of COLING-ACL’98. Montreal, Canada, 1277-1281

Tzoukermann, E., Radev, D. (1997): Tagging French Without Lexical Probabilities - Combining Linguistic Knowledge and Statistical Learning cmp-lg/9/10002, 10 October, 1997

Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L., Palmucci, J. (1993): Coping with Ambiguity and Unknown Words through Probabilistic Models in Computational Linguistics, vol. 19, no. 2 (pp. 219-242), June 1993


Recommended