a50-uneson

50

When Errors Become the Rule: Twenty Years withTransformation-Based Learning

MARCUS UNESON, Lund University

Transformation-based learning (TBL) is a machine learning method for, in particular, sequential classifi-cation, invented by Eric Brill [Brill 1993b, 1995a]. It is widely used within computational linguistics andnatural language processing, but surprisingly little in other areas.

TBL is a simple yet flexible paradigm, which achieves competitive or even state-of-the-art performancein several areas and does not overtrain easily. It is especially successful at catching local, fixed-distancedependencies and seamlessly exploits information from heterogeneous discrete feature types. The learnedrepresentationan ordered list of transformation rulesis compact and efficient, with clear semantics.Individual rules are interpretable and often meaningful to humans.

The present article offers a survey of the most important theoretical work on TBL, addressing a perceivedgap in the literature. Because the method should be useful also outside the world of computational linguis-tics and natural language processing, a chief aim is to provide an informal but relatively comprehensiveintroduction, readable also by people coming from other specialities.

Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing; I.2.4[Artificial Intelligence]: Knowledge Representation Formalisms and Methods; I.5.4 [Pattern recogni-tion]: Applications; J.5 [Arts and Humanities]: Linguistics

General Terms: Algorithms, Performance

Additional Key Words and Phrases: Transformation-based learning, error-driven rule learning, sequentialclassification, computational linguistics, natural language processing, supervised learning, brill tagging

ACM Reference Format:Marcus Uneson. 2014. When errors become the rule: Twenty years with transformation-based learning.ACM Comput. Surv. 46, 4, Article 50 (March 2014), 51 pages.DOI: http://dx.doi.org/10.1145/2534189

1. INTRODUCTION1.1. Four Aspects of a Restaurant ConversationConsider the following hypothetical fragment of a dialogue (Example 1), perhaps be-tween a head waiter and a nervous, newly employed colleague in an overworked restau-rant kitchen.

(1) Replace the fork on table four.OK. Should I apologize for the wait?For now its enough to light the candle on the table.

It is a perfectly ordinary piece of language, English, in this particular caseindeed,it may be ordinary enough to be uninteresting to most people. But let us assume that

Authors address: Marcus Uneson, Centre for Languages and Literature, Lund University, Box 201, 221 00Lund, Sweden; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected] 2014 ACM 0360-0300/2014/03-ART50 $15.00DOI: http://dx.doi.org/10.1145/2534189

ACM Computing Surveys, Vol. 46, No. 4, Article 50, Publication date: March 2014.

50:2 M. Uneson

we have some valid reason to study this samplemaybe we are engineers and wantto build some technical application that might receive it as input, or maybe we arelinguists and would like to investigate the language phenomena it exemplifies.In any case, we probably have many more samples like it. We would like to describe

them all in some abstracted way, which highlights their similarities and differenceswith regard to the aspect we currently happen to be interested in. For instance, inExample 2, our domain of interest is the sequence of words, and to each element ofthis sequence we wish to assign a part-of-speech, or POSclasses such as verb, noun,preposition, and the like. We will use the notation w1/POS1 w2/POS2 . . . to indicate sucha classification (and generalize as needed).1

(2) Replace/VB the/DT fork/NN on/IN table/NN four/CD ./.OK/JJ ./. Should/MD I/PN apologize/VBP for/IN the/DT wait/NN ?/.For/IN now/RB it/PP s/VBZ enough/JJ to/TO light/VB the/DET candle/NNon/IN the/DT table/NN ./.

In another scenario (Example 3), we are more interested in the turns of the dialogueitself than in the exact wordings of the utterances. In dialogue act tagging, we tryto label entire utterances by an abstracted representation of the speakers intentions:GREET, INFORM, REQUEST, SUGGEST, REJECT, APOLOGIZE . . .

(3) Replace the fork on table four./ REQUESTOK./ACCEPTShould I apologize for the wait?/ YES-NO-QUESTION

For now its enough/REJECTto light the candle on the table./ REQUEST

Going from larger elements to very small ones, in Example 4, we instead want tostudy how letters correspond to speech sounds (or, to use a posh term, their grapho-phonemic relationships). Here, the data consist of an alignment of each letter to itscorresponding pronunciation (bottom row, in IPA2), with placeholders or groupingswhen one-to-one-alignment is inappropriate. Similar subtasks often appear in speechprocessing systems, but may also be useful for things like spelling correction or nor-malization of names in search queries.3

1The parts-of-speech in this example are taken from the Penn Treebank tagset. So are their sometimesinscrutable abbreviations VB for verb, NN for noun, IN for preposition, etc. For the purposes of this article,it is enough to think of such names as arbitrary, atomic labels; see http://www.cis.upenn.edu/treebank/ forexternal definitions.Admittedly, these labels bear little similarity to what one might have learned about word classes in fifth

grade. To remove any possible misunderstandings, parts-of-speech are not given by nature (this becomesvery clear when unrelated languages are compared). Instead, for practical purposes the set of allowable classlabels, the tagset, needs to be specified stipulatively. The Penn Treebank tagset (along with a few others) iscommonly used for English, but needs extensive modification to be useful even for closely related languages.2IPA is the International Phonetic Alphabet, http://www.langsci.ucl.ac.uk/ipa/ipachart.html.3This problem formulation, which employs single-letter correspondences, is chosen for illustration ratherthan efficiency. It works well for many languages, but due to its very irregular orthography, English is notone of them. Some unnaturalness in the mapping is unavoidable; here, we (rather arbitrarily) introducedgrouped phonemes as well as empty ones (denoted by _). However, the point of this section is to providea few introductory examples of sequential classification rather than solve graphophonemic representationproblems, so we will gloss over such details here.


When Errors Become the Rule: Twenty Years with Transformation-Based Learning 50:3

(4)

In yet another setting (Example 5), we are interested in finding exactly what part ofa sentence is modified by some prepositional phrase (PP). For instance, in the example,we would like to decide whether on the table says something about the activity oflighting or about the candle that is being lighted. This is the problem of PP attachment:in this case, attachment to the verb (light), or to its associated object noun phrase (thecandle). We index these with and , respectively, and indicate the correspondingassociation by co-indexing the PP itself as appropriate. In the example, the latterchoice turns out to be the correct one; whereas, for instance, for now its enough todance the rumba on the table, it would have been the former.4

(5) [Replace] [the fork] [on table four]PP .OK. Should I apologize for the wait?For now its enough to [light] [the candle] [on the table]PP .

1.2. Classification of Elements and SequencesAll of the tasks described are common, often needed (as preprocessing steps) in real-world applications. They all involve classification: given a set of observations, each onedescribable by some predefined characteristics (its features), the job is to assign eachobservation to one out of a likewise predefined set of discrete classes. Amore demandingvariant is probabilistic classification, where we need to return a probability distributionover the entire set of classes (or, less ambitiously, a ranked list of the kmost probableones).What kind of knowledge sources do we have at our disposal, to inform such a classi-

fication? Well, if the observations are taken from a predefined set, then we might havesome a priori knowledge, irrespective of the dataset at hand. We might know what themost common class for each element is, or we might even have a probability distribu-tion over all possibilities. If there is no such predefined domain (or if there is one, butit is not closed and thus not guaranteed to contain all new data), then we will sooneror later encounter elements that we have never seen before. However, we might stillmake an educated guess from a dynamic analysis of the features, which thus constitutea second knowledge source.Actually, in Examples 24, what we are given is not a set of observations, but a set

of sequences of observations. The classification of each element depends on its localcontext: its neighbors (within some not-too-wide window) and their classifications.Sequential classification tasks often appear when we deal with symbols ordered in time

4Most such ambiguities pass unnoticed by humans we disambiguate on semantic grounds, usually withouteven noticing that we did. But it is not difficult to come up with examples where also humans will be hesitant.Consider the attachments of the preposition phrase in the following examples:

I [tripped] [the man] [with my umbrella]PP (?)I [tripped] [the man] [with the black umbrella]PP (?)I [tripped] [the man] [with the umbrella]PP/PP (??)


50:4 M. Uneson

or space, such as those present in human language. In such tasks, an additional thirdknowledge sourceby definitionis the sequential context: which are the neighbors ofthe sample we are trying to classify, what are their features, and what is our (current)idea of their classification?5The example applications illustrate the varying importance of these knowledge

sources:

In part-of-speech tagging, the domain is semi-closed: most words are likely to beknown beforehand, and we might well have them specified in a lexicon. Still, previ-ously unseen words are certain to occur now and then in any real-world application,and we are much helped by being able to make intelligent guesses from dynamicfeature analysisfor instance, guessing that staycation is a noun and defriend is averb.6 Generally, ambiguous words cannot be resolved without sequential context.

In dialogue act tagging, the domain is truly infinite, and only seldom will we listento utterances that we have heard in their entirety before (when it does happen, itis usually short phrases: single words, or word-like groups of words: yes, whats up,I dont know). Thus, appropriate feature extraction is crucial. Sequential context isclearly importantthe answer to a SUGGEST is much more likely to be an instance ofACCEPT or REJECT than GREET, no matter the phrasing.

In finding letter-to-sound correspondences, we are unlikely to encounter any previ-ously unseen letters. Thus, feature extraction is pointlesswhatever we might wishto use features for would better have been included elsewhere, as a priori knowl-edge. The background knowledge specifies default correspondences, and sequentialcontext can (crucially, for many languages) be used to emend these.

In PP attachment, the domain is again infinite, but, in contrast to the other examples,sequential context has no influence: the fact that a PP was attached to the verb in theprevious sentence tells us nothing about the current one. Thus, intelligent featureextraction is the single source of information.

Another interesting dimension alongwhich these examples vary is thewell-definednessof the classifier range. In PP attachment, we generally have two answers to choosefrom, and if we look at a wide enough context, exactly one of them is correct. In findingletter-to-sound correspondences, we may argue about the best alignment, but there is

5In pattern recognition, individual samples to be classified are often assumed to be independent and iden-tically distributed (iid). As we mention, this is not a good fit in all cases; for instance, in many tasks wheretime is a dimension (explicit or implicit), elements will then often depend on their context their closestneighbors in a sequence.Rephrasing the element-wise iid assumption for sequential classification can be done along at least two

different axes. One redefines the task itself, so that classification applies to entire sequences, iid amongthemselves. The dependencies of individual elements on their context are then hardcoded and implicit inthe representation. Another approach retains the view of classification as something done to individual ele-ments, and explicitly encodes in the representation any necessary information on context (say, as pointersto neighboring elements).The first choice is simple and good enough for most practical linguistic tasks (if we accept the idea that

some sequences may hold only a single element, as in Example 5, and that some sequences are just arbitrary,imposed serializations of hierarchical structures, as in Figure 13).In this paper, however, we try not to make silent assumptions on the data, and we thus prefer the second

view on neighborhood. That is, we wish TBL to classify elements which come in isolation, with no neighbors;or in sequences, with neighbors to the left and/or to the right. But, even if we do not have any specific real-lifeexamples in mind, we would also like to be able to use TBL on elements which come in n-dimensional grids,with 2*n neighbors; or indeed in arbitrary graphs, where these can be considered as fixed and given by theproblem specification.We will still stick to the term sequential classification, since it is so commonly used and well describes

(almost) all of our examples. However, for covering the more general cases, if and when the day comes,something like structured classification or structured prediction might arguably fit the bill better.6New entries in the Oxford English Dictionary 2010.



usually reasonable agreement on the lexical pronunciation (at least if we considersome reference variety of the target language). POS tagging is trickier: it is onlymeaningful with respect to some stipulatively defined tagset, specific to a languageand sometimes also to a certain dataset.7 Dialogue act tagging, finally, is less studiedand understood; thus, it has the characteristics of POS tagging to an even higherextent, with tagsets depending also on domain or setting. As can be expected, humaninterannotator agreement for the four tasks decreases in the order given.

1.3. Todays Meal: Classification by TransformationThe sequences from the restaurant conversation presented earlier were all short. Onthe other hand, we may have many of themthousands, millions, or billionsand wecertainly want a computer to help us with the classification. One way to automatethe chore is to implement a classifier as a set of manually specified rules. For somecombinations of task and data, this is actually the best solution. Restricting the datatype to natural language for the sake of discussion, it is easy enough to write a letter-to-sound converter for Finnish, Spanish, or Turkish by enumerating the few necessaryrules in a page or two. Much more substantial human effort was invested into thethousands of rules of the EngCG POS tagger [Karlsson et al. 1995], for a long time oneof the best part-of-speech taggers for English.8For most instances of sequential classification, including those illustrated in

Section 1.1, this approach is simply infeasible: there are too many and too weak de-pendencies, and it is far too laborious to try to specify them by hand. Instead, we maychoose amongmany reasonablemachine learning approaches: decision trees (DTs), hid-den Markov models, neural networks, maximum entropy models, and memory-basedlearning, to mention just a few. These are all well-known techniques in the machinelearning community and certainly good choices in many situations. Their main draw-back for the tasks we are interested in is the opacity of the learned representation. Forthe mentioned techniques, learning amounts to filling a black, inscrutable box withestimated parameters. The main exception is DTs, which do slightly better: they giveus a somewhat interpretable tree with if. . .else questions at every node. Nevertheless,the questions tend to be overwhelmingly many for real-world tasks.The focus of the present survey is on yet another machine learning method:

transformation-based learning (TBL). It was invented 20 years ago by Eric Brill [1993b,1995a] and has been refined by him and many others since. In terms of the techniquesmentioned, TBL is a hybrid: its representation involves rules, or transformations, butthese are learned automatically from the training data. Rules are iteratively createdand evaluated based on how well they deal with the current set of errors in the data;hence, the approach is often termed error-driven.TBL is typically used as a supervised machine learning technique for classification

of sequences, where each element is represented as a symbolic feature vector and as-signed a single symbolic value in the classification. Interpretability of representationis but one out of several properties that make the method appetizing for applicationsinvolving natural language; some others are the natural ease with which it handleslocal (especially fixed-width) dependencies, its resistance to overtraining, and its gen-eral flexibility and adaptability to different tasks. In the next section, we substantiatethese claims further.The TBL method itself, however, does not (and good implementations should not)

make any assumptions that are valid only for natural language classification tasks. To

7Of course, such tagsets are not created in a vacuum; they build on each other and, for a given language,differences between them partly reflect the number of subdivisions made. Thus, a larger set can often beconverted to a smaller with relative ease.8EngCG was later augmented by automatically derived rules. See also Section 4.2.2.


50:6 M. Uneson

be sure, natural language abounds with ambiguities to pit machine learning methodsagainst, TBL or others; and it also abounds with weak, mostly local, and incompletelyunderstood dependencies that these methods may exploit. But life offers many exam-ples of symbols ordered in time or space, depending on their neighbors to a greateror lesser degreejust to mention a few, DNA and protein sequences, musical notes,and suns and clouds in weather forecasts. Furthermore, the problem characteristicsdescribed are typical, not mandatorywith the appropriate problem encoding, it is per-fectly possible to apply TBL to almost any supervised classification task. In addition,many extensions have been suggested, some of which are specifically aimed at enhanc-ing the methods expressive power. For instance, TBL can be rebuilt into a regressor(with real-valued output) or a probabilistic classifier (with a probability distributionover the entire set of classes as output); see Section 4.With this article, our aim is to provide a self-contained but still relatively compre-

hensive introduction to TBL. It is intended to be readable without much acquaintancewith either specialized linguistic terminology or the toolbox of a computational lin-guist. Linguistic terminology cannot be entirely avoided, however: to date, TBL hasbeen applied almost exclusively to natural language data, and most citations and ex-amples must necessarily be drawn from that area. Where deemed necessary, we havetried to explain nonelementary concepts in a phrase or two in the body text; some-times we also provide more extensive but less crucial comments in footnotes. This ishopefully enough to illustrate the inputs and outputs of a certain problem, but it isalmost certainly not enough to convey the rationales behind posing it in the first place.Furthermore, in some cases, where exact understanding of the terminology might notbe needed for the understanding of the algorithmic aspects, we found that further de-tours added more clutter than clarity. When explanations given here are insufficient,we refer to some dedicated textbook in computational linguistics, for instance Jurafskyand Martin [2008].The article is organized as follows. By way of introduction, in the practically oriented

Section 2, we review the original algorithm proposed by Brill [1993b, 1995a], with aneye on its main design choices and inherent strong points. This tutorial-styled sectioncontains many forward references to later, more detailed descriptions, but is otherwiseintended to be readable in isolation.The rest of the article is a survey of the most important TBL literature. Roughly,

this set can be divided into a handful of publications at least partly dealing with TBLfrom a theoretical perspective, a few dozen that develop or extend the method alongdifferent dimensions, and a larger number that report on the application of the methodto new domains (possibly requiring clever problem encodings or otherwise noteworthyrecasting considerations).9This classification also forms the basis for the organization of the survey. The the-

oretical descriptions of the original TBL method are summarized in Section 3: itsrelation to its cousin DTs and its closer relatives decision lists and, in particular, theless well-known decision pylons [Bahl et al. 1989]. Section 4 provides an overview ofthe most important developments of the original paradigm, aimed at augmenting it indifferent directions. We review attempts to extend the range of possible inputs and theexpressivity and quality of the output, to relax the amount of supervision needed, toimprove efficiency, and to ease the problem description. Section 5 contains an overview,necessarily brief, of TBL uses in practical applicationsmany and varied, but, as men-tioned, almost exclusively within the realm of computational linguistics and its closest

9There are, of course, manymore papers that simply apply TBL to a known problem or report on comparisonswith other systems for some particular task or dataset. These largely fall outside the scope of the article, butmany are found in the TBL bibliography (see the online Appendix).



Fig. 1. A barnyard scene, for transformation-based painters. From Samuel [1998b], reprinted with permis-sion.

neighbors. Section 6 concludes the article and hints at some possible future directions.An Appendix is dedicated to practical TBL resources, including brief presentations ofexisting TBL implementations.

2. PLAIN VANILLA TRANSFORMATION-BASED LEARNING2.1. A Painting AnalogyA useful TBL picture-painting analogy is offered by Samuel [1998b], who attributes itto Terry Harvey. It generates several of the right intuitions, so we will retell it here(slightly adapted).Consider the barnyard picture in Figure 1 (where, for simplicity, colors have been

named rather than rendered). A painter comes by. As it happens, he is actually atransformation-based painter, which is mostly like any other painter, except he doesnot ever want to change from a smaller brush to a larger. He is also more-than-averagecavalier about making mistakes, claiming that they can always be fixed later.Our painter finds the barnyard picture and decides to reproduce it on his own canvas,

as follows. First, he looks at the current state of his painting (a blank canvas) andcompares it to the target, or truth, represented by the figure. He notes that the mostefficient way to reduce the difference between his painting and the truth is to take thelargest brush he has and paint the entire canvas blue. When the paint has dried, heagain compares the current state of his painting to the truth. This time, he finds thatthe easiest way to increase the similarity to the target is to take a slightly smallerbrush and paint the filled outlines of a red barn. There is no need to worry aboutnon-red details, such as windows, doors, and roof, as these will be taken care of in laterstages, by smaller brushes.And so our painter goes on. With each change of color, he picks a finer brush and uses

it with increasingly thin and precise strokes. Coarser brushes, used early on, cover alarge part of the picturethey add a lot of paint, but they also make many mistakes.With later, thinner brushes, less paint will be added, but also fewer errors. The finalstep might be to fill in the fine black lines with a very fine brush.The main point of the analogy is that the painter uses a sequence of color-brush

pairs, in descending order according to how much paint they add to the canvas. Eachpoint of the canvas may be repainted several times; although we can be convinced thatthe overall result looks increasingly like the target with each application of a brush, wecannot be sure about the color of a specific point until all brushes have been applied.


50:8 M. Uneson

Fig. 2. Data flow of TBL training, Brills original algorithm. Main loop in heavier stroke.

2.2. Algorithmic OverviewTransformation-based learning works in much the same way as transformation-basedpainting. The method produces a sequence of rules, or transformations, ordered afterimpact. Early rules are very general and may change classifications on large fractionsof the data, usually committing errors in so doing. Subsequent rules are more specificand may correct errors introduced by earlier ones. A single transformation rule has thegeneral form

if CONDITION (x) then do ACTION (x)

where x is a data sample; CONDITION, sometimes referred to as context, is a predicate(i.e., a boolean-valued function) on attributes of x and/or its local context; and ACTIONchanges some attribute of x. The rules are induced automatically from the trainingdata. The actual structure of CONDITION and ACTION is delineated by user-specifiedpatterns known as templates; thus, templates define the transformation space.The flow of data in the original TBL training algorithm (as presented in the seminal

works Brill [1993b, 1995a]) is shown in Figure 2, and pseudocode for a typical imple-mentation including a few typical optimizations is given in Algorithm 1. The output ofthe learning algorithm is an ordered sequence of transformation rules. New data (oncethey have been initialized in the same way as the training data, see Sections 2.2.3 and2.3.1) can now be classified by applying this sequence of learned transformations to it.In the following, we look at the datasets, at the templates, and at the flow of control.Finally, we return to some critical points where design choices lurk.

2.2.1. Corpora. Like most machine learning methods, TBL takes as its point of depar-ture a dataset of a certain size. For applications dealing with natural language, such adataset is usually referred to as a corpusindeed, corpora are the bread and butter ofcomputational linguistics.10

10A corpus, pl. corpora, is basically a sizable collection of real-world natural language datatext or speechor something more exotic, such as video recordings of sign language. We use that term because it is themost common for the tasks TBL has been applied to. However, it was not created by any transformation-based deities, and the reader should feel free to replace it with big dataset at any time. For nonlinguisticapplications, this may roll more smoothly off the tongue.



ALGORITHM 1: Original TBL. Algorithm 3.3 of Florian [2002b], adapted and reprinted withpermission.Input: set of templates T , training corpus D, score threshold Output: an ordered list L of transformation rules/* baselinetag(x) computes an initial classification of x (Sections 2.2.3 and 2.3.1) *//* C[x] returns the current classification of sample x (or sets it) *//* D[x] returns the true classification of sample x */

(1) foreach x D : C[x] baselinetag(x)(2) i 0; L [] /*iteration 0: L is empty list*/(3) Repeat:

(a) Ri (b) foreach x D such that C[x] = D[x]:

i. generate all correcting rules R(x) from the templatesii. foreach r R(x) : increment posD(r)iii. Ri Ri R(x)

(c) foreach r Ri , sorted in descending order of posD(r):i. for all x D such that C[x] = D[x] C[r(x)] = D[x] :

A. update negD(r)B. if scoreD(r) < scoreD(b): stop evaluating rule r (continue loop)

ii. if posD(r) < scoreD(b): stop evaluating rules Ri (break loop)iii. if scoreD(b) < scoreD(r) : b r

(d) if scoreD(b) : return L and exit(4) Append b to L; i i + 1; goto (3)

In the TBL case, this dataset is assumed to come with reference classifications, mak-ing it a reference corpus. Thus, TBL is a supervisedmethod: it depends on the existenceof some annotations that can be taken as truth. The annotations are usually provided(or at least proof-read) by humans. Manually annotated corpora thus represent largeinvestments, sometimes enormous, and creating them from scratch for a single projectis rarely an option.A corpus can generally be thought of as a set of items that are independent of each

other, by nature or by assumption, with respect to the task at hand. For instance, in POStagging (Example 2), we can be reasonably sure that the POS of thewords in the currentsentence does not depend on those in the previous one. In dialogue act tagging, eachutterance (Example 3) is clearly dependent on the previous one, but the dialogue acts inone conversation are likely independent of those in another. Partitioning the corpus intoa set of subsets for which we can assume mutual independence is highly beneficial, forthe quality of the learned representation as well as the efficiency with which it can belearned. To tag POS, we will want our corpus split into sentence-sized chunks. For thecase of dialogue acts, the corpus will hopefully contain some delimiters distinguishingindividual dialogues (this case also illustrates that the need of relevant and reliableannotations may go beyond individual samples). The individual classifications areoften called tags (irrespective of their being true or not, and irrespective of the task athand).11 In the TBL case, we will often speak of the training corpus, which is essentiallythe reference corpus with the annotations deleted.12

11Somewhat confusingly, the term tag may sometimes be short for part-of-speech. However, in this article,we will treat the latter as a special case of the former.12Here and in Figure 2, we make the distinction between the two to emphasize the conceptual read-onlynature of the reference corpus; by contrast, the training corpus can conceivably be thought of as mutabledata. Some authors may use the term training corpus for both senses.


50:10 M. Uneson

Table I. Sample Templates for POS Tagging by TBL, Original Phrasing, in -TBL Syntax and as ProseTemplate Change A to B whenever . . .pos:A>B B B B B


Table II. Sample Templates for POS Tagging by TBLAs in Table I but generalized: the precondition of the current tag is moved from ACTION to CONDITION.Template (ACTION B B B B B

50:12 M. Uneson

Table III. Handling Templates with Out-of-Bound (OOB) AccessesFive example strategies. From the perspective of a template, a sequence w1 . . . wn may behave (a) as if surroundedby nothing, or (b) as if preceded and followed by an infinite number of OOB tokens (here ), or (c) as if it had aspecial left boundary marker (here ) at position 0 and a special right boundary marker (here ) at position n + 1,or (d) a combination of b and c, or (e) as if surrounded by an infinite number of position-unique tokens (here ,

, etc.; , , etc.) in both directions. See also Section 2.3.3.. . . 2 1 0 1 . . . n n + 1 n + 2 . . . Position/Strategy. . . w1 . . . wn . . . a) Template not applicable OOB. . . w1 . . . wn . . . b) Single OOB token. . . w1 . . . wn . . . c) Boundary markers only. . . w1 . . . wn . . . d) Boundary markers w/ OOB token. . .

w1 . . . wn . . . e) Position-unique token

to be at least somewhat helpful: no time will be wasted with rules that will nevercorrect any errors, or (worse) rules that will never be triggered by the training data atall.15We return to training efficiency issues (Section 4.3.1); here, we only note that most

rules that were candidates for the current corpus ci also will be so for the next ci+1.Thus, much of what we record about the rules could be recorded once and then cachedand minimally updated between iterations. This observation underlies several of thefaster techniques we will see later.A point of ambiguity in deriving rules, unfortunately not often specified in the de-

scription of practical implementations, is how to handle templates that refer to nonex-isting positions in the sequence. For instance, consider the following single-sequence,two-sample corpus, while again learning from the tag:A>B a


percentage of correct classifications. For those cases, f can be a simple function ofthe number of good gr and bad br changes that r will bring about if applied. Thestraightforward f (g,b) = gb given earlier is commonly used for f . We note, however,that, as far as the basic TBL algorithm is concerned, any f that fulfills f (g,b) >0 iff g > b is good enough. Put into words, the only requirement is that any rulewith a positive score will decrease the total error count (and thus the algorithm mustterminate), and any rule that decreases the total error count will have a positive score(and thus all positive rules may be learned). With this observation, it is conceivablefor f to introduce a bias; for instance, we might simply generalize the previous scoringfunction to f(g,b) = g b. With = 1, we retrieve the original function. With < 1, the scoring function will reward high-accuracy rules (for instance, f0.9(100,10) >f (200,100)). Similarly (but probably less useful), with > 1, it will reward rules withlarge impact on the corpus (for instance, f1.1(200,100) > f (120,10)). So far, we haveignored the number of neutral applications nr, where r just changes one error intoanother. It is also mostly ignored in the literature or found to be of little importancewhen mentioned. For example, in experiments described by Lager [1999b], the ruleslearned by the two different scoring functions f=1 = gr br and f =1 = gr br nr werenot significantly different. Nevertheless, the task at hand may dictate valid reasons forletting f depend also on nr; perhaps themain interest lies with the rules learned, ratherthan in the number of reduced errors, and we prefer low-impact rules to high-impactones with the same or even higher gr br.16Some tasks do not use accuracy but rather slightly more involved forms of e. This

is true in particular when classification is not the goal in itself but rather an artifactof the particular problem encoding (cf. the discussion and the encoding examples inSection 5.1). For example, in named entity recognition, the goal is to identify substringsof words referring to specific locations, people, places, organizations, and the like.Here, evaluation is based on the measures precision and recall, well-known in patternrecognition and information retrieval. The two measures are often combined into theirharmonic mean, the F1 measure.17 Similarly, in parsing, where the goal is to assigna tree structure to an input sentence, there are a large number of ways to evaluatesystem output against the gold standard, as defined by human annotators [Carrollet al. 1998].Of course, more generally speaking, any scoring function that reflects our ideas of the

task at hand by quantifying the gain of applying a candidate rule could be used. Such afunction could well take additional inputssay, sequence length, current correctness,or estimated classification probability (cf. Section 4.2.1). For instance, we might bemore interested in maximizing the number of correctly tagged sequences rather thanthe number of correctly tagged sequence elements. If so, wewill weight rule applicationsin almost-correct sequences, whether they correct or introduce errors, differently fromthose in sequences with more errors.More elaborate scoring schemes seem to be little explored (but see Section 4.2.3). One

should note that some of the faster training algorithms we meet later (Section 4.3.1)assume simple scoring rules andwill not workwithmore complex variants.More impor-tantly, the scoring function is used not only to rank the rules, but also in the stoppingcriterion; thus, with exotic scorings the user may need to assume the responsibility

16It should be noted that most reports on the empirical behavior of TBL describe the special case of POStagging on English, which has several properties not necessarily shared by other tasks (such as initialaccuracy on the order of 90%, a tagset size on the order of 100, and few dependencies on distance larger than2 or 3). Other questions, perhaps yet unasked, may have 1% or 99% as initial accuracy, tagset sizes of 2 or2,000, and much wider sequential dependencies.17http://en.wikipedia.org/wiki/F1_score.


50:14 M. Uneson

of defining a terminating process. If a user-defined scoring function cannot be useddirectly, a simple way to do so is to exploit the fact that the rules are ranked af-ter their estimated usefulness and just predefine some maximum number of ruleslearned.See also the discussion on scoring in a multidimensional learning setting,

Section 4.1.1.

2.3.5. Rule Selection. The selection of the best rule uses greedy search: whenever achoice needs to be made, the locally optimal one (here, the highest-scoring rule) ispicked, without worrying about its impact on future choices. This simplification rep-resents a major pruning of the search space, which is indeed immense. Somewhatsimplifying an example from Curran and Wong [2000], if we consider only rules whereconditions and actions read the same attribute and conditions all refer to all positionsin a fixed window of width C, then with a tag vocabulary of size |Vt| and templates ofthe type exemplified in Table II, we get |Vt|C+1 different transformation candidates foreach rule. Learning P of these in the optimal order involves P! |Vt|(C+1)P possibilities.Pruning is clearly necessary for other than toy-sized examples.The highest scoring rule is not guaranteed to be the optimal one, however. It is not

difficult to find problem instances in which greedy rule selection will fail to produce theglobally best solution. For instance, consider the single-sequence, six-sample corpus ofExample 7 (for clarity, with only attribute tag shown) and learning with the samesingle template as before (tag:A>B a c a a


the right (or, to be sure, in any other of the N! ways, like left-to-right but odd positionsbefore even; but let us restrict discussion to the less far-fetched options).Brill [1995a] points out the differences but takes no obvious stand. In our view, the

intelligibility and declarativity of the rule representation may suffer badly with im-mediate application. For instance, modifying an example of Brills, suppose we havethe rule tag:a>b

50:16 M. Uneson

Table IV. Sample POS Tagging TransformationsThe first few transformations learned from English text (60,000 words of financial news from Wall Street Journal),with example phrases from the same source. Templates as in Brill [1995a]. For clarity, the examples only show thetags matching the corresponding rule. Tags appearing: VBP: verb, present, not 3rd person singular; VB: verb baseform (infinitive); MD: modal verb (e.g., can, will , should, may); TO: the word to; NNS: plural noun; NN: singular ormass noun; DT: determiner; NP: proper noun; IN: preposition or subordinating conjunction; WDT: wh-determiner(e.g., relative which).ID Score Acc Rule Example1 98 0.99 pos:VBP>VB VBon rates

2 51 1.00 pos:VBP>VB VB theinterests

3 42 0.82 pos:VB>VBP VBPto undermine

4 42 1.00 pos:NN>VB VB the Treasuryfar more

5 41 0.81 pos:VB>NN NN in the field6 41 0.67 pos:IN>WDT WDTdont pollute

7 38 0.97 pos:VBN>VBD VBD itsinterest in 1987

8 28 0.60 pos:NN>VB VBequipment

conditional probabilities or other statistical parameters). But the sequence of rules isunderstandable enough that it might encourage inspection, modification, experimen-tation, and occasionally give a new insight.To illustrate with the POS tagging task,18 Table IV gives the first few learned trans-

formation rules from the Wall Street Journal, a widely used English corpus, and pro-vides examples from the same text. All of the rules cited are general enough that theycould be suggested as rules of thumb for human POS taggers (at least inexperiencedones, still trying to internalize and abstract the definitions). The accuracy figures si-multaneously hint at the reliability of the rules thus learned (however, the actualscores of the rules are unimportant, because they depend on the size of the trainingcorpus and the specific tagset chosen). For instance, rules 45 may in such a setting beparaphrased

Uncertain if you look at a noun or a verb? Well, here is some help. If the previous word is a modal, suchas can, should, may, then what you see is almost certainly a verb. If one of the two previous words is adeterminer (of which the most common cases are articles, such as a, an, the), then you probably see anoun.

The two things to note here are that these rules make perfect sense and that theywere extracted automatically.For ease of exposition, we have used the same 26 templates as in Brill [1995a]. Note,

however, that the alternative templates mentioned in Section 2.3.1 might permit even

18We have chosen POS tagging (Example 2) as a recurring example not because we believe it is the onlything TBL is good for, but rather because it is a practical, basic, and well-defined task, often needed as apreprocessing step, and frequently treated in the literature.We also observe that very often the challenges of any task for a particular language are decided by its

typological properties how many, how regular, how complicated are the inflection patterns, how rigid theword order, how well-defined the word boundaries, etc. POS tagging of Russian is very different from POStagging of Chinese, and both are different again from POS tagging of English. In this way, the typologicalvariety of the thousands of languages of the world brings to light a wide range of interesting subchallenges.POS tagging, being one of the most basic tasks, has inspired (or at least been used to illustrate) severalextensions of popular machine learning algorithms, including TBL.



Table V. Sample Stress and Word Accent Rules for SwedishThe first few transformations learned from 33,390 syllables in 12,396 noncompound, noninflected entries of aSwedish pronunciation dictionary [Hedelin et al. 1987]. 35 templates with the following features: relative syl-lable position (left/right); word length (redundantly) both in numeric and syllabic representation (#syll[able]s,mono/bi/tri+); pref[ix]/suff[ix] of length 1. . .6. Accuracy threshold = 0.96. Positive indices count syllables from thebeginning of the word, negative from the end. For instance, rule 7 says that bisyllabic words ending in -ig, suchas konstig strange, stenig stony, will have accent II on the next-to-last syllable. See also Section 2.4.1.ID Score Acc Accent@Syll

50:18 M. Uneson

representation is very compact. As an extreme example, Brill [1994] presents a TBLsystem for unknown-word-guessing (i.e., assigning the most likely POS to words not inthe system lexicon; for instance, reviving the examples from Section 1.2, guessing thatstaycation is a noun and defriend a verb). He quotes comparable performance for his148-rule system and an existing statistical unknown-word-guesser with 100,000,000parameters.

2.4.3. Real-World Objective Function. TBL is an example of error-driven learning: thefunction we use to evaluate and choose between different solution candidates (the ob-jective function, in optimization problem terminology) is typically (a monotonic functionof) the (current) number of errors in the corpus. Crucially, however, this number is in-teresting not only for the system, but also for its end user: it is the primary criterionfor judging the systems success. Thus, our way of evaluating competing rules or rulesets directly optimizes a measure in which we have a practical, real-world interest,and differences in this measure correspond to differences in performance. By contrast,most classifiers use some less straightforward objective function that is only indirectlyrelated to the classifier performance. Although the correlation is, of course, designed tobe strong, it is not necessarily perfect.The real-world relevance of the objective function also allows TBL training to be

recast as an optimization problem. This view permits the application of typical opti-mization techniques to the task. For instance, Wilson and Heywood [2005] use geneticalgorithms to minimize the error function between reference and current corpus (seefurther Section 4.1.3).

2.4.4. Resistance to Overtraining. For most machine learning algorithms, a major prob-lem is overtraining: the learned representation describes random error or noise andthus fails to generalize outside the training data. For instance, it is very easy to detri-mentally overfit DTs, and careful measures must be taken to avoid it (e.g., by grow-ing the trees to completion and then back-pruning, or by performing some statisticalanalysis before deciding that a node should be further split [Mitchell 1997]). TBL, bycontrast, comes with an implicit ranking of the learned rules: they are automaticallyordered after expected impact. This fact is themain reason for themethods remarkableinsensitivity to overtraining.To be clear, TBL does overtrain; that is, if left to train until conclusion with a very

low score threshold, it will learn a large number of spurious, low-impact rules with noprospects of generalization (most of which will apply to a single site in the trainingcorpus). But this trail of irrelevant rules does not significantly influence overall perfor-mance (in either direction). In case we prefer not seeing them anyway (perhaps becauseour main interest is the relevant rules only and not overall classifier performance), anefficient filter is just to raise the score threshold.20 More sophisticated approaches arealso conceivable (see more on classifier combination, discussed in Sections 5.2 and 5.3).Perhaps more disturbingly, irrelevant rules may conceivably emanate from unfortu-

nate choices of templates. Ramshaw and Marcus [1994] briefly investigate this issue.They report on experiments with training in the presence of a template that can safelybe assumed to be irrelevant (such as the POS of the word 37 positions to the left ofthe current). When used in isolation, such a template naturally yields a large numberof spurious rules; but when combined with relevant templates, its influence is largelyneutralized. Their conclusion is that the presence of irrelevant templates will have

20Exactly where to put it will depend on task, intention, corpus, tagset size, etc., and may need someexperimentation, but it need not be very high. As a comparison, Brill recommends a score threshold of 2 forhis POS tagger designed for English (typical tagset sizes 50150), on the corpus sizes of the mid-90s (105words).



little impact, if only they are mixed with relevant ones. This is useful knowledge,particularly for cases where we are uncertain on what templates best catch the de-pendencies of the problemdisregarding practical matters such as training time andmemory usage, there is little risk in specifying all possible templates we can think of(see also Section 4.1.2).We note that Ramshaw and Marcus [1994] are brief on their results, and in any case,

more investigation into TBL overtraining behavior would be welcome, for differentlysized data and tagsets and for other templates and tasks. In the words of Manning andSchutze [2001], it appears to be more of an empirical result than a theoretical one.

2.4.5. Search during Training Rather Than Application. In TBL learning, new rules are com-puted on the current, updated version of the training corpus; the learner may thusexploit all knowledge gained so far in the process (including that encoded in previ-ously learned rules). Moreover, as Florian [2002b, p. 21; p. 33] observes, this recursivesearch takes place at training time rather than test time, when greedily selecting themost efficient error-correcting rule sequence. Differently from many machine learningalgorithms, no search is involved during application; the rules are simply applied inthe order they were learned. This is of course preferable; normally, a system is appliedmuch more often than it is trained.

2.4.6. Integration of Heterogeneous Features. Like many but not all machine-learningmethods, TBL can seamlessly integrate heterogeneous features, especially symbolic-ranged, boolean-valued ones. For instance, without any explicit modeling of indepen-dence relations, POS tagging could easily draw from information sources such as isthe previous word capitalized?, does the next word contain a number?, or even doesthis sentence have eight words?This flexibility becomes particularly powerful when combined with an expressive

way of specifying templates, perhaps in the form of a little programming language ofits own. See Section 4.6 and Figure 10 for examples.

2.4.7. Competitive Performance. The rules induced by TBLmay be interesting or at leastinterpretable to humans. Sometimes, however, we dont really care about such fringebenefits, but only about classification performance. As a stand-alone classifier, TBLgenerally reaches competitive results on a wide range of tasks and state of the art onsome. For other tasks, it lags somewhat behind the best statistical classifiers (at thetime of writing, often Support Vector Machines). However, the trend in recent years isthat the best overall results are reached not by single, stand-alone classifiers, but bycombining several of these into ensemble learners. Such systems generally gain fromdiversity in their constituents, and indeed TBL often contributes diversity. We returnbriefly to classifier combination later (Section 5.2).

2.5. TBL vs. Decision TreesAs several authors have noted [Ramshaw and Marcus 1994; Brill 1995a; Manningand Schutze 2001; Florian 2002b], TBL has several commonalities with decision trees[Breiman et al. 1984; Quinlan 1993]. Decision trees (DTs) are a widely used and well-researched machine learning method, often one of the first to fall out of the discreteclassifier compartment when opening the machine learners toolbox. Given their pop-ularity, we close this practically oriented section with a brief comparison betweendecision trees and TBL from a pragmatic point of view. In Section 3, we revisit thecomparison from a more theoretical perspective.A prototypical DT outputs a set of yes/no questions that can be asked about a sample

to get at its classification, similar to the context part of a transformation rule. Just asfor TBL, the questions may refer to attributes of the sample being classified, but also to


50:20 M. Uneson

those of other elements on which it may depend; and, just as for TBL, the learner easilyand seamlessly integrates information of different types without any need of explicitmodeling of dependencies.The differences are just as important, however. First, DTs automatically synthesize

complex questions on any subset of the available attributes, whereas vanilla TBLrequires the format of the rules to be specified by the templates.21Second, DTs have no (easy) way of saving away current hypotheses; thus, the ques-

tions cannot refer to intermediate predictions. Thus, in DTs (as in most other MLclassification schemes), classification is performed once and never changed. By con-trast, TBL makes several passes through the data and earlier predictions may bechanged later. As Ramshaw and Marcus [1994] put it: [D]ecision trees are appliedto a population of non-interacting problems that are solved independently, while rulesequence learning is applied to a sequence of interrelated problems that are solved inparallel, by applying rules to the entire corpus.22At least for such interrelated problems, then, we should not be surprised to find real-

world cases in which a solution with transformations is much more concise and naturalthan an equivalent DT. Brill exemplifies with tagging a word whose left neighbor is to,which may be either an infinitival marker (to eat) or a preposition (to Scotland). In thefirst case, to is an excellent cue for verbs, in the second a good one for nouns. However,it may well be that the tagging of to itself into these two classes is unreliable. If so,TBL can automatically delay the exploitation of this cue until it has been more reliablyestablished by other intermediate rules, working on current predictions. By contrast,a DT that exploits the same information is quite complicated and will likely containduplicate subtrees. This property is at the heart of what Florian [2002b, p. 21] termsthe dynamic behavior of TBL: bidirectional dependencies can be exploited at trainingtime without penalty. In effect, the rules can first find and classify islands of relativecertainty at any position in the sequence, then leverage the knowledge gained in theslightly deeper waters on either side.This fundamental difference betweenDTs andTBL can also be viewed as one between

stateless and stateful classification. State is a mixed blessing, the discussions on whichfill many a book in an average computer science library.23 Here, we only note that a DTat work generally has no notion of order or time, whereas learned TBL rule sets arestrictly ordered, with current predictions representing state.24State aside, several (although not all) of the points mentioned in the previous

section (2.4) apply also to the comparison between DTs and TBL. We here only note thetransparency of the objective function in TBL (which in DTs correspond to some ratherindirect measure of node impurity, as estimated by, for instance, information gain orGini coefficient [Breiman et al. 1984]) and the much better resistance to problems withsparse data and overtraining. As for the latter, a TBL rule has access to the entiretraining data when being evaluated; by contrast, at each node, a DT recursively splitsits training data into smaller subsets that, from that point on, know nothing about each

21However, see also Section 4.1.3 and 4.1.2.22Brill [1995a] proves by induction that for anyDT there is a TBL rule list that returns the same classification.Furthermore, he shows that ordinary TBL rules are strictly more expressive than DTs: precisely due to thepossibility of leveraging intermediate results, there are classification tasks that can be solved by TBL butnot by DTs. However, as pointed out by Florian [2002b, p. 85], the proof makes the unrealistic assumptionthat the number of target classes is unbounded. With predicates containing at most k conjuncts and a fixednumber c of allowed classifications, there are DTs (for instance, of depth k(c1)+1) that cannot be expressedas a TBL rule list.23The classic Abelson and Sussman [1996] is but one of them.24Brill [1995a] describes a transformation list, when applied, as a processor rather than a classifier.



Fig. 3. Decision topologies recognizing the boolean formula (A B) (C D). The edge widths are intendedto indicate the relative number of samples going down each branch (assuming, for the sake of example,uniform distribution and conditional independence, p(X) = p(X|Y ) = 0.5; X,Y {A, B,C, D}). Decisiontrees (a) repeatedly ask partial queries, following every sample from root to leaf. They can express arbitraryconjunctions but often fragment the data; also, disjunctions may lead to subtree duplication (see the subtreesunder the two C nodes). Decision lists (b) look for complex queries that can perform the classification in asingle step. They thus tend to concentrate on the exceptional cases early, then more general rules later.Decision pylons (c) may apply any number of queries to an input sample, depending on its present state(branch). During classification, samples may be moved back and forward between the branches; to a pylonicclassifier, all paths leading to a given state are equivalent, and all samples in that state are indistinguishable.As a result, predicates cannot exploit classification history, but training suffers less from data fragmentation.

otherthe lower the nodes are placed in the tree, the more fragmented data they learnfrom. In addition to overtraining, this fragmentation often leads to inefficient repre-sentations in form of subtree duplication: once subsets of the training data have beendivided along different branches, they are independent and may well induce identicalsubtrees [Oliver 1992] (see also Figure 3(a)).


50:22 M. Uneson

3. TBL THEORYIn this section, after some brief preliminaries, we survey the theoretical work on TBL,especially in terms of its relation to similar, more well-known classifiers. Much workremains to be done on TBL theory; only a handful of publications on TBL include anytheoretical considerations at all. Notable exceptions are Ramshaw and Marcus [1994]and, in particular, Florian [2002b, Ch. 3], on which most of this section is based. Forreasons of space, wemust necessarily omit much detail and especially the longer proofs;we refer to the original when the full picture is needed.We have noted (Section 2.5) the practical similarities between TBL and DTs. From a

more theoretical perspective, they both belong to the larger family of decision topologies:the classifier poses a series of questions on the item to be classified, the answers of whichwill effectively steer the evaluation through a graph. Some nodes are leaves (terminals,exits) and output a classification; others contain further questions.The class of decision topologies includes several other members, the properties of the

underlying graph being their most important defining trait. Florian bases a theoreticalaccount of TBL on two of them: decision lists [Rivest 1987], less familiar than DTsbut still tried on many problems, and decision pylons [Bahl et al. 1989], rather moreexotic.25 The relevance of decision pylons, put into a single sentence, is that they are justanother representation of TBL rule sequences and that they can be shown to subsumethe class of decision lists, thus enabling the reuse of several known theoretical resultsfor the latter class.The subsequent sections, after providing some background, expand on these claims.

The following notation is used, taken from Florian [2002b] and Rivest [1987].

true and false are represented by 1 and 0, respectively.Xn = {0,1}n is the n-dimensional Boolean space (or, equivalently, the set of binarystrings of length n). We assume, in a particular discrete learning situation, thatthe samples are represented by n boolean-valued attributes and that we have somepredefined encoding of these attributes (for instance, the ith bit set representing thatattribute i is true). Then, an element x Xn describes an object, or more formally,represents the truth value for that object of a specific predicate on the feature space for instance, is the object edible? or is the current word that and the POS of theprevious word NNS?

F Xn is the space of samples; Fn is the space of dimensionality n.C is the set of target classifications (the classifier range); for a binary classifier,

|C| = 2 and, by convention, C = {0,1}.We also write cl(x) for the class that classifier cl assigns to sample x.

3.1. Decision ListsDecision trees are well-known players on the ML stage. Briefly, a DT, once learned, is arooted tree in which each node ni holds a question qi regarding the input. At evaluationtime, the answer of the question qi decides which of the children of ni, c(ni), to visitnext. The number of possible answers to qi equals the number of branches bi = |c(ni)|.When reaching a leaf, an answer is givenusually as the top-candidate classificationor as a probability distribution over all classes. If all nodes have bi = 2, then the treeis binary, and the questions generally have the form of predicates.The less well-known decision lists [Rivest 1987] are a special case of a binary DT, in

which one of the branches (conventionally the yes branch) points to a leaf: that is, it

25To the extent that relative familiarity can be operationalized as ratio of hit counts in a popular searchengine, decision trees are roughly 25 times as familiar as decision lists, which in turn are 1,000 times asfamiliar as decision pylons (July 2013).



immediately yields an answer. More importantly, in contrast to (most variants of) DTs,where each qi queries only a single feature of the data, the questions (predicates) ofa decision list often are complex conjuncts that can refer to many different features.When building a decision list, the predicate space is searched for predicates pi suchthat the subset Ti of the examples for which pi is true have zero or close to zero entropy;that is, for which (almost) all members of Ti have the same classification. In practice,this tends to result in lists with specialized predicates (high precision but possibly lowcoverage) early on and more general (higher coverage but probably lower precision)rules later.The number of allowed conjuncts k in the predicates of a decision list is the main

determinative factor of its expressive power (as well as the flip side of expressive power,data sparsity). Another relevant characteristic of a decision list is the number of nodesit contains, which we term its length; or, when we prefer the tree view, its depth. Moreformally, a k-decision list is a sequence of pairs

(p1, v1), . . . , (pr, vr), (8)

where pi is a predicate term with at most k conjuncts; vi C; and r is the length ofthe list. pr is always true, and vr is called the default value. The set of k-decision listswith predicates taken from Xn is denoted by k-DL(n). The algorithmic interpretation ofEquation 8 is given in Algorithm 2. An example of a simple boolean formula expressedas a DT and a decision list is shown in Figures 3(a) and 3(b)

ALGORITHM 2: Evaluating a decision listif p1 thenreturn v1

else if p2 thenreturn v2

else if . . . then. . .

elsereturn vr

end if

3.2. Decision PylonsDecision pylons were introduced in Bahl et al. [1989] in order to make the constructionof a tree-based language model with composite questions tractable. They fit excellentlyas an abstract, foundational model for TBL. Indeed, Florian [2002b, p. 17] succinctlybut aptly summarizes TBL as a hill-climbing strategy of learning decision pylons.A decision pylon starts out with a default classification of each sample. Then, it

applies a list of queries in order. For each query, which is a pair of predicate p (definedon the feature space and the current class) and new class c, it identifies the subset ofsamples for which p is true and changes the classification of that subset of samples to c.Crucially, decision pylons are not trees: there may be more than one path to a given

node. They keep no record of classification history and thus operate with a simplerpredicate space; but, on the other hand, they suffer less from data fragmentation.As for decision lists, the most important characteristic of a pylon is the maximum

number of conjuncts k in its queries. Pylons also havewhat we term a depth, the numberof questions, and a width, the number of output classes. More formally, a k-decisionpylon is a sequence of pairs

(v0), (p1, c1 v1), (p2, c2 v2), . . . , (pr, cr vr) (9)


50:24 M. Uneson

Fig. 4. Example of decision pylon of class 1-DP(n) that can correctly classify the number of set bits problem,not solvable by any k-decision list if k < n. Figure 3.2 of Florian [2002b], reprinted with permission.

where pi is a predicate term with at most k conjuncts; vi C; i = 1 . . . r; and v0 is calledthe default value of the decision list. The set of k-decision pylons with predicates takenfrom Xn is denoted by k-DP(n).Examples of decision pylons are shown in Figures 3(c) and 4. The algorithmic inter-

pretation of Equation 9 is given in Algorithm 3.

ALGORITHM 3: Evaluating a decision pylonc v0 {c holds current classification}if p1 and c = c1 thenc v1

else if p2 and c = c2 thenc v2

else if . . . then. . .

else if pr and c = cr thenc vr

end ifreturn c

3.3. Lists to Pylons and BackDecision lists and decision pylons have some obvious differences and similarities. Themain operational difference is that, for decision lists, the evaluation process is short-circuitingonce a single predicate returns true, evaluation terminates and a classi-fication is output. By contrast, decision pylons take no shortcutsall predicates aretested and applied, irrespective of the applicability of previous rules.In terms of representational power, Florian [2002b, p. 57] shows that, for the general

classification case, with independent samples26

(1) k-DL(n) is a proper subset of the class k-DP(n).(2) k-DP(n) is a proper subset of the class p-DL(n), where p = min(k(|C| 1),n)

26That is, the results do not generalize to sequential classification or other types of structured prediction,where chains of rules can exploit dependencies longer than the template width.



Here, we give Florians constructive proofs of these two crucial properties (slightlyadapted and shortened). The two proofs follow and generalize earlier work given inRoth [1998], where the claim k-DP(n) k-DL(n) is proven for binary classification.(1) For every decision list dl k-DL(n), there is an equivalent decision pylon dp

k-DP(n) such that dl(x) = dp(x), x Xn. To show this, we introduce the nota-tion (pk, vk) as a shorthand for (pk, c1 vk), (pk, c2 vk), . . . , (pk, cm vk)where C = c1 . . . cm is an enumeration of the classification space. First, consider thedecision list

dl = (p1, v1), . . . , (pk, vk).Let x Xn be an arbitrary sample, and let j be the first index such that pj(x) = true.The classification assigned by the decision list dl(x) is then v j . Now, construct thedecision pylon

dp = (vk), (pk, vk), (pk1, vk1), . . . , (p1, v1)and apply it to x. At some point, the evaluation reaches . . . , (pj, v j). After that,no more predicates pj1, pj2, . . . , p1 will be true, and thus dp(x) is again v j .To show that the inclusion is strict, that there are k-DP(n) that cannot be rep-

resented by any k-DL(n), consider the simple problem of counting the number ofset bits in a bit string of length n. Because decision lists are stateless and lackmemory, a correct solution needs to inspect all the n predicates at once. Therefore,any k-DP(n), k < n, cannot solve this problem. By contrast, a 1-DP(n) solution isgiven in Figure 4.

(2) For every decision pylon dp k-DP(n), there is an equivalent decision list dl k-DL(n) such that dp(x) = dl(x), x Fn. To see this, we consider the k-decisionpylon

dp = (v0), ( f1, c1 v1), . . . , ( fm, cm vm).and build an equivalent decision list dl by exhaustive construction, as follows.(a) For every possible sample x Fn:

i. Trace the series of states that x will pass through when evaluated by dp: = v0, 1, 2, . . . , m

ii. Construct the predicate P = true fi . . . , i 1 . . .m 1 where i = i+1(that is, for every state change, add the triggering predicate to P)

iii. Add to dl (P, m), if not already present.(b) Finally, add v0 as the default value of dl.

To see that any predicate P in dl needs at most min(n,k (c 1)) terms, we firstnote the trivial case that a term of size n always is sufficient (all variables canbe inspected at once). For the more interesting case, form the directed graph G =(V, E), where V = C and (, ) E whenever = i and = i+1 for somei 1 . . .m 1 (that is, there is an edge between state and whenever follows at some point in ). Also, associate the predicate fi to that edge, if it had noprevious association.Clearly, a path through G from v0 to m must exist (because the graph contains

the path implicitly described by ). Now, because a decision pylon immediatelyforgets a samples classification history once the sample is in a given state, anypath through G from v0 to m is equivalent. In particular, the shortest path S willdo. If a shortest path exists, it can be no longer than |C| 1, because it can at mostvisit all |C| nodes. Thus, a P formed by the conjunction of all predicates associatedwith the edges of S, each being at most k-sized, needs at most k (c 1) terms.


50:26 M. Uneson

These are important results because several known properties of decision lists willthen also apply to decision pylons. Below, following Florian [2002b], we comment themost relevant from a decision-pylonic point of view. For full details and proofs ondecision lists, we refer to Rivest [1987], the major theoretical contribution on the topic.

3.3.1. For Binary Classification, k-DL(n) = k-DP(n). As an immediate consequence of thetwo properties just given, we note that if c = |C| = 2, then p = k, and thus k-DL(n)= k-DP(n). That is: for the important class of binary classifiers, the class of k-decisionlists is equivalent to the class of k-decision pylons.

3.3.2. The Class k-DP(n) Contains the Classes k-CNF(n), k-DNF(n), and k-DT(n). Each of theclasses k-CNF(n) (Boolean formulae expressed in Conjunctive Normal Form, i.e., as aconjunction of at most kdisjunctions of literals); k-DNF(n) (Boolean formulae expressedin Disjunctive Normal Form, i.e., as a disjunction of at most k conjunctions of literals);and k-DT(n) (decision trees of depth at most k) are proper subsets of the class k-DL(n)[Rivest 1987]. Thus, they are also contained in k-DP(n).

3.3.3. k-DP(n) is PAC Learnable. PAC learnability [Valiant 1984; Mitchell 1997] is usedto characterize a computational learning problem in terms of complexity. For a binaryclassification function, PAC learnability implies that there is some algorithm that,with as high probability 1 as desired (the P of PAC, for probably), can construct aclassifier with an error rate as low as desired compared to the target (the AC of PAC,for approximately correct), independently of the distribution of the sample space.Furthermore, if this algorithm is polynomial in (N,1/,1/), where N is the input size,then the problem is said to be efficiently PAC learnable. Rivest [1987] shows that theclass k-DL(n) is efficiently PAC learnable; thus, this is true also for the class k-DP(n)[Florian 2002b, p. 62 f.].

4. TBL EXTENSIONS AND DEVELOPMENTSIn Section 2.4, we showcased several desirable properties of the basic TBL method,but there is certainly room enough for development, improvement, and extension. Inthe following, we briefly survey the most important new ideas for TBL: extending thedomain of the learned hypothesis (Section 4.1), extending its range (i.e., the type ofits predictions; Section 4.2), increasing efficiency (Section 4.3), decreasing the need ofsupervision (Section 4.5), and facilitating declarative problem specifications by domain-specific languages (Section 4.6). For convenience, Table VI summarizes the subheadingsof this section, with selected references for each.

4.1. Extending the Hypothesis Domain4.1.1. Multidimensional Learning. Many real-world applications involve more than one

subtaskperhaps one classifier to assign POS to each word and another one for com-bining the result into, say, binary branching parse trees. A simple way of performingsuch combined tasks is to put the appropriate classifiers in a pipeline, resulting ina strictly feed-forward system, in which the outcome of task k cannot influence theoutcome of task k 1. For very different and/or truly independent tasks, this may bethe best way. A variation is beam search, in which we keep n> 1 hypotheses from earlysteps and defer complete disambiguation until later.However, when two tasks A and B are somewhat dependent, we may benefit more

from multitask learning [Caruana 1997]. Intuitively, if we can expect that solutions toAmay provide useful information for solving B, and vice versa, then it would be betternot to impose an artificial ordering of the tasks on the system but rather solve themin parallel. In that way, the easy cases (of both A and B) may provide information forthe more difficult ones (of both Aand B). Two such interrelated tasks are POS taggingand chunkingdividing sentences into larger, non-overlapping units. We look closer at



Table VI. TBL Developments and Extensions, with Selected ReferencesDevelopment Main referencesExtending the hypothesis domain

Multidimensional learning Florian and Ngai [2001]; Florian [2002b]Automatic template learning Curran and Wong [2000]; Milidiu et al. [2007]; dos Santos

and Milidiu [2009]; dos Santos [2009]Template-free TBL Wilson and Heywood [2005]

Extending the hypothesis rangePredicting probabilities Florian et al. [2000]; dos Santos and Milidiu [2007]Predicting sets Lager [1999b, 2001]Predicting numbers Bringmann et al. [2002]

Improving efficiencyEfficiency in training Ramshaw and Marcus [1994]; Samuel [1998b]; Hepple

[2000]; Carberry et al. [2001]; Ngai and Florian [2001a];Florian [2002b]

Efficiency in application Roche and Schabes [1995]Improving robustness: Redundant rules Florian [2002b]Unsupervised TBL Brill [1995b]; Aone and Hausman [1996]; Becker [1998]DSLs and template compositionality Lager [1999b]; Lager and Zinovjeva [1999]

chunking later (Figure 12); here, we only note that there are easy cases for both thePOS and the chunking part, where the tag can be predicted directly from word forms,and that these two sets do not overlap completely.Indeed, multitask learning on well-chosen representations may be worthwhile even

when we are not really interested in solving all of the tasks; learning POS by jointlylearning chunks is conceivably easier than learning POS alone.One way to set up such a joint classifier is to simply create a new derived feature

as a concatenation or tuple of the features we wish to combine. However (in analogywith computing joint probability distributions), this strategy may cause problems ofdata sparseness with many types of data. Unusual joint values will have unreliableestimates, and joint values that do not occur in the training datawill never be predicted.A better way for many purposes is to let the tasks share a common representation

and let each classifier work on it simultaneously. TBL is well suited for joint learn-ing in this way. Florian and Ngai [2001] point out the few changes needed to themain algorithm, mainly small modifications to the scoring function f . As stated before(Section 2.3.3), to guarantee termination, f needs to assign positive values only to rulesthat actually decrease the current error count. This requirement is easily met also ina multidimensional setting, by letting f take more than one field into account:

f (r) =sD

ni=1

wi (Si(r(s)) Si(s))

Here, D is the set of training samples (the training corpus); r is a candidate trans-formation rule to be scored; r(s) is the transformed sample resulting from applying rto sample s; n is the number of fields (tasks); wi is an optional weight or priority fortask i; and Si(s) is an indicator of the current classification of sample s for task i (1 ifcorrect, 0 if not).The weights wi could be used to manually assign priorities to each subtask [Florian

and Ngai 2001]. They might conceivably also be initialized from training data based onthe counts of correct and incorrect sample classifications figures for each field, as con-stants for the entire training session, or even as per-iteration values to be recalculatedand reassigned after each rule application. Different weights for the target attributes,whether set manually or derived from data, are apparently an unexplored option.


50:28 M. Uneson

To conclude, in the TBL case, multidimensional learning can be seen as a general-ization of one of the previous key arguments for the method: we can use intermediateresults to guide later predictions. The multidimensional addition is that such interme-diate results may refer to and exploit more than one feature.

4.1.2. Automatic Template Learning. Templates are the main tool for embedding and en-coding domain knowledge in TBL. Although this strategy is flexible, templates canbe tedious or difficult to produce. The tedium may to a large extent be alleviated byautomation (e.g., the template compiler described in Lager [1999a]). As we have seen,overtraining is seldom a problem in TBL, so unnecessary templates will mostly onlyaffect training time. A greater concern is insufficient domain knowledge: for less well-understood problems, it may be difficult to know where to start. Learning templatesautomatically then become an attractive option.Curran andWong [2000] envision evolving templates that change in size, shape and

number as the learning algorithm continues, starting from few and simple templatesthat are gradually refined into (or replaced by) more complex ones in a data-drivenfashion. They present no implementation, but they show empirically that the number ofconditions in the templates and their specificity (e.g., words rather than tags) increaseduring learningsimple templates with few conditions are most efficient early on, butlater in the learning session more involved templates tend to pay off better. However,the task providing their data is again POS tagging for English; it would be desirableto see their claims corroborated for other tasks and datasets.Less abstractly, Milidiu et al. [2007] implement a genetic algorithm [Mitchell 1997]

for automatic template learning. Their system shows impressive performance, but theirreported setup suffers from slow training; with the TBL algorithm as the fitness func-tion, training must happen for each individual of the population, for each generation.For anything but the smallest feature sets, this rapidly becomes intractable.A more recent approach [dos Santos and Milidiu 2009; dos Santos 2009], dubbed

Entropy-based Transformation Learning (ETL), instead constructs templates from aDT trained on the task at hand. A DT [Mitchell 1997; Quinlan 1993] can in this contextbe thought of as a series of yes/no questions asked about an object to be classified, withthe questions ordered after expected usefulness (cf. Section 2.5). The main idea behindETL is now that the features that are addressed by the DT-induced questions on taskX are likely to make up a good set of TBL templates for X. Each path from the root toany internal node of the learned DT corresponds to a specific series of questions andthus a specific set of features (Figure 5); that is, a template.Some care needs to be taken to avoid typical DT training problems. First, the stan-

dard algorithmswill strongly favor high-dimensional featuresfor instance, word iden-tity rather than POS. The version of ETL described in dos Santos [2009] tries to controlthis by sorting the values of a high-dimensional feature in decreasing order of infor-mation gain (IG) and replacing all except the z top-scoring ones with a dummy value,where z is a parameter of the algorithm. Second, as discussed in Section 2.5, DTs areinherently stateless and have no notion of a current classification. For the purposes oftemplate generation, ETL solves this by introducing the true value of the classificationsin the context (but not for the current object).dos Santos and Milidiu [2009] report excellent results for ETL on a number of tasks

[Fernandes et al. 2010; Milidiu et al. 2008, 2010; dos Santos et al. 2008, 2010]. Theapproach has apparently so far only been used by this single group of researchers; amore thorough evaluation will have to wait.

4.1.3. Template-Free TBL. An even more radical approach than automatic templatelearning is TBL without any templates at all. Wilson and Heywood [2005] suggest agenetic algorithm [Mitchell 1997], which (in contrast to the previous genetic approach



Fig. 5. Template extraction from decision tree [dos Santos 2009]. First, the leaves and the edge labels of thelearned tree (left, for POS tagging on Swedish; only a small fragment shown) are discarded. In the resultingsubtree, each path from the root to an internal node corresponds to a (not necessarily unique) template(right).

Fig. 6. Bit string encoding of one (fictive) particular rule in TBL by genetic algorithm. Any subset of tagswithin a window of current position 3 can be encoded, replacing a manually specified template set. FromWilson and Heywood [2005], adapted and redrawn with permission.

by Milidiu et al. [2007]) does away with templates entirely. Instead, an entire sequenceof rules corresponds to one individual in the population, and one individual is a collec-tion (378, in the experiment) of rules, each represented as a bit string of fixed length(48, in the experiment; Figure 6). Rule reordering within individuals is altered with acrossover operator.Although an undeniably novel approach, several critical choices are tuned to the

specifics of POS tagging (in particular, for English). Thus, the rather small tagset sizeis part of the encoding assumptions: it is not clear how the method would handle alarger target set or many-valued features (as may well be crucial in other tasks). Itis also unclear how it would respond to a lower scoring baseline. The paper does notexemplify any rules learned; that would otherwise have made interesting comparisonswith regular TBL.TBL without manually specified templates seems in particular well-motivated when

the choice of templates is problematic, perhaps due to poor understanding of thedomain. This is hardly the case for POS tagging in English, so template-less geneticalgorithms may have more potential for other tasks. Performance is also quite a bit


50:30 M. Uneson

lower (the authors report 89.8% from a baseline of 80.3%; Brills original algorithmscores 97.0% on the same corpus).

4.2. Extending the Hypothesis Range4.2.1. Predicting Probabilities: Soft Classification. The vanilla output of a TBL classifier is

one class per sample, with no record kept of the confidence in the choice. Such outputis based on hard decisions, each committing the system to a single possibility with noinformation on the certainty of the choice.Probabilistic systems, by contrast, make soft decisions, providing confidence mea-

sures for each classification. Confidence measures are useful for many purposes andindispensable for some. For instance, in ensemble systems (Section 5.2), the memberclassifiers may disagree; it is then very useful to know how much they are willingto insist. In active learning, the idea is to minimize manual effort; such systems useconfidence measures to identify the most uncertain (and thus, to the system, mostinformative) samples and ask the annotator for their classification. For some applica-tions, hard decisions are simply inappropriate. Larger, multicomponent systems (e.g.,speech recognizers) are generally built from many smaller modules that all deal withprobabilistic input and output (say, as probability distributions or as ranked candidatelists).Two notable attempts have been made to enhance TBL with probabilistic classifica-

tion. They share the basic idea of splitting the training data into equivalence classes:all the samples that have been tagged X for reasons Y are considered together. Thenprobabilities can be estimated for each equivalence class by standard means (e.g.,maximum likelihood estimation, probably with some smoothing).27Florian et al. [2000] give an algorithm for transforming a learned rule list into an

equivalent DT and then taking the leaves of that tree as equivalence classes. Theynote that the equivalence classes thus constructed will tend to vary a lot in size. Inparticular, with a good baseline, the equivalence class of samples to which no rulesapply at all can easily make up most of the corpus, and the probability estimates ofthat class will be close to those arrived at without any learning at all. To remedy that,the learned tree is treated as a (highly accurate) prefix, whose paths are grown furtherwith standard DT learning methods [Quinlan 1993].dos Santos andMilidiu [2007] instead construct equivalence classes from the baseline

classification and rule traces; for instance, all samples that had initial classification NNand were later touched by rules 11, 57, and 88 form an equivalence class of their own.The problem of unevenly sized classes is solved by subdividing on manually specifiedauxiliary features. The authors claim a significant improvement over Florian et al.[2000] on comparable tasks. The

Date post:	11-Nov-2015
Category:	Documents
Upload:	mohd-zulhafizi
View:	216 times
Download:	0 times

a50-uneson

Documents