Clues from Information Theory Indicating a Phased Emergence of Grammar

Post on 15-May-2023

0 views 0 download

transcript

Clues from Information Theory Indicating a

Phased Emergence of Grammar

Caroline Lyon, Chrystopher L. Nehaniv, Bob Dickerson

1 Introduction

In this chapter we present evidence that there is an underlying local sequentialstructure in present day language, and suggest that the components of such astructure could have been the basis of a more highly evolved hierarchical gram-mar. The primary local sequential structure is shown to have its own benefits,which indicate that there could be an intermediate stage in the evolution ofgrammar, before the advantages of a fully developed syntax were realised.

A consequence of having such a structure is that the consecutive segmentsthat compose it have internal cohesion, so we expect local dependencies to bepronounced - part of the small world effect. The closest dependencies are betweenneighbouring elements of a sequence, with a few long distance dependencies, andwe expect sequential processing to play a key role.

The evidence we present is primarily drawn from investigations into the un-derlying characteristics of present day language. Linguistic communication isbased on the interaction between the production of speakers and the perceptionof hearers, and we show that the processing of heard speech is more efficientwhen, rather than being taken as a string of individual words, it is segmented ina sequential local structure. This is achieved through the application of informa-tion theoretic tools. We note that recent neurobiological research supports ourcase, showing the key role played by primitive sequential processing in languageproduction and perception. We also draw on simple observations of the abun-dance of homophones in everyday speech, to illuminate human language pro-cessing. The fact that we usually have no trouble disambiguating homophonesindicates that words are taken in their context rather than individually.

1.1 Overview of the investigations

The core of the work described in this chapter is an investigation into the statis-tical characteristics of spoken and written language which can help explain whylanguage was likely to evolve with a certain structure. We take a large corpusof written text and transcribed speech to see whether the efficiency of encodingand decoding the stream of language is improved by processing a short sequenceof words rather than individual words. To do this we measure the entropy ofthe word sequence, comparing values when we take single words, pairs, triplesand quadruples. A decline in entropy indicates an increase in predictability, animprovement in decoding efficiency. Our experiments show that entropy does in-deed decline as sequences of up to three words are processed, and thus support

the hypothesis that local sequential processing underlies communication throughlanguage.

We also measure the entropy with and without punctuation, to see whethercommunication is more efficient if the stream of words is broken into segmentsthat usually correspond to syntactic components. Entropy indeed declines fur-ther with the inclusion of punctuation. As there is a strong correlation betweenpunctuation and prosodic markers in speech (Fang and Huckvale, 1996; Taylorand Black, 1997) this decline indicates that there is an advantage in processinglanguage in the segments that prosodic markers provide, since it is then easierto decode.

This suggests that there could be an intermediate stage in the developmentof a full hierarchical grammar. Processing a linear stream of words that is appro-priately segmented is more efficient for the decoder than taking unsegmented,continuous strings of words. Such segments can then be the components of ahierarchical grammar.

Experiments have been carried out with the British National Corpus, BNC,(Visited March 2006), about 100 million words of text and transcribed speechfrom many different domains.

Earlier work was carried out on a small parsed corpus, in which the subjectand predicate of sentences were marked. The corpus was mapped onto part-of-speech tags, and “virtual tags” were inserted into the sequence to mark thesubject and predicate boundaries. Again, it was found that the entropy of thecorpus with these virtual tags was lower than that without these syntactic com-ponents being represented (Lyon and Brown, 1997), though the corpus in thiscase was too small for the results to be very significant.

Other recent work in this field has been done on a comparatively small corpusof 26,000 words of transcribed speech, annotated with prosodic markers (Lyonet al. 2003). However, using the large BNC corpus enables us to confirm andexpand upon those results.

1.2 Related work

Recent work on the small world phenomenon has investigated possible universalpatterns of organization in complex systems (Ferrer i Cancho and Sole, 2001).This effect, which is evident in natural language, picks up on the dominance oflocal dependencies, and research is continuing into how robust complex systemscan emerge (see Section 5).

Other work that supports our hypotheses includes neurobiological investiga-tions with fMRI and PET (Lieberman, 2002). Furthermore, observations on thefrequency of homophones in everyday speech, and the ability of speakers andhearers to disambiguate them without difficulty, lend further credence to ourhypothesis, as discussed below.

Another related area of research is in the role of ‘chunking’ mechanisms inhuman perception, cognition and learning (Gobet et al. 2001), and associatedcomputational models such as CHREST (Chunk Hierarchy and REtrieval STruc-tures). A variant of this, MOSAIC, simulates the early acquisition of syntactic

categories by children aged 2-3 years, and indicates that chunking mechanismsplay a significant role.

2 Background to this work

2.1 Co-operative communication

A number of different scenarios have been used to introduce hypotheses on theevolution of language. They have included a range of possibilities, such as “gos-sip, deceit, alliance building, or other social purposes” (Bickerton, 2002). In con-trast, the work described here is based on those scenarios where producers andreceivers are co-operating, sharing information. As Fitch (this volume) explains“it is in the signaller’s interest to share information honestly, and the receiver’s toaccept this information unskeptically”. A typical scenario for co-operative com-munication would be in group hunting or fishing situations, where deceit wouldbe counter-productive. Even with Bickerton’s manipulative communication sce-nario a degree of co-operation is required to enable understanding. We look atmodes of communication that are most efficient for producers and receivers. Toinvestigate this we take a large corpus of spoken and written language and applyan analytic tool from information theory, the entropy measure, to determine theefficiency of different modes of communication.

2.2 Entropy indicators

The original concept of entropy was introduced by Shannon (1993). Informally, itis related to predictability: the lower the entropy the better the predictability ofa sequence of symbols. Shannon showed that the entropy of a sequence of lettersdeclined as more information about adjacent letters is taken into account; it iseasier to predict a letter if the previous ones are known. Entropy is representedas H , and we measure:

– H0 : entropy with no statistical information, symbols equi-probable;

– H1 : entropy from information on the probability of single symbols occurring;

– H2 : entropy from information on the probability of 2 symbols occurringconsecutively; and

– Hn : entropy from information on the probability of n symbols occurringconsecutively.

Hn measures the uncertainty of a symbol, conditional on its n− 1 predeces-sors; we call Hn the conditional entropy.

For an introductory explanation of the concept of entropy, see Lyon et al.(2003), page 170. The derivation of the formula for calculating entropy is inAppendix B. For many years Automated Speech Recognition developers haveused entropy metrics to measure performance (e.g. Jelinek, 1990).

2.3 Using real language

A significant amount of language analysis has not been done with real language.Well known examples include Elman’s experiments with recurrent nets (1991),which use a 23 word vocabulary: 12 verbs, 10 nouns and a relative pronoun.Sentences like boy sees boy are considered grammatical, because there is numberagreement between the subject and verb, though this sentence would be consid-ered ungrammatical in real language because the determiners are missing. Elmanhimself is careful to say that this language is artificial, but this is not the casewith many of his followers, who assert that it is a subset of natural language.

This artificial example consists almost exclusively of content words, but infact many, sometimes most, of the words most people utter are function words.Though in any model we have to abstract out the features we consider mostsignificant, a focus on content words alone introduces distortions if we are lookingat human communication. We present evidence that there is a phrasal basisto language, a view also supported by Wray (this volume). This is not to saythat there are not other essential characteristics of language. For instance, akey development in its evolution is the emergence of compositionality (Kirby,this volume), but word focused compositionality is compatible with a linearlystructured phrasal basis.

3 The British National Corpus

The BNC corpus is composed of a representative collection of English texts;about 10% of the total is transcribed speech. As we want to investigate theprocessing of natural language, headlines, titles, captions and lists are excludedfrom our experiments. Including punctuation marks leads to a corpus of about107 million symbols.

To carry out an analysis on strings of words it is necessary to reduce anunlimited number of words to a smaller set of symbols, and so words are mappedonto part-of-speech tags. As well as making the project computationally feasible,this approach is justified by evidence that implicit allocation of parts of speechoccurs very early in language acquisition by infants, even before lexical accessto word meanings (Morgan et al. 1996).

The BNC corpus has been tagged, with a tagset of 57 parts of speech and4 punctuation markers. We have mapped these tags onto our own tagset of 32classes, of which one class represents any punctuation mark (Appendix A). Tagsets can vary in size but our underlying aim is to group together words thatfunction in a similar way, having similar neighbours. Thus, for example, lexicalverbs can usually have the same type of predecessors and successors whetherthey are in the present or past tense:We like swimming / We liked swimmingso in our tagset they are in one class. We believe we have not lost discriminationby moving to the smaller tagset, as is evidenced by structure detected in ourexperiments.

Another reason for mapping the BNC tagset onto our smaller set is that theentropy measures are more pronounced for the smaller set.

4 Experiments

We have run the following experiments. First, we have processed the wholecorpus of 107 million parts of speech tags, with punctuation, and found H1, H2,H3, H4 and H5 as shown in Table 1. We also ran experiments over each of the10 directories in which the corpus material is placed to see if there was muchvariation. In fact, variations between the directories is small: the results clusterround the measure for the whole corpus. An example is shown in the lower partof Table 1.

We also process a comparable set of randomly generated numbers, in orderto ensure that distortions do not occur because of under sampling. With 32 tagsthe number of possible sequences of 5 tags is 33,554,432. If too small a sampleis used, the entropy appears lower than it should because not all the infrequentcases have occurred. A simple empirical test on sample size is through a randomnumber sequence check. For a random sequence, the entropy should not declineas more of the information over preceding items is taken into account, since theyare generated independently. Thus H for a sequence of random numbers in therange 0 to 31 should stay at 5.0. Sequences of random numbers were produced bythe Unix random number generator. The results show that for the whole corpuswe can be fully confident up to the H4 figure, but H5 should be treated withcaution. For the 10 subdirectories, H4 should be treated with caution, and H5

far underestimates the entropy given the sample size and so is omitted.

Secondly, we process the whole BNC corpus, but omitting punctuation marks,as shown in Table 2. This time there are 31 tags. The number of words is reduced,as punctuation marks are counted as words.

4.1 Analysis of results

The results in Table 1 show that entropy declines as processing is extended over2 and then 3 consecutive part-of-speech tags. There is a small further declinewhen 4 consecutive tags are taken. The results for 5 consecutive tags are notconsidered fully reliable, in view of the random sequence check for 107 millionsymbols.

Compare these results with those in Table 2. This time there is one lesstag symbol, so we expect unpredictability to decrease compared to that for thecorpus tagged with 32 symbols, and entropy to be less. This is what we find forH0 and for H1. However, as we take words 2, 3 and 4 at a time we find thatentropy is slightly greater than in the first case. This indicates that punctuationcaptures some of the structure of language, and by removing it we increase theuncertainty. Paraphrasing Shannon, we can say that a string of words betweenpunctuation marks is a cohesive group with internal statistical influences, and

Corpus H0 H1 H2 H3 H4 H5

107 million words + punctuation 5.0 4.19 3.27 2.94 2.84 (2.75)32 tags

107 million random words 5.0 5.0 5.0 5.0 5.0 4.832 tags

10 million words, subdirectory F 5.0 4.18 3.25 2.91 (2.79)32 tags

10 million random words 5.0 5.0 5.0 5.0 4.93 3.0532 tags

Table 1. Entropy measures for the BNC corpus, mapped onto 32 part-of-speech tags.3-grams, 4-grams and 5-grams that span a punctuation mark are omitted. Figures inbrackets are to be treated with caution

Corpus H0 H1 H2 H3 H4 H5

94 million words, no punctuation 4.95 4.16 3.29 3.14 3.07 (3.01)31 tags

94 million random words, 4.95 4.95 4.95 4.95 4.95 4.7231 tags

Table 2. Entropy measures for the BNC corpus, mapped onto 31 parts of speech tags,omitting punctuation. The figure in brackets should be treated with caution.

consequently the sequences within such phrases, clauses or sentences are morerestricted than those which bridge punctuation (Shannon, 1993, page 197).

These results indicate that a stream of language is easier to decode if wordsare taken in short sequences rather than as individual items, and supports thehypothesis that local sequential processing underlies communication throughlanguage.

5 Other evidence for local processes

5.1 Computer modelling and the small world effect

It is worth looking at syntactic models based on dependency grammar andrelated concepts. Dependency grammar assumes that syntactic structure con-

sists of lexical nodes (words) and binary relations (dependencies) linking them.Though these models are word based, phrase structure emerges. An practicalexample is the Link Parser (Sleator et al. 2005) where you can parse your owntexts on line and see how the constituent tree emerges. Now, it is reported (Fer-rer i Cancho, 2004) that, in experiments in Czech, German and Romanian witha related system, about 70% of dependencies are between neighbouring words,17% at a distance of 2 words, with fewer long range dependencies. This is oneof the characteristics of the small world effect. A significant amount of syntacticknowledge is available from local information, before our grammatic capability isenhanced by the addition of long range dependencies associated with hierarchicalstructures.

This again suggests that an intermediate stage in the development of a fullyfledged grammar could have been based on local syntactic constraints.

Returning to another computer model, Elman’s recurrent networks, we notethat they could have a useful role to play in modelling short strings of words,but there are inherent obstacles to modelling longer dependencies (Bengio, 1996;Hochreiter et al. 2001).

5.2 Neurobiological evidence

Further evidence for concatenated linear segments as a basis for language struc-ture is provided by the fact that primitive sequential processors in the basalganglia play an essential role in language processing (Lieberman, 2000, 2002).Language processing is not confined to Broca’s and Wernicke’s neocortical areas.An overview of the evidence that language and motor abilities are connected isgiven in a special edition of Science (2004). In detail, Lieberman reports on theresults of recent investigations with fMRI and PET that indirectly track activityin the brain. The subcortical neural processors that control the sequencing ofmotor movements, which include articulatory acts, also play a role in sequenc-ing cognitive activities. In studies of patients with Parkinson’s disease deficitsin sequencing manual motor movements and linguistic sequencing in a sentencecomprehension task were correlated (Lieberman, 2002, p.45). The basal gangliaplay a part in sequencing elements that make up a component in speech pro-duction, and can interrupt it, such as switching a sequence at a clause boundary(ibid p.57).

The importance of sequential processing supports the hypothesis that localdependencies play a key role in language production and perception. These localdependencies provide the internal cohesion for segments of an utterance, or pieceof text, at a sub-sentential level. And these segments, which are concatenated tocreate a linear structure, are usually grammatical fragments.

5.3 Evidence from homophony

Any hypothesis on the evolution of language needs to explain why all languagesseem to have homophones. In English some of the most frequently used wordshave more than one meaning such as to / too / two. Even young children are

able to disambiguate them without difficulty. In an agglutinative language suchas Finnish they are rarely used by children, but occur in adult speech (Warren,2001).

We can classify homophones into two groups: those in which the homophonousforms have the same part of speech, and those in which they do not. In Englishand other languages the latter group is much the larger (Ke et al. 2002). Theyare frequently function words, and the fact that we disambiguate them so eas-ily provides clues to our underlying language processing abilities. Homophonessuch as there / their or no / know have dependencies on neighbouring words,so usually only one is possible in a given context. In the case of homophoneswith different parts of speech these dependencies are primarily syntactic. Localcontext is the key factor: words are not taken alone, but in sequential segments.

5.4 An unrealistic model

Recently Nowak et al. have proposed that words in evolutionarily advancedlanguage have a single meaning, that “the evolutionary optimum is reached ifevery word is associated with exactly one signal” (Nowak et al. 1999, page 151).It is also asserted that there is a “loss of communicative capacity that arises ifindividual sounds are linked to more than one meaning” (Nowak et al., 2002,page 613) and that lack of ambiguity is a mark of evolutionary fitness. While suchmodels may be logically attractive, and could be the basis of a communicationsystem for some artificial agents and engineering applications, in no way do theyrepresent human language.

We might have expected that there would be an optimum number of phonemesthat provide the basis for speech but in fact the number of phonemes in differentlanguages varies from about 12 to over 100 (Maddieson, 1984). There is massiveredundancy, and no need for homophones because of a shortage of phonemes.

However, if we accept the hypothesis that local sequential processing un-derlies our language capability then there is not a problem in accounting forthe homophone phenomenon : they are disambiguated by the local context, andthere is no reason why they should not have occurred.

6 Conclusion

When we look for clues to the evolution of language we can examine the state weare in now and reason about how we could have arrived at the present position.This may take the form of brain studies, but it can also include the sort ofanalysis of language that we are doing. Chomsky (1957) once famously claimedthat “One’s ability to produce and recognize grammatical structures is not basedon notions of statistical approximation and the like”. We do not suggest thatstatistics are consciously used in the production or perception of speech: butthey contribute to a post hoc analysis, and can illuminate the way in whichlanguage processing is carried out. Investigations on large corpora can now be

done that were not possible a few decades back - a computer is to a linguistwhat a telescope is to an astronomer.

When we look at language around us we see that much of it is not composedof syntactically correct sentences. Newspaper headlines, advertisements, titlesof books and papers are often sub-sentential, as is much informal conversation.However, though these fragments are not complete sentences, they are typicallygrammatical: if the words in a headline are mixed up the meaning is lost. Weargue that such grammatical fragments are also the underlying components oflonger elements such as sentences. There are local dependencies between neigh-bouring words that produce cohesive segments, and these segments make up alinear structure.

The fact that there are significant local dependencies does not mean thatthere are not also long distance dependencies. Consider a sentence likeThe pipe connections of the boiler are regularly checked.As well as the linear phrasal structure we have number agreement between thehead of the subject, the plural “connections” and the plural verb “are”, al-though these words are at a distance. The noun preceding the verb is the singu-lar “boiler”, but this does not determine the number of the verb. The case weare making is that a full hierarchical grammar could have been preceded by anintermediate linear structure that has its own benefits.

In fact, this preliminary stage is still the basis for successful language pro-cessing applications, such as automated speech recognition, which typically treatlanguage as having a regular grammar that can be processed using Markov mod-els. Though a regular grammar is known to be inadequate, it can still produceacceptable results (Lambourne et al. 2004). Accuracy levels of around 95% aretypical of current speech recognisers - which could be interpreted as an averageerror rate of one word in every sentence, assuming an average of 20 words asentence. At this level of performance the constraints of a hierarchical grammarwith long range dependencies are masked by the dominant local constraints.

Our experiments have indicated that utterances are processed in segmentscomposed of a few words. These segments are either grammatical fragments orhave a looser local sequential structure making the utterance easier to compre-hend than unstructured strings of words. We suggest that these segments mayserve as the building blocks out of which a hierarchical grammar is built.

References

Bell, T. C., Cleary, J. G., Witten, I. H., 1990. Text Compression. Prentice Hall.Bengio, Y., 1996. Neural Networks for Speech and Sequence Recognition. ITP.Bickerton, D., 2002. Foraging versus social intelligence in the evolution of protolan-

guage. In: Wray, A. (Ed.), The Transition to Language. OUP, pp. 207–225.BNC, Visited March 2006. The British National Corpus. The BNC Consortium,

http://www.hcu.ox.ac.uk/BNC.Chomsky, N., 1957. Syntactic Structures. The Hague: Mouton.Elman, J. L., 1991. Distributed representations, simple recurrent networks and gram-

matical structure. Machine Learning, 195–223.

Fang, A. C., Huckvale, M., 1996. Synchronising syntax with speech signals. In: et al.,V. (Ed.), Speech, Hearing and Language. University College London.

Gobet, F., Lane, P. C. R., Croker, S., Cheng, P. C-H., Jones, G., Oliver, I., Pine, J.M., 2001. Chunking Mechanisms in Human Learning. Trends in Cognitive SciencesVol.5 No.6, 236–243.

Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., 2001. Gradient flow in re-current nets: the difficuly of learning long term dependencies. In: Kremer, S. C.,Kolen, J. F. (Eds.), A Field Guide to Dynamical Recurrent Neural Networks. IEEEPress.

Holden, C., 2004. The origin of speech. Science 303, 1316 –1319.i Cancho, R. F., 2004. Patterns in syntactic dependency networks. Physical Review E

69, 051915.i Cancho, R. F., Sole, R. V., 2001. The small world of human language. Proceedings of

The Royal Society of London. Series B, Biological Sciences 268 (1482), 2261–2265.Jelinek, F., 1990. Self-organized language modeling for speech recognition. In: Waibel,

A., Lee, K. F. (Eds.), Readings in Speech Recognition. Morgan Kaufmann, pp.450–503, iBM T.J.Watson Research Centre.

Ke, J., Wang, F., Coupe, C., 2002. The rise and fall of homophones: a window to lan-guage evolution. In: Proceedings of 4th International Conference on the Evolutionof Language.

Lambourne, A., Hewitt, J., Lyon, C., Warren, S., 2004. Speech-based real time subti-tling services. International Journal of Speech Technology 4, 251–349.

Lieberman, P., 2000. Human Language and our Reptilian Brain. Harvard UniversityPress.

Lieberman, P., 2002. On the nature and evolution of the neural bases of human lan-guage. Yearbook of Physical Anthropology.

Lyon, C., Brown, S., 1997. Evaluating Parsing Schemes with Entropy Indicators. In:MOL5, 5th Meeting on the Mathematics of Language.

Lyon, C., Dickerson, B., Nehaniv, C. L., 2003. The segmentation of speech and itsimplications for the emergence of language structure. Evolution of Communication4, no.2, 161–182.

Maddieson, I., 1984. Patterns of sounds. Cambridge University press.Morgan, J., Shi, R., Allopenna, P., 1996. Perceptual bases of rudimentary grammatical

categories. In: Morgan, J., Demuth, K. (Eds.), Signal to Syntax. Lawrence Erlbaum.Nowak, M. A., Komaraova, N. L., Niyogi, P., 2002. Computational and evolutionary

aspects of language. Nature 417, 611 – 617.Nowak, M. A., Plotkin, J. B., Krakauer, D. C., 1999. The evolutionary language game.

J. Theoretical Biology 200, 147–162.Shannon, C. E., 1993. Prediction and Entropy of Printed English (1951). In: Sloane,

N. J. A., Wyner, A. D. (Eds.), Shannon: Collected Papers. IEEE Press.Sleator, D., Temperly, D., Lafferty, J., 2005. Link Grammar. Carnegie Mellon Univer-

sity, http://www.link.cs.cmu.edu/link/, visited 17 Jan 2006.Taylor, P., Black, A., 1997. Assigning phrase breaks from part-of-speech sequences. In:

Eurospeech’97. Vol. 2.Warren, S., 2001. Phonological acquisition and ambient language: a corpus based, cross-

linguistic exploration. Ph.D. thesis, University of Hertfordshire, UK.

Appendix A

The tagset of the British National Corpus is mapped onto our tagset. Each ofthe BNC tags is mapped onto an integer, as shown below, so that functionallysimilar tags are grouped together.

Tag Code for our mapping

AJ0 1Adjective (general or positive) (e.g. good, old, beautiful)

AJC 1Comparative adjective (e.g. better, older)

AJS 1Superlative adjective (e.g. best, oldest)

AT0 2Article (e.g. the, a, an, no)

AV0 3General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g.often, well, longer (adv.), furthest.

AVP 3Adverb particle (e.g. up, off, out)

AVQ 3Wh-adverb (e.g. when, where, how, why, wherever)

CJC 4Coordinating conjunction (e.g. and, or, but)

CJS 4Subordinating conjunction (e.g. although, when)

CJT 4The subordinating conjunction that

CRD 2Cardinal number (e.g. one, 3, fifty-five, 3609)

DPS 5Possessive determiner-pronoun (e.g. your, their, his)

DT0 2General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQor an AT0.

DTQ 2Wh-determiner-pronoun (e.g. which, what, whose, whichever)

EX0 6Existential there, i.e. there occurring in the there is ... or there are ... con-struction

ITJ 7Interjection or other isolate (e.g. oh, yes, mhm, wow)

NN0 8Common noun, neutral for number (e.g. aircraft, data, committee)

NN1 9Singular common noun (e.g. pencil, goose, time, revelation)

NN2 10Plural common noun (e.g. pencils, geese, times, revelations)

NP0 11Proper noun (e.g. London, Michael, Mars, IBM)

ORD 1Ordinal numeral (e.g. first, sixth, 77th, last) .

PNI 12Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody)

PNP 13Personal pronoun (e.g. I, you, them, ours)

PNQ 14Wh-pronoun (e.g. who, whoever, whom)

PNX 15Reflexive pronoun (e.g. myself, yourself, itself, ourselves)

POS 16The possessive or genitive marker ’s or ’

PRF 17The preposition of

PRP 18Preposition (except for of) (e.g. about, at, in, on, on behalf of, with)

PUL 0Punctuation: left bracket - i.e. ( or [

PUN 0Punctuation: general separating mark - i.e. . , ! , : ; - or ?

PUQ 0Punctuation: quotation mark - i.e. ’ or ”

PUR 0Punctuation: right bracket - i.e. ) or ]

TO0 19Infinitive marker to

UNC 7Unclassified items which are not appropriately considered as items of theEnglish lexicon.

VBB 20The present tense forms of the verb BE, except for is, ’s: i.e. am, are, ’m, ’reand be [subjunctive or imperative]

VBD 20The past tense forms of the verb BE: was and were

VBG 21The -ing form of the verb BE: being

VBI 22The infinitive form of the verb BE: be

VBN 23The past participle form of the verb BE: been

VBZ 24The -s form of the verb BE: is, ’s

VDB 20The finite base form of the verb DO: do

VDD 20The past tense form of the verb DO: did

VDG 21The -ing form of the verb DO: doing

VDI 22The infinitive form of the verb DO: do

VDN 23The past participle form of the verb DO: done

VDZ 24The -s form of the verb DO: does, ’s

VHB 20The finite base form of the verb HAVE: have, ’ve

VHD 20The past tense form of the verb HAVE: had, ’d

VHG 21The -ing form of the verb HAVE: having

VHI 22The infinitive form of the verb HAVE: have

VHN 23The past participle form of the verb HAVE: had

VHZ 24The -s form of the verb HAVE: has, ’s

VM0 25Modal auxiliary verb (e.g. will, would, can, could, ’ll, ’d)

VVB 26The finite base form of lexical verbs (e.g. forget, send, live, return) [Includingthe imperative and present subjunctive]

VVD 26The past tense form of lexical verbs (e.g. forgot, sent, lived, returned)

VVG 27The -ing form of lexical verbs (e.g. forgetting, sending, living, returning)

VVI 28The infinitive form of lexical verbs (e.g. forget, send, live, return)

VVN 29The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned)

VVZ 30The -s form of lexical verbs (e.g. forgets, sends, lives, returns)

XX0 31The negative particle not or n’t

ZZ0 7Alphabetical symbols (e.g. A, a, B, b, c, d)

Appendix B

The derivation of the formula for calculating entropy

This is derived from Shannon’s work on the entropy of symbol sequences. Heproduced a series of approximations to the entropy H of written English, takingletters as symbols, which successively take more account of the statistics of thelanguage.

H0 represents the average number of bits required to determine a symbolwith no statistical information. H1 is calculated with information on single sym-bol frequencies; H2 uses information on the probability of 2 symbols occurringtogether; Hn, called the n-gram entropy, measures the amount of entropy withinformation extending over n adjacent symbols. As n increases from 0 to 3, then-gram entropy declines: the degree of predictability is increased as informationfrom more adjacent symbols is taken into account. If n − 1 symbols are known,Hn is the conditional entropy of the next symbol, and is defined as follows.

bi is a block of n − 1 symbols, j is an arbitrary symbol following bi

p(bi, j) is the probability of the n-gram consisting of bi followed by j

pbi(j) is the conditional probability of symbol j after block bi, that is

p(bi, j) ÷ p(bi)

Hn = −

i,j

p(bi, j) ∗ log2

pbi(j)

= −

i,j

p(bi, j) ∗ log2

p(bi, j) +∑

i,j

p(bi, j) ∗ log2

p(bi)

= −

i,j

p(bi, j) ∗ log2

p(bi, j) +∑

i

p(bi) ∗ log2

p(bi)

since∑

i,j p(bi, j) =∑

i p(bi).

N.B. This notation is derived from that used by Shannon. It differs from thatused, for instance, by Bell et al. (1990).