+ All Categories
Home > Documents > ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and...

ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and...

Date post: 19-Apr-2020
Category:
Upload: others
View: 18 times
Download: 0 times
Share this document with a friend
197
Transcript
Page 1: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

ENGLISH LINGUISTIC GUIDE

Despite the remarkable and irreversible changesthat have come upon the English language since the Anglo�Saxon period�

it has not yet reached a point of perfection and stabilitysuch as we sometimes associate with Latin of the Golden Age�

the language of Virgil� Horace� and others�

� DR� ROBERT BURCHFIELD �The English Language�in The Oxford Guide to the English Language ����

What a patch�work has been our old saxon�by the bitter frost that nipped its early budding�

and the constant habit of borrowing thence resulting�the learned among us�as well as the unlearned�

though in very di�erent ways�are constantly made to feel�The English language as we have it now

is not so much a coherent growth as a disturbed organism�

� PROF� JOHN STUART BLACKIE Gaelic Self Taught ����

Page 2: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon
Page 3: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

CONTENTS

� ENGLISH ORTHOGRAPHY ��

��� Number of spellings ��

��� Spelling number ��

��� Language Code �

�� Frequency of Spelling �

��� Spelling ��

����� Diacritics ������� Reverse transcriptions ��

��� Spelling columns ��

����� Transcriptions for corpus types ��

����� Transcriptions for English lemmas ��

������� Spellings for English headwords ��

������� Spellings for syllabi�ed headwords ���

����� Transcriptions for wordforms ��

������� Spellings for plain wordforms ��

������� Spellings for syllabi�ed wordforms ���

� ENGLISH PHONOLOGY ���

��� Number of pronunciations ���

��� Pronunciation number �� ��� Status of pronunciation ���

�� Phonetic transcriptions ���

���� Computer phonetic character sets ��

���� Plain transcriptions ���

���� Syllabi�ed transcriptions ���

��� Stressed and syllabi�ed transcriptions ��

���� Example transcriptions ���

��� Phonetic patterns ���

� ENGLISH MORPHOLOGY ��

��� Morphology of English lemmas ��

����� How to segment a stem ��

Page 4: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

����� Types of analyses ���

������� The Derivation ���������� The Compound ���

������� The Derivational Compound ���

������ The Neo�Classical Compound ���

������� The Noun�Verb�A�x Compound ���

����� How to assign an analysis �

������� The Noun�Verb�A�x Compound ��

���� Status and language codes ��

��� Derivational�compositional information ���

����� Analysis type codes ���

����� Immediate segmentation ���

����� Complete segmentation ��at ��

���� Complete segmentation �hierarchical ���

��� Other codes ����� Morphology of English wordforms ���

���� In�ectional features ������ Type of �ection ���

���� In�ectional Transformation ���

� ENGLISH SYNTAX ���

�� Word class codes � letters or numbers� ����� Subclassi�cation � Y or N ����� Subclassi�cation nouns ��� Subclassi�cation verbs ����� Subclassi�cation adjectives ���

�� Subclassi�cation adverbs ���� Subclassi�cation numerals ����� Subclassi�cation pronouns ���

�� Subclassi�cation conjunctions ��

� ENGLISH FREQUENCY �� �

��� Frequency information for lemmas and wordforms �� �

����� Frequency information from written and spoken sources �� �

����� Written corpus information �� �

����� Spoken corpus information �� �

��� Frequency information for COBUILD corpus types ���

Page 5: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� Frequency information for COBUILD written corpus types ����

�� Frequency information for COBUILD spoken corpus types ����

Page 6: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon
Page 7: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���

� ENGLISH ORTHOGRAPHY

Detailed and varied information is available on the ortho�graphic forms of headwords and wordforms� You can choosefrom a range of transcriptions� they can be syllabi�ed or un�syllabi�ed� they can include or omit diacritics �as explainedbelow� or� in some cases� they come with the order of theletters reversed� or with the letters sorted alphabetically� Inaddition� there are columns which tell you the number ofletters or syllables a particular transcription contains�

This flex window is the menu you see for a lemma or awordform lexicon when you choose the Orthography optionof the �rst ADD COLUMNS menu�

ADD COLUMNS

Number of spellingsSpelling number ���N�Language codeFrequency of spelling �Spelling �

TOP MENUPREVIOUS MENU

��� NUMBER OF SPELLINGS

This option in the ADD COLUMNS menu is a column whichtells you how many ways each lemma� wordform or abbrevi�ation �according to the type of lexicon you are using can bespelt� For the verb lemma generalize� this column has thevalue �� which means there are two possible ways of spellingit� Unless you construct a restriction on your lexicon� gener�alize will occur twice� one row using the form generalize� theother using the form generalise�

This column is particularly useful when you want to identifywords which have spelling variants� To exclude from your

Page 8: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

lexicon all items which only have one possible spelling� con�taining instead those which can be spelt in a number of ways�you can construct an expression restriction which simplystates that the number of spellings must be greater than ��OrthoCnt � ��

The flex name and description of this column are as follows�

OrthoCnt

�OrthoCntLemma�

Number of spellings

��� SPELLING NUMBER

Just as the very �rst available ADD COLUMNS option is anumber which uniquely identi�es each lemma or wordform��according to the type of lexicon you are using� so thiscolumn uniquely identi�es every spelling to be found for eachlemma or wordform�

If you are using a lemma lexicon� the spelling variants aregiven in the form of headwords �with or without syllablemarkers� For example� the verb generalize has two spellings�one is the form generalize� and another is the form generalise�These have the spelling numbers � and � respectively� If youuse a syllabi�ed stem representation in place of the plainheadword representation� spelling � takes the form gen�er�al�ize and spelling � the form gen�er�al�ise�

This means you can use the universal sequence number toidentify a particular lemma �or wordform� depending on thetype of lexicon you are using and then the spelling numberto identify the di�erent individual spellings used for eachlemma� Usually� no preference is indicated by the spellingnumbers each variant has� generalize is as valid as gener�alise� However� the number � spelling is always an acceptableBritish from� and any American variants are always given ahigher number� Thus the lemma monologue has the Britishform monologue as its number � spelling and the Americanform monolog as its number � spelling� �You can �nd outwhether a spelling is British or American by using the Or�thoStatus column described below�

One important point to remember is that the spelling numbercan be used to eliminate unwanted rows from your lexicon�If you only want to see one spelling for each lemma �or

Page 9: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Spelling number ���

wordform� you should construct a restriction which statesthat only rows with a spelling number equal to � are to beincluded �in the form OrthoNum � �� If you don�t do this�you usually end up with lexicons that are too long becausethey needlessly repeat certain pieces of information� Take theexample generalize again� only this time imagine you want toknow its pronunciation rather than the various ways it canbe spelt� You create a lexicon with three columns� one givingthe spelling number� one giving the orthography of the stem�and one giving the pronunciation of the headword� Withoutthe restriction� flex returns two rows for generalize�

Spelling Number Headword Pronunciation

� generalize dZ�E�n���r���l�aI�z�

� generalise dZ�E�n���r���l�aI�z�

� generalize dZ�E�n�r���l�aI�z�

� generalise dZ�E�n�r���l�aI�z�

This is unnecessary� since you are interested only in thepossible pronunciations� The extra row merely gives you aspelling variant while the pronunciations remain the same�When you include the restriction OrthoNum � �� however�only the rows with the number � spelling are included�

Spelling Number Headword Pronunciation

� generalize dZ�E�n���r���l�aI�z�

� generalize dZ�E�n�r���l�aI�z�

And of course the more lemmas your lexicon contains� thegreater the number of eliminated lines becomes� simply as aconsequence of adding this important restriction�

If you are particularly interested in spelling variation� thendo not add this OrthoNum � � restriction� that way youget to see all the variant orthographic forms of each lemma�Otherwise� whenever you just want to use a simple ortho�graphic transcription as a means of representing the lemmain your lexicon� always remember to insert it�

The flex name and description of this column are as follows�

OrthoNum

�OrthoNumLemma�

Spelling number

Page 10: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

��� LANGUAGE CODE

For every di�erent spelling there is a code which tells youwhether it is an acceptable British form � B � or whether it isonly ever an American form � A � This applies to lemmas andwordforms� A British spelling is always the �rst one given�that is� its spelling number is always �� while one whichonly occurs in American English is never the �rst form� andalways has a spelling number greater than ��

Spelling Status Exampletype code

British B adviserAmerican A advisor

Table �� Orthographic status codes for English spellings

The flex name and description of this column are as follows�

OrthoStatus

�OrthoStatusLemma�

Status of spelling

��� FREQUENCY OF SPELLING

There are �gures available which tell you how frequently eachspelling of each lemma or wordform occurs in the cobuildcorpus� along with deviation �gures which give a range oferror for each frequency� They di�er from the main frequency�gures in that they are speci�c to one spelling� whereas thefrequency columns proper refer to the more general frequencycounts for the whole lemma or for each wordform�

To arrive at these �gures� a count has to be made of thenumber of times each string occurs in the current ���� millionword version of the cobuild corpus� This lets you see thatthe string beauty �for example occurs �� times� and thattruth occurs ���� times� and these �gures are the spellingfrequencies for the wordforms which are spelt that way�

Usually �but not always this string frequency is the same asthe wordform frequency� and it�s then possible to formulatespelling frequencies for each lemma� This simply meansadding together the frequencies for each wordform in eachin�ectional paradigm� Thus the spelling frequency for the

Page 11: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Frequency of Spelling ���

lemma beauty is �� �the spelling frequency for the wordformbeauties plus �� �the spelling frequency for the wordformbeauty� giving a total of � �� Similarly the spelling fre�quency for the lemma truth is ��� �truths plus ���� �truth�giving a total of � ��

On the occasions when the string frequency cannot be linkedto just one lemma� an alternative plan of action is used� Takeas an example the spelling tender� which can refer to fourdi�erent lemmas� the �rst is the adjective meaning soft orgentle� the second is the verb meaning o�er or present� thethird is a noun meaning �nancial estimate or proposition�and the fourth meaning the wagon which comes behind asteam engine� The problem is that the string frequency canbe assigned to any of these lemmas� it is always ambiguous�To overcome this problem� it is possible to check every occur�rence of tender in the corpus and work out exactly how manybelong to each of the four lemmas� To a certain extent thiscan be done by computer program� but celex undertook thetask by hand � reading occurrences in context and then decid�ing to which lemma the ambiguous string belongs� This ap�proach clearly requires more time� but the investment yieldsa much more dependable result� The problem is� though�that the words which require disambiguation�and there areapproximately � �� of them�are usually very frequent�Disambiguating all the occurrences of just one word couldinvolve reading thousands of corpus sentences� To avoid this�a random sample of occurrences is taken from the corpus� upto a maximum of � �whenever the frequency is greater than� � Disambiguating such a set produces a simple ratiowhich can be used to calculate the �nal frequency �gure�The string tender occurs ��� times� and after examining � occurrences of the word in the cobuild corpus� �� out of thehundred were gentle� � were the verb o�er� � was the noun�nancial proposition� and � was the noun wagon� The ratio ofone meaning to the other is thus ��� � � � � � � � � �� Thefrequency of tender �that is� the adjective gentle is then ���multiplied by ��� � which is � � while the frequency of theverb o�er is ��� multiplied by � �� which is ��� The noun�nancial proposition and the noun wagon share the frequencyof ��� multiplied by � �� which rounded up comes to four�

However� the story is still not complete� Occasionally it is im�possible to decide which lemma a particular spelling belongs

Page 12: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

to� The contraction I�ll is one such word� since it can meanI will or I shall� In such cases a safe� unambiguous decisioncannot be made� and the ratio is said to be �� � ��� If thereare three possible options� the ratio is ����� � ����� � ������and so on�

The result of this work is that you have an accurate fre�quency �gure for each spelling of each lemma� wordform� orabbreviation in the database� and that �gure is contained inthis column� the flex name and description of which is asfollows�

CobSpellFreq

�CobSpellFreqLemma�

Spelling frequency� COBUILD ����m word corpus

How accurate are the �gures in the CobSpellFreq column�The answer is that if there are no ambiguities to be resolved�then the �gures are naturally completely accurate� Thisis true for most of the words in the database� From theabove description� though� it�s clear that in certain cases� adegree of approximation is included� When ambiguities dooccur� then it is possible to calculate a deviation �gure whichspeci�es the range of error to an accuracy of at least ����This is the required formula�

N � �����

rp ��� p�

n�

N � n

N � �

whereN is the frequency of the word as a whole� n is the totalnumber of words which were disambiguated in the randomsample� and p is the ratio �gure for the word when it belongsto one particular lemma� Thus for tender �the adjectivegentle� N is ���� n is � � and p is ���� and the formulagives �� as the deviation� This means that the true frequencyfor this form of tender is almost certain���� certain atleast�to lie between ��� and ��� �

Occasionally you may come across cases where the deviation�gure is greater than or equal to the frequency �gure itself�This indicates that you are dealing with a spelling whichcannot be disambiguated� as with the example I�ll discussedabove� While the frequency �gures in such cases are arbi�trary� the accompanying deviation �gures are � � accurate�

So while CobSpellFreq gives the disambiguated frequency�gure for each spelling� this column indicates the statistical

Page 13: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Frequency of Spelling ���

deviation of that �gure� Its flex name and description areas follows�

CobSpellDev

�CobSpellDevLemma�

��� confidence deviation� COBUILD ����m wordcorpus

Finally� remember that the columns described here refer onlyto the frequencies of individual spellings� most of the fre�quency information is dealt with in section � �English Fre�quency��

��� SPELLING

Before de�ning the speci�c spelling columns available withboth of the English lexicon types� it�s worth consideringa few important general features which apply to many ofthe columns� namely diacritics and reversed transcriptions�After that come the individual spelling columns themselves�

����� DIACRITICS

As you work your way down the ADD COLUMN menus� youcan see that on several occasions the last menu in the seriesallows you to select transcriptions which contain�or omit�diacritics� Diacritics are the accents written above certaincharacters as a guide to pronunciation� Usually� only foreignwords use such markers consistently in English � words likevicu�na or soup�con or d�eb�acle� These special accented char�acters are eight�bit characters designed for use on certaindigital terminals �the vt��� and newer terminals� If youuse such a terminal� or can get your own terminal to emulateit� then you look at the diacritics columns with no problemsat all� If you have a completely di�erent terminal� you canstill use diacritics columns by selecting the MODIFY COLUMNS

option CONVERT to change the digital eight�bit codes tothe form your terminal needs to produce the same diacriticcharacters�

To do this� you need a table of the digital eight�bit codesthat celex uses� such as the one given in part � of themanual� the Appendices� In it you can �nd out the hexa�decimal codes of the letters you need to convert� You alsoneed a table of the codes your terminal uses to produce the

Page 14: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

same diacritical markers� The example that follows convertsall the digital eight�bit codes that are used in the Englishdatabase to their ms�dos equivalents �as de�ned in the ����olivetti ms�dos User Guide� The characters which occurwith diacritic markers are as follows� �a �a �c �e �e �e � �o�n� and u� When you reach the MODIFY CONVERSION win�dow� �rst select a column which contains transcriptions withdiacritics� then type in the following string�

�x� ��x�F��

��xE ��x����xE���x�� ��xE���x����xE���x�A

��xE���x����xEA��x�� ��xEF��x�B��xF���x��

��xF���xA���xFC��x����

Once installed� this pattern will convert all the diacritic char�acters whenever you SHOW or EXPORT the column� If you�renew to the pattern matcher and its capabilities then it mayappear very mysterious� but in fact it�s straightforward� Readthe next couple of paragraphs for a full explanation�

The �rst line indicates that one or more normal ascii codes�those with hexadecimal values between � and �F are al�lowed�

The remaining lines indicate the changes that must be madeto any ��bit characters that occur� The pattern matcher usesthe � sign to indicate a conversion� the element to the leftof the � is converted to the element on the right� �This useof the � sign is di�erent from the �wildcard� function it hasin an expression restriction or query� The pattern matcheralso uses the symbols �x to mean that the two characterswhich follow form a hexadecimal code � thus in the digitaleight�bit code �xF� actually means �n� In the ms�dos codingset� the same �n character is represented by the code �xA��So to tell the pattern matcher to convert from a digital �n

to an ms�dos �n� you must type �xF���xA��

So far� this accounts for one diacritic character� To convertall the diacritic characters� you have to add extra parts tothe pattern as appropriate� until you end up with a patternlike the one above� Each element is separated by the ormarker � � The whole pattern comes between brackets fol�lowed by an asterisk at the end ������ which means �theword may be made up of zero or more of the elements betweenthe brackets��

Page 15: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Reverse transcriptions ��

����� REVERSE TRANSCRIPTIONS

Transcriptions without diacritics are often available in re�verse order � each item is given back to front� Thus backis given as kcab� The reason for this is that with a draftlexicon� looking up word endings can be done much morequickly when you use reverse transcriptions�

��� SPELLING COLUMNS

This section sets out the columns with spellings available foreach lexicon type� First there is a short subsection dealingwith corpus type transcriptions� then a longer subsection onthe headword transcriptions available with a lemma lexicon��nishing up with a subsection on wordform transcriptions�

����� TRANSCRIPTIONS FOR CORPUS TYPES

One column is available� It gives plain transcriptions� whichinclude lower case letters� hyphens� full stops� apostrophes�round brackets� and digits� If you�re not sure exactly whatcorpus types are� check part � of the manual� the Introduc�tion to �nd out� The flex name and description of thiscolumn are as follows�

Type Graphemic transcription

����� TRANSCRIPTIONS FOR ENGLISH LEMMAS

The English lemma is always represented by the headword�as described in the Introduction� section ���� When youchoose a column which contains orthographic transcriptionsof headwords� it is as if you are choosing the bold�type head�word in a dictionary� All the other columns in the databasecontain information speci�c to individual headwords� so themain function of the orthographic transcription is to identifyany other information you look up � looking at a list oflemma frequency �gures isn�t meaningful unless you can seethe lemmas they refer to� However� you may not alwaysneed to see the orthographic form of the headword� if you�relooking for phonetic transcriptions with certain interestingsyllable��nal characteristics� say� you may not be interestedin the orthographic headword � in which case you needn�t

Page 16: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

keep it on view� and you might even want to miss it out ofyour lexicon altogether�

Described below are several di�erent forms of orthographictranscription� and each form is assigned its own column� The�rst distinction you can make between them is whether ornot syllable makers are included� Thereafter you can choosebetween back�to�front transcriptions� transcriptions with �orwithout diacritics� transcriptions which consist only of lowercase characters� and even transcriptions with the letters ofthe headword re�ordered alphabetically� Read the details�and then choose the columns which best help you to builduseful lexicons�

������� SPELLINGS FOR ENGLISH HEADWORDS

There are six columns available which do not give any indi�cation of the orthographic syllabi�cation� they just deal withthe letters in each headword�

ADD COLUMNS

Without diacriticsWithout diacritics� reversedWith diacriticsPurely lowercase alphabetical�Purely lowercase alphabetical� sortedNumber of letters

TOP MENUPREVIOUS MENU

The �rst column has information which is basic to the other�ve columns� It simply contains headwords composed ofupper and lower case characters� hyphens and apostrophes�with no diacritics or any other alterations� So� the headwordwhich represents the verbal family of in�ections walk walkswalking and walked is� quite simply� walk� �For informationabout which forms are used as headwords� see the Intro�duction section ��� The flex name and description of thiscolumn is as follows�

Head

�HeadLemma�

Headword� without diacritics

Page 17: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Spellings for English headwords ����

The second column contains the same transcriptions as the�rst� only the order of the letters is reversed� Thus theheadword walk is given as klaw� and increase is given asesaercni� �However� with words like repaper you might notnotice much of a di�erence� And proceed cautiously with theword embargo� the reverse transcription is ograbme� Theflex name and description of this column are as follows�

HeadRev

�HeadRevLemma�

Headword� reversed

The third column gives spellings which include diacritics aswell as the basic upper and lower case characters� hyphensand apostrophes of the basic transcriptions� So� while the�rst column gives the plain form cloissone� this column in�cludes the authentic French acute accent� cloisson�e� Likewisedebacle becomes d�eb�acle and deshabille becomes d�eshabill�e�The characteristics of diacritics are described in section �����above� The flex name and description of this column areas follows�

HeadDia

�HeadDiaLemma�

Headword� diacritics

The fourth column contains the same basic transcription asthe �rst except that any upper case letters which occur arereduced to lower case� and any non�alphabetic charactersare removed� Thus Jeremiah becomes jeremiah and Saxbecomes sax�

Such a column is useful when you�re trying to sort a list ofwords into true alphabetical order� as opposed to ascii order�Each letter� whether upper or lower case� has a di�erentascii number� and computers usually sort and order letterson the basis of these numbers� Because lower case lettersall have higher numbers than upper case ones� the resultsof a sort program aren�t always what you expect� Howeverwith this column� since all the characters are lower case� theproblem doesn�t arise� So� to make a �le which containsa true alphabetical list of plain headwords� make a lexiconwhich consists of the plain headwords column Head andthis column� and when you export it� put a � againstthis purely lower case column� and in this way alphabetic

Page 18: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

normality will be restored� The flex name and descriptionof this column are as follows�

HeadLow

�HeadLowLemma�

Headword� lowercase� alphabetical

The most important feature of the �fth and last column isthat the letters which make up each headword are sorted intoalphabetical order� �This does not refer to the dictionary�likealphabetical order of headwords listed in the database� it�sto do with the letters within each word� And in addition�any upper case characters are reduced to lower case� andnon�alphabetic characters are removed� Thus� for example�Jeremiah becomes aeehijmr� and dread �perhaps confusinglybecomes adder� Using this column� anagrams can be solvedquickly� and searches for words containing certain numbers ofletters can be carried out with ease� creating a query whichlooks for aaa� in this column can return a list of words �fromanother column which contain at least three a characters�The flex name and description of this column are as follows�

HeadLowSort

�HeadLowSortLemma�

Headword� lowercase� alphabetical� sorted

The sixth and last column contains counts of the numberof letters in each headword� Here letters means any upperor lower case alphabetic characters� excluding hyphens andapostrophes� This means that sometimes the length of aword is di�erent from the number of letters it contains � thenumber of letters in fo�c�sle for example is �� The flex nameand description of this column are as follows�

HeadCnt

�HeadCntLemma�

Headword� number of letters

������� SPELLINGS FOR SYLLABIFIED HEADWORDS

There are two columns which contain headwords with theirorthographic syllable markers� In these columns� a hyphenmarks the boundary between each pair of syllables within theheadword� Thus the plain headword abandonment is given asa�ban�don�ment� There is a third column relating to syllabi��ed headwords� and it tells you the number of orthographicsyllables each headword has�

Page 19: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Spellings for syllabi�ed headwords ����

ADD COLUMNS

Without diacriticsWith diacriticsNumber of syllables

TOP MENUPREVIOUS MENU

The �rst column contains the basic headwords plus syllablemarkers� each transcription consisting of upper and lowercase characters� hyphens and apostrophes� The flex nameand description of this column are as follows�

HeadSyl

�HeadSylLemma�

Headword� syllabified

The second column contains the same headwords as the �rst�except that diacritics are included where appropriate� Theflex name and description of this column are as follows�

HeadSylDia

�HeadSylDiaLemma�

Headword� syllabified� diacritics

Some people like to use only partially syllabi�ed headwords� that is� syllabi�ed transcriptions which omit the �rst syl�lable marker if the �rst syllable consists of only one let�ter� For example� the partially syllabi�ed transcription ofabandonment would be aban�don�ment� Such transcriptionsare useful for automatic hyphenation programs� since typo�graphic convention says that a word divided at the end ofa line should consist of more than one character� To obtaintranscriptions in this form� you can use the CONVERT optionof the MODIFY COLUMNS menu� When you reach the MODIFY

CONVERSION window� select a column containing normal syl�labi�ed headwords� and then type the following string�

����������

This means ��rst there is one character of some sort� Then�if there is a hyphen followed by a character which is not

Page 20: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

a hyphen� convert the hyphen into nothing� then there arezero or more other characters of some sort�� Thus wheneveryou SHOW or EXPORT your lexicon� the syllabi�ed transcrip�tions will always appear in partially syllabi�ed form� Twohyphens together after a �rst letter indicate that there is anorthographic hyphen �as opposed to a syllable marker in thespelling at this point �as in T�shirt� for example� They areleft as two hyphens to di�erentiate this sort of hyphen fromthe other syllable markers the word might contain�

The third and last column for syllabi�ed headwords tells youhow many syllables each headword contains� The number ofsyllables in the word a�ban�don�ment� for example� is � Theflex name and description of this column are as follows�

HeadSylCnt

�HeadSylCntLemma�

Number of orthographic syllables

����� TRANSCRIPTIONS FOR WORDFORMS

Wordforms are the words we use in everyday speech andwriting� Elsewhere in the database� families of wordforms�in�ectional paradigms are represented by one form� thelemma� When you work with a wordform lexicon� all thewordforms are available as separate entries� not just onerepresentative form� All the other columns in the databasecontain information speci�c to individual wordforms� so themain function of the orthographic transcription is to identifyany other information you look up � looking at a list ofsyntactic class codes isn�t very meaningful unless you can seethe wordforms they refer to� A full description of the prop�erties of wordforms can be found in part one of the manual�the Introduction� under the section called �Lexicon types��Orthographic transcriptions of wordforms are available eitherwith or without syllable markers� The next section dealswith plain �unsyllabi�ed transcriptions� and the one afterthat deals with syllabi�ed transcriptions�

������� SPELLINGS FOR PLAIN WORDFORMS

Described below are several di�erent forms of orthographictranscription� and each form is assigned its own column�using the ADD COLUMNS menu shown below� you can choosebetween back�to�front transcriptions� transcriptions with �or

Page 21: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Spellings for plain wordforms ����

without diacritics� transcriptions which consist only of lowercase characters� and even transcriptions with the letters ofeach wordform re�ordered alphabetically� Read the detailsbelow� and then choose the columns which best help you tobuild useful lexicons�

ADD COLUMNS

Without diacriticsWithout diacritics� reversedWith diacriticsPurely lowercase alphabetical�Purely lowercase alphabetical� sortedNumber of letters

TOP MENUPREVIOUS MENU

The �rst column contains information which is basic to theother �ve columns� It simply contains wordforms composedof upper and lower case characters� hyphens and apostrophes�with no diacritics or any other alterations�

Word Word

The second column contains all the wordforms to be foundin the �rst column� except that the order of the letters isreversed � Thus the wordform walks is given as sklaw� andincreased is given as desaercni� �However� with wordformslike dei�ed you might not notice a big di�erence� And pro�ceed cautiously with the word desserts� because in this col�umn it has to be stressed� The flex name and descriptionof this column are as follows�

WordRev Word� reversed

The third column gives spellings which include diacritics aswell as the basic upper and lower case characters� hyphensand apostrophes of the basic transcriptions� The character�istics of diacritics are described in section ����� above� Theflex name and description of this column are as follows�

WordDia Word� diacritics

Page 22: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

The fourth column contains the same basic transcription asthe �rst except that any upper case letters which occur arereduced to lower case� Thus Peking becomes peking andUranus becomes uranus� In addition� any non�alphabeticcharacters �hyphens� apostrophes are removed� Such a col�umn is useful when you�re trying to sort a list of words intotrue alphabetical order� as opposed to ascii order� Eachletter� whether upper or lower case� has a di�erent asciinumber� and computers usually sort and order letters onthe basis of these numbers� Because lower case letters allhave higher numbers than upper case ones� the results ofa sort program aren�t always what you expect� Howeverwith this column� since all the characters are lower case� theproblem doesn�t arise� So� to make a �le which containsa true alphabetical list of plain wordforms� make a lexiconwhich consists of the plain wordforms column Word andthis column� and when you export it� put a � againstthis purely lower case column� and in this way alphabeticnormality will be restored� The flex name and descriptionof this column are as follows�

WordLow Word� lowercase� alphabetical

The most important feature of the �fth and last column isthat the letters which make up each wordform are sorted intoalphabetical order� �This does not refer to the dictionary�likealphabetical order of wordforms listed in the database� it�sto do with the letters within each word� And in addition�any upper case characters are reduced to lower case� ThusPeking is given as egiknp� and Uranus as anrsuu� �Anotherexample is the word editorials� whose sorted form is adei�ilorst� Interestingly enough� adeiilorst is also the sorted formof the word idolatries� Using this column� anagrams canbe solved quickly� and searches for words containing certainnumbers of letters can be carried out with ease� creating aquery which looks for aaa� in this column can return a listof words �from another column which contain at least threea characters� The flex name and description of this columnare as follows�

WordLowSort Word� lowercase� alphabetical� sorted

Page 23: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Spellings for plain wordforms ����

The sixth and last column contains counts of the numberof letters in each wordform� Here letters means any upperor lower case alphabetic characters� excluding hyphens andapostrophes� This means that sometimes the length of aword is di�erent from the number of letters it contains � thenumber of letters in shouldn�t for example is �� The flexname and description of this column are as follows�

WordCnt Word� number of letters

������� SPELLINGS FOR SYLLABIFIED WORDFORMS

There are two columns which contain wordforms with theirorthographic syllable markers� In these columns� a hyphenmarks the boundary between each pair of syllables within thewordform� Thus the plain wordform abandoning is given asa�ban�don�ing� There is a third column relating to syllabi��ed wordforms� and it tells you the number of orthographicsyllables each wordform has�

ADD COLUMNS

Without diacriticsWith diacriticsNumber of syllables

TOP MENUPREVIOUS MENU

The �rst column contains wordforms plus syllable markers�Each transcription consisting of upper and lower case char�acters� hyphens and apostrophes� The flex name and de�scription of this column are as follows�

WordSyl Word� syllabified

The second column contains the same wordforms as the �rst�except that diacritics �as explained in section ����� are in�cluded where appropriate� The flex name and descriptionof this column are as follows�

WordSylDia Word� syllabified� with diacritics

Page 24: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Some people like to use only partially syllabi�ed headwords� that is� syllabi�ed transcriptions which omit the �rst syl�lable marker if the �rst syllable consists of only one letter�For example� the partially syllabi�ed transcription of aban�doning is aban�don�ing� Such transcriptions are useful forautomatic hyphenation programs� since typographic conven�tion says that a word divided at the end of a line shouldconsist of more than one character� To obtain transcriptionsin this form� you can use the CONVERT option of the MODIFYCOLUMNS menu� When you reach the MODIFY CONVERSION

window� select a column containing normal syllabi�ed word�forms� and then type the following string�

���part��������part����part���part��

This basically means �call the �rst character by the namepart� � The second character may or may not be a hyphen�Any subsequent characters are called part � Re�write thewhole word as part� plus part �� Only the parts of the wordassigned to part� or part are re�written� thus excluding the�rst hyphen whenever it occurs� because it is not assignedto any variable� When you SHOW or EXPORT your lexicon�the syllabi�ed transcriptions will always appear in partiallysyllabi�ed form� However note that on this occasion� whena double �orthographic� hyphen occurs after the �rst letter�only one of the two hyphens is written� So if you ever do seea hyphen as the second letter in your converted column� youknow for sure that it is actually an orthographic and not asyllabic hyphen�

The third and last column for syllabi�ed wordforms tells youhow many syllables each wordform contains� The number ofsyllables in the word a�ban�don�ing� for example� is � Theflex name and description of this column are as follows�

WordSylCnt Word� number of orthographic syllables

Page 25: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

English phonology ���

� ENGLISH PHONOLOGY

Phonetic transcriptions are available for each lemma andwordform in the database� They are speci�ed in four dif�ferent character sets� and variant pronunciations are alsogiven whenever they occur� The transcriptions you choosecan also include syllable markers or stress markers� Eachpronunciation has a unique identi�cation number� so that inconjunction with the lemma or wordform number� you canidentify every single pronunciation in the database� Alsoavailable are cv patterns� stress patterns� and phonemeand phonetic syllable counts� In addition� when you areusing a wordform lexicon� you can get phonetic information�and other information too about the lemmas of any of thewordforms� The sections below deal with the informationunder the headings which correspond to those used in theADD COLUMNS menus�

ADD COLUMNS

Number of pronunciationsPronunciation number ���N�Status of pronunciationPronunciation �Phonetic Patterns �

TOP MENUPREVIOUS MENU

Exactly the same columns are available for lemma lexiconsand wordform lexicons� so all the phonetic column descrip�tions and de�nitions are equally valid for both lexicon types�

��� NUMBER OF PRONUNCIATIONS

This option in the ADD COLUMNS menu is a column whichtells you how many ways each lemma or wordform �accordingto the type of lexicon you are using can be pronounced� Forthe lemma dexterous� this column has the value �� which

Page 26: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

means there are two possible ways of pronouncing it� Unlessyou construct a restriction on your lexicon� dexterous oc�curs twice� one row with d�E�k�s�t���r���s�� and anotherrow with d�E�k�s�t�r���s� �these examples are from thePhonSAM column�

This column is particularly useful when you want to identifywords which have pronunciation variants� To exclude fromyour lexicon all lemmas or wordforms which only have onepossible pronunciation� containing instead those which canbe pronounced in a number of ways� you can construct anexpression restriction which simply states that the numberof pronunciations must be greater than �� PhonCnt � ��

The flex name and description of this column are as follows�

PronCnt

�PronCntLemma�

Number of Pronunciations

��� PRONUNCIATION NUMBER

Just as the very �rst available ADD COLUMNS option is anumber which uniquely identi�es each lemma or wordform�according to the type of lexicon you are using� so thiscolumn uniquely identi�es every pronunciation to be foundfor each lemma� wordform� or abbreviation�

For example� the noun dexterous has two pronunciations��rst d�E�k�s�t���r���s�� and second an alternative formd�E�k�s�t�r���s� � These have the pronunciation num�bers � and � respectively� If you use a syllabi�ed transcrip�tion in place of the plain transcription pronunciation � takesthe form dEk�st��r�s and pronunciation � the form dEk�

str�s �

This means you can use the universal sequence number toidentify a particular lemma or wordform and then the pro�nunciation number to identify the di�erent individual pro�nunciations given for each lemma� Moreover� the pronun�ciation number allows you to identify quickly the �primary�pronunciation �as laid down in the English Pronouncing Dic�tionary by Daniel Jones� A�C� Gimson and Susan Ramsaranbecause such forms are always �rst in the list� for everylemma or wordform� the number � pronunciation is the �pri�mary� form� The classi�cations of variant pronunciations

Page 27: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Pronunciation number ����

are dealt with fully in the next section on the �status ofpronunciation� column�

One important point to remember is that the pronunciationnumber can be used to eliminate unwanted rows from yourlexicon� If you only want to see one pronunciation for eachlemma �or whatever� you should make a restriction whichstates that only rows with a pronunciation number equalto � are to be included �in the form PronNum � �� If youdon�t do this� you usually end up with lexicons that aretoo long because they needlessly repeat certain pieces ofinformation� Take the example dexterous again� in a lexiconwith three columns� one giving the pronunciation number�one giving the orthography of the headword� and one givingthe pronunciation of the headword� Without the restriction�flex returns two rows for dexterous�

Pronunciation Headword Pronunciation

Number

� dexterous dEk�st��r�s

� dexterous dEk�str�s

You may �nd this unnecessary� needing to see only one �de�fault� pronunciation� the extra row merely gives you an extrapronunciation� When you include the restriction PronNum

� �� however� only the row with the preferred spelling isincluded�

Pronunciation Headword Pronunciation

Number

� dexterous dEk�st��r�s

And of course the more lemmas or wordforms your lexiconcontains� the greater the number of eliminated lines becomes�simply as a consequence of adding this important restriction�

For some words� there are many variants given � transitionalhas forty possible pronunciations� all of which are given on aseparate row in the database� If you are particularly inter�ested in pronunciation variation of this kind� then do not addthe PronNum � � restriction� that way you get to see all thevariant forms� Otherwise� whenever you just want to use asimple phonetic transcription� always remember to insert it�

Page 28: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

The flex name and description of this column are as follows�

PronNum

�PronNumLemma�

Pronunciation ID number

��� STATUS OF PRONUNCIATION

For every di�erent pronunciation� there is a code which tellsyou whether it is a primary pronunciation � P � or a sec�ondary pronunciation � S � This applies to both lemmas andwordforms� According to the English Pronouncing Diction�ary� primary and secondary forms are all standard forms� butprimary forms are heard more frequently�

The �rst pronunciation given for each word �that is� thepronunciation with PronNum � is usually a citation form�the sort of pronunciation you are most likely to hear if youasked a speaker of �standard� English to pronounce a par�ticular word by itself� It is always given the status code P

for �primary� pronunciation� For example� the number onepronunciation of transparent is tr�n��sp��r�nt�

Sometimes stylistic variants of this �rst form are recorded�variants which indicate the highly frequent elision of soundsthat occur in connected speech� Since the formal pronun�ciation of a word is quite rare� the stylistic variants areprobably heard more often than the citation form� Thesecond variant for transparent is tr�n��sp��rn�t� wherethe third syllable loses the schwa and the n� is a syllabicconsonant� Frequent stylistic variants retain the status codeP for �primary pronunciation�� but always have a pronunci�ation number which is greater than �� Less frequent stylisticvariants are considered secondary pronunciations� such astr�n��sp��r�nt and trn���sp��r�nt�

Of course there may be alternative� less frequent pronuncia�tions of the primary citation form� alternatives which can�t beattributed to the word being used in connected speech� Suchdi�erences �sometimes known as speaker�to�speaker varia�tions remain in all the stylistic variants each di�erent ci�tation form has� If you ask someone how to pronounce theword transparent� you might hear the primary form above� orinstead a secondary form like trA�n��sp��r�nt or tr�n�

�spE��r�nt� the di�erence being the vowel used in the �rst

Page 29: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Status of pronunciation ����

and second syllables respectively� When the secondary formoccurs in speech� it might be pronounced with some stylisticvariation as trA�n��sp��rn�t or tr�n��spE��r�nt� Nomatter what the stylistic variations� the phoneme which dis�tinguishes one citation form from another remains the same�

Secondary forms� whether citation forms or stylistic variants�are classi�ed as such because they are thought to be usedless commonly than primary forms� All secondary forms getthe code S� and all have a pronunciation number greaterthan �� Unlike primary forms� where the �rst form listedis automatically the citation form� secondary forms are notgiven in any sort of order� So� while the �rst secondary formyou see listed in the database might be a citation form� itcould just as well be one of many stylistic variants�

Pronunciation Status Pronunciation Exampletype code Number passenger

Primary P � p�sIn�dZ�r�

P � p�sIn�Z�r�

Secondary S � p�s�n�dZ�r�

S p�s�n�Z�r�

S � p�sn��dZ�r�

S � p�sn��Z�r�

Table �� Pronunciation status codes for English

The flex name and description of this column are as follows�

PronStatus

�PronStatusLemma�

Status of pronunciation

��� PHONETIC TRANSCRIPTIONS

When you begin selecting phonetic transcriptions from thecelex databases� you have to choose from a wide array ofoptions� First you must decide whether or not you wantto use transcriptions which include syllable boundaries� andstress markers� Then you have to choose which of the foursets of computer phonetic codes best suits your task andyour personal preferences� So� before de�ning each of thecolumns in turn� the section below describes the features ofthe four available sets of computer phonetic codes� In thelast subsection� after the column de�nitions� there is a table

Page 30: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

of examples which gives some transcriptions from each ofthe transcription columns� it is especially useful for compar�ing the di�erent types of transcription �plain� syllabi�ed orstressed and syllabi�ed�

����� COMPUTER PHONETIC CHARACTER SETS

Four di�erent sets of phonetic character codes are availablefrom celex� The �rst three sets are sam�pa� celex andcpa� and they can be thought of as computerized versionsof ipa� They use standard ascii codes�those which can betyped in and read on almost any terminal�to represent cer�tain of the ipa characters� As far as possible� these sets havebeen designed to resemble ipa� a lot of the characters youtype or read look like their ipa counterparts� As with ipa�diphthongs and a�ricates are represented by writing the twoappropriate characters next to each other� and long vowelsare indicated by length markers� In some cases� however�these conventions can lead to ambiguity� are the two vowelsshown next to each other really a diphthong� or are theyin fact two separate vowels� To overcome such problems�there are columns which contain transcriptions with syllablemarkers� and also columns available which have a delimiterplaced after each consonant� a�ricate� vowel� long vowel ordiphthong� So� these sets of computer codes for phonetictranscription can provide a readable approximation of ipa�with extra provision made to overcome the possibility ofambiguity�

The tables over the next two pages list the basic set ofsegments for English� Each line gives an ipa character along�side a word which exempli�es the sound and the equivalentcharacters in the four computer�usable sets available withcelex�

The �rst of the three ipa�like sets is the sam�pa set� Itwas developed in connection with a European Communityresearch program� and it has been presented in the Jour�nal of the International Phonetic Association ����� �� � ���pp� ����� as a widely�agreed computer�readable phoneticcharacter set suitable for use with Danish� Dutch� English�French� German and Italian� For technical reasons� the ver�sion of sam�pa implemented by celex has to include onechange� the � character �ascii code �� representing the

Page 31: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Computer phonetic character sets ����

ipa example sam�pa celex cpa disc

p pat p p p p

b bad b b b b

t tack t t t t

d dad d d d d

k cad k k k k

� game g g g g

� bang N N N N

m mad m m m m

n nat n n n n

l lad l l l l

r rat r r r r

f fat f f f f

v vat v v v v

S thin T T T T

� then D D D D

s sap s s s s

z zap z z z z

M sheep S S S S

� measure Z Z Z Z

j yank j j j j

x loch x x x x

h had h h h h

w why w w w w

Q cheap tS tS T� J

� jeep dZ dZ J� �

�j bacon N� N� N� C

mj idealism m� m� m� F

nj burden n� n� n� H

lj dangle l� l� l� P

� father r� r� r� R

�possible linking �r��

Table �� Computer phonetic codes for English consonants

�half�open front rounded� vowel sound has been implementedas � �ascii code �� The second is a set originally designedfor use within celex� The third is cpa� the ComputerPhonetic Alphabet� or Esprit ��� which was developed inthe Ruhr Universit�at Bochum� Germany�

The fourth set is the disc set� so called because it is acomputer phonetic alphabet made up of distinct single char�acters� It is fundamentally di�erent from the other three in

Page 32: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

ipa example sam�pa celex cpa disc

� pit I I I I

� pet E E E E

� pat � ��

� putt V V � V

� pot Q O O Q

V put U U U U

another � � � �

iq bean i� i� i� i

q barn A� A� A� �

�q born O� O� O� �

uq boon u� u� u� u

�q burn �� �� �� �

e� bay eI eI e� �

a� buy aI aI a� �

�� boy OI OI o�

V no �U �U O� �

aV brow aU aU A� �

� peer I� I� I� �

� pair E� E� E� �

V poor U� U� U� �

� timbre � �� ��� c

�q d�etente A�� A�� A�� q

��q lingerie �� ��� ���� �

��q bouillon O�� O�� O�� �

Table �� Computer phonetic codes for English vowels and diphthongs

that it assigns one ascii code to each distinct phonologicalsegment in the sound systems of Dutch� English and Ger�man� Here segment means a consonant� an a�ricate� a shortvowel� a long vowel or a diphthong� There are two mainadvantages to this set� First� it provides one character forone segment � in contrast to the other three sets which useextra characters for long vowels� a�ricates and diphthongs�Second� there is no possibility of ambiguous transcriptions� Adiphthong is always shown as a diphthong� and two separatevowels in proximity to each other �say on either side of asyllable boundary can thus no longer be confused with areal diphthong� an a�ricate is always shown as such� and notas two consonants� For both these reasons� those interestedin processing phonetic transcriptions�as opposed to readingtranscriptions in a character set that resembles the familiar

Page 33: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Computer phonetic character sets ����

ipa�may well choose transcriptions in this character set�Its most basic codes correspond to sam�pa� all the sam�pa codes which represent short vowels and consonants areincluded in this set� The remaining long vowels� diphthongsand a�ricates have been assigned codes not already in usefor other purposes� The resulting character set thus does notlook as elegant and ipa�like as the other three sets� However�if you are mainly interested in the computer processing oftranscriptions� such �sthetic considerations might not be soimportant�

Clearly� you have a wide choice of transcriptions availableto you� The type you choose will depend on the natureof the task you have in mind� For ipa�like readability andnon�ambiguous transcriptions� use the sam�pa� celex orcpa sets� For computer processing tasks which need one�character�to�one�segment�correspondence� use the disc set�In Appendix I there is a table which sets out disc and howit relates to Dutch� English and German� One �nal pointworth noting is that if instead of the standard sets of codeso�ered here you want to use a set of your own making�you can implement it by means of the pattern transduceravailable in the flex window MODIFY CONVERSION� and thedisc character set is probably the easiest set to convert�

����� PLAIN TRANSCRIPTIONS

The �rst set of columns o�ers plain transcriptions � thatis� transcriptions which do not have any syllable markers orstress markers� written in each of the four coding systemsalready described� However� three of these columns have onespecial feature� each phonetic segment ends with a delimiter�Here a segment means a vowel� a consonant� a long vowel�a diphthong� or an a�ricate� Using a delimiter avoids anypossibility of ambiguity between the two parts of a diphthongor an a�ricate� These delimiter transcriptions are available inthe sam�pa� celex� and cpa characters sets� Delimiters arenot given with disc transcriptions since the unique single�character nature of that set obviates the need to delimit eachsegment in this way� Also available with the plain transcrip�tions is a column which indicates how many phonemes eachtranscription contains�

The �rst plain transcription column uses the sam�pa charac�ter set� and full stops � � as delimiters� The flex name and

Page 34: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

description of this column are as follows�

PhonSAM

�PhonSAMLemma�

Unsyllabified� SAM�PA character set

The second column uses the celex character set� and fullstops � � as delimiters� The flex name and description ofthis column is as follows�

PhonCLX

�PhonCLXLemma�

Unsyllabified� CELEX character set

The third column uses the cpa character set� and full stops� � as delimiters� Normally cpa uses full stops as syllablemarkers� but here of course� no syllable markers are used�The flex name and description of this column is as follows�

PhonCPA

�PhonCPALemma�

Unsyllabified� CPA character set

The last plain transcription column uses the disc set� Nodelimiters� syllable markers or stress markers are included�The flex name and description of this column are as follows�

PhonDISC

�PhonDISCLemma�

Unsyllabified� DISC character set

A count which tells you how many phonemes each headwordyou select contains is available� Since certain phonemes �longvowels and diphthongs are sometimes given by two charac�ters� this count is more sophisticated than merely the lengthof the string� Here are some examples� the lemma farmhousehas six phonemes in its transcription �fa�m�haUs�� andammeter also has six � �mI�t�r���

nPhonCnt

�PhonCntLemma�

Number of phonemes

Page 35: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Syllabi�ed transcriptions ���

����� SYLLABIFIED TRANSCRIPTIONS

The next set of transcriptions use the same basic transcrip�tions as the �plain� set� but this time� they are given withoutany phoneme delimiters� Instead� the syllables which makeup each word are shown in one of two ways� The �rst methodis to use a hyphen �or� in the case of cpa� a full stop tomark every syllable boundary within each word� The secondmethod� available with the celex character set� is to encloseeach syllable within square brackets�

The �rst syllabi�ed transcription column uses the sam�pacharacter set� It uses a hyphen as the syllable marker� andits flex name and description are as follows�

PhonSylSAM

�PhonSylSAMLemma�

Syllabified� SAM�PA character set

The next two syllabi�ed transcription columns use the celexcharacter set� and the �rst of these uses hyphens to markevery syllable boundary in each word� The flex name anddescription of this column are as follows�

PhonSylCLX

�PhonSylCLXLemma�

Syllabified� CELEX character set

The other celex syllabi�ed column uses the brackets no�tation as described above� and its flex name and descriptionare as follows�

PhonSylBCLX

�PhonSylBCLXLemma�

Syllabified� CELEX character set brackets�

The fourth syllabi�ed transcription column uses the cpacharacter set� and every syllable boundary within each wordis marked by a full stop� The flex name and description ofthis column are as follows�

PhonSylCPA

�PhonSylCPALemma�

Syllabified� CPA character set

Page 36: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

The �fth syllabi�ed transcription column uses the characterset calleddisc� and every syllable boundary within each wordis marked by a hyphen� The flex name and description ofthis column are as follows�

PhonSylDISC

�PhonSylDISCLemma�

Syllabified� DISC character set

The last column in this set gives for each lemma or wordforma count which tells you the number of phonetic syllables inthe word� For example� �fa�m�haUs � contains two sylla�bles� and � �mI�t�r� � contains three�

SylCnt

�SylCntLemma�

Number of phonetic syllables

����� STRESSED AND SYLLABIFIED TRANSCRIPTIONS

The third set of columns gives transcriptions which are syl�labi�ed and also have primary and secondary stress markers�There are four such columns� each containing transcriptionsin one of the four computer usable phonetic character setsdescribed above�

The �rst column uses the sam�pa character set� and as wellas using hyphens to mark syllable boundaries� these tran�scriptions show points of primary stress by means of the�double quote� character � � and points of secondary stressby means of the �percent� character � � � These charactersare placed immediately before a stressed syllable� The flexname and description of this column are as follows�

PhonStrsSAM

�PhonStrsSAMLemma�

Syllabified� with stress marker� SAM�PA

character set

The second column uses the celex character set� and aswell as using hyphens to mark syllable boundaries� thesetranscriptions show the points of primary stress with aninverted comma � � and the points of secondary stress witha �double quote� � � �

PhonStrsCLX

�PhonStrsCLXLemma�

Syllabified� with stress marker� CELEX

character set

Page 37: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Stressed and syllabi�ed transcriptions ����

The third column uses the cpa character set� including fullstops to mark syllable boundaries� these transcriptions showpoints of primary stress with an inverted comma � � andthe points of secondary stress with a �double quote� � � �

PhonStrsCPA

�PhonStrsCPALemma�

Syllabified� with stress marker� CPA

character set

The fourth column uses the disc character set� along withhyphens to mark syllable boundaries� these transcriptionsshow points of primary stress with an inverted comma � � and points of secondary stress with a �double quote� � � �

PhonStrsDISC

�PhonStrsDISCLemma�

Syllabified� with stress marker� DISC

character set

A stress pattern is available for each lemma or wordform� aswell as a count of the number of syllables and phonemes eachcontains�

A stress pattern is a string which shows how each syllableis stressed in speech� Each syllable is represented by onenumeric character� either � � or �� � indicates that thesyllable receives secondary stress� � indicates that it receivesprimary stress� and that it does not receive primary orsecondary stress� The examples below contrast the syllabi��ed phonetic transcription which includes a primary stressmarker with the stress pattern described here�

Example Transcription Stress pattern

biographic �baI��U�gr�fIk ����

googly gu��glI ��

Table � Example stress patterns

StrsPat

�StrsPatLemma�

Stress pattern

Page 38: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

����� EXAMPLE TRANSCRIPTIONS

Column Examples

excruciatingly oceanic

Plain transcriptions

PhonSAM I�k�s�k�r�u��S�I�eI�t�I�N�l�I� �U�S�I��n�I�k�

PhonCLX I�k�s�k�r�u��S�I�eI�t�I�N�l�I� �U�S�I���n�I�k�

PhonCPA I�k�s�k�r�u��S�I�e��t�I�N�l�I� O��S�I����n�I�k�

PhonDISC IkskruSI�tINlI �SInIk

Syllabi�ed transcriptions

PhonSylSAM Ik�skru��SI�eI�tIN�lI �U�SI��nIk

PhonSylCLX Ik�skru��SI�eI�tIN�lI �U�SI���nIk

PhonSylBCLX �Ik��skru���SI��eI��tIN��lI� ��U��SI�����nIk�

PhonSylCPA Ik�skru��SI�e��tIN�lI O��SI����nIk

PhonSylDISC Ik�skru�SI���tIN�lI ��SI��nIk

Syllabi�ed and stressed transcriptions

PhonStrsSAM Ik�skru��SI�eI�tIN�lI ��U�SI��nIk

PhonStrsCLX Ik��skru��SI�eI�tIN�lI �U�SI����nIk

PhonStrsCPA Ik��skru��SI�e��tIN�lI O��SI�����nIk

PhonStrsDISC Ik��skru�SI���tIN�lI ��SI���nIk

Table � Example English Phonetic Transcriptions

��� PHONETIC PATTERNS

Phonetic patterns here means cv patterns� the consonantand vowel patterns for the phonetic transcription �as op�posed to the orthographic transcriptions of any headwordor wordform you select� Instead of the basic cv pattern�which uses hyphens to mark phonetic syllable boundarieswithin words� you may want to use the alternative notationwhich delimits syllables by means of square brackets� Thephonetic cv patterns used here represent each short vowelas V� each long vowel and diphthong as VV� each consonantand a�ricate as C� and each syllabic consonant as S�

This table illustrates the two di�erent formats you can choosefor your cv patterns�

Page 39: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Phonetic patterns ����

Example Transcription CV pattern CV patternwith brackets

farmhouse �fa�m�haUs CVVC�CVVC �CVVC��CVVC�

ammeter ���mI�t�r� V�CV�CVC �V��CV��CVC�

Table �� Example CV patterns

The basic phonetic cv patterns include hyphens as syllablemarkers� The flex name and description of this column areas follows�

PhonCV

�PhonCVLemma�

Phonetic CV pattern

Alternatively you can choose phonetic cv patterns of head�words which use square brackets to delimit the syllables�This column has the following flex name and description�

PhonCVBr

�PhonCVBrLemma�

Phonetic CV pattern� with brackets

Page 40: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

� ENGLISH MORPHOLOGY

Information on English Morphology is available with lemmalexicons and wordform lexicons� If you are interested inin�ectional morphology� then you should use a wordformslexicon� and if you are interested in derivational and com�positional morphology� you should use a lemma lexicon�

��� MORPHOLOGY OF ENGLISH LEMMAS

The morphological analyses given for lemmas in the celexdatabases always use the headword form of the lemma� be�cause this form �unlike Dutch is usually the shortest in anyin�ectional paradigm� without any visible in�ectional end�ings� However� when discussing English morphology� stem isthe normal term used to describe this form� and so in thissection stem is used instead of headword� just to �t in withcommon practice�

Before �nding out details about each of the columns avail�able� you should look at the sections below which try to givesome explanation of the methods used to obtain the analysesgiven in the database� You will then know what celexmeans by terms such as immediate segmentation hierar�chical segmentation compound derivation� and derivationalcompound� After all that� you�ll understand more clearlywhat each of the various columns has to o�er�

����� HOW TO SEGMENT A STEM

The �rst and most fundamental type of segmentation is im�mediate segmentation� This simply involves splitting a steminto its largest constituent parts� If you continue to carryout immediate segmentation until there is nothing left tosegment� you arrive at the stem�s complete segmentation�Depending on your requirements� you can look at a completesegmentation in two forms� The �rst is the �at form� whichshows every morpheme that makes up the stem� The secondis the hierarchical form� which� as well as pointing out theindividual morphemes in a stem� also shows all the analyses

Page 41: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

How to segment a stem ����

which have to be made to identify those morphemes� The�at segmentation gives the conclusion reached� while thehierarchical segmentation shows the working�

To illustrate the three types of segmentation� take as anexample the word interdenominational�

The �rst type of analysis �Immediate segmentation� gives thea�x inter plus the stem denominational�

interdenominational

inter denominational

The second type of analysis �complete segmentation ��at�shows you what you get if you keep applying immediatesegmentation� namely the constituent morphemes of inter�denominational� the a�x inter plus the a�x de plus thestem nominate plus the a�x ion plus the a�x al�

interdenominational

inter de nominate ion al

The third type �complete segmentation �hierarchical� showsyou the full analysis of the word� including each individualimmediate segmentation carried out� It gives you enoughinformation to produce a hierarchical tree diagram like thisone�

interdenominational

denominational

denomination

denominate

inter de nominate ion al

Page 42: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

For most stems in the database� representations of each ofthese three types of segmentation are available� Sometimesthere is more than one representation� because certain stemscan have more than one immediate segmentation� To explainthis fully� the next section describes the basic analyses thatresult from immediate segmentation�

����� TYPES OF ANALYSES

When you attempt to split a stem into its biggest componentparts� the result is always some combination of stems or �ec�tions and a�xes� A �ection�such as the freezing in freezingpoint�is treated the same as a stem� so that whenever ananalysis involves a stem� you know that the stem could alsobe a �ection� The most straightforward analysis of all isa stem which consists of only one �free morpheme� it ismonomorphemic� and clearly can�t be split up� Every otherstem� however� consists of one smaller stem or a�x plus atleast one a�x or one other stem� and can be termed eithera Derivation� a Compound� a Derivational Compound� or aNeo�classical Compound� It is important to understand thedi�erences between these four terms� since they are at theheart of the morphological information celex provides� So�in the subsections below� each is de�ned in terms of stemsand a�xes� Examples are given� and simple �tree� diagramsillustrate the appropriate immediate analyses�

������� THE DERIVATION

A derivation involves a�xation� whereby a�xes can beadded to an existing stem or �ection to form a new stem�The immediate analysis always takes one of four possibleforms�

�i a binary split into a stem or �ection plus an a�x �theword careful for example� care � ful�

stem

stem affix

Page 43: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

The Derivation ����

�ii a binary split into an a�x plus a stem or �ection� For ex�ample� the word barometer is analysed as baro � meter�

stem

affix stem

�iii a triform split into an a�x� a stem or �ection� and ana�x �the word extracurricular for example� extra � curricu�lum � ar�

stem

affix stem affix

�iv a triform split into a stem or �ection� an a�x and anothera�x� Such words can be derivations of in�ected forms �the word falteringly� for example� is analysed as falter �ing � ly� which is a stem plus an in�ectional a�x plusa derivational a�x� Alternatively� they can be lexicalisedforms of in�ected derivations like countri�ed� analysed ascountry � ify � ed� which is a stem plus a derivationala�x plus an in�ectional a�x� This sort of analysis is onlyappropriate when the stem and the a�x which immediatelyfollows it don�t together form a lemma� because otherwise�as with the word in�ationary�the immediate analysis wouldbe like type �i above� a stem plus an a�x �in�ation � ary�

stem

stem affix affix

Page 44: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

������� THE COMPOUND

A compound is the joining of two stems or �ections intoone new stem� The immediate analysis always takes one oftwo forms�

�i a binary split into two stems �the word nameplate forexample� name � plate�

stem

stem stem

�ii a triform split into a stem or �ection� an a�x �simply a�link� morpheme� and a stem or �ection �the word bandsmanfor example� band � s � man�

stem

stem affix stem

Words which consist of more than two stems aren�t ana�lysed as compounds� since they normally have the structureof a phrase or sentence� So headwords like nevertheless�Australian Rules football and be�all�and�end�all don�t geta morphological analysis�

������� THE DERIVATIONAL COMPOUND

A derivational compound is a compound which can onlybe formed in combination with a derivational a�x �as op�posed to a simple link morpheme� The immediate analysisnormally takes the form of a triform split into a stem or�ection� another stem or �ection� and an a�x �the wordicebreaker for example� ice � break � er�

stem

stem stem affix

Page 45: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

The Derivational Compound ���

A couple of words can be analysed as a quaternary split into astem� an a�x� a stem� and an a�x �the word brinksmanship�for example� is analysed as brink � s � man � ship andwhippersnapper is analysed as whip � er � snap � er�However this is a very rare form of analysis�

stem

stem affix stem affix

������� THE NEO�CLASSICAL COMPOUND

A neo�classical compound is a word which appears to bemade up of two a�xes� neither of which can occur as a wordin its own right� like aerodrome �aero � drome or neurology�neuro � ology� The a�xes which combine to form this typeof compound are generally known as combining forms�

stem

affix affix

������� THE NOUN�VERB�AFFIX COMPOUND

Problems sometimes arise with the analysis of words whichlook like derivational compounds� The general de�nition of aderivational compound is normally su�cient� but when thesecond stem is a verbal form� things become more compli�cated� A stem which comprises a noun plus a verb plus an af��x can normally be considered a derivational compound� butsome people may want to treat it as an ordinary compoundor derivation� The distinction is important� since it can a�ectnot only the appearance of a single immediate segmentationbranch� but also the appearance of a complete hierarchicaltree� The stem copy editor is such a �problem� compound� Ifyou consider it to be an ordinary compound �the stem copy

Page 46: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

plus the stem editor�� its complete hierarchical tree looks likethis�

copy editor

copy editor

edit �or

If you consider it to be an ordinary derivation �the stem copy�edit plus the a�x �or�� its complete hierarchical tree lookslike this�

copy editor

copy edit �or

copy edit

But if you consider it to be a derivational compound� the �rstimmediate segmentation gives you the stem copy plus thestem edit plus the a�x �or� which gives the full hierarchicaltree a di�erent appearance�

copy editor

copy edit �or

����� HOW TO ASSIGN AN ANALYSIS

When you�re faced with a headword that needs to be ana�lysed� how do you work out the correct analysis� How did thepeople at celex who carried out the morphological analysisby hand arrive at the answers contained in the database�In particular� in the case of noun�plus�verb�plus�a�x words�

Page 47: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

How to assign an analysis ����

how did they decide which of the analysis types discussed inthe previous section were appropriate�

To illustrate the principles used in analysing the information�there are two diagrams� given as Tables � and � below� The�rst illustrates the general strategy adopted for each head�word� and the second deals with the special problems thatarise with noun�plus�verb�plus�a�x words� In both diagramsabbreviations are used� s means stem� and a means a�x�making it easy to refer back to the sections above whichde�ne derivations� compounds� and derivational compoundsin terms of stems and a�xes� When an analysis is acceptable�it means that the component parts identi�ed are currentstems or a�xes� and that the word can be de�ned as a deri�vation� a compound� or a derivational compound accordingto the de�nitions given in sections ��������������� above� Anacceptable stem is one which appears in the Collins EnglishDictionary without being marked as �obsolete� or �archaic��

Following the �rst diagram� analysis starts with an attemptto see if the word under scrutiny is just the same as an alreadyexisting word with a di�erent word class� The word railroad�for example� can be used as a verb� and it is said to come fromthe corresponding noun railroad� This phenomenon is calledconversion or zero derivation� since there is no di�erence inthe form of the two words even though they have a di�erentword class� Conversion is explained in full under section �����Status and Language codes�� If conversion has occurred� theanalysis need go no further� inMorphStatus the word getsthe code Z � and NVA�Comp and its subordinate columnsDer� Comp� and DerComp are all set to N�

If the word is not a conversion� then the next step is to checkwhether it �ts with the de�nition of a derivation given insection ������� above� For example� the word calculator isanalysed as the stem calculate with the su�x �or� encircle isanalysed as the pre�x en� with the stem circle� and un�ap�pable as the pre�x un� plus the stem �ap plus the su�x �able�In all three cases� the word is classi�ed as a derivation� inMorphStatus the word gets the code C to indicate that itis complex� and NVA�Comp and its subordinate columnsDer� Comp� and DerComp are all set to N�

If the word turns out not to be a derivation� then the nextstage is to see if it �ts with the de�nition of a compound given

Page 48: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Is the lemma derived from anotherlemma identical in form butdi�erent in word class�

yes no

Zero Derivation Is the analysis sa or as or asaacceptable�

yes no

Derivation Is the analysis ss or sas acceptable�

yes no

Compound Is the analysis ssa �or sasa�acceptable�

yes no

DerivationalCompound

Monomorphemic

Table � How to carry out morphological analysis

in section ������� above� For example� the noun keyboard isanalysed as the stem key plus the stem board� and grounds�man as the stem ground plus the in�x �s� plus the stemman�In both cases� the word is classi�ed as a compound� underthe MorphStatus column it gets the code C to indicatethat it is complex� and NVA�Comp and its subordinatecolumns Der� Comp� and DerComp are all set to N�

If the word still hasn�t been classi�ed� then the last stageis to see whether the word is a derivational compound� asde�ned in section ������� above� For example� the adjectivebarefaced is analysed as the stem bare plus the stem faceplus the a�x �ed� The word is therefore classi�ed as aderivational compound� under MorphStatus it gets thecode C to indicate that it is complex� and NVA�Comp

and its subordinate columns Der� Comp� and DerComp

Page 49: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

How to assign an analysis ����

are all set to N�

It is possible that the word might not �t into any of theabove categories� this makes the word monomorphemic� andthe �analysis� is simply the word itself � chair� for example�or llama� Under MorphStatus the code M is given� andNVA�Comp and its subordinate columns Der� Comp�and DerComp are all set to N� In other cases where noanalysis can be carried out� the code under MorphStatus

indicates why� You can read about these codes in section���� �Status and language codes��

������� THE NOUN�VERB�AFFIX COMPOUND

The general scheme explained above is enough to arrive atan analysis in most cases� However� di�culties in apply�ing the system arise when you start considering so�callednoun�verb�a�x compounds � those words which contain averbal element� which aren�t conversions� and which couldbe analysed as a nominal stem plus a verbal stem plus ana�x� Examples of such words are stockholder and copy�editor� This type of compound is characterised by a Y inthe NVA�Comp column� As the diagram below shows�just because they could be analysed in such a way� it doesn�tmean they necessarily should be� Stockholder is both acompound and a derivational compound� and copy�editor isa derivation� a compound and a derivational compound� Theapproach outlined below is designed to keep as many mor�phologists as possible happy with the information availablein the database� it�s possible to choose for yourself whetherto restrict your lexicon to just one type of analysis� or topermit them all� according to your own requirements�

The �rst step is to see whether the NVA�Comp wordyou are dealing with can be classi�ed as a derivation� inaccordance with the de�nition in section ������� above� Takethe word dive�bomber as an example � it can be analysed asthe stem dive�bomb plus the a�x �er� The verb dive�bombis accepted as legitimate because it occurs as such in theCollins English Dictionary �ced�� So this word does meetthe de�nition of a derivation� and thus gets the code Y inthe Der column� Another example is the word bricklayer�since a verb bricklay doesn�t exist �according to the ced itcan�t be analysed as a stem plus an a�x� and so it gets thecode N in the Der column�

Page 50: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Is the analysis sa or as or asaacceptable�

yes no

Derivation Is the analysis ss or sas acceptable�

yes no

Compound Is the analysis ssa �or sasa�acceptable�

yes no

DerivationalCompound

Table �� Dealing with noun�verb�a�x compound analyses

The next stage is to see whether the NVA�Comp word inquestion can be classi�ed as a compound� as de�ned in sec�tion ������� above� Again� the word dive�bomber meets thede�nition� it can be analysed as the stem dive plus the stembomber and it is a particular sort of bomber� It can thereforebe called a compound� and it gets the code Y in the Compcolumn� Applying the same rules to bricklayer produces theopposite result� a bricklayer isn�t really a particular sortof layer� since �according to the ced layer doesn�t mean�someone who lays things�� So� bricklayer gets the code N inthe Comp column to show that it isn�t a compound�

The last stage is to decide whether the word is a derivationalcompound� as de�ned in section ������� above� This timedive�bomber does not qualify� Even though it can have thestructure stem �dive plus stem bomb plus a�x ��er� the

Page 51: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

The NounVerbAx Compound ����

noun dive cannot be the object of the verb bomb � you can�ttalk about �bombing a dive�� Since dive isn�t any sort ofobligatory complement to the verb bomb� dive�bomber isnot a derivational compound� and therefore gets the code N

in the DerComp column� Applying the rules to bricklayeragain produces the opposite result� It can have the structurestem �brick plus stem �lay plus a�x ��er� and brick is theobject of the verb lay � it is possible to talk about �layingbricks�� So since this time bricks is some sort of complementto the verb lay� bricklayer gets the code Y in DerCompcolumn to show that it is a derivational compound�

To illustrate this process further� more examples are given inthe table below�

Word Classi�cations

NVA�Comp Der Comp DerComp

typesetter Y Y N Y

dive�bomber Y Y Y N

copy�editor Y Y Y Y

stockholder Y N Y Y

churchgoer Y N N Y

cub reporter Y N Y N

Table ��� Example noun�verb�a�x compound analyses

It shows six examples and the codes each one gets inNVA��Comp� Der� Comp� and DerComp� as a quick way ofshowing how the words are classi�ed in the NVA�Companalysis scheme� In the database� however� each separateanalysis gets a separate row� and if you looked up analysesfor these six words you would get something like this�

Page 52: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Stem MorphNum NVA�Comp Der Comp DerComp Def Imm

typesetter � Y Y N N Y typeset er

typesetter � Y N N Y Y type set er

dive�bomber � Y Y N N Y dive�bomb er

dive�bomber � Y N Y N Y dive bomber

copy�editor � Y Y N N Y copy�edit or

copy�editor � Y N Y N Y copy editor

copy�editor � Y N N Y Y copy edit or

stockholder � Y N Y N Y stock holder

stockholder � Y N N Y Y stock hold er

churchgoer � Y N N Y Y church go er

cub reporter � Y N Y N Y cub reporter

If you�ve followed in full the explanation of how celex car�ried out its morphological analysis of English words� mostof this example lexicon should be clear� The columns itcontains� along with other columns are described and de�nedin the sections that follow� Using the columns available youcan control the number of analyses you see for each stem� aswell as the type of analyses� by means of restrictions on the�number� and �status� columns which are de�ned below� Youcan decide for yourself whether your lexicon should containjust one �default� analysis per stem� or whether it shouldcontain more than one analysis per stem� In cases wherea stem can be analysed as a derivation� a compound or aderivational compound� you can choose to include whichevertype you prefer� leaving out the other type� In short� youhave the freedom to build lexicons which contain morpho�logical information in the form you most prefer�

����� STATUS AND LANGUAGE CODES

The �rst ADD COLUMNS menu you see after you select the�Morphology� option is this one�

Page 53: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Status and language codes ����

ADD COLUMNS

StatusLanguage informationDerivational�compositional information �

TOP MENUPREVIOUS MENU

Before dealing with the various derivational�compositionalinformation columns� which form the bulk of the availablemorphological information� the �rst two columns are dealtwith here�

The �rst column simply tells you by means of a single codewhether each stem is morphologically simple� morphologi�cally complex� or a conversion� or why it is as yet unanalysed�The table below shows the codes that are used� and it isfollowed by a description of each of the eight codes� Justbefore the description concludes with the column de�nition�there is a diagramwhich illustrates the strategies celex usedto determine a status code for each stem�

Status Code Example

Morphological analysis available�

Morphologically complex C sandbankMonomorphemic M camelConversion �Zero Derivation� Z abandonContracted form F I�ve

Morphological analysis unavailable�

Morphology irrelevant I meowMorphology obscure O dedicateMorphology may include a �root R imprimaturMorphology undetermined U hinterland

Table ��� Derivational morphology status codes

If a stem contains at least one stem plus at least one otherstem or a�x� then it is said to be morphologically complex�Details of how the stem can be analysed are given in thederivational�compositional segmentation columns described

Page 54: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

in the section below� Thus if a stem has the morphologicalstatus code C for �complex�� you know that informationabout its derivational and�or compositional morphology areavailable in the database�

If a stem is monomorphemic� then it contains only one mor�pheme� and no further analysis is required� The morphologi�cal status code M means �monomorphemic�� and you knowthat a simple one�stem analysis is given as the derivationaland�or compositional morphology for each stem with thiscode�

If a stem appears to be derived from another stem which isidentical in form but di�erent in word class� it gets the code Z

for �zero derivation� or conversion� The noun delinquent� forexample� can be said to derive from the adjective delinquent�Normally derivations from one word class to another areclearly marked by means of an a�x � sheepish is an adjectivederived from the noun sheep� for example� But conversions�on the other hand� are not so marked� it�s as if an a�xcontaining nothing had been added to the original stem�

Naturally enough� when conversion occurs� it�s not immedi�ately obvious which stem is the original and which is thederivative� In analysing these words� celex adopted a strat�egy for determining the direction of the conversion �that is�if a verb has been converted into a noun� the derivation is inthe direction of the noun� Table �� indicates the normaldirection of conversion� Conversion in the opposite directionis also possible provided that it is speci�ed in the ShorterOxford English Dictionary �soed�

Default direction Example

verb�noun paintadjective�noun paralleladjective�adverb prettyadjective�verb palepreposition�adverb past

Table ��� Direction of conversions

The status code F indicates that the �analysis� given is in facta contraction� The single contraction �d can represent hadwould or did� and the complex contraction he�s represents he

Page 55: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Status and language codes ���

is or he has� Each contracted form gets its own row in thedatabase and the status code F�

In the case of monomorphemic stems� complex stems� conver�sion stems� and contractions of stems� morphological analysesare provided in the various segmentation columns� Howeverthere remains a large number of stems which have no analy�sis� and in such cases� codes indicating the reasons for thelack of analysis are given in this column� and these codesand reasons are explained below�

First of all� sometimes even attempting morphological analy�sis is not appropriate for a particular stem� Usually this istrue when the stem is an exclamation or an interjection ofsome sort �gosh� prithee or meow� for example� or whenit is a proper noun � Spooner and Germany aren�t ana�lysed� In addition� those few words which seem to havetaken on the structure of a short sentence �or at least consistof three or more stems� like nowadays or whodunit� don�tget an analysis� So� whenever a stem has the code I for�irrelevant�� you know that a morphological analysis isn�tconsidered necessary� and that its entries in the segmentationcolumns described below are therefore empty�

Some stems are recognizable recent loanwords which haveachieved some sort of currency in English � words like virtu�oso or pretzel or mazurka� Since providing analyses for suchstems would� in many cases� mean delving into the morphol�ogy of languages not covered by celex� they simply receivethe code U for �undetermined�� The languages loanwordsoriginate from are shown in the next column� Lang �

On other occasions� an analysis seems possible� but cannotbe fully explained� The stem tabby� for instance� appears toconsist of the productive su�x �y plus what might be anotherstem tab� However tab bears no immediate relation to theadjective tabby� so that tabby gets the code O to indicatethat the morphological analysis is �obscure��

In most cases morphological analysis is carried out on asynchronic basis� the stems or a�xes which make up aword must occur in modern� current English� regardless ofthe historical origins they might have� On many occasions�however� an etymological root could explain the morphologyof a stem which would otherwise be unanalysable� The stem

Page 56: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

patrimony� for example� appears to be made up of a Latinpre�x patri� and what may be a Latin su�x �mony� Stemslike this� which could be analysed on the basis of the histor�ical root of its constituent parts� are given the code R for�root��

If you want to understand this coding system more fully� thenyou can examine Table �� � which is a diagram that sets outa scheme for arriving at appropriate morphological statuscodes� Starting with a stem whose morphology is relevant�that is� it doesn�t belong with those stems that have the codeI � you can work out the code it should have by followingthe diagram through� This strategy is the one actually usedby celex to determine the correct codes�

This column can be used to eliminate from your lexicon stemsfor which there are no morphological analyses� allowing youto concentrate on those which do� Simply add a restrictionwhich states that you only want stems which are morpholog�ically complex� MorphStatus � C�

The column which contains these morphological status codeshas the following flex name and description�

MorphStatus

�MorphStatusLemma�

Morphological status

The second column contains codes that identify the par�ticular geographical origins of lemmas� including the foreignlanguages from which some of the lemmas have been bor�rowed� The headword conquistador thus gets the code S toindicate that it is of Spanish origin� and the verb wa�e getsthe code B because it�s reckoned to be a peculiarly Britishusage� Whenever a lemma doesn�t have a code� it isn�t arecent borrowing from another language� and it doesn�t haveassociations with one regional variety of English� Table �below sets out all the codes used and their meanings�

Page 57: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Status and language codes ����

Is the Lemma derived from anotherlemma identical in form butdi�erent in word class�

yes no

ZZero derivation

Does the Lemma contain at leastone modern� productive lemmaor ax plus one other element�

yes no

Does the otherelement contain atleast one productivelemma or ax�

Is the Lemma arecent loanword�

yes no yes no

CComplex

OObscure

UUndetermined

Does the Lemmacontain at leastone etymologicallyrecognisable root�

yes no

RRoot

MMonomorphemic

Table ��� How to assign MorphStatus codes

Page 58: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Language Code Meaning Example

A American English billfoldF French patisserieB British English divvyD German sauerkrautG Greek eurekaI Italian ciceroneL Latin emeritusS Spanish siesta

Table ��� Language codes for English headwords

The flex name and description of the column which containsthese codes is as follows�

Lang

�LangLemma�

Language information

��� DERIVATIONALCOMPOSITIONAL INFORMATION

ADD COLUMNS

Number of morphological analysesAnalysis number ���N�Status of morphological analysis �Segmentations �Other �

TOP MENUPREVIOUS MENU

These options give you information about the derivationaland compositionalmorphology of stems� including how manyanalyses are available for each stem� a unique number for eachanalysis� an indication of the way in which each analysis hasbeen made� and a marker for the �default� analyses for eachstem�

The �rst option is a column which simply indicates how manyanalyses have been made for each stem� For example� back��re has one analysis� �ashbulb has two� and treasurer three�The number of analyses for each stem also equals the number

Page 59: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Derivational�compositional information ����

of rows that stem can have with distinct analyses� since eachmorphological analysis is assigned to its own individual row�

You can use this column to construct restrictions for yourlexicon� A simple example would be one that includes in yourlexicon only those stems which have more than one analysis�This would take the form MorphCnt � �� The flex nameand description of this column are as follows�

MorphCnt

�MorphCntLemma�

Number of morphological analyses

The second option is a column which identi�es each analysisof a particular stem� Each di�erent morphological analysisof a stem is assigned to a di�erent row� and this column givesthe number of the row� Thus the adjective lemma �ashbulbhas two rows� one has theMorphNum �� the other has theMorphNum � or a stem� description of this column are asfollows�

MorphNum

�MorphNumLemma�

Morphological analysis ID

����� ANALYSIS TYPE CODES

Under the �status of morphological analysis� option thereare �ve �yes�no��type columns which� when you use themto construct restrictions� can help you extract the analysesyou want from the many stem segmentations available�

Each distinct morphological analysis of each stem has a num�ber� and is given �in several di�erent forms on its own row inthe database� These columns give simple information abouteach analysis� and are particularly useful whenever a stem isa noun�verb�a�x compound� �A noun�verb�a�x compound�as discussed in section ������� � can correctly be analysedas a derivation� a compound� or a derivational compound�The �ve columns in question are calledNVA�Comp� Der�Comp� DerComp� and Def�

Whenever NVA�Comp contains a Y� you know that �yes�this row contains a stem which is considered a noun�verb�a�x compound� and which therefore might be analysed inthree di�erent ways�� And naturally whenever it contains

Page 60: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

an N� you know that the row contains a stem which is notconsidered a noun�verb�a�x compound� The flex name anddescription of this column are as follows�

NVA�Comp

�NVA�CompLemma�

Noun�verb�affix compound

Whenever Der contains a Y� you know that �Yes� this rowcontains a noun�verb�a�x compound which is analysed as aderivation�� And whenever it contains an N� you know thatthe row contains a noun�verb�a�x compound which is notanalysed as a derivation or a stem which is not of the noun�verb�a�x compound type� The flex name and descriptionof this column are as follows�

Der

�DerLemma�

Derivation analysis

Whenever Comp contains a Y� you know that �yes� thisrow contains a noun�verb�a�x compound which is analysedas a compound�� And again� N means that the row con�tains a noun�verb�a�x compound which isn�t analysed asa compound or a stem which is not of the noun�verb�a�xcompound type� The flex name and description of thiscolumn are as follows�

Comp

�CompLemma�

Compound analysis

Likewise whenever DerComp contains a Y� you know that�yes� this row contains a noun�verb�a�x compound whichis analysed as a derivational compound�� And naturally� N

means that the noun�verb�a�x compound isn�t analysed asa derivational compound or that it is a stem which is notof the noun�verb�a�x compound type� The flex name anddescription of this column are as follows�

DerComp

�DerCompLemma�

Derivational compound analysis

If a stem has more than one analysis� it�s sometimes helpfulto be able to identify one which is the best or most useful�or at least to discard unwanted alternatives� Whenever Def

Page 61: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Analysis type codes ����

contains a Y� you know that �yes� this row contains a defaultanalysis�� and when it contains an N� you know that the rowcontains another� non�default analysis�

Since there are three types of analyses which can be assignedto a complex stem� there might also be up to three defaultanalyses for one word� a default derivation analysis� a defaultcompound analysis� and a default derivational compoundanalysis� �Of course� not many words are eligible for threedefault analyses� While morphological analysis was beingcarried out� rules were formulated to determine which analy�ses should take precedence over the others� and these rulesare explained in Table �� below�

The left�hand column gives the problem which those doingthe analysis came up against� the central column shows whichof the two possible analyses should be the default �or �takeprecedence�� and the right�hand column illustrates the prin�ciple with an example� The �rst part of the table shows thepreferential order for derivations� and the second part showsthe preferential order for compounds� A part for derivationalcompounds isn�t necessary since they are only analysed inone way�

If� despite the range of analyses available� you only want justone default analysis� then you can get it by making a restric�tion onMorphNum� MorphNum � �� The �rst analysis fora lemma is always a default analysis� Analyses which arederivations take precedence over compounds� and likewisecompounds take precedence over derivational compounds�

Using this column in conjunction with the three preceedingcolumns� you can construct restrictions which select or omitthe analyses you specify� The flex name and description ofthis column are as follows�

Def

�DefLemma�

Default analysis

To illustrate how you can use these columns� imagine thatyou have chosen Imm and ImmClass as the form of mor�phological analysis you want to see� Imm shows the analy�sis� and ImmClass shows the word class of the analysedparts �these columns� and the other columns containing thesame analyses in di�erent forms� are described in the sections

Page 62: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Option Solution Example

Preferential order for the analysis of derivations�

stem � axor

ax � stem

stem � ax takes precedenceover ax � stem

disavowal is analysed �rst asdisavow � �al� then as dis� �avowal�

verb ending in �ate� �ete� �ote or�ute � the ax �ionor

verb not ending in �ate� �ete� �oteor �ute � an ax like �tion

The verb with the higherfrequency in the cobuild typelist takes precedence�

annunciation is analysed �rst asthe verb announce plus the ax�iation� and then as the verbannunciate plus the ax �ion�

adjective � sux �lyor

adjective � sux �ally

The adjective with the higherfrequency in the cobuild typelist takes precedence�

problematically is analysed �rstas problematic � �ally� andsecond as problematical plus �ly�

verb � suxor

noun � sux

Verb takes precedence when thesux is �able� �er� �or or �ure�noun takes precedence when thesux is �ery� �ism� �ist� �ous��some or �y�

comfortable is analysed �rst asthe verb comfort plus the sux�able� and second as the nouncomfort and the sux �able�chatty is analysed �rst as thenoun chat plus the sux �y� andsecond as the verb chat plus thesux �y�

verb � sux �ageor

noun � sux �age

When the word denotes actionor an instance of a phenomenon�then the verb takes precedence�when the word denotes a measureor collection of something� thenthe noun takes precedence�

leakage is analysed �rst as theverb leak � the sux �age� andsecond as the noun leak � theax �age�

pre�x a� � nounor

pre�x a� � verb

Verb takes precedence� aglow is analysed �rst as thepre�x a� plus the verb glow� andsecond as pre�x a� plus the nounglow�

Preferential order for the analysis of compounds�

verb � nounor

noun � noun

Verb � noun takes precedence� checkpoint is analysed �rst asthe noun check plus the nounpoint� and second as the verbcheck plus the noun point�

noun � nounor

noun � verb

Noun � noun takes precedence� windfall is analysed �rst as thenoun wind plus the noun fall�and second as the noun windplus the verb fall�

Table �� How to order multiple analyses of compounds and derivations

Page 63: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Analysis type codes ����

following this one� Then say that you are interested intwo stems dive�bomber� which has three di�erent analyses�and typesetter� which has two� Both words are noun�verb�a�x type words which may be derivations or compounds orderivational compounds� and this accounts for four of theanalyses given� However� for the compound analysis of dive�bomber� the stem dive is analysed as a verb but can also bethought of as a noun� which gives an extra analysis�

First you can decide whether you want just one default analy�sis for each stem� or whether you want to see all the availableanalyses�

If you want to see all possible segmentations� then you don�tneed to add extra restrictions� As the MorphCnt columnindicates� there are three analyses given for dive�bomber andtwo for typesetter� so this is what the unrestricted examplelexicon looks like�

Stem MorphNum NVA�Comp Der Comp DerComp Def Imm ImmClass

dive�bomber � Y Y N N Y dive�bomb�er Vx

dive�bomber � Y N Y N Y dive�bomber VN

dive�bomber � Y N Y N N dive�bomber NN

typesetter � Y Y N N Y typeset�er Vx

typesetter � Y N N Y Y type�set�er NVx

Derivations take precedence over compounds� so for bothwords the �rst row� with analysis number �� contains thederivation and gets Y under Der� And since each wordhas only one possible derivation analysis� both are also de�fault analyses� and therefore get Y under Def too� TheN under Comp and DerComp con�rm that they are notcompounds or derivational compounds�

Compounds take precedence over derivational compounds� sofor dive�bomber the next two rows contain the two compoundanalyses� with analysis numbers �MorphNum � and ��Both get the code Y under Comp� Since verb � nouncompounds take precedence over noun � noun compounds�the verb � noun analysis is a default analysis� it gets � as itsMorphNum� and Y under Def� The noun � noun analysisgets � as its MorphNum� and N under Def� The N codesunder Der and DerComp con�rm that neither of theseanalyses is a derivation or a derivational compound�

Page 64: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

The last row in the lexicon gives the derivational compoundanalysis of typesetter� with Y under DerComp� Since it isthe only possible derivational compound analysis� it is also adefault analysis� and therefore gets Y under Def too� TheN under Der and Comp con�rm that it is not a derivationor a compound�

However� rather than including all four forms in your lexicon�you might want to ignore the derivation and derivationalcompound analyses� and just see the compound analyses�To do this for all the stems in the database� you shouldadd an �expression� restriction to your lexicon which statesthat Comp � Y� In the example lexicon� this one restrictionproduces the following result�

Stem MorphNum NVA�Comp Der Comp DerComp Def Imm ImmClass

dive�bomber � Y N Y N Y dive�bomber VN

dive�bomber � Y N Y N N dive�bomber NN

In the same way� if you want to examine derivational com�pound analyses� and leave out all the other analyses� youshould add an �expression� restriction to your lexicon whichstates that DerComp � Y� In the example lexicon� this re�striction produces the following result�

Stem MorphNum NVA�Comp Der Comp DerComp Def Imm ImmClass

typesetter � Y N N Y Y type�set�er NVx

Rather than seeing a number of analyses� you might prefer tolook at just one straightforward default analysis� no matterhow many alternatives are given in subsequent rows� Again�you can quickly construct restrictions to make this possible�The quickest way is to use the MorphNum column� whichgives a number to each analysis of each stem� You can sayMorphNum � �� which means that only the very �rst analysisof each stem appears in your lexicon�

Sometimes there may be more than one default analysis� Ifyou want to see just the default analysis of each compound�you should use these two restrictions� Def � Y and Comp �

Y� In the example lexicon� this means that the non�preferrednoun � noun analysis is left out�

Stem MorphNum NVA�Comp Der Comp DerComp Def Imm ImmClass

dive�bomber � Y N Y N Y dive�bomber VN

Page 65: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Analysis type codes ���

These explanations may appear complicated� but by readingthem� you can get to know the important restrictions thatyou can use to extract the types of analyses you really want�

����� IMMEDIATE SEGMENTATION

Immediate segmentation is the least detailed form of analysiso�ered here� It doesn�t give you a full analysis� right downto all the smallest elements a stem contains� rather it is asimple� one�level breakdown of a stem into its next biggestelements� So� while complete segmentation is equivalent toa full analytical tree� immediate analysis can be thought ofas a close look at a particular level�

There are ten columns which present the immediate segmen�tation of stems to you� The �rst gives the orthography of theanalysed elements� The next three give more general codings�so that using the flex options SHOW and QUERY� you canlook for stems which have a particular form � a prepositionplus a noun� say� or a stem plus a stem plus an a�x� andso on� The remaining six deal with particular features whichsometimes occur in morphological analysis� stem allomor�phy� a�x substitution� opacity� derivational transformation�in�xation and reversion�

In the �rst column� you get the orthography of the �rst�levelelements themselves� each separated by a � sign� Diacrit�ical markers are not included� Thus the stem nameplateis shown as name�plate� in accordance with the variousrules discussed in section ����� � Note that each element isgiven in the form of a headword or an a�x� even when theoriginal word doesn�t use that particular form� Thus thestem liturgical is analysed as liturgy�ical� where liturg is re�written in the normal form of the stem liturgy� The flexname and description of this column are as follows�

Imm

�ImmLemma�

Immediate segmentation

The second column is like the �rst� except that where the�rst column gives you the orthography of each element� thiscolumn gives you the word class of each element� leaving outany � signs� Single letter labels are used to represent thesyntactic class of each element � which is unlike many of the

Page 66: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

syntactic codes used in other parts of the database� The useof a single character means that there is no possibility of acode becoming ambiguous� since each character is unique�Table �� shows you the labels used in this column�

Word Class Label

Noun N

Adjective A

Numeral Q

Verb V

Article D

Pronoun O

Adverb B

Preposition P

Conjunction C

Interjection I

Single contraction S

Complex contraction T

Ax x

Table �� Word class labels �immediate segmentation�

Using these codes� nameplate is given the code NN� to in�dicate that it is made up of two nouns �a compound� andemigration has the code Vx to indicate that it is made upof a verb and an a�x �a derivation� The flex name anddescription of the column that gives you these codes are asfollows�

ImmClass

�ImmClassLemma�

Immediate segmentation� word class labels

The third column provides more detailed information aboutthe syntactic categorization of verbal stems� The basic codesused are exactly the same as the ImmClass column� exceptthat instead of the V code to represent a verb� any one of anumber of codes is given� Table �� shows you these codes�along with their meaning�

Verbal subcategory Label

Intransitive �

Transitive �

Intransitive � transitive �

Unmarked for transitivity �

Table ��� One�character verbal subclass labels

Page 67: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Immediate segmentation ����

In this column� the word emigration has the code �x� It isexactly the same as the code in the previous column� exceptthat the V is replaced by the number �� indicating in moredetail what sort of verb it is�

The flex name and description of this column are as follows�

ImmSubCat

�ImmSubCatLemma�

Immediate segmentation� subcat labels

The fourth immediate segmentation column simply tells youwhether the elements identi�ed are stems or a�xes� Uppercase S indicates a stem� upper case A indicates an a�x�and upper case F indicates a �ectional form of a stem� Thusemigration is represented as SA� and bagpipes as SF� Theflex name and description of this column are as follows�

ImmSA

�ImmSALemma�

Immediate segmentation� stem�affix labels

The �fth immediate segmentation column concerns stem al�lomorphy� Within a word� a stem sometimes takes a formdi�erent from the one used when it is written down as aword in its own right� When morphological analysis is noteddown� any resulting stems are given their normal stem form�because it�s easiest to understand� An example is the wordabundant� which comprises the stem abound and the a�xant� Note the di�erence between what appears in the originalword �abund and its regular stem form �abound� each hasthe same meaning� the only di�erence between them is theirspelling� This is an example of derivational stem allomorphy�since a new word has been derived by linking a di�erent formof stem to an a�x�

Another sort of stem allomorphy sometimes occurs with con�versions � that is� words which change their word class with�out the addition of an a�x �sleep is both a noun and averb� for example� When conversion occurs� and the formof the stem seems to have altered� the process can be termedconversion with allomorphy� The verb halve is an instanceof conversion with allomorphy� since it is a conversion fromthe noun form half� Thus half and halve are consideredallomorphs� two di�erent forms or representations of thesame stem� There are three types of conversion with allo�morphy� The �rst is the voicing of the �nal consonant with

Page 68: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

the addition of a �nal �e� thus the verb thieve is consideredto be a conversion of the noun thief � The second is thesame process in reverse � the removal of a �nal �e� and thedevoicing of the last consonant� thus the noun belief is aconversion of the verb believe� The third is the change inspelling from �nal s to c� the noun practice can thus bethought of as a zero�derivation from the verb practise�

The next type of stem allomorphy is �ectional allomorphy�a relatively rare type� When the irregular past tense of averb is used as an adjective� both are said to derive from thein�nitive form� so that the adjective drunken comes from theverb drink� The same is true for past participle forms� theadjective born thus derives from the verb bear�

There are two other categories which are dealt with understem allomorphy even though they�re not really instances ofstem allomorphy � clippings and blends� Clippings are short�ened forms of words which do not change word class� Forexample� phone is a simple clipping of telephone� Sometimesa clipping consists of more than one morpheme � vibes is aclipping of vibraphone which contains the stem vibraphoneand the a�x �s� and hanky is a diminutive form consistingof the stem handkerchief and the a�x �y�

A blend is a word which is made up of two stems� at leastone of which may be shortened� The word smog is made upof the stems smoke and fog� and paratrooper consists of thestems parachute and trooper� Note that the de�nition of ablend only allows for stems� not a�xes�

The table below summarizes the �ve types of allomorphyand shows the codes used to identify them in the ImmAllocolumn�

Stem Allomorphy Code Example

Blend B breathalyseClipping C phoneDerivational D clarifyFlectional F bornConversion Z belief

Table � � Stem allomorphy codes

The flex name and description of this column are as follows�

Page 69: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Immediate segmentation ����

ImmAllo

�ImmAlloLemma�

Stem allomorphy� top level

The sixth immediate segmentation column marks stems witha morphological analysis involving a�x substitution� This isthe process whereby an a�x replaces part of a stem whenthat stem and the a�x join to form another stem� Forexample� active is analysed as the stem action and the a�x�ive� the a�x �ion has disappeared� and the new a�x �ivehas taken the place of the old one� So� this column gives Y

for yes if the immediate analysis of the stem involves a�xsubstitution� or N for no if it does not� The flex columnname and description of this column are as follows�

ImmSubst

�ImmSubstLemma�

Affix substitution� top level

The seventh column identi�es those words whose analysis isopaque � that is� words made up of morphemes which arerecognisable� but where the meaning of the head elementisn�t re�ected in the meaning of the full word� An exampleof this is accordion� it appears to be made up of the verbalstem accord �the head element and the a�x �ion� Sincethe semantic link between accord and accordion is far fromobvious� the analysis is marked as being opaque� and it gets aY in this column� Words whose analyses are morphologicallyand semantically clear get the code N� The flex name anddescription of this column are as follows�

ImmOpac

�ImmOpacLemma�

Opacity� top level

The eighth immediate segmentation column gives simple ex�pressions to illustrate any orthographic alterations the analy�sis of a word involves� A morpheme boundary is marked bya !� and letters removed from either side of a morpheme arepre�xed by a �� and letters which are added are pre�xed bya �� Letters which do not change are considered part of amorpheme� and not shown� A simple example is ! � this isthe pattern for the word unable� since it consists of the a�xun� and the stem able� and neither morpheme alters� Onthe other hand� undersized is given as !�e!� since nothinghappens to the �rst morpheme under�� the �nal e of the

Page 70: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

second morpheme size is removed� and nothing happens tothe last morpheme �ed� The flex name and description ofthe columns that contain these expressions are as follows�

TransDer

�TransDerLemma�

Derivational transformation� top level

The ninth column indicates which stems have an immediateanalysis involving derivation by means of an in�x� Usually�derivational a�xes are added to the beginning or end of astem� but in some cases the a�x is inserted into a multi�word� as in derivations from verb�and�particle combinationslike hanger�on from hang on and looker�on from look on�Stems marked for this type of in�xation get the code Y inthis column� all other analyses get the code N�

ImmIn�x

�ImmIn�xLemma�

Infixation� top level

The last immediate segmentation column deals with stemsanalysed as conversions from multi�words which have under�gone reversion of their parts in the process of conversion� Forinstance� the noun downpour is considered a conversion ofthe verb pour down� and the adjective o��putting is derivedfrom the verb put o� via its �ection putting o�� Whenevera stem is analysed in this way� this column yields a Y code�In all other cases� an N code is given�

ImmRevers

�ImmReversLemma�

Reversion� top level

����� COMPLETE SEGMENTATION FLAT�

Complete segmentation is �complete� in the sense that it iden�ti�es all the morphemes a stem contains� This is in contrastto immediate segmentation� which only picks out the nexttwo �sometimes three morphological elements� The com�plete segmentation discussed in this section is also �at� whichmeans that you can see what the constituent morphemes arewithout knowing the details of the full morphological analysiswhich has been carried out� When you draw a morphological�tree diagram�� this information gives the outermost branchesonly� you cannot analyse any further� and you cannot see the

Page 71: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Complete segmentation ��at� ����

intermediate levels� So� when you want to see the complete��at� segmentation of counter�revolutionary for example� youget this sort of information�

counter�revolutionary

counter revolt ution ary

There are three columns with complete segmentation ��atinformation� The �rst contains the morphemes themselves�The second contains the word class of each morpheme� andthe third simply states whether each morpheme is a stem oran a�x� The last two columns are useful when you�re look�ing for a stem with a particular combination of morphemes�using the flex SHOW and QUERY options� you can hunt outstems which are made up of a noun plus an a�x plus a noun�say� or all the stems which contain at least three other stems�

The �rst column gives you each stem split into its morphemesby � signs� Thus the stem counter�revolutionary is writtenin the following way�

counter�revolt�ution�ary

No diacritics are included� The flex name and descriptionof this column are as follows�

Flat

�FlatLemma�

Flat segmentation

The second column uses single�letter codes to represent theword class of each morpheme�

Page 72: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Word Class Label

Noun N

Adjective A

Numeral Q

Verb V

Article D

Pronoun O

Adverb B

Preposition P

Conjunction C

Interjection I

Single contraction S

Complex contraction T

Ax x

Table ��� Word class labels ��at segmentation�

Using these codes� the stem counter�revolutionary is givenas xVxx� The flex name and description of the column areas follows�

FlatClass

�FlatClassLemma�

Flat segmentation� word class labels

The last column simply indicates whether each morphemeis a stem� a �ection or an a�x� Upper case S means Stem�upper case F means Flection� and upper case A means A�x�The full code for counter�revolutionary is thus ASAA� Theflex name and description of this column are as follows�

FlatSA

�FlatSALemma�

Flat segmentation� stem�affix labels

����� COMPLETE SEGMENTATION HIERARCHICAL�

Complete� hierarchical segmentation gives the most detailedanalysis available for each stem� It is called hierarchicalbecause it can cover several di�erent levels� it is arrivedat after immediate analysis has been carried out on everystem that can be identi�ed within a larger stem� Withthis information� you can draw a complete morphological�tree diagram�� from the root to the outermost branches�with every intermediate branch fully represented� So� for

Page 73: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Complete segmentation �hierarchical� ����

the stem counter�revolutionary� you can get the followingmorphological analysis�

counter�revolutionary

counter�revolution

revolution

counter revolt ution ary

There are six columns which give information about the fullsegmentations of stems� Three of them give the hierarchicalsegmentations themselves� The simplest of these tells youwhat the constituent morphemes of the stem are� indicatingwith algebra�like brackets the structure of the �tree�� Alsoavailable are similar bracket notations which supply a wordclass label alongside each element on each level� or the wordclass without the spelling of the element itself� The remain�ing two columns indicate whether stem allomorphy or a�xsubstitution has occurred anywhere in the full hierarchicalanalysis�

The �rst column provides all the information you need todraw a tree diagram like the one above � that is� the con�stituent morphemes of a stem each delimited by a commaand enclosed in brackets which indicate its complete mor�phological structure� The stem counter�revolutionary thuslooks like this�

counter��revolt��ution����ary��

Each identi�able stem or a�x is enclosed by a pair of brack�ets� beginning with the brackets round the full original stem�Then there are brackets round the stem counter�revolution�and subsequently round the stem revolution� Finally thereare brackets round each of the four morphemes�

The flex name and description of the column which containsmorphological analyses in this form are as follows�

Struc

�StrucLemma�

Structured segmentation

Page 74: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

The next two columns use extra labels to indicate the wordclass of each segment� They are given between square brack�ets to the right of each closing round bracket� so that everysegment on every level within the original stem has a wordclass code� The word class codes used are as follows�

Word Class Label

Noun N

Adjective A

Numeral Q

Verb V

Article D

Pronoun O

Adverb B

Preposition P

Conjunction C

Interjection I

Single contraction S

Complex contraction T

Table ��� Word class labels �complete segmentation�

The codes used for a�xes are combinations of these wordclass labels� The stem counter�revolutionary can be repre�sented as follows�

counter�N��N��revolt�V��ution�N�V���N��N��ary�A�N���N��

This example illustrates the special form a�x codes take�There are two elements in each a�x code which are separatedby a vertical bar �� In front of the vertical bar is a single codewhich is the word class of the stem which the a�x in questionhelps to form� After the vertical bar comes a combinationof single letter codes which indicate the word class of eachelement within the stem formed� and the position of the a�xitself is given by a dot�

In the counter�revolutionary example above� the code givenalongside the a�x counter is N��N�� The N before the barmeans that the a�x counter helps to form a stem which isa noun �counter�revolution� The �N after the bar meansthat the segmentation of the noun counter�revolution is a�xplus noun� These detailed codes can help you to identify theway a�xes are used� and to get lists of stems which containa�xes used in particular contexts� the fact that the secondpart of the counter code is �N helps you to see at once that

Page 75: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Complete segmentation �hierarchical� ���

this a�x helps to form a derivation in conjunction with anoun�

Sometimes a pair of a�xes can only be used together� as inthe word aerodrome � the word aero does not exist and theword drome does not exist� In such cases� x marks the othera�x� and denotes that the a�xes must occur in combinationwith each other� so�called combining forms� The code forthe aero� of aerodrome is thus N��x�� and the code for the�drome is N�x���

So� this column is particularly useful for two things� First�you get the word class of each stem in the segmentationalongside the orthographic representations of individual mor�phemes� Second� you get detailed information about eacha�x each stem contains� The flex name and description ofthis column are as follows�

StrucLab

�StrucLabLemma�

Structured segmentation� word class labels

The next column shows the hierarchical structure of eachstem by means of round brackets and commas� and the fullword class labels between square brackets� just as with theprevious column� The only di�erence is that in this columnthe orthographic representation of the constituent stems anda�xes is missed out altogether� Thus the stem counter�revolutionary gets the following representation�

�N��N���V���N�V���N��N���A�N���N�

This column again helps you to search for stems which havea particular morphological structure and particular combina�tions of syntactic elements� The flex name and descriptionof this column are as follows�

StrucBrackLab

�StrucBrackLabLemma�

Structured segmentation� word class labels only

The fourth hierarchical segmentation column deals with stemallomorphy� Within words� stems sometimes take a formdi�erent from their generally accepted stem form� When amorphological analysis is noted down� the resulting stemsare given their normal stem orthography� An example is theword inedible� which comprises the a�x in�� the stem eat

Page 76: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

and the a�x �ible� note the di�erence between ed and eat�where the one element is spelt two di�erent ways� This isstem allomorphy� If stem allomorphy occurs at any point ina stem�s complete hierarchical segmentation� a code is givenin this column to show what sort of stem allomorphy occurs�The table below shows the codes� and you can read moreabout what each code means in section ����� above � theyare the same codes used in the ImmAllo column�

Stem Allomorphy Code Example

Blend B breathalyseClipping C phoneDerivational D clarifyFlectional F bornConversion Z belief

Table ��� Stem allomorphy codes

The flex name and description for this column are as fol�lows�

StrucAllo

�StrucAlloLemma�

Stem allomorphy� any level

The �fth hierarchical segmentation column marks stems witha morphological analysis involving a�x substitution� This isthe process whereby an a�x replaces part of a stem whenthat stem and the a�x join to form another stem� Forexample� melodic is analysed as the stem melody plus thea�x �ic� the a�x �y has disappeared� and the new a�x �ichas taken the place of the old one� So� this column givesY for yes if the complete analysis of the stem involves a�xsubstitution� or N for no if it does not� The flex name anddescription of this column are as follows�

StrucSubst

�StrucSubstLemma�

Affix substitution� any level

The sixth and last hierarchical segmentation column identi��es those words whose analysis is completely or partly opaque� that is� words made up of morphemes which are recog�nisable� but where the meaning of the head element isn�tre�ected in the meaning of the full word� An example ofthis is ladykiller� it appears to be made up of the noun

Page 77: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Complete segmentation �hierarchical� ����

stem lady and the noun stem killer �which can subsequentlybe analysed as kill plus �er� Since the meaning of thehead element killer doesn�t relate directly to the meaningof the full word� the analysis is marked as being opaque�and it gets a Y in this column� Words whose analyses aremorphologically and semantically clear get the code N� Theflex name and description of this column are as follows�

StrucOpac

�StrucOpacLemma�

Opacity� any level

��� OTHER CODES

The remaining three columns give counts of various sorts�the number of components �i�e� stems and a�xes in theimmediate analysis of each stem� the number of morphemesa stem contains after complete segmentation� and the numberof levels involved in the complete hierarchical analysis of eachstem�

The �rst of these columns is the simple count of the numberof components each stem contains� The normal �gure istwo� words are generally split into two parts each time onelevel of morphological analysis takes place� Sometimes threecomponents can be identi�ed� derivational compounds areusually analysed as a stem plus a stem plus an a�x� asare normal compounds which are joined with a special �linkmorpheme� ��a� �o�� or �s�� And of course� monomorphemicwords only contain one component� Any stems which cannotreceive an adequate morphological analysis �for the reasonsgiven in section ���� get the number �

Some examples� in the stem counter�revolutionary� the num�ber of components is two �the stem counter�revolution andthe a�x �ary� and for law�breaker it is three �the stem law�the stem break� and the a�x �er�

The flex name and description of this column are as follows�

CompCnt

�CompCntLemma�

Number of morphological components

Page 78: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

The second column gives you the number of morphemes ineach stem� For words without a morphological analysis� thenumber given is zero� The number of morphemes in thestem counter�revolutionary for example is four� while for law�breaker it is three�

The flex name and description of this column are as follows�

MorCnt

�MorCntLemma�

Number of morphemes

The last of the three columns gives a count of the numberof levels in the complete hierarchical segmentation describedabove� which is best illustrated by another look at the treediagram illustrating the analysis of counter�revolutionary �

counter�revolutionary

counter�revolution

revolution

counter revolt ution ary

Including the stem at the top� the diagram covers four lines�this is the number of levels the stem has� It is the numberof times you can carry on doing immediate analysis whenyou analyse a particular stem in full� Do not confuse it withthe number of all the immediate analyses required to arriveat the complete hierarchical segmentation� any one level ofanalysis may include more than one immediate segmentation�Monomorphemic stems always get the number �� while stemswithout analysis �for reasons explained in section ���� getthe number �

The flex column name and description of this column areas follows�

LevelCnt

�LevelCntLemma�

Number of morphological levels

Page 79: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Morphology of English wordforms ����

��� MORPHOLOGY OF ENGLISH WORDFORMS

There are two types of morphology information availablefor the wordforms given in the celex database� �rst� in�formation about the lemma which underlies each family ofwordforms� and second� a simple identi�cation of the in�ec�tional features which are speci�c to each wordform� either inthe form of thirteen �yes�no� feature columns or one columnwith feature identi�cation codes�

Dictionaries present their lexical information under bold�type headwords� which are used instead of listing every indi�vidual in�ected form separately� Such a form is often calledthe canonical form� since it represents a full canon of in�ec�tions� Thus the word eat is understood as referring not onlyto the form eat itself� but also the forms eats eating ate�and eaten� To print full details about every in�ected formseparately would result in a lot of needless repetition andenormous books which no one could lift from the bookshelf�However� for many applications� lemma information has to belisted for each individual wordform� and in a celex lexiconof type wordform� you can do just that when you includecertain �morphological� columns� This is done by providing alink between the wordform information and the lemma infor�mation� When you choose the option Lemma informationfrom the ADD COLUMNS menu� you are in fact being allowedinto the lemma information by the back door� You can nowlook up information speci�c to a particular wordform in yourlexicon� and at the same time see general information whichis common to all the other forms in the same in�ectionalparadigm� One particularly useful type of lemma informationyou can use in your wordform lexicon is the syntactic in�formation� which can give the word class of any wordformyou are looking at� There is also an important distinctionwhich you may be able to draw upon with the frequencyinformation� The wordform lexicon gives you a cobuildfrequency �gure speci�c to each wordform� while the lemmainformation available lets you see the sum frequency for allthe in�ectional forms in the same paradigm� a �gure referredto as the lemma frequency�

All the lemma information has already been de�ned else�where in this linguistic guide� so there is no point in repeat�ing it here� All that needs to be pointed out is that thecolumn names used in a real lemma lexicon di�er from those

Page 80: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

used in the lemma information option in the morphology ofwordforms� When a flex column name and description arede�ned in the course of lemma lexicon text� the column namegiven in brackets is the name of the column when it is usedas part of a wordforms lexicon� Usually this name is identicalto the lemma lexicon name� except that the word lemma isadded to the end�

ExampleName

�ExampleNameLemma�

The column names used for lemma information

in a Wordforms lexicon are given in

brackets� as this Example Name shows�

All the other details and de�nitions remain the same in bothcases� So� when you�re looking for the columns of lemmainformation provided with a wordforms lexicon under mor�phology� just go back to the original lemma information� it�sall there�

����� INFLECTIONAL FEATURES

There are thirteen special columns available only with a lexi�con of type wordforms� Each one corresponds to a particularin�ectional attribute which a wordform can have� There canonly be one of two codes in each column� Y for �yes� thiswordform has this attribute�� or N for �no� this wordform doesnot have this attribute�� These columns are therefore usefulfor constructing restrictions on your lexicons� restrictionswhich need not be �on view�� it�s unlikely that you willwant to look at the contents of these columns with the SHOW

option� �If� on the other hand� you want to have a labelwhich lets you see at a glance all the in�ectional featureseach wordform has� then you should use the �type of �ection�codes described in the next section�

An example� To make a lexicon which gives you all �rstperson� present tense verb forms in the database� you haveto include at least three columns in the wordforms lexiconyou create� namely a column which gives the orthographicrepresentations you prefer� along withPres and Sin� �whichare amongst the thirteen columns described below� Youmust then construct two restrictions for your lexicon� onestating that Pres must be equal to Y� and another stat�ing that Sin� must be equal to Y� You can then format

Page 81: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

In�ectional features ����

your lexicon to make sure that Pres and Sin� are not �onview�� that way� when you SHOW or EXPORT your lexicon�you just get the list of words you require without two listsof Y�s� To this basic lexicon you can of course add anyother columns you require� either the orthographic and fre�quency information speci�c to each wordform� or the generallemma information�particularly syntax�which is availablethrough the �Morphology of English wordforms� options�

The �rst in�ectional features column indicates whether awordform is a singular form of any sort� This means pastand present tense verb forms such as hibernated or babbles�or nouns such as sagacity� The flex name and descriptionof this column are as follows�

Sing Inflectional feature� singular

The second column indicates whether a wordform is a pluralin�ection of any sort� This means past and present tenseverb forms such as hibernate or submerged� or nouns such asjocularities� The flex column name and description of thiscolumn are as follows�

Plu Inflectional feature� plural

The third column marks all the wordforms which are posi�tive forms � that is� not comparative or superlative formslike better and best� but plain adjectival forms like good oroften� Thus adjectives like goofy and idiomatic or adverbslike seldom and idiomatically get the code Y� while all otherforms get the code N� The flex name and description of thiscolumn are as follows�

Pos Inflectional feature� positive

The fourth column marks all the wordforms which are com�parative forms� almost always adjectives� Wordforms suchas better or angrier or cannier thus get the code Y� whileall other non�comparative forms get the code N� There isalso a small number of comparative adverbs which get theY code� such as further� The flex name and description ofthis column are as follows�

Comp Inflectional feature� comparative

Page 82: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

The �fth column marks all superlative forms� so that word�forms such as best or angriest get the code Y� and everyother form gets the code N� There is also a small number ofsuperlative adverbs which get the Y code� such as furthest�The flex name and description of this column are as follows�

Sup Inflectional feature� superlative

The sixth column marks the form of the verb usually knownas the in�nitive� It is used as a headword in the celexdatabases� and in most dictionaries� Words like wa�e orhave or eat� which can be used with the particle to in frontof them� are in�nitives� Any wordform which is an in�nitivegets a Y code in this column� all the others get the code N�The flex column name and description for this column areas follows�

Inf Inflectional feature� infinitive

The seventh column marks any participles� past tense orpresent tense� Present participles are normally formed byadding �ing to the stem of the verb� with the exception ofsome irregular verbs� Past participles add a su�x endingin �d to the stem� and they are used in the formation ofthe perfect tense� �I�ve lived in Nijmegen for four years��Again� many irregular verbs don�t match this rule �gone isthe past participle of go� for example� Most past participlescan also be used adjectivally� as in �the panelled walls�� Anywordforms which are participles get the code Y� and all therest get the code N� The flex name and description of thiscolumn are as follows�

Part Inflectional feature� participle

The eighth column identi�es any present tense forms� in�cluding the present participles mentioned under Part� Thusverb forms like gleam gleams and gleaming get the code Y�while all other forms �including in�nitives� which are markedin a di�erent column get the code N� The flex name anddescription of this column are as follows�

Pres Inflectional feature� present tense

Page 83: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

In�ectional features ����

The ninth column identi�es any past tense forms� includingthe past participles mentioned under Part� Thus formslike occupied and elicited get the code Y� while all otherforms �including in�nitives� which are marked in a di�erentcolumn get the code N� The flex name and description ofthis column are as follows�

Past Inflectional feature� past tense

The tenth column marks �rst person singular forms of verbs�whether present tense or past tense� So� all �rst personsingular forms� like �I go� or �I �nished o��� are given thecode Y� and every other form gets the code N� The flexcolumn name and description of this column are as follows�

Sin� Inflectional feature� �st person verb

The eleventh column marks second person singular forms ofverbs� whether present tense or past tense forms� For mostverbs� the second person form is the same as the �rst personform� but some irregular verbs are exceptions� So all secondperson forms like �you are� or �you shout� are given the codeY� and every other form gets the code N� The flex columnname and description of this column are as follows�

Sin� Inflectional feature� �nd person verb

The twelfth column identi�es third person singular forms ofverbs� whether present tense or present tense forms� Formost verbs� the third person present tense form consists ofthe stem plus the su�x �s� Thus forms like �he stood up� or�Gilbert acts� get the code Y while every other form gets thecode N� The flex name and description for this column areas follows�

Sin� Inflectional feature� �rd person verb

The thirteenth and last column marks rare forms � normallyforms which have become outdated like brethren� shouldst�or wert� Such forms have the code Y in this column� whileevery other wordform gets the code N� The flex name anddescription of this column are as follows�

Rare Inflectional feature� Rare form

Page 84: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

����� TYPE OF FLECTION

In the �In�ectional Features� section above� thirteen di�erentin�ectional features are distinguished� and assigned to thir�teen separate �yes�no� columns� The same information is alsoavailable in one single column� using combinations of single�letter codes to show all the features each wordform has� The�yes�no� columns are useful for constructing restrictions onyour lexicon� whereas the �type of �ection� column describedhere provides you with a label that identi�es at a glance allthe features each wordform has� Table �� below sets out allthe combinations of single�letter codes that occur�

In�ectional feature Label �yes�no column name

Singular S SingPlural P Plu

Positive b Pos

Comparative c Comp

Superlative s SupIn�nitive i Inf

Participle p Part

Present tense e PresPast tense a Past

�st person verb � Sin�

�nd person verb � Sin��rd person verb � Sin�

Rare form r Rare

Headword form X

�not nouns� verbsadjectives or adverbs�

Table ��� Type of �ection labels

For a full de�nition of these �ection types� read the detailsgiven for the appropriate �yes�no� columns in section ����above� However� note that there is one type of �ection labelwhich does not correspond to a �yes�no� column� The X

label identi�es many forms not covered by the other labels�including prepositions like among or less� pronouns like thator hers� conjunctions like immediately or that� numerals like�fth or thousand� contracted forms like I�ll or hadn�t� andinterjections like phew or amen� These forms are always thesame as those used as the headword form of the lemma �thusthe very few in�ected adverbial forms do not get the code

Page 85: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Type of �ection ���

X� No nouns� verbs� adjectives or adverbs ever get the codeX�

Each wordform may have more than one code attached to it�Thus the wordform boasted has the code a�S� a means itis a past tense form� � means that it is a third person form�and S means that it is singular�

The flex name and description of this column are as follows�

FlectType Type of flection

����� INFLECTIONAL TRANSFORMATION

The last column shows how the orthographic form of a stemis altered when a �ection is formed� Each string of letters inthe stem is shown by the symbol �� so the �rst person presenttense form of the verb whose stem is abide is simply given as�� Any blanks or hyphens in the stem are shown as a blank�so abide by is shown as � �� Letters removed from the frontor back of a string are pre�xed by a minus sign �� and lettersadded are pre�xed by a plus sign �� so abiding by is given as��e�ing �� �rst the �nal �e of abide is removed� and thenthe su�x �ing is added� This formalism is an unambiguousway of showing the in�ectional transformations that occur inthe orthographic formation of wordforms�

Whenever the in�ectional transformation is irregular �as withthe past tense forms of the verb sing for example � sang andsung no transformation is given� the �eld remains empty�

The tables below show all the lettergroups represented inthe database which can be subtracted from or added to aheadword to make a wordform�

Lettergroups removed from a headword

e ey f fe y

Table ��� In�ectional transformation codes �letters removed�

Page 86: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

Lettergroups added to a headword

bed ber best bing dded der dest ding eder es est ged gergest ging ied ier iesiest ing ked king ledler lest ling med mermest ming ned ner nestning ped ping r redring s sed ses singst ted ter test tingved ving ves zed zeszing

Table ��� In�ectional transformation codes �letters added�

The flex name and description of this column are as follows�

TransIn�

�TransIn�Lemma�

Inflectional transformation

Page 87: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

English Syntax ����

� ENGLISH SYNTAX

Syntactic information is available for lexicons of type lemma�or with the lemma information presented withMorphology ofEnglish Wordforms� The subsections which make up this sec�tion correspond to the seven subclassi�cation options flexo�ers you when you ask for syntactic information under theadd columns window�

ADD COLUMNS

Word Class �Subclassification nouns �Subclassification verbs �Subclassification adjectives �Subclassification adverbs �Subclassification numerals �

TOP MENUPREVIOUS MENU

v

ADD COLUMNS

Subclassification pronouns �Subclassification conjunctions �

TOP MENUPREVIOUS MENU

First and foremost� as shown in the two add columnsmenus� there are basic word class codes available for all thelemmas in the database� in the form of numbers or labels�Then there is a wealth of subclassi�cation information whichsupplements the basic word class codes� for each of themore important classes of lemma� there are a number ofcolumns which indicate whether or not a certain lemma has

Page 88: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

a particular feature� For example� you can quickly see thatthe verb �lch is transitive and the verb �zzle is not when youcheck the entries for these words in the Trans V column�An explanation of the format used for the subclassi�cationcolumns is given in section �� �

��� WORD CLASS CODES � LETTERS OR NUMBERS

There are two ways of representing the syntactic code of eachlemma� You can choose for yourself whether to use numbers�Numeric codes or shortened verbal codes �Labels� Anadverb� for example� is represented by the number � or theletters ADV� No matter which type of code you decide to use�the information remains the same� only the format changes�Examples are provided in the table below�

Word class codes are a simple way of identifying the syntacticclass of every lemma in the database� In addition� theyalso identify the word class of every wordform� since all thewordforms which make up the in�ectional paradigm of anygiven lemma naturally share the same word class� So if youwant a wordform lexicon which gives you the word classof any wordform� you have to use the lemma informationcolumns which are available under Morphology of EnglishWordforms�

Twelve basic categories�set out in Table �� below�are dis�tinguished� and you can use them in either of the two formsdescribed here� Only the last two classi�cations given mayneed some explanation� they deal with contractions� thosewords which are made up of shortened forms of other words�For example� isn�t is a contraction of is and not� and d�youis a contraction of do and you� Two types of contractionare identi�ed� A simple contraction is a shortened formin isolation� �ve and �re are both simple contractions� Acomplex contraction is one form which contains two words�one of which might be shortened� I�ve and couldn�t are bothcomplex contractions�

Page 89: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Word class codes � letters or numbers� ����

Word Columns ExampleClass

ClassNum Class

Noun � N garrisonAdjective � A biblicalNumeral � NUM twentiethVerb V chortleArticle � ART thePronoun � PRON mineAdverb � ADV sheepishlyPreposition � PREP throughConjunction � C whereasInterjection �� I alleluiaSingle contraction �� SCON �reComplex contraction �� CCON you�re

Table �� English word class codes

If you want word class codes in the form of numbers� thenchoose the column which has the following flex name anddescription�

ClassNum

�ClassNumLemma�

Word class� numeric

If you want word class codes in the form of short verbalsymbols� choose the column which has the following flexname and description�

Class

�ClassLemma�

Word class� labels

��� SUBCLASSIFICATION � Y OR N

All the remaining syntactic columns come under a generalheading of subclassi�cations� Each of these subclassi�cationcolumns represents a particular syntactic attribute that alemma might have� The values contained in each column canonly be one of two characters� Y or N� Using these columnsyou can carry out quick checks on the qualities of any lemma�Is the conjunction whereas a coordinating conjunction� Lookup its value in Cor C and you discover that N� no� it is not�Is the noun brother a vocative form� Check its value in thecolumnVoc N� and you discover that Y� yes� it is �or rather�yes� it can be used in this way�

Page 90: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

Of course it�s possible to do more with these columns thansimple checks on individual lemmas� You can use them to de��ne the contents of your lexicon by constructing a restrictionwith one or more of them� If you want �say a lexicon whichonly contains entries that can be ditransitive verbs� then youshould �rst include the column Ditrans V in your lexicon�then construct a restriction which states that Ditrans V �

Y� Moreover you can construct any number of restrictions inthis way� to make the contents of your lexicon even moreclosely de�ned�

Remember� though� that these Y � N answers are not exclu�sive answers� The verb write can be both transitive �Shewrites very long letters and intransitive �What does he dofor a living� He writes� � So under the columns Trans Vand Intrans V� write gets the code Y� It follows that if youwant a list of verbs which are always intransitive� you mustconstruct two restrictions which state �rst that Trans V �

N and second that Intrans V � Y� To get the best fromthese subclassi�cation codes� then� you should read the de�scriptions given below carefully� work out exactly how to getwhat you want using restrictions� and then use flex to buildcarefully planned lexicons�

��� SUBCLASSIFICATION NOUNS

There are eleven possible attributes which a noun can have�so two add columns menus are needed to cover them all��The v and � symbols in the bottom right hand cornerof either screen indicate that you should use the next andprev keys to move from one part of the menu to the other�

ADD COLUMNS

CountUncountSingular usePlural useGroup CountGroup Uncount

TOP MENUPREVIOUS MENU

v

Page 91: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation nouns ����

ADD COLUMNS

AttributivePostpositiveVocativeProper nounExpression

TOP MENUPREVIOUS MENU

What follows now is a concise description of each of these tenattributes� along with the column names and descriptionsused in flex�

The �rst column answers the question �is this lemma a countnoun�� Such nouns can be treated as individual units� andthus counted� you can talk about seventy�six trombones� butnot �two musics� Thus the noun lemma trombone gets thecode Y because it is a count noun� while the noun lemmamusic gets the code N because it isn�t a count noun� And ofcourse every lemma with a word class other than noun auto�matically gets the code N� The flex name and descriptionof this column are as follows�

C N

�C NLemma�

For nouns� countable

The second column a mirror image of the �rst� it answers thequestion �is this lemma an uncountable noun�� Uncountablenouns are those nouns which are continuous entities� theyare not individual units which can occur in both singular andplural forms� You cannot talk about a dozen handwritings�but it is quite permissible to talk about twelve pens� Thelemma handwriting thus gets the code Y� while the lemmapen gets the code N� Another term used for uncountable nounis mass noun�

Note� however� that some nouns appear to be both countableand uncountable� The noun space is an uncountable nounwhen used in a phrase like Space � the Final Frontier� butcountable when used in a phrase like The Netherlands � �at

Page 92: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

open spaces� This is one example of a lemma which gets a Y

in the count and the uncount column�

Any noun lemma which can be uncountable gets the code Y

in this column� all other lemmas get the code N� The flexname and description of this column are as follows�

Unc N

�Unc NLemma�

For nouns� uncountable

The third column answers the question �does this lemma onlyever occur in the singular form�� This refers to what aresometimes called singularia tantum� The lemma monopoly�when used in a phrase like they think they have a monopolyon the truth� can only be singular� and it therefore gets thecode Y in this column� �Note however that in other cir�cumstances� monopoly does occur in the plural� the phraseprivate or state monopolies� for example� Any noun lemmathat can occur in a singular�only form gets the code Y in thiscolumn� all other lemmas get the code N� The flex nameand description of this column are as follows�

Sing N

�Sing NLemma�

For nouns� singular use

The fourth column answers the question �does this lemmaever occur in a plural�only form�� This refers to what areoften called pluralia tantum� The noun lemma people occursin election�time phrases like the people have spoken where itmeans the general population of a country or region� It onlyoccurs with a plural verb form� and therefore gets the code Y

in this column� Any other lemmas which can never be plural�only nouns get the code N� The flex name and descriptionfor this column are as follows�

Plu N

�Plu NLemma�

For nouns� plural use

The �fth and sixth column deal with collective nouns� thosenouns which refer to groups of people or things �the lemmamajority for example� Usually� all such lemmas can take aplural or a singular form of a verb� the majority is in favouris as acceptable as the majority are in favour� However� onlyin certain cases does the in�ectional paradigm of the lemma

Page 93: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation nouns ����

include a plural form� The lemma mankind� for example�can take a plural or a singular verb� but it has no pluralform �mankinds� In contrast� the lemma crew can take aplural or a singular form of the verb and it has a plural formcrews�

The �fth column answers the question �is this lemma a collec�tive noun that has a singular and a plural form�� The lemmagovernment gets the code Y� because it�s always possibleto talk about a particular �say the French government aswell as several �say the European governments� The lemmamankind gets the code N since it never has a plural formunder any circumstances� Other lemmas which are neithercollective nor nouns also get the code N� The flex name anddescription of this column are as follows�

GrC N

�GrC NLemma�

For nouns� group countable

The sixth column answers the question �Is this lemma acollective noun that only has a singular form� and not aplural form� The lemma populace has the code Y becauseit is never possible to use the form �populaces� and for thesame reason the lemma mankind also gets a Y code� Theflex name and description of this column are as follows�

GrUnc N

�GrUnc NLemma�

For nouns� group uncountable

The seventh column answers the question �can this lemma beused attributively�� This refers to nouns like machine whichoccurs in phrases likemachine translation� the �rst word sayssomething about the word that follows it� So in this column�machine gets the code Y since it can be used attributively�A lemma like gadgetry� on the other hand� gets the code N�because it cannot be used attributively� The flex name anddescription of this column are as follows�

Attr N

�Attr NLemma�

For nouns� attributive

The eighth column answers the question �can this lemma everbe used in a postpositive way�� Here postpositive refers tonouns which come after another noun and which qualify that

Page 94: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

other noun� An example is proof� which occurs in phraseslike Bushmills whiskey is forty�two percent proof� The flexname and description of this column are as follows�

PostPos N

�PostPos NLemma�

For nouns� postpositive

The ninth column answers the question �is this lemma usedto address people or things�� This usually refers to nouns likechicken� which can be used when you are speaking directlyto a hen �as in the well�known song chick chick chick chickchicken lay a little egg for me� or when you are speakingto someone you consider to be a coward �Chicken� Comeback and �ght�� The noun chicken thus gets the code Y

because it can be used as a vocative� whereas a noun da�odilgets the code N because normally �the more �owery poetsexcepted it cannot be used as a vocative� The flex nameand description of this column are as follows�

Voc N

�Voc NLemma�

For nouns� vocative

The tenth column answers the question �is this lemma usedas a proper noun�� Proper nouns are the names of people orplaces� so that lemmas such as Arthur or York get the code Y�while lemmaswhich aren�t proper nouns get the code N� Mostof the proper nouns in the database are included because theyform a necessary part of the morphological analyses providedin the �Morphology of English lemmas� columns� The flexname and description of this column are as follows�

Proper N

�Proper NLemma�

For nouns� proper noun

The eleventh and last column answers the question �is thisnoun lemma only ever used in combination with certain otherwords to make up a particular phrase�� An example is theword loggerheads� which only ever occurs in the phrase atloggerheads� and brunt which only ever occurs in the phrasebear the brunt of � Such nouns get the code Y� while all otherwords get the code N� The flex name and description of thiscolumn are as follows�

Exp N

�Exp NLemma�

For nouns� expression

Page 95: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation nouns ���

��� SUBCLASSIFICATION VERBS

There are nine verbal subclassi�cation columns available�and they all take the form of Y � N answers to questions �asdescribed in section �� above� They are presented to youin two add columns windows�

ADD COLUMNS

TransitiveTransitive plus complementationIntransitiveDitransitiveLinking verbPhrasal

TOP MENUPREVIOUS MENU

v

ADD COLUMNS

PrepositionalPhrasal prepositionalExpression

TOP MENUPREVIOUS MENU

The �rst of the nine columns answers the question �is thisa verb which can �sometimes take a direct object�� Theverb lemma crash� for example� gets the code Y� because youcan say things like he crashed the car� So does the verbadmit� because you can say things like he admitted that hewas wrong� where the direct object is a clause� The verblemma cycle gets the code N because you can�t say thingslike �he cycled the bike� The flex name and description ofthis column are as follows�

Trans V

�Trans VLemma�

For verbs� transitive

Page 96: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

�� english linguistic guide

The second column answers the question �is this a verb whichhas an object complement�� In a phrase like the jury foundhim guilty� there is a direct object him plus a complementwhich relates to that direct object guilty� These object com�plements can take the form of a noun phrase �they had madehim chairman� or an adjective phrase �so he thought it odd�or a prepositional phrase �when they threw him into jail� oran adverb phrase �and kept him there or a clause �since itcaused him to be embarrassed� Verbal lemmas which cantake such complements get the code Y in this column� allother lemmas get the code N� The flex name and descriptionof this column is as follows�

TransComp V

�TransComp VLemma�

For verbs� transitive plus complementation

The third verbal subclassi�cation column answers the ques�tion �is this a verb which �sometimes cannot take a directobject�� The verb alight for example can never take a directobject � he got the bus and alighted at the City Hall� Theverb leave can occur with or without a direct object � sheleft a will or she left at ten o�clock� Both verbs thus get thecode Y in this column� The verb modify gets the code N�however� since it always takes a direct object� you cannotsay �he modi�ed� but you can say he modi�ed his opinion onthe matter� The flex name and description of this columnare as follows�

Intrans V

�Intrans VLemma�

For verbs� intransitive

The fourth column answers the question �is this a verb whichcan be ditransitive�� Here ditransitive refers to verbs whichcan take two objects� one direct object plus one indirectobject� The verb envy� for example� gets the code Y inthis column� since you can say he envied his colleagues theirsuccess� So does tell because you can say things like shetold him she would keep in touch� Verbs like dance� on theother hand� which cannot take two objects� and all othernon�verbal lemmas� get the code N� The flex name anddescription of this column are as follows�

Ditrans V

�Ditrans VLemma�

For verbs� ditransitive

Page 97: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation verbs ���

The �fth verbal subclassi�cation column answers the ques�tion �is this lemma ever a linking verb�� The copula be is alinking verb � it links a subject I with a complement whichdescribes that subject a doctor in a sentence like I am adoctor� These subject complements can take the form ofa noun phrase �she is an intelligent woman� an adjectivephrase �she looks worried� a prepositional phrase �she livesin Cork � an adverb phrase �how did she end up there� or a clause �her main intention is to move somewhere else�Verbs which can have such subject complements get the codeY in this column� all other lemmas get the code N� The flexname and description of this column are as follows�

Link V

�Link VLemma�

For verbs� linking verb

The sixth column answers the question �is this lemma aphrasal verb�� A phrasal verb is a verb which is linked to aparticular adverb� such as speak out or run away� Often thesecombinations acquire a speci�c� idiomatic meaning� Phrasalverbs in the database get the code Y if they are markedas such in volume one of the Oxford Dictionary of CurrentIdiomatic English� All other lemmas get the code N� Theflex name and description of this column are as follows�

Phr V

�Phr VLemma�

For verbs� phrasal verb

The seventh column answers the question �is this lemma aprepositional verb�� A prepositional verb is one which islinked to a particular preposition� such as minister to orconsist of� Prepositional verbs in the database get the codeY if they are marked as such in the Oxford Dictionary ofCurrent Idiomatic English� All other lemmas get the codeN� The flex name and description of this column are asfollows�

Prep V

�Prep VLemma�

For verbs� prepositional verb

The eighth column answers the question �is this lemma aphrasal prepositional verb�� A phrasal prepositional verb is�naturally� one which is linked to a particular adverb and alsoto a particular preposition� such as walk away with or cry

Page 98: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

out against� Phrasal prepositional verbs in the database arecurrent phrasal prepositional verbs� only those which are inuse nowadays are given� on the basis of their inclusion in theOxford Dictionary of Current Idiomatic English� All phrasalprepositional verbs get the code Y� and all other lemmas getthe code N� The flex name and description of this columnare as follows�

PhrPrep V

�PhrPrep VLemma�

For verbs� phrasal prepositional verb

The ninth and last column answers the question �is this verblemma only ever used in combination with certain otherwords to make up a particular phrase�� An example is theverb toe� which only ever occurs in the phrase toe the line�and bell which only ever occurs in the phrase bell the cat�Such verbs get the code Y� while all other words get thecode N� The flex name and description of this column areas follows�

Exp V

�Exp VLemma�

For verbs� expression

��� SUBCLASSIFICATION ADJECTIVES

There are �ve attributes of adjectives covered in this versionof the database� which are shown in the add columnswindow below� Each of these attributes or subclassi�cationshas its own column in the database� and it always containsthe code Y or N �this simple coding system is explainedin section �� above� Each of the �ve attributes and theirdatabase columns are explained below�

ADD COLUMNS

OrdinaryAttributivePredicativePostpositiveExpression

TOP MENUPREVIOUS MENU

Page 99: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation adjectives ���

The �rst adjective subclassi�cation column is a simple one� itanswers the question �is this lemma an ordinary adjective���where ordinary means that it can be used both attributively��the new book� and predicatively ��the book is new�� Thusadjectives such as new and elementary get the code Y be�cause they are ordinary adjectives� while actual and ablazeget the code N because you cannot say ��the reason is actual�or ��the ablaze house�� The flex name and description ofthis column are as follows�

Ord A

�Ord ALemma�

For adjectives� ordinary

The second column gives an answer to the question �is thislemma an adjective which in some contexts can only beused attributively�� Here� attributive means those adjectiveswhich always come before the noun or phrase� as in sheernonsense� where sheer cannot come after the noun it quali�es�The flex name and description of this column are as follows�

Attr A

�Attr ALemma�

For adjectives� attributive

The third column answers the question �is this lemma anadjective which in some contexts can only be used predica�tively�� Here predicatively refers to adjectives like awakewhich can only qualify a noun when linked to it by a verb �the cat is awake� for example� The flex name and descrip�tion of this column are as follows�

Pred A

�Pred ALemma�

For adjectives� predicative

The fourth column answers the question �can this adjectivallemma ever be used in a postpositive way�� Here postpos�itive refers to lemmas like the adjective everlasting� whichoccurs in the phrase life everlasting� here it comes after thenoun it modi�es� whereas it is more normal in English fora modi�er to come before the noun it quali�es �everlastinglife is also acceptable� So everlasting gets the code Y inthis column� while adjectives like durable� which never occurpostpositively� get the code N�

PostPos A

�PostPos ALemma�

For adjectives� postpositive

Page 100: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

The �fth and last column answers the question �is this adjec�tive lemma only ever used in combination with certain otherwords to make up a particular phrase�� An example is theadjective bated� which only ever occurs in the phrase withbated breath� and the adjective put which only ever occurs inthe phrase stay put� Such adjectives get the code Y� while allother words get the code N� The flex name and descriptionof this column are as follows�

Exp A

�Exp ALemma�

For adjectives� expression

��� SUBCLASSIFICATION ADVERBS

There are �ve adverb subclassi�cation columns available�and they all take the form of Y � N answers to questions �asdescribed in section �� above� They are presented to youin this add columns window�

ADD COLUMNS

OrdinaryPredicativePostpositiveCombinatory adverbExpression

TOP MENUPREVIOUS MENU

The �rst of these �ve columns answers the question �is thislemma an ordinary adverb�� where ordinary simply meansthat it doesn�t necessarily have any special subclassi�cationfeatures� it�s just an adverb which can modify a verb or anadjective� Examples include generously which occurs in aphrase like people have given very generously or a generouslyillustrated book� The flex name and description of thiscolumn are as follows�

Ord ADV

�Ord ADVLemma�

For adverbs� ordinary

Page 101: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation adverbs ���

The second of the �ve adverb subclassi�cation column an�swers the question �is this lemma an adverb which can usuallyonly be used predicatively�� Adverbs which get the code Y

here can be distinguished from ordinary adverbs on the basisof their predicative use with the verb be� it is possible to saythe boat is adrift but not �the boat is quickly� Thus adriftis a predicative adverb� while quickly is not� Typically manypredicative adverbs begin with the letter a � around asternor awry for example� Many don�t however� high inland anddowntown� amongst others� All predicative adverbs get thecode Y in this column� and all other lemmas get the code N�The flex name and description of this column are as follows�

Pred ADV

�Pred ADVLemma�

For adverbs� predicative

The third adverb subclassi�cation column answers the ques�tion �is this lemma an adverb that can be used postpos�itively�� Here postpositive means �after a noun� as witho�side in the phrase several yards o�side or apart in thephrase a race apart� Adverbs which can be used in this wayget the code Y in this column� all other adverbs get the codeN� The flex name and description for this column are asfollows�

PostPos ADV

�PostPos ADVLemma�

For adverbs� postpositive

The fourth of the �ve adverb subclassi�cation columns an�swers the question �is this lemma an adverb which can beused in combination with a preposition or another adverb��An example is the adverb clean� you can say the sledgeham�mer broke clean through the door � here an adverb combineswith a preposition� Another example is the adverb all� youcan say they left the dog all alone � here an adverb combineswith another adverb� Adverbs which can combine in one ofthese two ways get the code Y� all other lemmas get the codeN� The flex column name and description of this column areas follows�

Comb ADV

�Comb ADVLemma�

For adverbs� combinatory adverb

Page 102: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

The last of the �ve adjective subclassi�cation columns an�swers the question �is this adverb lemma only ever used incombinationwith certain other words to make up a particularphrase�� An example is the adverb amok� which only everoccurs in the phrase run amok� and the adverb screaminglywhich only ever occurs in the phrase screamingly funny �Such adverbs get the code Y� while all other words get thecode N� The flex name and description of this column areas follows�

Exp ADV

�Exp ADVLemma�

For adverbs� expression

��� SUBCLASSIFICATION NUMERALS

The numerals given in the database are su�cient to let youspell any number in full� it�s possible to reconstruct theorthography of number ��� ����� �for example using nu�merals from the database� As well as the numbers them�selves� a few extra terms such as score or umpteenth arealso given� There are two subclassi�cation columns availablefor the numerals speci�ed in the database� and they bothtake the form of Y � N answers to questions �as explained insection ��� In addition� there is a column which identi�esthose numerals used in certain expressions� All three columnsare presented to you in this add columns window�

ADD COLUMNS

CardinalOrdinalExpression

TOP MENUPREVIOUS MENU

The �rst numeral subclassi�cation column answers the ques�tion �is this lemma a cardinal number�� Cardinal numbers�like three or seven or twelve�are the most important formsof numbers� and they simply indicate quantity rather thanrank order� Any lemma in the database which is a cardinal

Page 103: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation numerals ���

number gets the code Y in this column� and every otherlemma gets the code N� The flex name and description ofthis column are as follows�

Card NUM

�Card NUMLemma�

For numerals� cardinal

The second numeral subclassi�cation column answers thequestion �is this lemma an ordinal number�� In contrast tocardinal numbers� ordinals�like third or seventh or twelfth�indicate quantity and rank order� Any lemma in the databasewhich is an ordinal number gets the code Y in this column�and every other lemma gets the code N� The flex name anddescription of this column are as follows�

Ord NUM

�Ord NUMLemma�

For numerals� ordinal

The last numeral subclassi�cation column answers the ques�tion �is this numeral ever used in combination with certainother words to make up a particular phrase�� An exampleis ninety�nine� which occurs in the phrase ninety�nine timesout of a hundred� and sixty�four which occurs in the phrasesixty�four thousand dollar question� Such numerals get thecode Y� while all other words get the code N� The flex nameand description of this column are as follows�

Exp NUM

�Exp NUMLemma�

For numerals� expression

��� SUBCLASSIFICATION PRONOUNS

There are a total of eight pronoun subclassi�cation columnsavailable� and they all take the form of Y � N answers toquestions �as explained in section ��� They are presentedto you in these add columns windows�

Page 104: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

ADD COLUMNS

PersonalDemonstrativePossessiveReflexiveWh�pronounDeterminative use

TOP MENUPREVIOUS MENU

v

ADD COLUMNS

Pronominal useExpression

TOP MENUPREVIOUS MENU

The �rst of the seven pronoun subclassi�cation columns an�swers the question �is this lemma a personal pronoun�� Pro�nouns which refer directly to people or things are personalpronouns� This can include subject pronouns� such as themasculine third person singular form he� and object pro�nouns� such as the third person plural form them� Theselemmas thus get the code Y� all other lemmas get the codeN� The flex name and description of this column are asfollows�

Pers PRON

�Pers PRONLemma�

For pronouns� personal

The second of these columns answers the question �is thislemma a demonstrative pronoun��� The lemmas this thatthese and those� as used in phrases like that dress or thissceptr�d isle� get the code Y under this column� all other

Page 105: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation pronouns ��

lemmas get the code N� The flex name and description ofthis column are as follows�

Dem PRON

�Dem PRONLemma�

For pronouns� demonstrative

The third column answers the question �is this lemma apossessive pronoun�� Lemmas like her or hers� �as in her de�cision is hers and hers alone� for example which can indicateownership of some sort� both get the code Y� while all otherlemmas get the code N� The flex name and description ofthis column are as follows�

Poss PRON

�Poss PRONLemma�

For pronouns� possessive

The fourth pronoun subclassi�cation column answers thequestion �is this lemma a re�exive pronoun�� Lemmas likeyourself in give yourself a holiday or ourselves in we saw it forourselves get the code Y in this column� all other lemmas getthe code N� The flex name and description for this columnare as follows�

Re� PRON

�Re� PRONLemma�

For pronouns� reflexive

The �fth pronoun subclassi�cation column answers the ques�tion �is this lemma a wh�pronoun�� Such pronouns mostlybegin with the letters wh� and can be used as relative pro�nouns �an expert is one who knows more and more about lessand less or as interrogative pronouns �who do you love��All wh�pronouns� such as who whither whence whosoeverthat howsoever and so on get the code Y under this col�umn� all other lemmas get the code N� The flex name anddescription of this column are as follows�

Wh PRON

�Wh PRONLemma�

For pronouns� wh�pronoun

The sixth pronoun subclassi�cation column answers the ques�tion �can this pronoun lemma be used as a determiner�� Adeterminer helps to clarify which particular noun or nounphrase you are referring to� For example� you might talk

Page 106: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

��� english linguistic guide

about their cat in order to di�erentiate it from your cator another cat� As the name suggests� such words helpdetermine what you�re talking about� Any lemma that canbe used as a determiner gets the code Y in this column� andevery other lemma gets the code N� The flex name anddescription of this column are as follows�

Det PRON

�Det PRONLemma�

For pronouns� determinative use

The seventh pronoun subclassi�cation column answers thequestion �can this pronoun lemma be used pronominally��where pronominally means �instead of a noun or noun phraseand independent of any other noun�� For example� other canbe used in a phrase like choose the other� or mine in a phraselike you can�t� it�s mine or mine is better � so both thesepronouns can be used pronominally� In contrast� my cannotbe used pronominally� it can only occur in phrases like mybook� All pronoun lemmas which can be used pronominallyget the code Y in this column� and all other lemmas get thecode N� The flex name and description of this column areas follows�

Pron PRON

�Pron PRONLemma�

For pronouns� pronominal use

The eighth and last pronoun subclassi�cation columns an�swers the question �is this pronoun lemma only ever used incombinationwith certain other words to make up a particularphrase�� Examples of this are few� but one is aught� whichoccurs in the phrase for aught I know or for aught I care�Such pronouns get the code Y� while all other words get thecode N� The flex name and description of this column areas follows�

Exp PRON

�Exp PRONLemma�

For adverbs� expression

��� SUBCLASSIFICATION CONJUNCTIONS

There are two conjunction subclassi�cation columns avail�able� and they both take the form of Y � N answers to ques�tions �as described in section �� above� They are presentedto you in this add columns window�

Page 107: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Subclassi�cation conjunctions ����

ADD COLUMNS

CoordinatingSubordinating

TOP MENUPREVIOUS MENU

The �rst column answers the question �is this lemma a co�ordinating conjunction�� A conjunction like and is a co�ordinating conjunction� it can link two �or more clausestogether in such a way that they remain of equal value to eachother �or to put it another way� one clause of the new sen�tence is not subordinate the other� The other co�ordinatingconjunctions are but and or� and these three lemmas get thecode Y under this column� All other lemmas get the code N�The flex name and description of this column are as follows�

Cor C

�Cor CLemma�

For conjunctions� coordinating

The second of the two conjunction subclassi�cation columnsanswers the question �is this conjunction a subordinatingconjunction�� A conjunction like because is a subordinatingconjunction� it can link two clauses together in such a waythat one clause becomes dependent on the other� as in asentence like he�s reading a book because the television isbroken� Here because acts as a subordinating conjunction�the link between the two clauses is more complex than theuse of a co�ordinating conjunction like and would imply� Allconjunction lemmas which are subordinating conjunctionsget the code Y in this column� all other lemmas get thecode N� The flex name and description of this column areas follows�

Sub C

�Sub CLemma�

For conjunctions� subordinating

Page 108: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

� ENGLISH FREQUENCY

The frequency information given in the database �that is�details of how often words occur in English is available bothfor lemmas and wordforms� It is taken from the cobuildcorpus of the University of Birmingham� which in the early���� version extracted for and corrected by celex containedabout ���� million words� taken from written sources of manykinds� and some spoken sources as well� Frequency �guresare available for lemmas and for wordforms�

The starting point for calculating frequency information isthe cobuild ���� million word corpus� a count is made ofthe number of times each string occurs� This task is easyfor a computer� which can quickly make a count of all thewords that appear in the corpus� The resulting �gures areraw �string� counts � that is� they indicate how many timeseach separate group of letters occurs in the corpus� taking noaccount of the di�erent meanings or word classes that canbe applied to each group� You can see the remaining rawcounts in a cobuild corpus types lexicon when you selectthe Freq column� The string count of families for exampleis ���� while for bank it is ����� To develop this basicstring count into a more helpful word count� the strings mustbe identi�ed either as wordforms which can be linked to aparticular lemma� or as other things not represented in thedatabase� such as personal names� foreign words� and wordsmistyped or misread by an Optical Character Recognitionmachine�

Sometimes this identi�cation process is straightforward � thestringmillstones is only ever the plural wordform of the nounlemma millstone� So in this case the raw string frequency ofthe string millstones is also the frequency of the wordformmillstones� and so in the wordform lexicon Cob column itgets the same frequency as the string�

Once you know the frequencies of the wordforms associatedwith a particular lemma� working out a frequency �gure forthe lemma as a whole is straightforward � all you have to dois add up the appropriate wordform frequencies� In this way

Page 109: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

English frequency ����

the frequency of the noun lemmamillstone is the frequency ofthe wordform millstones plus the frequency of the wordformmillstone� The frequency of the lemmamillstone is the totalof the two� and this is the �gure given in the lemma lexiconCob column�

The only way to sort out the individual frequencies of each ofthese strings is to look at the way they are used in the corpus�a process known as disambiguation� It�s possible to carryout this task quickly by computer program� but at presentthe results of such programs can never be wholly accurate�For this reason� celex chose to disambiguate by hand� whichmeans that someone reads each occurrence of each ambiguousform in the corpus� and notes the lemma to which it belongs�While such an approach is both costly and time�consuming�it does produce results which are more dependable and accu�rate� For jumper� it seems that �� of the occurrences meanitem of clothing� and � mean someone who jumps� These arethe two �gures given in the wordform lexicon Cob columnfor the two di�erent jumper wordforms� Sometimes not alloccurrences refer to wordforms in the database� Some may beproper nouns �surnames� for example or typing errors� andsome simply can�t be disambiguated� For example jumperoccurs �� times in relation to a person�s name� Such infor�mation is not given in the database since it doesn�t relatedirectly to any of the lemmas or wordforms available�

Again� once the wordform frequencies have been clari�ed�working out the lemma frequencies is straightforward� Forthe two lemmas with the form jumper� the lemma frequenciesare �� �meaning clothing� and � �meaning someone whojumps� giving a total of � � These lemma frequency �guresare given in the lemma lexiconCob column� and in the samecolumn to be found with the �lemma information� given forwordforms�

When strings occur very frequently in the corpus� the workrequired to disambiguate each case by hand can be daunting�It may also be unnecessary� since an intelligent estimate cou�pled with an indication of how far that estimate is accurateshould usually be enough� So� whenever ambiguous words oc�cur more than � times in the corpus� not all the occurrencesare checked individually� Instead� one hundred occurrencesof the string are taken at random from the corpus and thenanalysed� In this way it�s possible to formulate a ratio which

Page 110: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

indicates the proportions of the various interpretations� andthis ratio can then be applied to the real �gures to see anestimate of how the fully disambiguated �gures would look�

As an example� take the string bank� Its basic corpus stringfrequency is ����� It can either be a singular noun� or aninstance of a verb� the �rst or second person singular form�the plural form� or the in�nitive� Here is a lexicon whichshows these wordforms with their word class and frequency�

Word Class Cob

bank N ����

bank V ��

bank V ��

bank V ��

bank V ��

To calculate these �gures� a � occurrences of the stringbank were taken from the corpus and disambiguated by hand�It turned out that � of the occurrences belonged to the verballemma� and �� to the noun lemma� So to estimate the realfrequency of the wordform belonging to the noun lemma�divide the number of times it occurred in the sample bythe total number of successfully disambiguated forms� andthen multiply the result by the original string frequency���

���� ��� ����� Repeating this procedure gives �� for the

verb lemma� This latter �gure is divided equally for thefour possible wordforms� �� each for the �rst person singularform� the second person singular form� the plural form andthe in�nitive� This is the usual way of sorting out ambiguousverbal �ections� since disambiguating every verbal form byhand is a task which would involve a great deal of workyielding results of interest to only a few�

For most items in the database� the frequency �gures areaccurate� However� when estimates have to be made on thebasis of a hundred examples� then deviation �gures have tobe calculated� to let you see just how accurate the estimatesare� This formula gives the required deviation �gure�

N � �����

rp ��� p�

n�

N � n

N � �

where N is the frequency of the string as a whole� n isthe number of items which could be disambiguated in the

Page 111: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

English frequency ����

random � �item sample� and p is the ratio �gure for theitem when it belongs to one particular lemma� Thus for thenoun wordform bank� N is ����� n is � � and p is ���� Theformula gives �� as the deviation� This means that the truefrequency for this form of bank is almost certain�at least��� certain�to lie between ���� and �����

Word ClassLemma Cob CobDev

bank N ���� ��

bank V �� ��

bank V �� ��

bank V �� ��

bank V �� ��

Whenever the deviation is greater than the frequency itself�then you know for sure that some sort of arbitrary approx�imation has been carried out� This happened for the verbalforms of bank� as you can see in the table above�

Working out deviation �gures for a lemma involves adding to�gether the frequencies of its disambiguated wordforms� Andonce again� whenever the resulting deviation �gure is equalto or greater than the frequency itself� you know that somearbitrary �disambiguation� has been necessary�

One �nal point to note here is that some frequency infor�mation is available with the orthographic columns� Thisrelates directly to the di�erent spellings that wordforms orheadwords can have� It does not a�ect the frequency infor�mation given here� which deals with each form as a wholeregardless of how it can be spelt� For instance� realize canalso be spelt realise� and the lemma frequency given alongsideeach of them is the same� ���� The spelling frequencyon the other hand shows that the spelling realize occurs��� times while the spelling realise occurs ��� For moredetails about this extra layer of disambiguation� read theappropriate subsection under �English Orthography��

��� FREQUENCY INFORMATION FOR LEMMAS ANDWORDFORMS

Now that the background details have been explained� theindividual column names and descriptions can be formallyde�ned� For both lemmas and wordforms� there are four

Page 112: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

columns available which express the cobuild frequency �g�ures in various ways�

The �rst column gives the plain cobuild frequency countfor each lemma or wordform� The �gure given in the lemmaversion of the column for collate is ��� which means that outof the ���� � words in the corpus� �� are the word collatein some form or other� The �gures given in the wordformversion of this column reveal how frequently each of thepossible forms occur� for collate the �gure is � for collatesit is �� There are also �� occurrences of collated� and �occurrences of collating� The flex name and description ofthis column are as follows�

Cob

�CobLemma�

COBUILD frequency

The second column indicates how accurate the frequenciesin the previous column are by providing a deviation �gurefor each lemma or wordform� calculated according to themethods described in the previous section� If a word hasbeen fully disambiguated without the need for any estimates�the �gure is � When some estimation has been required� the�gure will be greater than zero� If the �gure should ever beequal to or greater than the frequency it quali�es� then youknow that full disambiguation was not possible� The �guregiven for the verb lemma shine �in the sense of �be bright� or�direct the light� is ��� and when you use it in conjunctionwith the cobuild frequency �gure of ���� it indicates thatyou can be almost certain ���� certain that shine occurs inone form or another somewhere between �� and � times�The flex name and description of this column are as follows�

CobDev

�CobDevLemma�

COBUILD frequency deviation

The next column contains the same frequency �gures as the�rst column� except that they have been scaled down to arange of � to �� � instead of the usual � to ���� � �This is done by dividing the normal cobuild frequency foreach word by the number of words in the whole corpus� andthen multiplying the answer by �� � � The end resultis a set of �gures which are probably easier to understand�it makes greater sense to say that the word magni�cent is

Page 113: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Frequency information for lemmas and wordforms ����

twenty in a million than it does to say that it�s ��� wordsout of ���� � � And since other well�known text corpora�such as the London�Oslo�Bergen �lob� and Brown corpora ofEnglish�are also based on a count of one million� this scaleprovides the opportunity for interesting comparisons to bemade� However as you might expect� some detail is lost in thescaling�down process� the words barbecue and babysitter�which have the ���� millionword lemma frequencies of �� and� respectively� both share the same � million word frequencyof ��

CobMln

�CobMlnLemma�

COBUILD frequency �� � �

For those whose work requires a further transformation of the�gures �psycholinguists working with stimulus response timesfor example� a column containing logarithmic values is avail�able� The e�ect of the logarithmic scale is to emphasize theimportance of lower frequency words in a way that the usuallinear scale does not� For example� the di�erence betweentwo words� one of frequency � and the other of frequency ��becomes much greater than the di�erence between two wordsof frequency � � and � �� �For the �rst pair of words� thedi�erence is �� � �� while for the second pair the di�erenceis a mere � ���� This con�rms mathematically what weknow intuitively� because there are so many words with alow frequency� the di�erences between them are that muchmore signi�cant� With a high frequency word� a di�erenceof one or two isn�t very signi�cant�

The values given are the base � logarithms of each COBUILD

frequency �� � � described above� In place of ascale from � to �� � � the resulting logarithmic valuesin this column range from zero �log��� to � �log���� � �And when a word has a normal frequency of zero� the log�arithmic value is also given as zero� This is mathemati�cally inaccurate �log

x doesn�t exist� but�at least in this

context�relatively unimportant� any word with a logarith�mic frequency of occurs at the very most only �� timesin the full cobuild ���� million word corpus� The thing toremember is that only words which have a cobuild �� � frequency value of two or more �or� if you prefer� only wordswhich occur �� or more times in the cobuild corpus havea logarithmic value greater than zero�

Page 114: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

CobLog

�CobLogLemma�

COBUILD frequency� logarithmic

����� FREQUENCY INFORMATION FROM WRITTEN ANDSPOKEN SOURCES

About ���� � words in the cobuild corpus make upwritten texts� and the remaining ��� � words make upspoken texts� In a sense� then� there are two other corporayou can use� one which deals with written texts only and onewith spoken texts only� You can choose for yourself whetheryou wish to use either written or spoken �gures in place ofthe full �gures explained in the preceeding sections� Themethods used in working out the �gures given are the sameas those described in the previous section�

The columns available for written and spoken corpus fre�quencies are roughly the same as those for the full corpus�with the exception of the deviation �gures � they are notre�calculated for the written and spoken texts� Instead� youcan use the �gures given for the full corpus� though rememberthat when you apply them to frequencies for the written andspoken corpora� the range of error is actually larger thanwould otherwise be�

����� WRITTEN CORPUS INFORMATION

There are three columns which contain frequency informationfor the written sources in the cobuild corpus� The �guregiven in the lemma version of the column for memory is���� which means that out of the ���� � words in thecorpus� ��� are the word memory in some form or other�The �gures given in the wordform version of this columnreveal how frequently each of the possible forms occur� formemory the �gure is � ��� and for memories it is ��� Theflex name and description of this column are as follows�

CobW

�CobWLemma�

COBUILD written frequency �"�"m

The next column contains the same frequency �gures asCobW� except that they have been scaled down to a rangeof � to �� � instead of the usual � to ���� � � This

Page 115: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Written corpus information ���

is done by dividing the normal cobuild written frequencyfor each word by the number of words in the written cor�pus �about ���� � � and then multiplying the answerby �� � � The end result is a set of �gures which areprobably easier to understand� it makes greater sense to saythat a word is one in a million than it does to say that it�s ��words out of ���� � � However as you might expect� somedetail is lost in the scaling�down process� all words whichhave ���� million word lemma frequencies of between � and� share the same � million word frequency of ��

CobWMln

�CobWMlnLemma�

COBUILD written frequency �� � �

The third and last written corpus column contains the base� logarithms of each CobWMln� for the reasons describedabove in connection with the full corpus� In place of a scalefrom � to �� � � then� the resulting logarithmic valuesin this column range from zero �log��� to � �log���� � �And when a word has a normal frequency of zero� the log�arithmic value is also given as zero� This is mathemati�cally inaccurate �log

x doesn�t exist� but�at least in this

context�relatively unimportant� any word with a logarith�mic frequency of occurs at the very most only � times inthe cobuild ���� million written word corpus� The thingto remember is that only words which have a CobWMln

frequency value of two or more �or� if you prefer� only wordswhich occur �� or more times in the cobuild corpus havea logarithmic value greater than zero�

CobWLog

�CobWLogLemma�

COBUILD written frequency� logarithmic

����� SPOKEN CORPUS INFORMATION

There are three columns which contain frequency informationfor the spoken sources in the cobuild corpus� The �guregiven in the lemma version of the column for memory is � �which means that out of the approximately ��� � wordsin the corpus� � are the wordmemory in some form or other�The �gures given in the wordform version of this columnreveal how frequently each of the possible forms occur� for

Page 116: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

���� english linguistic guide

memory the �gure is �� and formemories it is ��� The flexname and description of this column are as follows�

CobS

�CobSLemma�

COBUILD spoken frequency ���m

The next column contains the same frequency �gures asCobS� except that they have been scaled down to a rangeof � to �� � instead of the usual � to ��� � � This isdone by dividing the normal cobuild spoken frequency foreach word by the number of words in the spoken corpus� andthen multiplying the answer by �� � �

CobSMln

�CobSMlnLemma�

COBUILD spoken frequency �� � �

The third and last spoken corpus column contains the base� logarithms of each CobSMln frequency� for the reasonsdescribed above in connection with the full corpus� In placeof a scale from � to �� � � the resulting logarithmic valuesin this column range from zero �log��� to � �log���� � �And when a word has a normal frequency of zero� the loga�rithmic value is also given as zero� This is mathematically in�accurate �log

x doesn�t exist� but�at least in this context�

relatively unimportant� any word with a logarithmic fre�quency of occurs at the very most only once in the cobuild��� million spoken word corpus� and is consequently only ofinterest to those concerned with the more esoteric branchesof lexicography� The thing to remember is that only wordswhich have anCobSMln frequency value of two or more �or�if you prefer� only words which occur two or more times inthe cobuild spoken corpus have a logarithmic value greaterthan zero�

CobSLog

�CobSLogLemma�

COBUILD spoken frequency� logarithmic

��� FREQUENCY INFORMATION FOR COBUILDCORPUS TYPES

The frequency information given in cobuild corpus typeslexicons consists of the raw string counts from which allthe other frequency �gures for lemmas� wordforms and ab�breviations are derived� Also available are �gures for the

Page 117: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Frequency information for COBUILD corpus types �����

spoken and written texts in the corpus� as well as for BritishEnglish and American English types which are not to befound amongst the wordforms and abbreviations given in thecelex database� If you are not already familiar with theterms token and type� then check the glossary and the �rstpart of the manual� the Introduction� in the section �Lexicontypes��

The �rst column simply lists the orthographic forms of alltypes as they occur in the cobuild corpus� The flex nameand description of this column are as follows�

Type Graphemic transcription

The second column is the basic �string� count which tells youhow many times each type occurs in the cobuild corpus�which contains about ���� � tokens� The flex nameand description of this column are as follows�

Freq Absolute frequency

��� FREQUENCY INFORMATION FOR COBUILDWRITTEN CORPUS TYPES

There are four columns which contain raw string counts fromthe written texts in the cobuild corpus� The �rst containsthe frequencies of all types which occur more than once inall the written texts�

FreqW COBUILD written frequency� �"�"m

The next column contains raw string counts from the writtentexts that are normal British usages� as opposed to AmericanEnglish or some other brand of English�

FreqWB COBUILD written frequency� British English

The third column contains raw string counts from the writtentexts that are normal American usages� as opposed to BritishEnglish or some other brand of English�

FreqWA COBUILD written frequency� American English

Page 118: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

����� english linguistic guide

The fourth column contains raw string counts from the writ�ten texts that are not normal British or American usages�but are instead from an unidenti�ed brand of English�

FreqWU COBUILD written frequency� undetermined origin

��� FREQUENCY INFORMATION FOR COBUILDSPOKEN CORPUS TYPES

There are three columns which contain raw string countsfrom the spoken texts in the cobuild corpus� About ���million words were transcribed from recorded conversationsand included in the corpus� None of conversations tran�scribed involved American English� so no separate �guresfor American English spoken types are available�

The �rst column contains the frequencies of all types whichoccur more than once in the spoken texts�

FreqS COBUILD spoken frequency� ���m

The next column contains raw string counts from the writtentexts that are marked as normal British usages�

FreqSB COBUILD spoken frequency� British English

The third column contains raw string counts from the writtentexts that are not normal British usages� but are instead froman unidenti�ed brand of English�

FreqSU COBUILD spoken frequency� undetermined origin

Page 119: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� TREE DIAGRAMS � COLUMNDESCRIPTIONS

This appendix is divided into sections corresponding to thelexicon types currently available� Each section begins with aset of tree diagrams which give you an overview of the columnsyou can choose when you select a particular type of lexicon� andthen there are technical details about each of those columns� the type of the column� its minimum and maximum valuesand lengths� the number of null values it contains� and thecharacters used in each column� These details are particularlyuseful when you export a �le from flex�

Whenever a new version of the database is released� the cor�responding section in this appendix will also be replaced withthe relevant diagrams and technical details� Always rememberto check the name and lexicon number when you�re using thisappendix� you can see which lexicon type and version you aredealing with by reading the title of each diagram or the line atthe top of each right�hand page�

Page 120: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� ORTHOGRAPHY OF ENGLISH LEMMAS �E���

Number of spellings OrthoCnt

Spelling number ���N� OrthoNum

Language code OrthoStatus

COBUILD frequency ����m CobSpellFreqFrequency of spelling

COBUILD � � con�dence deviation ����m CobSpellDev

Without diacritics Head

Without diacritics� reversed HeadRev

With diacritics HeadDiaPlain

Purely lowercase alphabetical HeadLow

Purely lowercase alphabetical� sorted HeadLowSort

Spelling Number of letters HeadCnt

Without diacritics HeadSyl

Syllabi�ed With diacritics HeadSylDia

Number of syllables HeadSylCnt

Page 121: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� PHONOLOGY OF ENGLISH LEMMAS �E���

Number of pronunciations PronCnt

Pronunciation number ���N� PronNum

Status of pronunciation PronStatus

SAM�PA char set PhonSAM

CELEX char set PhonCLX

Plain CPA char set PhonCPA

DISC char set PhonDISC

Number of phonemes PhonCnt

SAM�PA char set PhonSylSAM

CELEX char set PhonSylCLX

CELEX char set� brackets PhonSylBCLXPhonetic transcriptions Syllabi�ed

CPA char set PhonSylCPA

DISC char set PhonSylDISC

Number of syllables SylCnt

SAM�PA char set PhonStrsSAM

CELEX char set PhonStrsCLXSyllabi�ed�with stress CPA char set PhonStrsCPA

DISC char set PhonStrsDISC

Stress pattern StrsPat

CV pattern PhonCVPhonetic patterns Syllabi�ed

CV pattern� brackets PhonCVBr

Page 122: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� MORPHOLOGY OF ENGLISH LEMMAS �E���

Status MorphStatus

Language Langinformation

Number of morphological analyses MorphCnt

Analysis number ���N� MorphNum

Noun�verb�a�x compound NVA�Comp

Derivation method Der

Status of morphological analyses Compound method Comp

Deriv� compound method DerComp

Default analysis Def

Stems � a�xes Imm

Class labels ImmClass

Class � verb subcat labels ImmSubCat

Stem�a�x labels ImmSA

Stem allomorphy ImmAlloImmediatesegmentation A�x substitution ImmSubst

Opacity ImmOpac

Derivational transformation TransDer

In�xation ImmIn�x

Reversion ImmRevers

Stems � a�xes FlatDerivational� Completecompositional Segmentations segmentation Class labels FlatClassinformation ��at�

Stem�a�x labels FlatSA

Stems � a�xes Struc

Stems � a�xes� labelled StrucLab

Complete Empty brackets� labelled StrucBrackLabsegmentation�hierarchical� Stem allomorphy StrucAllo

A�x substitution StrucSubst

Opacity StrucOpac

Number of components CompCnt

Other Number of morphemes MorCnt

Number of levels LevelCnt

Page 123: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� SYNTAX OF ENGLISH LEMMAS�NOUNS� VERBS� �E���

Numeric codes ClassNumWord class

Labels Class

Count C N

Uncount Unc N

Singular use Sing N

Plural use Plu N

Group count GrC N

Subclassi�cation Group uncount GrUnC Nnouns

Attributive Attr N

Postpositive PostPos N

Vocative Voc N

Proper noun Proper N

Expression Exp N

Transitive Trans V

Transitive � complementation TransComp V

Intransitive Intrans V

Ditransitive Ditrans V

Subclassi�cation Linking verb Link Vverbs

Phrasal Phr V

Prepositional Prep V

Phrasal prepositional PhrPrep V

Expression Exp V

Page 124: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

�� SYNTAX OF ENGLISH LEMMAS�ADJECTIVES� ADVERBS� NUMERALS�PRONOUNS� CONJUNCTIONS� �E���

Ordinary Ord A

Attributive Attr A

Predicative Pred ASubclassi�cationadjectives Postpositive PostPos A

Group uncount GrUnc A

Expression Exp A

Ordinary Ord ADV

Predicative Pred ADV

Subclassi�cation Postpositive PostPos ADVadverbs

Combinatory Comb ADV

Expression Exp ADV

Cardinal Card NUM

Subclassi�cation Ordinal Ord NUMnumerals

Expression Exp NUM

Personal Pers PRON

Demonstrative Dem PRON

Possessive Poss PRON

Re�exive Re� PRONSubclassi�cationpronouns Wh�pronoun Wh PRON

Determinative use Det PRON

Pronominal use Pron PRON

Expression Exp PRON

Coordinating Cor CSubclassi�cationconjunctions Subordinating Sub C

Page 125: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� FREQUENCY OF ENGLISH LEMMAS �E���

COBUILD frequency ����m Cob

COBUILD � � con�dence deviation ����m CobDev

COBUILD all sources

COBUILD frequency �m CobMln

COBUILD frequency� logarithmic CobLog

COBUILD written frequency ����m CobW

COBUILD written sources COBUILD written frequency �m CobWMln

COBUILD written frequency� logarithmic CobWLog

COBUILD spoken frequency ���m CobS

COBUILD spoken sources COBUILD spoken frequency �m CobSMln

COBUILD spoken frequency� logarithmic CobSLog

Page 126: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

Attr A For adjectives� attributive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Attr N For nouns� attributive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

C N For nouns� countable

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Card NUM For numerals� cardinal

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Class Word class� labels

Type� character Null values�

Minimum value� A Minimum length� �

Maximum value� V Maximum length� �

Characters� A C D E I M N O P R S T U V

Page 127: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

ClassNum Word class� numeric

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� �� Maximum length� �

Characters� � � � � � " � � �

Cob COBUILD frequency

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��� �" Maximum length� "

Characters� � � � � � " � � �

CobDev COBUILD frequency deviation

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

CobLog COBUILD frequency� logarithmic

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ������ Maximum length� "

Characters� � � � � � � " � � �

CobMln COBUILD frequency �� � �

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

Page 128: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

CobS COBUILD spoken frequency ���m

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� � ��� Maximum length� �

Characters� � � � � � " � � �

CobSLog COBUILD spoken frequency� logarithmic

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ������ Maximum length� "

Characters� � � � � � � " � � �

CobSMln COBUILD spoken frequency �� � �

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

CobSpellDev COBUILD spelling frequency deviation

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

CobSpellFreq COBUILD spelling frequency

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��� �" Maximum length� "

Characters� � � � � � " � � �

Page 129: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

CobW COBUILD written frequency �"�"m

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

CobWLog COBUILD written frequency� logarithmic

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����"� Maximum length� "

Characters� � � � � � � " � � �

CobWMln COBUILD written frequency �� � �

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

Comb ADV For adverbs� combinatory

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Comp Compound analysis

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 130: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

CompCnt Number of morphological components

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � �

Cor C For conjunctions� coordinating

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Dem PRON For pronouns� demonstrative

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Der Derivation analysis

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

DerComp Derivational compound analysis

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 131: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

Det PRON For pronouns� determinative use

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Ditrans V For verbs� ditransitive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Exp A For adjectives� expression

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Exp ADV For adverbs� expression

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Exp N For nouns� expression

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 132: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

Exp NUM For numerals� expression

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� N Maximum length� �

Characters� N

Exp PRON For pronouns� expression

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� N Maximum length� �

Characters� N

Exp V For verbs� expression

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Flat Flat segmentation

Type� character Null values� � � �

Minimum value� April Minimum length� �

Maximum value� zoom Maximum length� ��

Characters� # � � A B D E F G H I J L M O P Q S T U V W ab c d e f g h i j k l m n o p q r s t u v w xy z

Page 133: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

FlatClass Flat segmentation� word class labels

Type� character Null values� � � �

Minimum value� A Minimum length� �

Maximum value� xxxx Maximum length� "

Characters� A B C D I N O P Q S T V x

FlatSA Flat segmentation� stem�affix labels

Type� character Null values� � � �

Minimum value� AA Minimum length� �

Maximum value� SSSA Maximum length� "

Characters� A F S

GrC N For nouns� group countable

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

GrUnc N For nouns� group uncountable

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 134: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

Head Headword

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� zoophyte Maximum length� �

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z

HeadCnt Headword� number of letters

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � � �

HeadDia Headword� diacritics

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� $ep$ee Maximum length� �

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z %a �a &c %e $e �e '( �n �o 'u

HeadLow Headword� lowercase� alphabetical

Type� character Null values�

Minimum value� a Minimum length� �

Maximum value� zoophyte Maximum length� �

Characters� a b c d e f g h i j k l m n o p q r s t u v wx y z

Page 135: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

HeadLowSort Headword� lowercase� alphabetical� sorted

Type� character Null values�

Minimum value� a Minimum length� �

Maximum value� ttuu Maximum length� �

Characters� a b c d e f g h i j k l m n o p q r s t u v wx y z

HeadRev Headword� reversed

Type� character Null values�

Minimum value� �oht Minimum length� �

Maximum value� zzuf Maximum length� �

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z

HeadSyl Headword� syllabified

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� zoom Maximum length� ��

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z

HeadSylCnt Headword� number of orthographic syllables

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � �

Page 136: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

HeadSylDia Headword� syllabified� diacritics

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� $e�p$ee Maximum length� ��

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z %a �a &c %e $e �e '( �n �o 'u

IdNum Lemma number

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

Imm Immediate segmentation

Type� character Null values� � � �

Minimum value� April Minimum length� �

Maximum value� zoom Maximum length� ��

Characters� # � � A B D E F G H I J L M O P Q S T U V W ab c d e f g h i j k l m n o p q r s t u v w xy z

ImmAllo Stem allomorphy� top level

Type� character Null values� � � �

Minimum value� B Minimum length� �

Maximum value� Z Maximum length� �

Characters� B C D F N Z

Page 137: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

ImmClass Immediate segmentation� word class labels

Type� character Null values� � � �

Minimum value� A Minimum length� �

Maximum value� xxx Maximum length� �

Characters� A B C D I N O P Q S T V x

ImmIn�x Infixation� top level

Type� character Null values� � � �

Minimum value� N Minimum length� �

Maximum value� N Maximum length� �

Characters� N

ImmOpac Opacity� top level

Type� character Null values� � � �

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

ImmRevers Reversion� top level

Type� character Null values� � � �

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

ImmSA Immediate segmentation� stem�affix labels

Type� character Null values� � � �

Minimum value� AA Minimum length� �

Maximum value� SSA Maximum length� �

Characters� A F S

Page 138: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

ImmSubCat Immediate segmentation� subcat labels

Type� character Null values� � � �

Minimum value� Minimum length� �

Maximum value� xxx Maximum length� �

Characters� � � � A B C D I N O P Q S T x

ImmSubst Affix substitution� top level

Type� character Null values� � � �

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Intrans V For verbs� intransitive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Lang Language information

Type� character Null values� �����

Minimum value� A Minimum length� �

Maximum value� S Maximum length� �

Characters� A B D F G I L S

LevelCnt Number of morphological levels

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " �

Page 139: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

Link V For verbs� linking verb

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

MorCnt Number of morphemes

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� " Maximum length� �

Characters� � � � � � "

MorphCnt Number of morphological analyses

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � �

MorphNum Morphological analysis number

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � �

MorphStatus Morphological status

Type� character Null values�

Minimum value� C Minimum length� �

Maximum value� Z Maximum length� �

Characters� C F I M O R U Z

Page 140: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

NVA�Comp Noun�verb�affix compound

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Ord A For adjectives� ordinary

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Ord ADV For adverbs� ordinary

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Ord NUM For numerals� ordinal

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

OrthoCnt Number of spellings

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � �

Page 141: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

OrthoNum Spelling number

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � �

OrthoStatus Status of spelling

Type� character Null values�

Minimum value� A Minimum length� �

Maximum value� B Maximum length� �

Characters� A B

Pers PRON For pronouns� personal

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

PhonCLX Phon� headword� CELEX charset

Type� character Null values�

Minimum value� �N�S���s� Minimum length� �

Maximum value� z�u��m� Maximum length� ��

Characters� � � � � � � A D E I N O S T U V Z a b d e fg h i j k l m n p r s t u v w x z �

Page 142: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

PhonCnt Headword� number of phonemes

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� �� Maximum length� �

Characters� � � � � � " � � �

PhonCPA Phon� headword� CPA charset

Type� character Null values�

Minimum value� �� Minimum length� �

Maximum value� z�u��m� Maximum length� ��

Characters� � � � � � � A D E I J N O S T U Z � a b d e fg h i j k l m n o p r s t u v w x z �

PhonCV Headword� phon� CV pattern

Type� character Null values�

Minimum value� C Minimum length� �

Maximum value� VVCC�CVCC Maximum length� �"

Characters� � C S V

PhonCVBr Headword� phon� CV pattern� with brackets

Type� character Null values�

Minimum value� CCCVCCC� Minimum length� �

Maximum value� V�VC� Maximum length� ��

Characters� C S V �

Page 143: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

PhonDISC Phon� headword� DISC charset

Type� character Null values�

Minimum value� ! Minimum length� �

Maximum value� �t�t Maximum length� ��

Characters� ! ) � � � � � " � � � � C D E F H I J N P QR S T U V Z * b c d f g h i j k l m n p q r st u v w x z � �

PhonSAM Phon� headword� SAM�PA charset

Type� character Null values�

Minimum value� ���T� Minimum length� �

Maximum value� ���n�Z�eI�n�j�u�� Maximum length� ��

Characters� � � � � � � A D E I N O Q S T U V Z a b d e fg h i j k l m n p r s t u v w x z � �

PhonStrsCLX Syll� phon� headword� with stress� CELEX charset

Type� character Null values�

Minimum value� � �b���lI�S��nIst Minimum length� �

Maximum value� zU��O�l��dZIst Maximum length� ��

Characters� � � � � � � � � A D E I N O S T U V Z a b de f g h i j k l m n p r s t u v w x z �

PhonStrsCPA Syll� phon� headword� with stress� CPA charset

Type� character Null values�

Minimum value� ����mO�r�l Minimum length� �

Maximum value� zU��O�l��J�Ist Maximum length� �

Characters� � � � � � � � � A D E I J N O S T U Z � a b de f g h i j k l m n o p r s t u v w x z �

Page 144: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

PhonStrsDISC Syll� phon� headword� with stress� DISC charset

Type� character Null values�

Minimum value� �!��mEn Minimum length� �

Maximum value� �n��t�t Maximum length� ��

Characters� � ! ) � � � � � � � " � � � � C D E F H I JN P Q R S T U V Z * b c d f g h i j k l m n pq r s t u v w x z � �

PhonStrsSAM Syll� phon� headword� with stress� SAM�PA charset

Type� character Null values�

Minimum value� ����TI Minimum length� �

Maximum value� �z��bE�stQs Maximum length� ��

Characters� � � � � � � � � A D E I N O Q S T U V Z a b de f g h i j k l m n p r s t u v w x z � �

PhonSylCLX Syll� phon� headword� CELEX charset

Type� character Null values�

Minimum value� �SI Minimum length� �

Maximum value� zu�m Maximum length� �"

Characters� � � � � � � A D E I N O S T U V Z a b d e fg h i j k l m n p r s t u v w x z �

PhonSylBCLX Syll� phon� headword� CELEX charset brackets�

Type� character Null values�

Minimum value� N�S�s� Minimum length� �

Maximum value� zu�m� Maximum length� ��

Characters� � � � � � A D E I N O S T U V Z � a b d ef g h i j k l m n p r s t u v w x z �

Page 145: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

PhonSylCPA Syll� phon� headword� CPA charset

Type� character Null values�

Minimum value� � Minimum length� �

Maximum value� zu�m Maximum length� ��

Characters� � � � � � � A D E I J N O S T U Z � a b d e fg h i j k l m n o p r s t u v w x z �

PhonSylDISC Syll� phon� headword� DISC charset

Type� character Null values�

Minimum value� ! Minimum length� �

Maximum value� �n�t�t Maximum length� �"

Characters� ! ) � � � � � � " � � � � C D E F H I J N PQ R S T U V Z * b c d f g h i j k l m n p q rs t u v w x z � �

PhonSylSAM Syll� phon� headword� SAM�PA charset

Type� character Null values�

Minimum value� ���TI Minimum length� �

Maximum value� ��n�ZeI�nju� Maximum length� �"

Characters� � � � � � � A D E I N O Q S T U V Z a b d e fg h i j k l m n p r s t u v w x z � �

Phr V For verbs� phrasal verb

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 146: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

PhrPrep V For verbs� phrasal prepositional verb

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Plu N For nouns� plural use

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Poss PRON For pronouns� possessive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

PostPos A For adjectives� postpositive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

PostPos ADV For adverbs� postpositive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 147: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

PostPos N For nouns� postpositive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Pred A For adjectives� predicative

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Pred ADV For adverbs� predicative

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Def Default analysis

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Prep V For verb� prepositional verb

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 148: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

Pron PRON For pronouns� pronominal use

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

PronCnt Number of pronunciations

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � � �

PronNum Pronunciation number

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � � �

PronStatus Status of pronunciation

Type� character Null values�

Minimum value� P Minimum length� �

Maximum value� S Maximum length� �

Characters� P S

Proper N For nouns� proper noun

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 149: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

Re� PRON For pronouns� reflexive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Sing N For nouns� singular use

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

StrsPat Headword� stress pattern

Type� character Null values�

Minimum value� � Minimum length� �

Maximum value� �� � Maximum length� �

Characters� � �

Struc Structured segmentation

Type� character Null values� � � �

Minimum value� confide��ent���ence���trick�����

Minimum length� �

Maximum value� zoom� Maximum length� ��

Characters� # � � � A B D E F G H I J L M O P Q S T U VW a b c d e f g h i j k l m n o p q r s t u vw x y z

Page 150: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

StrucAllo Stem allomorphy� any level

Type� character Null values� � � �

Minimum value� B Minimum length� �

Maximum value� Z Maximum length� �

Characters� B C D F N Z

StrucBrackLab Structured segmentation� word class labels only

Type� character Null values� � � �

Minimum value� �V��N� Minimum length� �

Maximum value� �V� Maximum length� ��

Characters� # � � � A B C D I N O P Q S T V � x �

StrucLab Structured segmentation� word class labels

Type� character Null values� � � �

Minimum value� confide�V��ent�A�V���A��ence�N�A���N��trick�V��N��N��N��V�

Minimum length� "

Maximum value� zoom�V� Maximum length� ��

Characters� # � � � � A B C D E F G H I J L M N O P Q ST U V W � a b c d e f g h i j k l m n o p qr s t u v w x y z �

StrucOpac Opacity� any level

Type� character Null values� � � �

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 151: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English Lemmas �E���

StrucSubst Affix substitution� any level

Type� character Null values� � � �

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Sub C For conjunctions� subordinating

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

SylCnt Headword� number of phonetic syllables

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � �

Trans V For verbs� transitive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

TransComp V For verbs� transitive plus complementation

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 152: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

TransDer Derivational transformation� top level

Type� character Null values� �����

Minimum value� ! Minimum length� �

Maximum value� �yupon�i! Maximum length� �

Characters� ! � � a b c d e f g h i k l m n o p q r s t uv w x y z

Unc N For nouns� uncountable

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Voc N For nouns� vocative

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Wh PRON For pronouns� wh�pronoun

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 153: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� ORTHOGRAPHY OF ENGLISH WORDFORMS �E���

Number of spellings OrthoCnt

Spelling number ���N� OrthoNum

Language code OrthoStatus

COBUILD frequency ����m CobSpellFreqFrequency of spelling

COBUILD � � con�dence deviation ����m CobSpellDev

Without diacritics Word

Without diacritics� reversed WordRev

With diacritics WordDiaPlain

Purely lowercase alphabetical WordLow

Purely lowercase alphabetical� sorted WordLowSort

Spelling Number of letters WordCnt

Without diacritics WordSyl

Syllabi�ed With diacritics WordSylDia

Number of syllables WordSylCnt

Page 154: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

�� PHONOLOGY OF ENGLISH WORDFORMS �E���

Number of pronunciations PronCnt

Pronunciation number ���N� PronNum

Status of pronunciation PronStatus

SAM�PA char set PhonSAM

CELEX char set PhonCLX

Plain CPA char set PhonCPA

DISC char set PhonDISC

Number of phonemes PhonCnt

SAM�PA char set PhonSylSAM

CELEX char set PhonSylCLX

CELEX char set� brackets PhonSylBCLXPhonetic transcriptions Syllabi�ed

CPA char set PhonSylCPA

DISC char set PhonSylDISC

Number of syllables SylCnt

SAM�PA char set PhonStrsSAM

CELEX char set PhonStrsCLXSyllabi�ed�with stress CPA char set PhonStrsCPA

DISC char set PhonStrsDISC

Stress pattern StrsPat

CV pattern PhonCVPhonetic patterns Syllabi�ed

CV pattern� brackets PhonCVBr

Page 155: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

�� MORPHOLOGY OF ENGLISH WORDFORMS �E���

Numeric id IDNumLemma

Orthography ORTHOGRAPHY OF ENGLISH LEMMAS

Phonology PHONOLOGY OF ENGLISH LEMMASLemmainformation Morphology MORPHOLOGY OF ENGLISH LEMMAS

Syntax SYNTAX OF ENGLISH LEMMAS

Frequency FREQUENCY OF ENGLISH LEMMAS

�See the informationin these diagrams forthe available columns�

Singular Sing

Plural Plu

Positive Pos

Comparative Comp

Superlative Sup

In�nitive Inf

In�ectional Participle Partfeatures

Present tense Pres

Past tense Past

�st person verb Sin�

�nd person verb Sin�

�rd person verb Sin�

Rare form Rare

Type of �ection FlectType

In�ectional TransIn�Transformation

Page 156: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

�� FREQUENCY OF ENGLISH WORDFORMS �E���

COBUILD frequency ����m Cob

COBUILD � � con�dence deviation ����m CobDev

COBUILD all sources

COBUILD frequency �m CobMln

COBUILD frequency� logarithmic CobLog

COBUILD written frequency ����m CobW

COBUILD written sources COBUILD written frequency �m CobWMln

COBUILD written frequency� logarithmic CobWLog

COBUILD spoken frequency ���m CobS

COBUILD spoken sources COBUILD spoken frequency �m CobSMln

COBUILD spoken frequency� logarithmic CobSLog

Page 157: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English wordforms �E���

Cob COBUILD frequency

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

CobDev COBUILD frequency deviation

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

CobLog COBUILD frequency� logarithmic

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� "

Characters� � � � � � � " � � �

CobMln COBUILD frequency �� � �

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� � ��� Maximum length� �

Characters� � � � � � " � � �

CobS COBUILD spoken frequency ���m

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� �"��� Maximum length� �

Characters� � � � � � " � � �

Page 158: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

CobSLog COBUILD spoken frequency� logarithmic

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ���� � Maximum length� "

Characters� � � � � � � " � � �

CobSMln COBUILD spoken frequency �� � �

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

CobSpellDev COBUILD spelling frequency deviation

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

CobSpellFreq COBUILD spelling frequency

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

CobW COBUILD written frequency ����m

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ��"��� Maximum length� "

Characters� � � � � � " � � �

Page 159: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English wordforms �E���

CobWLog COBUILD written frequency� logarithmic

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� �����" Maximum length� "

Characters� � � � � � � " � � �

CobWMln COBUILD written frequency �� � �

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� �� �� Maximum length� �

Characters� � � � � � " � � �

Comp Inflectional feature� comparative

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

FlectType Type of flection

Type� character Null values�

Minimum value� P Minimum length� �

Maximum value� s Maximum length� �

Characters� � � � P S X a b c e i p r s

IdNum Word number

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � ""� Maximum length� "

Characters� � � � � � " � � �

Page 160: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

IdNumLemma Lemma number

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

Inf Inflectional feature� infinitive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

OrthoCnt Number of spellings

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � �

OrthoNum Spelling number

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � �

OrthoStatus Status of spelling

Type� character Null values�

Minimum value� A Minimum length� �

Maximum value� B Maximum length� �

Characters� A B

Page 161: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English wordforms �E���

Part Inflectional feature� participle

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Past Inflectional feature� past tense

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

PhonCLX Phon� wordform� CELEX charset

Type� character Null values�

Minimum value� �N�S���s� Minimum length� �

Maximum value� z�u��z� Maximum length� ��

Characters� � � � � � � A D E I N O S T U V Z a b d e fg h i j k l m n p r s t u v w x z �

PhonCnt Wordform� number of phonemes

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� �� Maximum length� �

Characters� � � � � � " � � �

Page 162: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

PhonCPA Phon� wordform� CPA charset

Type� character Null values�

Minimum value� �� Minimum length� �

Maximum value� z�u��z� Maximum length� ��

Characters� � � � � � � A D E I J N O S T U Z � a b d e fg h i j k l m n o p r s t u v w x z �

PhonCV Wordform� phon� CV pattern

Type� character Null values�

Minimum value� C Minimum length� �

Maximum value� VVCCC Maximum length� ��

Characters� � C S V

PhonCVBr Wordform� phon� CV pattern� with brackets

Type� character Null values�

Minimum value� CCCC� Minimum length� �

Maximum value� V�VC� Maximum length� ��

Characters� C S V �

PhonDISC Phon� wordform� DISC charset

Type� character Null values�

Minimum value� ! Minimum length� �

Maximum value� �t�ts Maximum length� ��

Characters� ! ) � � � � � " � � � � C D E F H I J N P QR S T U V Z * b c d f g h i j k l m n p q r st u v w x z � �

Page 163: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English wordforms �E���

PhonSAM Phon� wordform� SAM�PA charset

Type� character Null values�

Minimum value� ���D�z� Minimum length� �

Maximum value� ���n�Z�eI�n�j�u��z� Maximum length� ��

Characters� � � � � � � A D E I N O Q S T U V Z a b d e fg h i j k l m n p r s t u v w x z � �

PhonStrsCLX Syll� phon� wordform� with stress� CELEX charset

Type� character Null values�

Minimum value� � �b���lI�S��nIst Minimum length� �

Maximum value� zU��O�l��dZIsts Maximum length� ��

Characters� � � � � � � � � A D E I N O S T U V Z a b de f g h i j k l m n p r s t u v w x z �

PhonStrsCPA Syll� phon� wordform� with stress� CPA charset

Type� character Null values�

Minimum value� ����mO�r�l Minimum length� �

Maximum value� zU��O�l��J�Ists Maximum length� �

Characters� � � � � � � � � A D E I J N O S T U Z � a b de f g h i j k l m n o p r s t u v w x z �

PhonStrsDISC Syll� phon� wordform� with stress� DISC charset

Type� character Null values�

Minimum value� �!��mEn Minimum length� �

Maximum value� �n��t�ts Maximum length� ��

Characters� � ! ) � � � � � � � " � � � � C D E F H I JN P Q R S T U V Z * b c d f g h i j k l m n pq r s t u v w x z � �

Page 164: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

PhonStrsSAM Syll� phon� wordform� with stress� SAM�PA charset

Type� character Null values�

Minimum value� ����TI Minimum length� �

Maximum value� �z��bE�stQs Maximum length� ��

Characters� � � � � � � � � A D E I N O Q S T U V Z a b de f g h i j k l m n p r s t u v w x z � �

PhonSylCLX Syll� phon� wordform� CELEX charset

Type� character Null values�

Minimum value� �SI Minimum length� �

Maximum value� zu�z Maximum length� ��

Characters� � � � � � � A D E I N O S T U V Z a b d e fg h i j k l m n p r s t u v w x z �

PhonSylBCLX Syll� phon� wordform� CELEX charset brackets�

Type� character Null values�

Minimum value� N�S�s� Minimum length� �

Maximum value� zu�z� Maximum length� �"

Characters� � � � � � A D E I N O S T U V Z � a b d ef g h i j k l m n p r s t u v w x z �

PhonSylCPA Syll� phon� wordform� CPA charset

Type� character Null values�

Minimum value� � Minimum length� �

Maximum value� zu�z Maximum length� ��

Characters� � � � � � � A D E I J N O S T U Z � a b d e fg h i j k l m n o p r s t u v w x z �

Page 165: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English wordforms �E���

PhonSylDISC Syll� phon� wordform� DISC charset

Type� character Null values�

Minimum value� ! Minimum length� �

Maximum value� �n�t�ts Maximum length� �"

Characters� ! ) � � � � � � " � � � � C D E F H I J N PQ R S T U V Z * b c d f g h i j k l m n p q rs t u v w x z � �

PhonSylSAM Syll� phon� wordform� SAM�PA charset

Type� character Null values�

Minimum value� ���TI Minimum length� �

Maximum value� ��n�ZeI�nju�z Maximum length� ��

Characters� � � � � � � A D E I N O Q S T U V Z a b d e fg h i j k l m n p r s t u v w x z � �

Plu Inflectional feature� plural

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Pos Inflectional feature� positive

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 166: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

Pres Inflectional feature� present tense

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

PronCnt Number of pronunciations

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � � �

PronNum Pronunciation number

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � � �

PronStatus Status of pronunciation

Type� character Null values�

Minimum value� P Minimum length� �

Maximum value� S Maximum length� �

Characters� P S

Rare Inflectional feature� rare form

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Page 167: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English wordforms �E���

Sin� Inflectional feature� �st person verb

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Sin� Inflectional feature� �nd person verb

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Sin� Inflectional feature� �rd person verb

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

Sing Inflectional feature� singular

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

StrsPat Wordform� stress pattern

Type� character Null values�

Minimum value� � Minimum length� �

Maximum value� �� � Maximum length� �

Characters� � �

Page 168: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

Sup Inflectional feature� superlative

Type� character Null values�

Minimum value� N Minimum length� �

Maximum value� Y Maximum length� �

Characters� N Y

SylCnt Wordform� number of phonetic syllables

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � �

TransIn� Inflectional transformation

Type� character Null values� ��"�

Minimum value� � Minimum length� �

Maximum value� ��y�iest Maximum length� ��

Characters� # � � � b d e f g i k l m n p r s t v y z

Word Word

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� zoos Maximum length� �

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z

Page 169: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English wordforms �E���

WordCnt Word� number of letters

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � � �

WordDia Word� diacritics

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� $ep$ees Maximum length� �

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z %a �a &c %e $e �e '( �n �o 'u

WordLow Word� lowercase� alphabetical

Type� character Null values�

Minimum value� a Minimum length� �

Maximum value� zoos Maximum length� �

Characters� a b c d e f g h i j k l m n o p q r s t u v wx y z

WordLowSort Word� lowercase� alphabetical� sorted

Type� character Null values�

Minimum value� a Minimum length� �

Maximum value� ttuu Maximum length� �

Characters� a b c d e f g h i j k l m n o p q r s t u v wx y z

Page 170: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

WordRev Word� reversed

Type� character Null values�

Minimum value� �oht Minimum length� �

Maximum value� zzuf Maximum length� �

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z

WordSyl Word� syllabified

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� zoos Maximum length� ��

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z

WordSylCnt Word� number of orthographic syllables

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � Maximum length� �

Characters� � � � � � " � �

WordSylDia Word� syllabified� diacritics

Type� character Null values�

Minimum value� �d Minimum length� �

Maximum value� $e�p$ees Maximum length� ��

Characters� # � � A B C D E F G H I J K L M N O P Q R S TU V W Y a b c d e f g h i j k l m n o p q r st u v w x y z %a �a &c %e $e �e '( �n �o 'u

Page 171: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

�� ENGLISH COBUILD CORPUS TYPES �E���

Orthography Graphemic transcription Type

COBUILD all sources Absolute frequency Freq

Written frequency FreqW

British English� written frequency FreqWB

COBUILD written sources

American English� written frequency FreqWA

Undetermined origin� written frequency FreqWU

Spoken frequency FreqS

COBUILD spoken sources British English� spoken frequency FreqSB

Undetermined origin� spoken frequency FreqSU

Page 172: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

Freq Absolute frequency

Type� numeric Null values�

Minimum value� � Minimum length� �

Maximum value� � ����" Maximum length� �

Characters� � � � � � " � � �

FreqS COBUILD spoken frequency� ���m

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� " ��� Maximum length� �

Characters� � � � � � " � � �

FreqSB COBUILD spoken frequency� British English

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

FreqSU COBUILD spoken frequency� undetermined origin

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

FreqW COBUILD written frequency� ����m

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� � ����� Maximum length� �

Characters� � � � � � " � � �

Page 173: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Column descriptions for English corpus types �E���

FreqWA COBUILD written frequency� American English

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� ����� Maximum length� �

Characters� � � � � � " � � �

FreqWB COBUILD written frequency� British English

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� �� �� Maximum length� "

Characters� � � � � � " � � �

FreqWU COBUILD written frequency� undetermined origin

Type� numeric Null values�

Minimum value� Minimum length� �

Maximum value� �"��� Maximum length� �

Characters� � � � � � " � � �

Type Graphemic transcription

Type� character Null values�

Minimum value� �� mm Minimum length� �

Maximum value� zzzzzzrrrrr Maximum length� ��

Characters� � � � � � � � � � " � � � a b c d e f g h ij k l m n o p q r s t u v w x y z

Page 174: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon
Page 175: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� COMPUTER PHONETICCHARACTER CODES

The tables in this appendix exemplify the disc character set infull� disc is the character set which gives a single� unique codeto each phonetic segment in the standard sounds systems ofDutch� English� and German� Here segment means consonant�a�ricate� syllabic consonant� short vowel� long vowel� diphthongor nasalized vowel�

Each table gives the ipa characters at the far left hand side� andthe corresponding disc characters on the far right hand side� Inbetween come examples �where they occur of words in Dutch�English� and German which exemplify the segments in question�and the code or codes used to represent those segments in theother character coding sets available� sam�pa� celex� and cpa�This means you can use this appendix both as a full overviewof disc and a check on every phonetic character code used inthe celex databases� If you just want to see the codes used forone particular language� then you should consult the Phonologysection of the appropriate Linguistic Guide� you can also �ndgeneral descriptions of the character sets there�

Page 176: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

ipa Dutch English German sam�pa celex cpa disc

p put pat Pakt p p p p

b bad bad Bad b b b b

t tak tack Tag t t t t

d dak dad dann d d d d

k kat cad kalt k k k k

� goal game Gast g g g g

� lang bang Klang N N N N

m mat mad Ma� m m m m

n nat nat Naht n n n n

l lat lad Last l l l l

r� J rat� later rat Ratte r r r r

f fiets fat falsch f f f f

v vat vat Welt v v v v

S thin T T T T

� then D D D D

s sap sap Gas s s s s

z zat zap Suppe z z z z

M sjaal sheep Schi� S S S S

� ravage measure Genie Z Z Z Z

j jas yank Jacke j j j j

x� �c licht� gaat loch Bach� ich x x x x

regen G G G G

h had had Hand h h h h

w why waterproof w w w w

Y wat w w w w

pf Pferd pf pf pf �

ts Zahl ts ts C� �

Q cheap Matsch tS tS T� J

� jazz jeep Gin dZ dZ J� �

�j bacon N� N� N� C

mj idealism m� m� m� F

nj burden n� n� n� H

lj dangle l� l� l� P

� father �linking �r�� r r r R

Disc Computer Phonetic Codesconsonants� affricates and syllabic consonants

Page 177: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Computer phonetic character codes

ipa Dutch English German sam�pa celex cpa disc

iq liep bean Lied i i i i

iqq analyse i i i �

q barn Advantage A A A �

aq laat klar a a a a

�q born Allroundman O O O

uq boek boon Hut u u u u

�q burn Teamwork � � � �

yq buut f�ur y y y y

yqq centrifuge y y y �

�q scene K�ase E E E �

�q freule � U Q

�q zone Q O o �

eq leeg Mehl e e e e

�q deuk M�obel � � q �

oq boom Boot o o o o

e� bay Native eI eI e� �

a� buy Shylock aI aI a� �

�� boy Playboy OI OI o� �

V no �U �U O� �

aV brow Allroundsportler aU aU A� �

� peer I� I� I� �

� pair E� E� E� �

V poor U� U� U� �

�i wijs EI EI y� K

�y huis �I UI q� L

u koud Au AU A� M

ai weit ai ai a� W

au Haut au au A� B

�y freut Oy Oy o� X

Disc Computer Phonetic Codeslong vowels and diphthongs

Page 178: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

ipa Dutch English German sam�pa celex cpa disc

� lip pit Mitte I I I I

� Pf�utze Y Y Y Y

� leg pet Bett E E E E

� G�otter � Q Q �

� pat Ragtime � � �� �

a hat a a a �

lat Kalevala A A A A

� pot Q O O Q

� putt Plumpudding V V � V

� bom Glocke O O O O

V put Pult U U U U

T put � U Y� �

gelijk another Beginn � � � �

��q Parfum �� Q� Q� �

�� timbre impromptu �� �� ��� c

�q d�etente D�etente A� A� A� q

��q lingerie Bassin �� �� ��� �

��q bouillon A�ront O� O� O� �

Disc Computer Phonetic Codesshort vowels and nasalized vowels

ipa Description sam�pa celex cpa disc

q length marker

� syllable marker � � � �

h primary stress ! ! !

i secondary stress "

� nasalization � � �

examples� A� A� A�

Disc Computer Phonetic Codeslength� stress� syllable and nasalization markers

Page 179: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� ASCII AND EIGHT�BITCHARACTER CODES

The two tables which follow show full details of the sevenand eight bit character codes used by celex on its digitalvax�vms computer systems� They are particularly useful whenyou need to transfer data to or from the celex machine� youcan �nd out which codes must be converted� The �rst tableshows the basic characters in use � they are the standard sevenbit ascii codes� and most ascii terminals and printers shouldreproduce these characters as shown� The second table showsthe eight bit codes which digital vt��� and vt����seriesterminals can reproduce� these are the codes which provide thediacritic characters available in some columns in the celexdatabases�

Most of the printable seven and eight bit codes conform tothe standard character set known as iso ������ �Latin Alpha�bet No� � or ecma��� There are some exceptions� however�The iso ���� �decimal characters �� � ��� ���� ���� ������� ���� ��� �� � � �� ���� � � ��� and ��� are not imple�mented in the digital set� and ���� ���� and �� each producea character other than the iso ���� recommended one�

For details about each character� consult the digital vms Gen�eral User Guide� Volume �A Guide to using VMS �VMS version�� � April ����� pages A���A����

Page 180: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Appendix �

� � � � � � � �

NUL���

SOH���

STX���

ETX���

EOT���

ENQ���

ACK���

BEL���

BS���

�HT

��

LF

����

AVT

����

BFF

����

CCR

����

DSO

����

ESI

����

F

DLE������

DC�������

DC�������

DC������

DC�������

NAK������

SYN������

ETB������

CAN��

����

EM��

���

SUB����

�AESC

����

�BFS

����

�CGS

���

�DRS

����

�EUS

����

�F

SP������

�������

�������

�������

)������

�������

�������

������

��

����

��

���

�����

�A�

����

�B

� ����

�C

����

�D

� ����

�E�

����

�F

�������

������

�������

�������

�������

�������

�������

�������

���

����

���

���

�����

�A

����

�B+

����

�C

�����

�D�

����

�E�

����

�F

��������

A�������

B�������

C�������

D�������

E������

F�������

G�������

H���

����

I���

���

J�����

�AK

�����

�BL

�����

�CM

�����

�DN

�����

�EO

����

�F

P�������

Q�������

R�������

S�������

T�������

U�������

V�������

W�������

X���

����

Y���

��

Z����

�A�

����

�B�

����

�C�

����

�D�

����

�E

����

�F

,������

a������

b������

c�����

d��������

e��������

f��������

g ��������

h���

�����

i���

����

j������

�Ak

������

�Bl

������

�Cm

�����

�Dn

������

�Eo

������

�F

p ��������

q ��������

r��������

s��������

t��������

u��������

v��������

w�������

x���

�����

y ���

����

z������

�A�

������

�B�

������

�C�

������

�D�

������

�EDEL

������

�F

� A B C D E F

Character O���

�F

OctalDecimalHexadecimal

DIGITALCELEX SEVEN�BIT ASCII CODES

Page 181: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Ascii and eightbit character codes

� � � � � � � �

��������

�������

��������

��������

IND��������

NEL��������

SSA��������

ESA��������

HTS������

��HTJ

������

�VTS

������

�APLD

�����

�BPLU

������

�CRI

������

�DSS�

������

�ESS�

������

�F

DCS�������

PU��������

PU��������

STS�������

CCH�������

MW������

SPA�������

EPA�������

���

����

���

���

������

ACSI

������

BST

������

COSC

������

DPM

������

EAPC

�����

F

A

������A�

�������A�

C�������A�

�������A�

������A�

�Y������A�

������A�

x������A�

A

bcad������

A�

c������

A

a ������

AA�

������

AB

������

AC

������

AD

������

AE

������

AF

B

�������B�

�������B�

�������B�

������B�

������B�

�������B�

�������B�

� ������B�

B

������

B�

�������

B

o ������

BA�

������

BB

������

BC

�����

BD

�����

BE

������

BF

C

�A�����C�

A�����C�

!A�����C�

"A�����C�

#A�����C�

$A�����C�

%�����C�

C&����C�

C

�E������

C� E

������

C!E

������

CA#E

������

CB�I

������

CC I

������

CD!I

������

CE#I

������

CF

D

������D�

"N�����D�

�O������D�

O������D�

!O������D�

"O������D�

#O������D�

'������D�

D

(������

D��U

������

D U

������

DA!U

�����

DB#U

������

DC#Y

������

DD

������

DE�

������

DF

E

�a������E�

a������E�

!a������E�

"a������E�

#a������E�

$a�����E�

)������E�

&c ������E�

E

�e������

E� e

������

E!e

������

EA#e

������

EB�*

������

EC *

������

ED!*

������

EE#*

�����

EF

F

������F�

"n������F�

�o������F�

o������F�

!o������F�

"o������F�

#o������F�

+������F�

F

,������

F��u

�����

F u

������

FA!u

������

FB#u

������

FC#y

������

FD

������

FE

������

FF

� A B C D E F

Character #O���

���

D�

OctalDecimalHexadecimal

DIGITALCELEX EIGHT�BIT CODES

Page 182: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon
Page 183: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

� GLOSSARY

Page 184: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon
Page 185: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Glossary ABBREVIATION �BACKTRACK KEY

ABBREVIATION A term which refers to the shortened form of a normal wordor phrase which can be used when the word or phrase itself is thought to betoo long or unwieldy� A special lexicon type which contains nothing butabbreviations is available� An abbreviation can take one of three generalforms� of which the most common is the contraction� where particular letters�often vowels� are removed from the word �thus Gld� is an abbreviation ofGelderland�� Another form is the acronym� where the initial letters of eachconstituent word in a phrase are joined to make a new word �thus FIFA isan acronym of F�ed�eration Internationale de Football Association�� Finallythere is also truncation� where a number of letters is removed from the endof a word �thus the chemical symbol for Argon is Ar��

ABSTRACT STEM This term refers to an alternative orthographic form ofthe stem When a stem ends in ��s� or ��f�� and when� in any of its relatedwordforms� that ��f� becomes a �v� �the verb leven� ik leef� wij leven� orthat �s� becomes a �z� �the noun kaas� singular kaas� plural kazen�� then theabstract stem is given with the endings ��v� and ��z� respectively �thus leevand kaaz instead of the normal stems leef and kaas�� All other abstractstems have the same form as the normal stem�

AFFIX SUBSTITUTION This refers to the process by which an a�x replacespart of a lemma when the a�x and the lemma combine to make a newlemma� An example is the English lemma fatuity� where the headwordfatuous can be said to lose the a�x �ous and gain the a�x �ity�

ALPHABETIC KEYS This refers to the letters�as opposed to the numbersand various other symbols�that are on your keyboard� twenty�six uppercase and twenty�six lower case characters� When you are working in flex�you can use them to move to the nearest menu option which begins withthe letter you press�

AND OPERATOR The logical connective combining two restrictions �orgroups of restrictions� x and y in such a way that a row is includedin the lexicon only if both x and y are true for that row� otherwise therow is not included in the lexicon

ASCII The seven�bit binary number coding system used to represent alphabetic�numeric� punctuation and other characters in some types of computers� Theletters stand for American Standard Code for Information Interchange�

BACKTRACK KEY This is a flex term which refers to the key you press toreverse back down the menu path you have just come along� You generallyuse the backtrack key to leave a menu window you do not wish to use� andto return to the previous window�

Page 186: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

BAR �COPY Glossary

BAR This refers to the way flex indicates which option you are choosing in aparticular window� Usually bold text� underlined text or reverse video textis used to di�erentiate the current option �the one you will get if you pressreturn� from the others�

BATCH MODE This allows you� or flex working for you� to submit cer�tain commands to the computer which it carries out as a separate� non�interactive job� flex uses batch mode for certain types of job� because inthis way they are executed while the computer is not being used for smallerjobs or more important jobs�

BELL This refers to the noise your terminal can make to attract yourattention� In flex� the bell normally sounds when a new window appears�or when a message is displayed�

CANCELLED This refers to the status of a flex job which either you orflex has asked to be stopped� When you cancel a job� the computer stopsworking on it� and ignores any results already achieved by that job�

CLASS LABELS A simple coding system used to indicate the syntactic classof a word n means a noun� a means an adjective� and so on� They can beused instead of a numeric coding system� or typing the syntactic class infull�

COBUILD This is an acronym for CollinsBirminghamUniversity Internation�al Language Database� In �� � cobuild published the Collins CobuildEnglish Language Dictionary� which is based on analysis of their largecorpus of modern English� The frequency information in the celexEnglish database was taken from this corpus which at the time contained �������� words�

COLUMN A database term which refers to the storage of one particular typeof information a column can contain a speci�c sort of words� or codes� oranalyses�

COMPLETE SEGMENTATION This means the full derivational morphologi�cal analysis of a lemma into all its constituent morphemes�

COMPLETED This is a flex term which indicates that a job working inbatch mode has now �nished successfully�

COPY This is a flex term which refers to the creation of a new lexiconby using the de�nitions �i�e� the columns and restrictions� alreadyspeci�ed for a di�erent lexicon� The lexicon you copy can be your own�or� if you have a grant� someone else�s�

Page 187: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Glossary CORPUS �DBMS

CORPUS A sizeable collection of words� usually written texts� which canbe used and processed by computers� Three text corpora were used toprovide celex�s frequency information the inl� cobuild� and eind�hoven corpora� They all contain modern�day texts drawn from diverseprinted sources� such as recently�published books� newspapers and maga�zines� and sometimes� though to a much lesser extent� transcriptions madefrom recordings of speech�

CORPUS TOKEN A term which refers to the units distinguished during thedisambiguation by computer of a text corpus used to provide fre�quency information� A corpus token is any string containing at least onealphabetic character� along with zero or more alphanumeric characters�The inl corpus contained ������� �� tokens� and the cobuild corpuscontained �������� tokens�

CORPUS TYPE A term which refers to a corpus token that occurs oneor more times in a corpus� During the process of disambiguation� theoccurrence corpus tokens can be quanti�ed� Whenever a new corpustoken is discovered� it is also noted as a corpus type� and thereafter anyre�occurrences counted to give the frequency count of the type� The typewhich accounts for the greatest number of tokens in the inl corpus is de�it occurs �������� times�

CPA A computer phonetic alphabet developed the Ruhr Universit�at Bochum�The letters stand for Computer Phonetic Alphabet�

CURSOR An indicator� usually a small �ashing box or a line� used to indicatewhere the next character will appear or� in flex� to mark the current menuoption� usually in conjunction with a bar�

CV PATTERNS A cv pattern is a re�written orthographic� phonetic or phono�logical transcription in which� generally speaking� any vowels or diphthongsare replaced by the letter v� and consonants by the letter c�

DATABASE A database is a collection of information stored in computer �lesin such a way as to make the retrieval of that information quicker and more�exible�

DATANET�� This is the name of the main public psdn in the Netherlands�At present� all surfnet nodes are datanet�� nodes� since surfnet usesdatanet��

DBMS These letters stand for database management system� which is computersoftware designed to facilitate the use and development of a database�celex and flex use the relational dbmsmarketed by the oracle company�

Page 188: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

DELIMITER �EXPRESSION Glossary

DELIMITER This refers to a character or group of characters used in a fileto indicate the beginning or end of every field�

DERIVATIONALCOMPOSITIONALSEGMENTATION This is the the typeof morphological analysis which identi�es the constituent lemmas� a�xesand morphemes in a lemma� as opposed to in�ectional analysis which dealswith the wordforms each lemma takes�

DIACRITICS The markers used in conjunction with regular orthographic char�acters to indicate some di�erence in pronunciation or stress� as with theGerman �umlaut� the French �acute� and the Czech h�a�cek�

DISAMBIGUATION This term refers to the process by which the frequencyof words in a large text corpus can be established� either by computer�or people� or both� The process tries to link each word in the corpus �thatis� each string consisting of one alphanumeric charcter plus at least onealphabetic character� with a space on either side� with a lemma� If a stringoccurs more than once� and if such a link can be made� then the word isconsidered to be a wordform� and the number of times the link was madeis the frequency of that wordform�

DISC This is the name of the celex computer phonetic alphabet which uses oneunique� distinct character for each vowel� long vowel� diphthong� consonantand a�ricate� Although not elegant in appearance� it is useful for computerprocessing�

DRAFT A flex term which refers to the version of a lexicon It indicatesthat its de�nition is stored by flex� and that when you use the lexicon�the information is extracted from the main celex database using thatde�nition� Contrast with fixed�

DTE These letters stand for data terminal equipment� which� for most celexusers� normally just means �computer��

ETHERNET A special communications set�up for a lan which allows di�erentsorts of computers and other devices to be linked without central controlfrom any one computer�

EXECUTING This is a flex term which indicates that a job is currently beingcarried out in batch mode�

EXPORT This is a flex term which refers to the process of making a normalvax�vms �le from the contents of a lexicon�

EXPRESSION This is a flex term which refers to the right�hand part of arestriction� that is� the part which contains some number� word� or wildcard� A column name is linked to an expression by means of an operator�

Page 189: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Glossary FIELD�GATEWAY

FIELD In flex� this refers to that part of a window where information fromthe database appears� In a vax�vms file� it refers to a speci�c part of aline which is used for a particular sort of information�

FILE A collection of data stored for computer use� and arranged in a way whichis signi�cant to the user�

FINITE FORMS This refers to those �ections which can occur in their ownright in a main clause or sentence� and which indicate di�erences in tenseand person for example ik beweg� ik bewog� wij bewegen� wij bewogen�

FIXED A flex term which refers to the version of a lexicon A fixedlexicon is a separate� independent database which contains informationoriginally taken from the central celex databases� and when you use it� theinformation is extracted from this database rather than the central celexdatabase� Contrast with draft�

FIXED FORMAT FILE This is a file whose fields are always a �xed numberof characters wide� regardless of the width of the data each �eld contains�

FLAT SEGMENTATION This is one type of derivational�compositional mor�phological analysis� It reduces a lemma directly to its constituent mor�phemes� without showing any of the intermediate levels of analysis you get�Contrast with hierarchical segmentation�

FLEX�EXP This is a logical name which refers to the directory of yourcelex account which is set aside speci�cally for files which are extractedfrom flex using the export facility�

FREQUENCY The number of times a corpus type occurs in a particularcorpus� For example� the wordform radio has an inl frequency of �����as counted in the ������� �� word inl corpus� This �gure can also beexpressed proportionally �i�e� the frequency expected per million words�or logarithmically� To arrive at a �gure for the frequency of lemmas� thefrequencies of its in�ectional forms �that is� its wordforms� are addedtogether�

FULLY SYLLABIFIED This refers to orthographic transcriptions which havea syllable marker whenever a syllable boundary occurs within a word�including single�letter syllables which occur at the beginning or end of aword� Contrast with partially syllabified�

GATEWAY This refers to the point of interconnection between two di�erentcommunications networks� Often users are not aware they are using gate�ways� occasionally� though� you may �rst have to connect to a gatewaybefore being able to use the other network�

Page 190: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

GRANT � INL Glossary

GRANT This is a flex term that indicates whether one or more particularflex users� or every flex user� can copy a lexicon created by you�

GRAPHEMIC This is the adjective used to denote characters which occur innormal Dutch� English or German orthography� It is used to distinguishphonetic or phonological transcriptions� which use speci�cally phoneticcharacter alphabets� from transcriptions which are written or typed usingthe roman alphabet�

HEADWORD A term which refers to one of the two forms a lemma is givenin the celex databases� It corresponds to the traditional lexicographicheadword to be found in dictionaries� In Dutch� German� and English theforms used always resemble words that occur naturally in the language�rather than abstract forms� Thus in Dutch� the headword of a noun is itssingular form� �For a de�nitive list of the forms used� consult Appendix iv��Contrast with stem�

HELP KEY This is a flex term that refers to the key you press to receiveon�line advice on how to use flex as you are working with it�

HIDDEN This is a flex term which refers to the columns displayed usingthe show option� If your lexicon contains so many columns that notall of them can be displayed at once on screen� then you can indicate thatcertain columns should temporarily be missed out of the display� so thatyou can see other columns of more interest� The missed out columns arecalled hidden columns�

HIERARCHICAL SEGMENTATION This is one type of derivational or com�positional morphological analysis� It reduces a lemma directly to its con�stituent morphemes� showing all the intermediate levels of analysis involvedin arriving at all the morphemes� Contrast with flat segmentation�

IMMEDIATE SEGMENTATION This is one type of derivational or com�positional morphological analysis� It reduces a lemma to its next biggestcomponents � other lemmas� a�xes or morphemes� To arrive at completesegmentation� immediate segmentation may have to be carried outseveral times�

INDEX This is a database term which refers to columns whose contents areindexed in a way conceptually identical to the indexing of book� Informationfrom columns with an index can be looked up more quickly by the dbms�

INL This is the normal abbreviation for Instituut voorNederlandseLexicologie�the Dutch Lexicography Institute in Leiden� They are developing a largetext corpus of modern written Dutch� and the frequency informationcontained in the celex Dutch database was extracted from this corpuswhen it contained over �� million words� It is still being extended� and nowcontains more than �� million words�

Page 191: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Glossary INTEGRITY �LEVEL

INTEGRITY A term which refers to the protection of information stored ina database when it can be altered by two or more sources� A databasemaintains its integrity so long as only one source can alter the data atany one time� If two people try to alter the same data at the same time�the resulting information is no longer consistent� and the integrity of thedatabase is lost�

INTERVAL This is a flex option which allows you to specify a particular setof consecutive rows in your lexicon for export

IPA This letters stand for International Phonetic Alphabet� the set of writtencharacters approved for phonetic transcription by the International Pho�netic Association�

ISO These letters stand for the International Standards Organization� theSwiss�based organization which is involved in developing and coordinatingworldwide standards�

LAN These letters stand for local area network� and refer to a communicationsnetwork which links a number of computers over a relatively small area�such as a factory plant or university�

LAT These letters stand for local area transport� and refer to the protocols adec terminal server uses to communicate with computers using vax�vmsover an ethernet�

LANGUAGE CODES These are codes used in the English database to providebackground information about some lemmas� such as the national originwords loaned from other languages and whether certain lemmas are morelikely to be British or American English�

LEMMA A term intended to signify the abstract notion which underlies a familyof in�ected forms� so that� for example� walk could be the lemma underlyingthe verbal forms walk� walks� walked� and walking� In the celex databases�lemmas are distinguished on the basis of �� the pronunciation� ��� thesyntactic class� ��� the morphological structure� ��� the orthographic formof their variouswordforms� as well as ��� the full in�ectional paradigm ofthe lemma� No explicit consideration of meaning is involved� so inthe celex databases� the lemmas of any two �or more� words which di�erin meaning but which otherwise are identical in each of these �ve ways arereduced to one lemma� In principle� any convenient form could be used torepresent a lemma an abstract form� or even a number� In practice� celexuses two forms the headword and the stem

LEVEL This refers to any one analytical step in morphological analysis� com�plete segmentation is �nished when every possible level of analysis hasbeen carried out�

Page 192: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

LEXICON�MENU Glossary

LEXICON This term refers to a subset of one of the celex databases which youcan de�ne for yourself using flex� Rather than using the entire databaseat all times� you specify certain columns and delimit their contents usingrestrictions to form a coherent subset of information drawn from thecentral database�

LEXICONTYPE This is a flex term that indicates which of the central celexdatabases the information in a lexicon is drawn from� Each of the centraldatabases has as its main subject one type of canonical form� such as Dutchlemmas or English wordforms� The type of canonical form is then used toindicate the type of lexicon�

LISP This is a high level programming language often used in arti�cial intel�ligence work� In particular� it uses a special brackets notation for its inputand output data�

LOCKED This means that flex is currently working on your lexicon� andthat in order to protect its integrity� you cannot do any more work withit until the job flex is doing has �nished�

LOGICAL COMBINATION This is a flex term which refers to the wayrestrictions or groups of restrictions linked by brackets work togetherto delimit the contents of a lexicon� by means of the AND operator�the OR operator� and the NOT operator�

LOGICAL NAME A vax�vms term which refers to a speci�c directory inyour account� It is used as part of a file name to help you to rememberwhere it is� and the computer to know how to �nd it or store it�

LOGIN This refers to the way you identify yourself to the computer beforebeginning any work� You normally have to give the name of your accountand a password�

LOGOUT This refers to the way you indicate to the computer that you wantto stop working� On the celex machine� you simply type logout�

MAIL This a flex term and a vms term� In flex� it refers to the main menuoption mail� which allows you to communicate with other flex users purelywithin flex� it does not link in with the other national or internationalnetworks� In vms� there is a more comprehensive mail facility which allowsyou to send messages to other celex computer users� as well as users onother computers via decnet or datanet���

MENU This is a flex term which refers to the boxes displayed on your screenfrom which you can choose an option that allows you to continue with yourwork� or a particular type of information� Compare window�

Page 193: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Glossary MESSAGE LINE�OR OPERATOR

MESSAGE LINE This is a flex term which refers to the line immediatelyabove the bottom line of the screen� It displays instructions� error messagesand other information to help you as you use flex� and whenever celexcomputer system messages are sent to your terminal� they are also displayedhere�

MODEM This is an an acronym for the words modulator and demodulator� Itis a machine which converts the characters from your computer �a digitalbit stream� into a form �an analog signal� that can be transmitted alonga telephone line� this is modulation� It can also convert the analog signalreceived down a telephone line back into the digital bit stream used in yourcomputer� this is demodulation� Thus you can use telephone lines to workinteractively with a computer that might be located hundreds of miles away�provided that you have a terminal and a modem� and the remote computeris also linked to a modem�

NEXT KEY This is a flex term which refers to the key you press to displaymore information in a window or menu�

NOTOPERATOR The logical connective applied to one restriction or groupof restrictions z in such a way that a row is included in the lexicon ifz is untrue� If z is true� the row is not included in the lexicon�

ON VIEW This is a flex term which is important for columns that are usedin the construction of restrictions� If a column is on view� you can seeit when you display your lexicon using the show or export options� If itis not on view� you never see it� but it still works in any restriction youhave made with it� All columns are on view by default� you can changethis in the edit restrictions menu�

OPERATING SYSTEM This refers to the software which you use speci�callyto control a computer or a computer system� The commands you type tostart a program running or to give a directory listing are operatingsystem commands� The celex computers use vax�vms�

OPERATOR This is a flex term which refers to the simple mathematicalrelation symbols that you can use in restrictions�

OR OPERATOR The logical connective combining two restrictions �orgroups of restrictions� x and y in such a way that a row is included inthe lexicon if �i� either x or y is true for that row� or �ii� both x and yare true for that row� otherwise the row is not included in the lexicon�

Page 194: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

PAD�QUERY Glossary

PAD These letters stand for packet assemblerdisassembler� a device �or pro�gram� which gathers individual characters that you send from your ter�minal or computer and puts them into groups �that is� packets� which canthen be sent across a psdn to some other computer� Likewise when packetscome back to your computer across the psdn� the pad splits them up intoindividual characters again� ready for display on your terminal�

PAGE This is a flex term that refers to data displayed in the show windowthere is room for ten lines of information on screen� and one page is equalto these ten lines�

PARTIALLY SYLLABIFIED This refers to orthographic transcriptions whichindicate each syllable boundary within a word by means of a hyphen� withthe exception of syllables at the beginning or end of the word which consistof only one letter� such syllables are not marked� Compare with fullysyllabified�

PENDING This is a flex term which means that a batch job cannot beexecuted by the computer at the moment� usually because other batchjobs are being carried out� A job which is pending will eventually becarried out� however� unless you cancel it�

PREV KEY This is a flex term which refers to the key you press to re�displayold information that you have already seen in the window or menu youare currently working in�

PSDN These letters stand for packet switching data network� which is a widearea network that can control the rapid transmission of packets of data�possibly prepared by a pad� for example� between di�erent points in thenetwork� psdns enable you to work interactively on a computer which islocated hundreds of miles away� In the Netherlands� the public x�� psdnis called datanet��� and it is currently used in the implementation ofsurfnet�

PSI These letters stand for packetnet system interface� the vax�vms softwareproduct that enables vax computers to link up with psdns� It performs thefunction of a pad�

PSS These letters stand for packet switch stream� the name of the British x��psdn�

QUERY This is a flex term that refers to the show menu option that allowsyou to look at a particular part of your lexicon� It does not permanentlyalter your lexicon�

Page 195: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Glossary REDRAW�STRESS PATTERN

REDRAW This is a flex term that refers to the key which you press to re�display all the flex information currently displayed on screen� It allowsyou to correct any badly�drawn lines or get rid of unwanted messages orstray characters�

RESTRICTION This is a flex term which refers to a simple logical statementyou formulate to specify in detail the information to be included in yourlexicon� with reference to the contents of the columns already in yourlexicon�

ROW A database term which refers to the storage of di�erent types of infor�mation which refer to one word each row contains an orthographic tran�scription� a phonetic transcription� a morphological analysis� a syntacticcode and a frequency count �and more besides� for each word�

SAM�PA These letters stand for Speech Assessment Methods Phonetic Al�phabet� sam is an Esprit �European Community funded� project� Thedevelopment of the phonetic alphabet was co�ordinated by John Wells withthe intention of it becoming the standard European computer phoneticalphabet�

SEGMENTATION This is a term which refers to the process of morphologicalanalysis of words into their constituent lemmas� a�xes and morphemes�

SQL�PLUS This is the name of the standard dbms produced by the ora�cle company� It is dbms used by celex� when you work with flex� youare using a system which generates sql�plus code to access the celexdatabases�

STATUS LINE This is a flex term which refers to the very bottom line of thescreen� It displays your flex username� the name of the lexicon you haveselected� and version number of the flex program you are using�

STEM A term which refers to one of the two forms a lemma is given inthe celex Dutch database� and the term used in place of headword inEnglish morphology� It is that part of a lemma�s in�ectional paradigmwhich is common to all the in�ected forms� separate from the in�ectionala�xes themselves� Usually� it is identical to the headword except forDutch verbs� where it takes the form of the �rst person singular� presenttense �but see also abstract stem�� In English morphology� a stem is aheadword� or sometimes a �ectional form of a headword�

STRESS PATTERN This refers to special strings of numbers� each of whichrepresents one phonetic syllable and indicates how that syllable is stressed�A zero always means �unstressed�� a �� indicates �stressed� in stress pat�terns for Dutch words and �primary stress� for English words� ��� indicates�secondary stress� for English words�

Page 196: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

SURFNET�WAN Glossary

SURFNET This is the Dutch national academic computer network which pro�vides electronic mail facilities and logins to computers all over the Nether�lands� At present� it uses datanet�� to carry out its work�

SYLLABIC CONSONANT This term refers to a consonant which by itself orwith other consonants forms a distinct syllable in the pronunciation of aword� without the presence of a vowel� The �nal �l in the word bottle canbe realised as a syllabic consonant�

TERMINAL This is a device which can accept data from and transmit data toa computer� For most people a terminal is a visual display unit �vdu forshort�� which consists of a television�like screen to display data received�and a keyboard to transmit data� including operating system commands�There are many types of terminal� all with their own speci�c control codesand capabilities�

TERMINAL EMULATOR This is a type of software which allows your personalcomputer or terminal to behave and respond like another sort of terminal�

TERMINAL SERVER A device that connects terminals �and modems andprinters� to an ethernet�

VAXVMS The trademark used by the Digital Electronic Corporation �dec�to identify the operating system used on their vax series computers�vax stands for Virtual Addressing eXtension� and vms stands for VirtualMemory System�

VERSION This is a flex term that refers to the way your lexicon is stored� Ifit is a draft lexicon� only the de�nition is stored� and when you use it� thedata it requires is looked up in the main celex database� If it is a xedlexicon� it is a separate� probably much smaller database which is quickerand easier used�

VT��� This refers to a standard dec type of terminal� Users who have such aterminal� or who have a terminal emulator which can imitate such aterminal� should be able to log into celex and use flex with no problems�

VT��� This refers to a standard dec type of terminal which is newer thanthe vt���� It is the default terminal type for celex and flex�

WAN These letters stand for wide area network and refer to a communicationsnetwork which links a number of computers over a relatively large area�Sometimes these networks cover entire nations �such as surfnet in theNetherlands� or even larger areas �such as earn� the European academicnetwork��

Page 197: ENGLISH LINGUISTIC GUIDE · 2007-10-22 · ENGLISH LINGUISTIC GUIDE Despite the remark able and irrev ersible c hanges that ha v e come up on the English language since the AngloSaxon

Glossary WILDCARD�X��

WILDCARD This refers to the � and � characters which can be used in arestriction or query to indicate respectively �any character or group ofcharacters� and �any single character��

WINDOW This is a flex term which refers to the boxes shown on yourscreen which contain either menu options� data drawn from the database�or other relevant flex information� A window which contains options isalmost always called simply a menu�

WORDFORM A term which is synonymous with word in the general sense�Wordforms are the units occurring in natural language� which� when writ�ten� are bounded on either side by a space� and which can be associatedwith a lemma� �However some English and Dutch wordforms include spaces� swimming pool� for example� or Nederlandse Spoorwegen�� They arethe inflected forms in regular use� as opposed to lemmas� stems�and headwords which are convenient� but abstract� representations ofcomplete families of wordforms��

X�� This refers to the standard protocols recommended by the Comit�e Consul�tatif International T�el�egraphique et T�el�ephonique for equipment operatingwithin a psdn�

X�� This refers to the standard procedures recommended by the Comit�eConsultatif International T�el�egraphique et T�el�ephonique for the exchangeof user data and the required control information between your terminaland a remote pad over a psdn�


Recommended