+ All Categories
Home > Documents > A statistical estimate of infant and toddler vocabulary size from CDI analysis

A statistical estimate of infant and toddler vocabulary size from CDI analysis

Date post: 10-Mar-2023
Category:
Upload: uio
View: 0 times
Download: 0 times
Share this document with a friend
17
PAPER A statistical estimate of infant and toddler vocabulary size from CDI analysis Julien Mayor 1,2 and Kim Plunkett 2 1. Basque Center on Cognition, Brain, and Language (BCBL), Donostia, Spain 2. Department of Experimental Psychology, University of Oxford, UK Abstract For the last 20 years, developmental psychologists have measured the variability in lexical development of infants and toddlers using the MacArthur-Bates Communicative Development Inventories (CDIs) – the most widely used parental report forms for assessing language and communication skills in infants and toddlers. We show that CDI reports can serve as a basis for estimating infantsand toddlerstotal vocabulary sizes, beyond serving as a tool for assessing their language development relative to other infants and toddlers. We investigate the link between estimated total vocabulary size and raw CDI scores from a mathematical perspective, using both single developmental trajectories and population data. The method capitalizes on robust regularities, such as the overlap of individual vocabularies observed across infants and toddlers, and takes into account both shared knowledge and idiosyncratic knowledge. This statistical approach enables researchers to approximate the total vocabulary size of an infant or a toddler, based on her raw MacArthur-Bates CDI score. Using the model, we propose new normative data for productive and receptive vocabulary in early childhood, as well as a tabulation that relates individual CDI measures to realistic lexical estimates. The correction required to estimate total vocabulary is non-linear, with a far greater impact at olderages and higher CDI scores. Therefore, we suggest that correlations of developmental indices to language skills should be made to vocabulary size as estimated by the model rather than to raw CDI scores. Introduction How many words does an infant know? This question is central in developmental psychology, as researchers often evaluate different aspects of an infants cognitive development in the context of her developing vocabu- lary skills (Thal, Marchman, Stiles, Aram, Trauner, Nass & Bates, 1991; Werker, Fennell, Corcoran & Stager, 2002). Traditionally, this question has been answered by counting the number of different words an infant produces within a representative period of time (Nice, 1926; Carey, 1978). Diary methods and home- based recordings provide an estimate of the infants productive vocabulary. Such direct measures face a dilemma; short recordings lead to a sub-sampling of the infants total vocabulary knowledge as they do not use their entire lexicon every day, and longer recordings are time-consuming and expensive strategies for assessing individual vocabulary sizes. A further important limi- tation of these approaches is that they only provide a measure of productive vocabulary, which may inade- quately reflect an infants receptive vocabulary knowl- edge: typically, comprehension precedes production in infancy (Fenson, Marchman, Thal, Dale, Reznick & Bates, 2007). As an alternative to diary methods and home-based recordings, parental reports offer a rich source for eval- uating infantsvocabulary knowledge. Parents can be reliable judges of whether their infant comprehends and / or produces a given word (Ring & Fenson, 2000; Fenson et al., 2007; Styles & Plunkett, 2009) when asked to fill in questionnaires which contain words likely to be known by their infant, providing valuable information about the status of an infants lexicon on the day the form is completed. Furthermore, the time taken to complete a questionnaire is relatively short compared to diary methods and recordings, and so is an efficient and inexpensive means of assessment. Many researchers assessing infantsor toddlerslexical development now prefer this method and rely on the MacArthur-Bates Communicative Development Inventories (CDIs), 1 Words and Gestures (referred to hereafter as CDI-WG, Address for correspondence: Julien Mayor, Basque Centeron Cognition, Brain, and Language (BCBL), Donostia, Spain; e-mail: [email protected] or Kim Plunkett, Department of Experimental Psychology, University of Oxford, UK; e-mail: [email protected] 1 Throughout the manuscript, we will use the term CDIswhen speaking of both the MacArthur-Bates CDI, Words and Gestures, and Words and Sentences. The term CDI scorewill refer to the number of words reported to be known (in comprehension) and / or produced (in production) on the appropriate CDI. Note that, in principle, the method developed in the manuscript can be applied to adaptations of the MacArthur-Bates CDI in other languages (see Appendix 2 for details). Ó 2010 Blackwell Publishing Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA. Developmental Science 14:4 (2011), pp 769–785 DOI: 10.1111/j.1467-7687.2010.01024.x
Transcript

PAPER

A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis

Julien Mayor1,2 and Kim Plunkett2

1. Basque Center on Cognition, Brain, and Language (BCBL), Donostia, Spain2. Department of Experimental Psychology, University of Oxford, UK

Abstract

For the last 20 years, developmental psychologists have measured the variability in lexical development of infants and toddlersusing the MacArthur-Bates Communicative Development Inventories (CDIs) – the most widely used parental report forms forassessing language and communication skills in infants and toddlers. We show that CDI reports can serve as a basis forestimating infants’ and toddlers’ total vocabulary sizes, beyond serving as a tool for assessing their language development relativeto other infants and toddlers. We investigate the link between estimated total vocabulary size and raw CDI scores from amathematical perspective, using both single developmental trajectories and population data. The method capitalizes on robustregularities, such as the overlap of individual vocabularies observed across infants and toddlers, and takes into accountboth shared knowledge and idiosyncratic knowledge. This statistical approach enables researchers to approximate the totalvocabulary size of an infant or a toddler, based on her raw MacArthur-Bates CDI score. Using the model, we propose newnormative data for productive and receptive vocabulary in early childhood, as well as a tabulation that relates individual CDImeasures to realistic lexical estimates. The correction required to estimate total vocabulary is non-linear, with a far greaterimpact at older ages and higher CDI scores. Therefore, we suggest that correlations of developmental indices to language skillsshould be made to vocabulary size as estimated by the model rather than to raw CDI scores.

Introduction

How many words does an infant know? This question iscentral in developmental psychology, as researchersoften evaluate different aspects of an infant’s cognitivedevelopment in the context of her developing vocabu-lary skills (Thal, Marchman, Stiles, Aram, Trauner,Nass & Bates, 1991; Werker, Fennell, Corcoran &Stager, 2002). Traditionally, this question has beenanswered by counting the number of different words aninfant produces within a representative period of time(Nice, 1926; Carey, 1978). Diary methods and home-based recordings provide an estimate of the infant’sproductive vocabulary. Such direct measures face adilemma; short recordings lead to a sub-sampling of theinfant’s total vocabulary knowledge as they do not usetheir entire lexicon every day, and longer recordings aretime-consuming and expensive strategies for assessingindividual vocabulary sizes. A further important limi-tation of these approaches is that they only provide ameasure of productive vocabulary, which may inade-quately reflect an infant’s receptive vocabulary knowl-edge: typically, comprehension precedes production ininfancy (Fenson, Marchman, Thal, Dale, Reznick &Bates, 2007).

As an alternative to diary methods and home-basedrecordings, parental reports offer a rich source for eval-uating infants’ vocabulary knowledge. Parents can bereliable judges of whether their infant comprehendsand ⁄or produces a given word (Ring & Fenson, 2000;Fenson et al., 2007; Styles & Plunkett, 2009) when askedto fill in questionnaires which contain words likely to beknown by their infant, providing valuable informationabout the status of an infant’s lexicon on the day theform is completed. Furthermore, the time taken tocomplete a questionnaire is relatively short compared todiary methods and recordings, and so is an efficient andinexpensive means of assessment. Many researchersassessing infants’ or toddlers’ lexical development nowprefer this method and rely on the MacArthur-BatesCommunicative Development Inventories (CDIs),1

Words and Gestures (referred to hereafter as CDI-WG,

Address for correspondence: Julien Mayor, Basque Center on Cognition, Brain, and Language (BCBL), Donostia, Spain; e-mail: [email protected] orKim Plunkett, Department of Experimental Psychology, University of Oxford, UK; e-mail: [email protected]

1 Throughout the manuscript, we will use the term ‘CDIs’ whenspeaking of both the MacArthur-Bates CDI, Words and Gestures, andWords and Sentences. The term ‘CDI score’ will refer to the numberof words reported to be known (in comprehension) and ⁄ or produced(in production) on the appropriate CDI. Note that, in principle, themethod developed in the manuscript can be applied to adaptationsof the MacArthur-Bates CDI in other languages (see Appendix 2 fordetails).

! 2010 Blackwell Publishing Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

Developmental Science 14:4 (2011), pp 769–785 DOI: 10.1111/j.1467-7687.2010.01024.x

for infants) or Words and Sentences (CDI-WS, for tod-dlers) (Fenson et al., 2007). The MacArthur-Bates CDI-WG and CDI-WS include a list of words frequentlyencountered by infants and toddlers, respectively. Boththe CDI-WG and the CDI-WS contain a section inwhich the caregiver is asked to indicate whether eachword on the checklist is understood (comprehension)and ⁄or said (production) by the infant or toddler.Compilation of CDI reports into a database, theLEX2005 database, from 1800 infants and toddlers hasenabled Dale and Fenson (1996) to produce month-by-month norms for comprehension and production datafrom the MacArthur-Bates CDIs. These norms providean index for comparing an infant’s or toddler’s vocabu-lary with other infants of the same age, enabling childlanguage researchers and others to determine whether anindividual infant or toddler is advanced or delayed in hervocabulary development.Researchers have used CDI scores as an index of total

vocabulary size by examining whether different aspectsof cognitive development correlate to raw CDI scores.For example, Bates and Goodman (1997) used raw CDIscores at 20 months to predict grammatical developmentat 28 months, using the CDI score as an index for atoddler’s total vocabulary size. Swingley and Aslin (2000)correlated response latencies with vocabulary size asindexed by CDI scores, and Werker et al. (2002) ‘corre-lated productive vocabulary size, as measured by the CDI... to performance’ (p. 11). In all these cases, CDI scoreswere used with the underlying assumption that theycould be used as an index for total vocabulary size. Thisassumption is valid just so long as the index of vocabu-lary size provided by the CDI scores behaves like thetotal vocabulary size when calculating correlationaleffects.Unfortunately, this further assumption does not

always hold. As we shall see, the outcome of correla-tional analyses of developmental indices to CDI scoresand to total vocabulary size may differ if there is a non-linear relationship between total vocabulary size andCDI score. Hence, it is unwise to rely on the raw CDIscore as an index of total vocabulary size when the goal isto evaluate the correlation of vocabulary size with someother measure. The index of vocabulary developmentprovided by the raw CDI score may underestimate thecorrelation with other measures and perhaps lead toconclusions of independence that are inappropriate. Anestimate of total vocabulary size offers a more powerfulmetric for researchers interested in relating vocabularydevelopment to other indices of infant development.We will demonstrate that raw MacArthur-Bates CDI

scores can nevertheless be used to provide an estimate oftotal vocabulary size and, furthermore, that CDI scoresprovide suitable input to the model for estimating totalvocabulary size. This estimate can then feature as a factorin correlational analyses involving vocabulary develop-ment so that transformed CDI scores can be used to cal-culate the correlation between vocabulary development

and other indices of development. The validity of per-forming correlations of other cognitive developmentindices to raw CDI scores, instead of performing thecorrelation to their total vocabulary size, will also bediscussed.The originators of the MacArthur-Bates CDIs have

strived to include a representative sample of words thatinfants know at different ages. However, the CDI is notintended to be an exhaustive listing of all the words thatany infant might know. For example, when vocabularysize reaches a significant proportion of the CDI listing,the likelihood that infants would know words that arenot listed on the CDI increases dramatically: ‘Althoughthe present index might approach the status of an atlasfor the younger children, it becomes an increasinglysmaller subset of vocabulary for older children’ (Fenson,Dale, Reznick, Thal, Bates, Hartung, Pethick & Reilly,1993, p. 40). A simple vocabulary count based on a CDIis therefore unlikely to be an accurate estimate of totalvocabulary size for older infants. In order to test whetherthe MacArthur-Bates CDI can be used as a measure oftotal productive vocabulary size, Robinson and Mervis(1999) compared the CDI-based productive vocabularyestimate with an exhaustive diary report based on datafrom a single child. The discrepancy in CDI and diaryscores increased dramatically with age, thereby con-firming Fenson et al.’s (1993) concerns. Similarly, Roy,Franck and Roy (2009) reported a much higher pro-ductive vocabulary at 24 months of age as assessed bydense recordings than the CDI score would predict. Suchfindings would appear to undermine the utility of the CDIin providing an estimate of the total number of words anindividual infant knows and ⁄or produces. In particular,the CDI would lead to a systematic underestimate ofinfant total vocabulary knowledge as, by construction,the stratified selection of words made in the CDI wouldleave many words out of a list of the most frequent wordsencountered by infants of a given age.The goal of this paper is to demonstrate that CDIs are

not only an accurate tool for assessing relative develop-ment of infants’ and toddlers’ language skills, but thatCDI scores can also be used as the basis for an estimateof total vocabulary sizes, both in comprehension andproduction. This new estimate takes into account boththe idiosyncrasies of an infant’s individual vocabularyand frequent words that have not been included in theCDI and aims at providing an estimate of the totalvocabulary size of an infant, even when her CDI scorereaches a substantial fraction of the CDI size. Afterdemonstrating how raw CDI scores can be converted toan accurate estimate of total vocabulary size, we providein an appendix look-up tables that indicate an estimateof the total vocabulary size given the number of wordsreported on the MacArthur-Bates CDIs (CDI-WG andCDI-WS), as well as new estimates of normative data forproductive and receptive vocabulary in early childhood.Our analyses reveal a non-linear transformation from

CDI scores to total vocabulary size such that corrections

770 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

of higher CDI scores and at older ages have a far greaterimpact than at young ages or lower CDI scores. Failureto take account of these non-linearities can undermineattempts to identify correlations between vocabularyscores as measured by the CDI and other aspects ofcognitive, linguistic, social or emotional development.Therefore, we advocate that attempts to identify corre-lations with such developmental indices be made tovocabulary size as estimated by the model rather than toraw CDI scores.

Mapping raw MacArthur-Bates CDI scoresto vocabulary size

Overview of the strategy

MacArthur-Bates CDI forms have been compiled into adatabase, the LEX2005 database (Dale & Fenson, 1996)so that it is possible to compute the proportion of infantsat a given age that are reported to understand and ⁄orproduce any given word on the CDI. In other words, wecan determine the probability that a given word isunderstood and ⁄or produced by infants at a given age.The average of individual CDI scores for all infants isequivalent to computing the sum, for all words wi on theCDI, of the probabilities p(wi) that words wi are knownto an infant. A mean CDI score (Vocest) can be computedas in Equation 1:

Vocest !XW

i!1

p"wi# !1N

XN

j!1

voc"j# "1#

where W is the number of words on the CDI and voc(j)measures the CDI score of infant j and N represents thenumber of infants (see Appendix 2 for a full mathemat-ical description of the procedure).The advantage of this ‘by items’ analysis is that the

estimate of the probability that an infant knows a specificword approaches the ‘exact’ value as the number ofinfants increases, assuming that caregivers respondaccurately. Any inaccuracy in the estimate of totalvocabulary size is now the outcome of having a calcu-lation that runs over words on the CDI, rather than allthe words in the language. That is, for a word not in-cluded in the CDI we lack information regarding theprobability that it is known by an infant. The task ofestimating total vocabulary size from CDI scores nowtranslates to estimating the fraction of infants that knowwords that are not listed on the CDI.We distinguish two sources of underestimation of an

infant’s vocabulary size. First, individual infants’ lex-icons are only partly overlapping. For example, an infantwhose parent is a car mechanic is likely to possess anearly knowledge about car-related words, otherwise rareamong other infants. Such idiosyncratic words cannot belisted in a CDI, despite their contribution to overalllexicon size, because they would greatly inflate the size of

CDIs and the time taken to complete the form. Thesecond source of underestimation derives from frequentwords in the language that are not listed in the CDI.CDIs have been compiled by listing commonly usedwords in the infant’s vocabulary at different ages. In sodoing, the stratified structure of the CDI will mean thatsome relatively frequent words are not listed. In the nextsection, we describe how to evaluate both effects in orderto provide an accurate description of the lexicon sizebased on CDI reports.

The first correction and second correction: a graphicalillustration

Suppose words are initially sorted in descending orderaccording to the proportion of infants that know theword as reported on the CDI. We then plot the proba-bility that infants know a given word against its rank onthat list. For example, a word that is known by the vastmajority of infants (‘daddy’) will be ranked high on thelist, and a word known by only a small fraction of infantswill have a low rank. Figure 1(a) plots a hypotheticaldistribution of word rankings and the probability that aword is known to a given infant. This probability dis-tribution is a monotonically decreasing function wherewords of low probability occur in the tail of the distri-bution and are not listed on the CDI. Idiosyncraticwords – accounted for by the first correction – corre-spond to words that are only known to a small minorityof infants. These are words that occur in the tail of thedistribution. The size of this underestimate thereby cor-responds to estimating the length of the tail, as depictedin Figure 1(b).Commonly occurring words absent from the CDI – the

second source of underestimation – change the shape ofthe probability distribution as shown in Figure 1(c).Increasing the number of commonly known wordsinflates the distribution but maintains the monotonicdecrease. Quantification of the length of the tail andestimation of the parameters for the correct shape of theprobability distribution permits a more accurate estimateof an infant’s vocabulary size. The total vocabulary sizetherefore corresponds to the CDI score plus an estimateof the size of the shaded areas in Figures 1(b) and 1(c).The model will be developed in two steps. First,

parameters describing the distribution of word knowl-edge will be evaluated for the different age groups on theLEX2005 database. The evaluation of the number ofidiosyncratic words infants know will enable us to pro-vide the first correction to the raw CDI score. Second, theestimation of the number of words left out of the strat-ified selection of words included in CDIs will be done byanalysing individual trajectories for which we possessboth a high-density recording and its corresponding CDIscore. Exhaustive comparison of these two measures willconstrain the model estimate and allow for the determi-nation of the mapping function between MacArthur-Bates CDI scores and total vocabulary sizes. A formal

A statistical estimate of vocabulary size 771

! 2010 Blackwell Publishing Ltd.

presentation of the mathematical procedure for estimat-ing total vocabularies from CDI scores is given inAppendix 2.

An overview of the first correction: adding idiosyncraticwords to the lexicon

After sorting words in the LEX2005 database for a givenage group, according to the proportion of infants thatknow the words, we model this distribution of wordknowledge using a standard sigmoid function thatdescribes the probability that a word is known given itsrank among other words. The sigmoid function providesan intuitive fit of this distribution with values close to100% for highly ranked words (very common words,known by every infant) and values closer to 0% for lowranked words, known to only a very small subset of thepopulation. Furthermore, sigmoid functions have onlytwo free parameters, referred to as a and b (see Equation3 in Appendix 2). The first of these parameters, a,determines the location along the x-axis of the non-linearity in a sigmoidal curve. In our case, this first freeparameter determines the number of words that areknown to 50% of the infants; it can be seen as an index ofoverall vocabulary size (see Figure 2(a) and 2(b)). Thesecond parameter, b, determines the steepness of the non-linearity in the sigmoidal curve. In the present model,this second free parameter determines the overlap ofword knowledge across the population of infants at agiven age. A very low value for b corresponds to a steepprobability distribution, whereas a high value yields ashallow distribution (see Figure 2(c)). Shallow distribu-tions correspond to low overlap of individual vocabu-laries, whereas low values correspond to high overlap(see Figure 2(d)).We have identified idiosyncratic words as occupying

the tail of this probability distribution. They are thewords of lowest rank. However, the graphical analysisshows that the size of the tail is influenced by both thefree parameters a and b. Fortunately, the sigmoid func-tion possesses another useful property: Once the valuesof the parameters a and b are determined, the overallshape of the probability distribution is known and so wecan determine the size of the tail. Thus, quantification ofa and b allows us to determine the value of the firstcorrection.In principle, the parameters a and b can vary inde-

pendently from one age to another. For example, younginfants may have high overlap in vocabulary (small b)whereas older infants have less overlap in vocabulary(higher b), or indeed vice versa. Consequently, we mustquantify the values of parameters a and b across theentire age range if we are to estimate the first correctionaccurately. Luckily, we will show that the degree ofoverlap between individual vocabularies remainsapproximately constant over the age range considered.The fact that the overlap parameter b does not varymuch with age has important consequences. Graphically,it means that the shape of the sigmoid curve (theprobability distribution of word knowledge) does notchange with age, and only undergoes a shift to largerlexicons (parameter a) as age increases. The constancy of

Rank

Perc

enta

ge o

f inf

ants

kno

win

g th

e w

ord

Rank

Perc

enta

ge o

f inf

ants

kno

win

g th

e w

ord

W (CDI size)

Rank

Perc

enta

ge o

f inf

ants

kno

win

g th

e w

ord

W (CDI size)

(c)

(b)

(a)

Figure 1 (a) General shape of the probability distribution thatinfants know a word given its rank. The area below the curvecorresponds to the total vocabulary size. (b) Cut-off for idio-syncratic words not included in the CDI (first correction).The area below the curve on the left of the dashed linecorresponds to the CDI score. (c) Inflation of the probabilitydistribution to include frequent words not included in theCDI (second correction).

772 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

parameter b across ages allows us to derive a uniquemapping from CDI scores to the vocabulary estimateafter applying the first correction, independently fromage, because only parameter a varies with age in themodel. As demonstrated later, the second correction willonly modulate this unique mapping.Figure 3 depicts the first correction for different values

of parameter b, when we assume that parameter b doesnot change with age. The curves are obtained by mea-suring the total area below the sigmoidal curve in Fig-ure 1(b) as a function of the area on the left of thedashed line, corresponding to the CDI score. Note thatif b is constant, there is a one-to-one correspondencebetween CDI scores and parameter a. In the case of aperfect overlap (b = 0, Figure 2(c)), all infants know thesame set of words and no infant knows any other word.In this extreme case of perfectly overlapping vocabular-ies, there are no idiosyncratic words and the first cor-rection is redundant, and parameter a is equal to theCDI score. However, when infants reach the maximalCDI score (680), the total vocabulary size cannot bedetermined, as they may know any number of words thatare not on the CDI. This is equivalent to attempting toestimate the number of rabbits in a forest by counting thenumber of traps that are full after a given period. If all

0 20 40 60 80 1000

20

40

60

80

100

Rank

Per

cent

age

of in

fant

s kn

owin

g th

e w

ord 40

60

80

40 8060

0 200 400 6000

20

40

60

80

100

Rank

Per

cent

age

of in

fant

s kn

owin

g th

e w

ord

b = 0

b = 100

b = 300

(a)

(b)

(c)

(d) b = 0

b = 100

b = 300

0 100 200 300 400 5000

500

1000

1500

2000

CDI score

Firs

t cor

rect

ion

of v

ocab

ular

y si

ze

b = 0b = 100b = 300

Max CDI score

Figure 3 Different overlap parameters b lead to different non-linear corrections from CDI scores (first correction). The curvesare obtained by plotting the total area below the curve inFigure 1(b) as a function of the area on the left of the dashedline, for different position of the curve with respect to thedashed line.

Figure 2 (a) As parameter a increases, words that were in thetail of the distribution join the main body. Hence, increasing aleads to larger vocabularies that include the earlier vocabu-laries – panel (b). (c) As parameter b increases, the likelihoodthat a word of high rank (say 5) is known by all infantsdecreases. Thus, there is less overlap in vocabulary knowledgeacross the population of infants – panel (d).

A statistical estimate of vocabulary size 773

! 2010 Blackwell Publishing Ltd.

traps are occupied, the only knowledge we have gained isthat there are at least as many rabbits as traps.In the case of partial overlap in vocabularies, CDI

scores still need to be corrected even when smaller thanthe maximal CDI score. Different overlap parameterslead to different non-linear corrections as shown inFigure 3. As the parameter b increases, i.e. overlapdecreases, the required correction increases for any givenvalue of the CDI score. Determination of the overlapparameter b, based on comparison of individual CDIscores, allows us to uniquely define the value of the firstcorrection.

Application of the first correction to the MacArthur-BatesCDI

The method is applied for both production (16 monthsto 30 months old, CDI-WS) and comprehension(8 months to 18 months old, CDI-WG) based on theLEX2005 database (Dale & Fenson, 1996). For each agegroup, words are sorted in descending order from thosethat are known by most infants to the least known wordswithin each group. A regression is applied to identify theparameters a and b of the sigmoidal function relatingword rank to the fraction of infants knowing the word,by minimizing the squared error between the data andthe model. Figure 4 displays the probability that infantsknow a word given its rank on the CDI for data from theCDI-WS (at 20 and 30 months of age) and the corre-sponding sigmoidal fit. Note that at 20 months of age,

the distribution of word knowledge is concave (itresembles an exponential curve) whereas at 30 months ofage, the same distribution is convex. The good agreementbetween the data and the sigmoidal fits provide anadditional justification for the choice of the type of curvecapturing the distribution of word knowledge across theage range considered; the fits to these two different dis-tributions are in fact the exact same function, onlyshifted to the right for the older age group. The param-eter b, determining the shape of the non-linearity, is thesame at 20 months of age and at 30 months of age (it canbe considered to be age-independent, as we will demon-strate later) and only the parameter a increases with age.Thus, all the distributions of word knowledge for thedifferent age groups on the CDI-WS can be fitted withthe same sigmoidal curve, shifted to the right as ageincreases. All sigmoidal fits of the CDI data explain atleast 80% of the variance, indicating that a regression (seeEquation 3 in Appendix 2) is applicable. The first cor-rection for idiosyncratic words (thereby removing theceiling effect introduced by a limited-size sampling ofwords) of the vocabulary size as predicted by the model isobtained with the measured parameters a and b (seeEquation 4 in Appendix 2). Figure 5(a) depicts a com-parison of the first correction of the vocabulary incomprehension with the CDI data. The model predictsthe first correction to be larger for older age groups(14 months and older) than direct CDI measures. Themodel also indicates that the estimate of vocabulary sizeafter the first correction is smaller than the CDI estimatefor younger age groups (up to 12 months). The discrep-ancy is plotted in Figure 5(b). The negative discrepancyat 8–12 months may be due to an unsuitable equation to

0 5000

20

40

60

80

100

Word Rank

% o

f chi

ldre

n kn

owin

g th

e w

ord

CDI!WS 20mSigmoidal fit

0 5000

20

40

60

80

100

Word Rank

CDI!WS 30mSigmoidal fit

Figure 4 Percentage of toddlers knowing a word as a functionof its rank on the CDI, for 20-month-olds (left panel) and30-month-olds (right panel). The sigmoidal curve fittingexperimental data has exactly the same shape for both agegroups (parameter b is constant) and is only shifted to the rightfor the older age group (a = 609 for 30-month-olds whereasa = 142 for 20-month-olds).

8 10 12 14 16 180

200

400

Rec

eptiv

e vo

c. s

ize

CDI1st correction

8 10 12 14 16 18—20

0

20

Dis

crep

ancy

[%]

Discrepancy

8 10 12 14 16 18

0

100

200

300

Age [months]

Para

met

er ab

(a)

(b)

(c)

Figure 5 Comprehension: (a) comparison of vocabulary esti-mates from the CDI and from the model after applying the firstcorrection; (b) discrepancy between the two estimates; (c)evolution of the two free parameters with age.

774 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

describe these early words or proto-words (in rankedorder; mommy, daddy, bye, peekaboo, bottle, no, hi).Note that this source of discrepancy is not itself a mea-sure of idiosyncratic words in the infant vocabulary, butan indirect effect of estimating the parameters a and bwhich quantify the idiosyncratic contribution. As ex-pected, the discrepancy between the direct CDI measureand the model estimate after applying the first correctionincreases with age. At 15–18 months of age, the modelsuggests that idiosyncratic words would increase vocab-ulary size by the order of 8–10% (see Figure 5(b)).However, for the ages ranging from 11 to 13 months, theestimate based on a direct count converges with themodel.Both parameters a and b are plotted for the different

age groups in Figure 5(c). The decomposition of thevocabulary curve into two parameters allows us to dis-entangle the contribution to overall vocabulary growth,a, and the overlap of vocabulary knowledge acrossinfants, b. Figure 5(c) shows that parameter amirrors theoverall vocabulary growth, approximating the typicalvocabulary size of the infants. Parameter b shows that theamount of overlap across infants in the vocabulary spacestays relatively constant and that the number of ‘unique’idiosyncratic words stays approximately constant overthe age range under consideration. A regression ofparameter b with age revealed that the hypothesis that bis independent of age cannot be rejected, in compre-hension, for infants older than 11 months (CI = [)2.508.43], p = .23). As a consequence, we will use bWG = 100(where WG stands for values pertaining to the CDI-WG)for all age groups older than 11 months on the CDI-WG.The constancy of parameter b will enable us to derive aunique mapping from CDI-WG scores to total vocabu-lary size.The same procedure is applied to the productive

vocabulary of toddlers, from 16 months to 30 months ofage, using the LEX2005 database, compiling data fromthe MacArthur-Bates CDI-WS. Figure 6(a) depicts thedevelopment of productive vocabulary after applying thefirst correction according to both the direct CDI measureand the model. The model again predicts a highervocabulary size after the first correction than a directcount from the age of 19 months, with a discrepancy thatincreases with age. Figure 6(b) depicts the discrepancybetween the direct CDI estimate and the vocabulary sizeobtained by the model after the first correction. Fromabout 19 months of age, the contribution of idiosyncraticwords to vocabulary size increases steadily to reach about18% at 30 months of age. This gradual increase with agesuggests that the underestimate of productive vocabularybased on a direct CDI count will be even greater for oldertoddlers. Parameters a and b can also be computed forproductive vocabulary, as depicted in Figure 6(c).Parameter a reflects the overall vocabulary size for olderage groups (from about 20 months). An analysis of theregression coefficients revealed that the hypothesis that bis independent of age, in production, cannot be rejected

for infants older than 20 months (CI = [)1.09 6.21], p =.15). Parameter b remains essentially constant after20 months of age, indicating that shared productivevocabulary does not change over time. As a consequence,we will use bWS = 180 (where WS stands for valuespertaining to the CDI-WS) for all age groups older than20 months on the CDI-WS, enabling us to derive a un-ique mapping from CDI-WS scores to total vocabularysize.In summary, we have shown that the sigmoidal func-

tion is an appropriate description of the distribution ofword knowledge captured by direct CDI scores and thatparameter b remains approximately constant for both theCDI-WG and the CDI-WS, over the age range where weattempt to quantify the underestimate in vocabulary size.These findings allow us to derive a unique mapping fromraw CDI scores to the first correction of vocabulary size.

Overview of the second correction; filling the gaps instratified CDIs

Having estimated the parameters a and b from theLEX2005 database, we now attempt to estimate the sizeof an individual infant’s vocabulary given her CDI score.We have established that for both receptive and pro-ductive vocabulary, there is an increase in the magnitudeof underestimation as the total vocabulary increases. Wehave seen that the overlap parameter b remains constantfor older age groups. Therefore, we can assume that themagnitude of the underestimation may be calculatedusing the same value of parameter b for the oldest agegroups. We use different values for b in calculating thecorrections for the CDI-WG (bWG = 100) and CDI-WS(bWS = 180). The constancy of parameter b across

16 18 20 22 24 26 28 300

200

400

600

800

Age [months]

Para

met

er ab

16 18 20 22 24 26 28 300

500

Prod

uctiv

e vo

c. s

ize

CDI1st correction

16 18 20 22 24 26 28 30!20

0

20

Dis

crep

ancy

[%]

Discrepancy

(a)

(b)

(c)

Figure 6 Production data: (a) comparison of vocabularyestimates from the CDI and from the model after applyingthe first correction; (b) discrepancy between the two estimates;(c) evolution of the two free parameters with age.

A statistical estimate of vocabulary size 775

! 2010 Blackwell Publishing Ltd.

different age groups determines the uniqueness (and theshape of its non-linearity) of the mapping from CDIscores to the first correction. However, the first correc-tion does not estimate the fraction of frequent words thatare not included in the stratified selection made whencreating the MacArthur-Bates CDIs.2

MacArthur-Bates CDIs have been constructed as astratified selection of words that are typically acquired atdifferent ages, so as to be sensitive to lexical develop-ment. Thus, some highly frequent words are not presenton the CDI, even though they are likely to be part ofmany infants’ lexicons. We investigate the impact of theabsence of frequent words in the CDI on the estimatedvocabulary size. The number of missing words can onlybe estimated based on a comparison of a diary count andCDI count for single case studies (Robinson & Mervis,1999; Roy et al., 2009).

Application of the second correction to the MacArthur-Bates CDI

The fraction of words omitted from the CDI is likely tobe smaller among the most frequent words, where anexhaustive list of well-known words can be establishedrelatively easily, compared to less frequent items, wherelisting all the better-known words is a difficult task andwhere the number of potential candidates for inclusion inthe CDI increases with decreasing rank.For simplicity, we assume that the fraction of missing

words increases linearly with the lexicon size (see Equa-tion 5 in Appendix 2). However, in order to estimate thenumber of frequent words omitted from the CDI werequire a comparison of an individual infant’s raw CDIscore with a direct measure of that infant’s vocabularysize. Since we have already provided an estimate of thecontribution of idiosyncratic words (first correction)based on raw CDI scores, we can calculate the size of thesecond correction for this infant. The second correctionis just that infant’s total observed vocabulary minus thefirst corrected vocabulary score based on the infant’s rawCDI score. Note that this second correction applies in alinear fashion to the first correction across the range ofvocabulary sizes (b is constant).An exhaustive comparison of productive vocabulary

based on a detailed diary report with a CDI-based esti-mation is presented in Robinson and Mervis (1999).

They reported the vocabulary production of one malechild from about 10 months of age to 2 years of age andidentified, on a monthly basis, the total number of dif-ferent words produced and counted how many of thesewords were listed on the CDI-WS. As predicted, theunderestimation (the words produced that are not on theCDI) increased with vocabulary size. This comparisonallows us to calculate the fraction of frequent wordsomitted from the CDI (the single free parameter a ofEquation 5 in Appendix 2) by fitting the fully correctedcurve to the data provided by Robinson and Mervis. Thisprovides an estimate of the second correction.Figure 7 depicts total vocabulary sizes based on CDI-

WS scores from several different single case studies(Robinson & Mervis, 1999; Roy et al., 2009; Haggerty,1929). The model’s fit to the Robinson and Mervis datais constrained by the pre-established value of parameter band the value of the second correction obtained bycomparing Robinson and Mervis’ observed data withcorresponding raw CDI scores. The fit to the Robinsonand Mervis (1999) data confirms the strong non-linearityof the mapping from raw CDI score to total vocabularysize, as the CDI score increases. This clear agreementbetween experimental data and the model suggests thatthe two corrections account for the increasing underes-timation of the lexicon as the number of words reportedon the CDI increases. Of course, the reliance on a singlechild for determining the magnitude of the second cor-rection questions the capacity of the model to generalizebeyond that particular child. In order to demonstrate the

Figure 7 Estimated total productive vocabulary size as afunction of raw CDI score (solid line). Comparison of diaryreports with CDI data from Robinson and Mervis (1999) isshown. Roy et al. (2009) as well as Haggerty’s (1929) studyon the vocabulary produced by a 30-month-old child duringa single day are described in Appendix 1. The dotted linecorresponds to the raw CDI score. See text for further details.

2 The omission of frequent words from the CDI may also have an impacton the estimate of parameters a and b because it changes the shape ofthe probability distribution, thereby leading to a biased estimate of thecontribution of the tail of the distribution of word knowledge to theoverall vocabulary. In order to disentangle the impact of the absence offrequent words from the contribution of idiosyncratic words, we ran anadditional experiment in which we randomly deleted a percentage ofwords on the CDI and compared the direct CDI count with the modelestimate. We found that the omission of a fraction of words on the CDIhad the same impact on the direct count and on the model estimate,indicating that missing frequent words do not induce complex biaseswhen estimating the tail of the distribution of word knowledge.

776 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

validity of the Robinson and Mervis (1999) corpus toestimate the reliability of the second correction beyond aparticular child, we also provide a series of demonstra-tions using other databases and cross-validation tech-niques in Appendix 1.These additional tests show that: (1) the mapping from

raw CDI score to total vocabulary size for Robinson andMervis’ child is best when we use exactly the sameoverlap parameter calculated for the other infants fromthe LEX2005 database; (2) estimation of the secondcorrection is consistent across all data points providedby the Robinson and Mervis study; and (3) additionalcorpora (Roy et al., 2009; Haggerty, 1929) are consistentwith the strong non-linearity of the mapping from rawCDI score to total vocabulary size. Altogether, thesetests confirm that reliance on Robinson and Mervis’(1999) study to constrain the second correction is valid,and has the predictive power to extend beyond a singlecase study.

Quantitative predictions of the model

The model is now fully constrained: The first correctionfor idiosyncratic words is determined by measuring theoverlap of vocabulary knowledge over the population ofinfants (parameter b) and the second correction for theomission of frequent words on the CDI is measured byfitting the model to the data provided by Robinson andMervis (1999) (parameter a). We can now apply thesecorrections to individual raw CDI scores so as to providean estimate of total vocabulary size as a function of thenumber of known CDI words. Table A1 (in Appendix)lists total vocabulary sizes as a function of the number ofwords produced on the CDI-WS, together with the rel-ative underestimation error. The relative underestimationerror is simply the sum of the two corrections divided bythe CDI score. Whereas small CDI scores lead to similartotal vocabulary estimates, the estimated vocabulary ofan infant knowing 675 words on the CDI is more thanfour times larger; 2877 words. Table A1 also lists thecorresponding model estimates (in bold) for the raw CDIscores presented in Robinson and Mervis (1999). Thediscrepancy between the model and the data is alwayssmaller than 42 words and the RMS error is 22.8 words.The model predicts a total productive vocabulary size of1277.9 words based on the infant’s CDI score of 545,whereas Robinson and Mervis reported the infant asknowing 1275 words, or an error of only 0.2%. Thisagreement indicates that the model’s predictive validityfor larger vocabularies is high, despite the large discrep-ancies between raw CDI scores and estimated totalvocabulary size.We used the parameter a to estimate the number of

omitted words on the CDI-WS as an approximation tothe number of words missing from the CDI-WG. Usingparameter b to measure the overlap of vocabularies (b =100), we can also provide an estimation of the totalreceptive vocabulary size of an infant, given the number

of words reported on the CDI-WG. Table A2 tabulatesthe total vocabulary size as a function of the CDI-WSscore. As with production, the underestimation increaseswith measured vocabulary, so that an infant reported toknow 395 words on the CDI-WG is likely to possess atotal receptive vocabulary of about 1870 words.Although applying the model to the distribution of

CDI scores is straightforward, the non-linearity of themappings implies that mean total vocabularies cannot beobtained by mapping the mean of the CDI scores. Meantotal vocabulary sizes are obtained by computing themean of the individual mappings. The model was appliedto the distribution of MacArthur-Bates CDI scores –CDI-WS database for production and CDI-WG forcomprehension – for girls, boys and both sexes as shownin Table A3 (production) and in Table A4 (comprehen-sion).Several aspects of these mappings are noteworthy. The

total vocabularies for the older age groups are consid-erably larger than expected. To give just one example,raw CDI-WS scores at 30 months of age, for boys andgirls together, suggest a mean CDI score of 518.6 words,whereas the mean total vocabulary reaches 1313.6 words.Second, the non-linearity of the mapping has importantimplications. Applying the model to the mean of rawCDI scores is inappropriate. Furthermore, the non-lin-earity of the corrections has an impact on the relativeprecocity of infants compared to other infants despite thefact that their ranking is not altered.Figure 8 compares the standardized raw CDI scores

and the total vocabulary of 30-month-olds. A standardCDI score of 0 corresponds to the mean CDI score

—3 —2 —1 0 1

—2

—1

0

1

2

3

Standard CDI score

Stan

dard

tota

l voc

abul

ary

size

30!months

Figure 8 Standard total vocabulary sizes as predicted by themodel as a function of the standard CDI scores for 30-month-olds. The solid line corresponds to a linear transformation ofthe raw CDI scores. The dashed lines define the limit for whichthe correction is larger than 1 standard deviation from thestandard mean.

A statistical estimate of vocabulary size 777

! 2010 Blackwell Publishing Ltd.

and a standard CDI score of !1 is a standard devia-tion away from the mean CDI score. Similarly, astandard total vocabulary size of 0 corresponds to themean total vocabulary size. The individual raw CDIscores of 30-month-olds are also plotted in Figure 8.For one infant, the raw CDI score indicates that she ismore than 3 standard deviations below the mean. Oncethe model is applied, her total vocabulary is onlyabout 2 standard deviations below the mean. Similarly,another infant presents with a CDI score about 1standard deviation above the mean. After applying themodel, her total vocabulary is closer to 2 to 3 standarddeviations above the mean. In other words, whendealing with normal distributions, this infant is notonly in the top 10% according to the raw CDI score,but more likely to belong in the top 1% of the infantsof her age.As another consequence of the non-linearity of the

mappings, the shape of the distribution of total vocab-ulary size is markedly different from the distribution ofraw CDI scores. Figure 9(a) depicts the distribution ofCDI-WS scores at 30 months of age, along with thedistribution of total vocabulary size. The distribution ofraw CDI scores is heavily skewed, revealing a ceilingeffect due to the limited number of words listed on theCDI. After applying the model, the skewness disappearsand reveals a broad distribution of vocabulary sizes. Theapplication of a Jarque-Bera test (Jarque & Bera, 1980)for normality on the distribution of productive vocabu-lary at 30 months of age revealed that the hypothesis ofnormality can be rejected (p = .006) for raw CDI scoresbut it cannot be rejected for total vocabulary size asobtained with the model (p = .46). This near-zeroskewness of the distribution of total vocabulary size is,however, only attained for older age groups.In populations of younger infants, the distribution is

positively skewed, as depicted in Figure 9(b). The posi-tive skewness arises from the fact that vocabulary size is apositive value. Early in life, most of the infants possess avery small vocabulary, whereas a small subset of infantsalready have larger lexicons. As infants add new words totheir vocabulary, the distribution of vocabulary sizesshifts and the value for skewness becomes smaller. After24 months of age, the distributions of raw CDI scores arenegatively skewed, highlighting the fact that the numberof words infants know becomes larger when compared tothe number of words listed on the CDI. This negativeskewness is an artefact of this ceiling effect. However,when corrected, the distribution of vocabulary size dis-plays another trend. Its skewness always remains posi-tive, asymptotically reaching a non-skewed distributionfor older toddlers. Application of the model effectivelyremoves the ceiling effect induced by the limited size ofthe CDIs.Since the distribution of vocabulary sizes for early age

groups is heavily skewed, means and standard deviationsmust be interpreted with caution. In order to offer amore complete picture of the distributions of early

vocabularies, we provide tables with percentiles based onraw data. Given the limited number of infants at eachage, the raw percentiles at the tails of the distributions,the 10th and the 90th, are relatively noisy. This canexplain, for example, why 9-month-old infants seem toknow fewer words than 8-month-old infants (Table A4).This artifact due to a random sampling of infants isalready present in the raw CDI score. Compilation ofCDI forms with an even larger number of infants andtoddlers would remove this artifact.

0 500 1000 1500 2000 2500 30000

10

20

30

40

50

60

Lexicon size

Num

ber o

f Inf

ants

CDI scoreTotal vocabulary

16 18 20 22 24 26 28 30!2

!1

0

1

2

3

4

Age [months]

Skew

ness

Distribution of CDI scoresDistribution of total voc. sizes

(a)

(b)

Figure 9 (a) Distribution of productive vocabulary sizes at30 months, for both girls and boys. The white bars depict thedistribution of raw CDI scores based on the CDI-WS (data,courtesy of the MacArthur-Bates CDI board). Note the negativeskewness due to the limited number of words present on theCDI. The black bars depict the distribution of total productivevocabulary sizes, after applying the model to the raw CDIscores. The distribution approaches normality. (b) Skewnessmeasured on the raw CDI scores and on the total vocabularysize distribution. Whereas saturation of CDI scores generates anegative skewing for older age groups, the total vocabularydistribution approaches symmetry.

778 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

Table A5 displays the raw percentiles for girls, boysand for both sexes between 16 and 30 months of age.Similarly, we provide tables with raw percentiles forvocabulary in comprehension for girls, boys and bothsexes, from 8 to 18 months of age (Table A6).

Discussion

We have proposed a mathematical model for estimatingthe total vocabulary size of infants given their raw CDIscores. Two corrections are applied to the raw CDImeasurements: First, the number of idiosyncratic wordsis estimated via a measure of the overlap of individualvocabularies in the infant population. Second, an esti-mation of the number of frequent words omitted fromthe CDI is constrained by a comparison of diaryreports and direct CDI counts. The model is applied tothe MacArthur-Bates CDI database, Words and Ges-tures, for comprehension and Words and Sentences, forproduction. The model predicts that the underestima-tion of total vocabulary size increases with the numberof words reported as known on the CDI. This pre-diction confirms the expectations of the original com-pilers of the CDI instrument (see Fenson et al., 1993,p. 40). Moreover, mapping from CDI score to totalvocabulary size is highly non-linear; for small vocabu-laries, the raw CDI count is an accurate measure fortotal vocabulary size, whereas when infants know 90%of the words on the CDI, they are likely to possessabout three times as many words in their vocabularies.The close agreement with the data reported byRobinson and Mervis suggests that the predictions ofthe model are accurate even for large CDI scores. Wedemonstrated that the predictive validity of the modelis not compromised by its reliance on a single casestudy. Convergence of parameter estimation from dif-ferent sources (CDI data, Robinson & Mervis’ (1999)data, Roy et al.’s (2009) data, Haggerty’s (1929) data)as well as cross-validation assessments attest to theaccuracy of our estimates.The non-linearity in the mapping has important

implications. For example, CDIs are a widely used toolfor diagnosing language delays. Whereas criteria foridentifying delays are not absolute, we advocate againsttheir use based on raw CDI scores. At 30 months of age,we have shown that some outliers (2 SD below the mean)on raw CDI scores are not outliers on corrected scores.The opposite can also hold, i.e. individuals may be out-liers on corrected vocabulary scores (> 2 SD above themean) even though they are not on the raw CDI score(£ 1 SD above the mean). Diagnosis of delay should takethese findings into account.Beyond the exact numbers predicted by the model, the

non-linearity of the mapping has additional repercus-sions. Many studies report correlation of different vari-ables, such as grammatical knowledge (Bates &Goodman, 1997) or spoken word recognition (Swingley

& Aslin, 2000) to vocabulary size. Any correlation tovocabulary knowledge based on raw CDI scores isbiased, and may lead, in the worst case, to incorrectinterpretation of the results, and in the best case, toinaccurate statistical confidence levels. However, the useof CDI scores is more likely to weaken the researcher’sopportunity to find interesting correlations betweenlanguage behaviour and CDI scores, since CDI scores areweak at differentiating between infants or toddlers withhigh vocabularies and infants or toddlers with very highvocabularies. Therefore, we advocate the use of totalvocabulary size as predicted by the model in correla-tional analyses rather than raw CDI scores.The high variability of vocabulary acquisition when

measured with raw CDI scores has been subject to de-bate. Feldman, Dollaghan, Campbell, Kurs-Lasky,Janosky and Paradise (2000) criticized the MacArthur-Bates CDI in several ways (see also the reply by Fenson,Bates, Dale, Goodman, Reznick & Thal, 2000); we listhere three aspects that they questioned, and highlight thecontribution of our approach. First, Feldman et al.(2000) doubted that CDIs are a relevant tool forassessing vocabulary acquisition, given that standarddeviations for early vocabularies are as large as (or largerthan) the mean vocabularies. The question is whether thisstrong variability is an intrinsic effect of early lexicalacquisition or an artefact of the application of the CDI.Given the close agreement between the model andexperimental data, we suggest that CDIs are indeed auseful tool for assessing vocabulary development, oncethe model has been applied. Moreover, the means andstandard deviations of total vocabulary size reported inTable A3 and Table A4 confirm the high variability forthe young age groups to be a true variation of earlyvocabulary acquisition rather than the product ofapplying a biased diagnostic tool. Second, Feldman et al.(2000) measured the correlation between the scores of theCDI-WG at 12 months of age and the CDI-WS at24 months of age and found them to be moderate. As aconsequence, they suggest that CDIs are not a reliabletool for screening for delayed language or for identifyingchildren at risk. Although we do not question that cli-nicians should exercise caution in applying the CDI,again we suggest that any correlation to CDI scoresshould be applied on total vocabulary size as predictedby the model rather than to CDI scores, as the non-linearity of the mapping would induce biases in theestimation. It remains to be verified whether vocabularysize at 12 months of age is a reliable predictor of thevocabulary size at 24 months of age or not. Third,Feldman et al. (2000) criticized the use made of percen-tiles, as a small number of words would span a largenumber of percentile ranks for young infants, but thesame number of words would span a much-reducednumber of percentile ranks for older infants. The largepositive skewness of vocabulary distributions for earlyage groups suggests that percentiles should be usedinstead of means and standard deviations for the

A statistical estimate of vocabulary size 779

! 2010 Blackwell Publishing Ltd.

population of young infants. Whereas the distribution ofCDI scores is negatively skewed for older age groups asshown in Figure 9(b), the distribution of total vocabu-lary size has skewness close to zero. As a consequence,the use of means and standard deviations is fully justifiedfor the older age groups. The application of the modeleffectively removes the ceiling effect due to the limitednumber of words listed on the CDI.The removal of the ceiling effect inherent in the CDI

scores for older infants can also offer additional infor-mation for those interested in characterizing the vocab-ulary spurt often observed at the end of the second yearof life. Many researchers attempt to identify an inflectionpoint in increasing vocabulary size (Ganger & Brent,2004). Often, direct measure from the CDI indicates aslowing-down in the speed of acquisition of new wordsafter the spurt. Again, the non-linearity of the correctionsuggests more caution in the analysis of vocabulary sizes,as the deceleration is likely to disappear when analysingtotal vocabulary sizes rather than CDI scores.It is important to reiterate that the model is partially

constrained by a single case study. This may seem to castdoubt on the generality of the corrections for otherinfants’ vocabularies. However, the close agreementbetween the model and data from independent sourcessuggests that the model offers a good approximation,even though a more extensive comparison of diaryreports to raw CDI scores has the potential to increasethe accuracy of the estimate. Nevertheless, even richdiary studies and high-density recordings can underesti-mate vocabulary size. Consequently, we must assume thatthe estimates delivered by the model provide a lowerbound for total vocabulary sizes. A further correction tothe present estimate could be achieved by comparingextensive diary reports and ‘true’ vocabulary knowledgeas assessed, for example, by monitoring infant lookingbehaviour at pictures while hearing words in an inter-modal preferential looking paradigm. In addition, thereis no general consensus as to when a word is properlylearnt since it can be understood first in a given context(duck refers to the bath) and then progressively refined toa decontextualized label (duck being a name for allducks). Any systematic biases in parental judgements asto what constitutes proper word understanding willclearly also impact the accuracy of CDI measures.3

The present study used diary reports (production data)to provide mappings from CDI scores to vocabulary sizein both comprehension and production. From this per-spective, we assumed that the number of frequent wordsomitted from the selection of words in CDIs remains thesame for equivalent CDI scores in production andcomprehension. Diary reports about infants’ receptivevocabulary compared with CDI scores would allow for avalidation of this assumption. In the absence of such

reports, we must assume that productive vocabulary sizeestimates are more accurate than receptive vocabularysize estimates. As a further limitation, the model does notallow the identification of specific missing words on theCDI nor infants’ individual idiosyncrasies. However, themodel permits an efficient and accurate use of CDI datain establishing a quantitative evaluation of an infant’slexicon.All data being used to constrain the estimate of vocab-

ulary size come from typically developing infants andtoddlers. The estimate of vocabulary size from CDI scoreshas only beenvalidated for unimpaired children, and relieson regularities observed across this population, such asparameter b monitoring the overlap of vocabularyknowledge. If such regularitieswere to be observed amongother populations, such as Williams syndrome children,Down syndrome children, autistic children, to name but afew, the present mapping from CDI scores to vocabularysize may be applicable beyond typically developing chil-dren. A comparison of CDI scores to diary reports inatypical populations would be needed to assess theapplicability of the present mapping to these populations.Finally, the analysis of the amount of overlap between

individual vocabularies suggests that the total number ofidiosyncratic words does not change with age over theage range considered. This observation may offer someimportant boundary conditions in the attempt to applystatistical models of lexicon growth such as preferentialattachment (Steyvers & Tenenbaum, 2005) or preferen-tial avoidance (Hills, Maouene, Maouene, Sheya &Smith, 2009). For example, populations of networksimplementing lexical growth with preferential attach-ment may exhibit a relatively strong overlap of individuallexicons as they are likely to start with similar corewords, whereas populations of ‘preferential avoidance’networks may exhibit a diverging overlap between indi-vidual vocabularies.

Acknowledgements

We would like to thank the MacArthur-Bates CDIadvisory board for generously giving us access to dataconcerning the distribution of CDI scores. This work wassupported by the Economic and Social Research CouncilGrant RES-062-23-0194 awarded to Kim Plunkett.Note: Online calculators providing estimates of voca-

bulary sizes from CDI scores can be found on http://www.bcbl.eu/cdi and http://babylab.psy.ox.ac.uk/research/oxford-cdi/.

References

Bates, E., & Goodman, J. (1997). On the inseparability ofgrammar and the lexicon: evidence from acquisition, aphasiaand real-time processing. Language and Cognitive Processes,12 (5 ⁄ 6), 507–584.

3 See Styles and Plunkett (2009) for a discussion of parents’ evaluationof their children’s understanding of words in the context of CDIreports.

780 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

Carey, S. (1978). The child as word learner. In M. Halle, J.Bresnan, & G.A. Miller (Eds.), Linguistic theory and psy-chological reality (pp. 264–293). Cambridge, MA: MIT Press.

Dale, P., & Fenson, L. (1996). Lexical development norms foryoung children. Behavior Research Methods, Instruments &Computers, 28 (1), 125–127.

Feldman, H., Dollaghan, C., Campbell, T., Kurs-Lasky, M.,Janosky, J., & Paradise, J. (2000). Measurement properties ofthe MacArthur Communicative Development Inventories atages one and two years. Child Development, 71 (2), 310–322.

Fenson, L., Bates, E., Dale, P., Goodman, J., Reznick, J., &Thal, D. (2000). Reply: Measuring variability in early childlanguage: don’t shoot the messenger. Child Development, 71(2), 323–328.

Fenson, L., Dale, P.S., Reznick, J.S., Thal, D., Bates, E.,Hartung, J.P., Pethick, S., & Reilly, J.S. (1993). MacArthurCommunicative Development Inventories: User’s guide andtechnical manual. San Diego, CA: Singular Press.

Fenson, L.,Marchman, V.A., Thal, D.,Dale, P., Reznick, S., &Bates, E. (2007). MacArthur-Bates Communicative Develop-ment Inventories: User’s guide and technical manual, 2nd edn.Baltimore, MD: Paul H. Brookes.

Ganger, J., & Brent, M. (2004). Reexamining the vocabularyspurt. Developmental Psychology, 40, 621–632.

Haggerty, L. (1929). What a two-and-one-half-year-old childsaid in one day. Journal of Genetic Psychology, 38, 75–100.

Hills, T., Maouene, M., Maouene, J., Sheya, A., & Smith, L.(2009). Longitudinal analysis of early semantic networks:preferential attachment or preferential acquisition? Psycho-logical Science, 20 (6), 729–739.

Jarque, C., & Bera, A. (1980). Efficient tests for normality,homoscedasticity and serial independence of regressionresiduals. Economics Letters, 6 (3), 255–259.

MacWhinney, B. (1991). The CHILDES project: Tools foranalyzing talk. Hillsdale, NJ: Lawrence Erlbaum Associates.

Nice, M. (1926). On the size of vocabulary. American Speech, 2,1–7.

Ring, E.D., & Fenson, L. (2000). The correspondence betweenparent report and child performance for receptive andexpressive vocabulary beyond infancy. First Language, 20(59), 141–159.

Robinson, B., & Mervis, C. (1999). Comparing productivevocabulary measures from the CDI and a systematic diarystudy. Journal of Child Language, 26 (1), 177–185.

Roy, B., Franck, M., & Roy, D. (2009). Exploring words in ahigh-density longitudinal corpus. In N. Taatgen & H. vanRijn (Eds.), Proceedings of the 31st Annual Conference of theCognitive Science Society (pp. 2106–2111). Austin, TX:Cognitive Science Society.

Thal, D.J., Marchman, V., Stiles, J., Aram, D., Trauner, D.,Nass, R., & Bates, E. (1991). Early lexical development inchildren with focal brain injury. Brain and Language, 40, 491–527.

Steyvers, M., & Tenenbaum, J. (2005). The large-scale structureof semantic networks: statistical analyses and a model ofsemantic growth. Cognitive Science, 29 (1), 41–78.

Styles, S., & Plunkett, K. (2009). What is ‘word understanding’for the parent of a one-year-old? Matching the difficulty of alexical comprehension task to parental CDI report. Journalof Child Language, 36 (4), 895–908.

Swingley, D., & Aslin, R.N. (2000). Spoken word recognitionand lexical representation in very young children. Cognition,76 (2), 147–166.

Werker, J.F., Fennell, C.T., Corcoran, K.M., & Stager, C.L.(2002). Infants’ ability to learn phonetically similar words:effects of age and vocabulary size. Infancy, 3 (1), 1–30.

Zipf, G.K. (1949). Human behavior and the principle of leasteffort: An introduction to human ecology. Cambridge, MA:Addison-Wesley Press.

Received: 3 February 2010Accepted: 21 September 2010

Appendix 1. Additional tests for validating thereliance on Robinson and Mervis (1999) for thesecond correction

Note that the first correction for the absence of idiosyncraticwords is uniquely determined by the parameter b that we haveshown to remain constant over vocabulary size. The fit toRobinson and Mervis’ data is a linear transformation of thisfirst correction. The strong non-linearity predicted by themodel therefore derives from the first correction introduced byidiosyncratic words. The omission of frequent words on theCDI serves only to modulate this non-linearity. Consequently,the quality of the fit to total vocabulary for the Robinson andMervis study is strongly dependent on the overlap parameter bderived from other infants. Stronger or weaker overlap betweeninfant vocabularies would change the shape of the mappingfrom CDI score to total vocabulary size, thus leading to loweraccuracy. We attempted to optimize the fit to Robinson andMervis, using different overlap parameters b. Figure A1 depictsRMS errors of the best fit, given different parameters b(through the optimization of parameter a, see Equation 5, inAppendix 2). The fit from raw CDI score to total vocabulary

101 10220

25

30

35

Overlap parameter b

RM

S e

rror

on

fit to

R&

M (1

999)

Figure A1 Goodness of fit to the data provided by Robinsonand Mervis (1999). RMS error is reported for different overlapparameters b. The mapping from CDI score to total vocabularysize is most accurate when using a value for b = 180 corre-sponding to the vocabulary overlap measured on other infantsthan the one used by Robinson and Mervis, thereby validatingthe reliance on this study.

A statistical estimate of vocabulary size 781

! 2010 Blackwell Publishing Ltd.

size is best when parameter b = 180. This is precisely the samevalue obtained for the overlap parameter calculated for apopulation of infants on the LEX2005 database, CDI-WS (seeFigure 6(c)). This strongly suggests that the unique mappingdetermined from the analysis of the LEX2005 database (seeFigure 3) precisely captures the non-linear correction from rawCDI score to total vocabulary. In other words, despite beingan early talker, Robinson and Mervis’ (1999) infant follows atrajectory in vocabulary space that is captured by the analysisof hundreds of other infants from the MacArthur-Bates CDI,thereby justifying the use of a single case study to evaluate thefraction of words omitted from the CDI.

Moreover, every single measurement of the total vocabularysize (and its associated CDI score) for the Robinson andMervis study provides an estimate of the second correction.This permits us to evaluate the reliability of our estimate usingcross-validation. The error between the model’s estimationaveraged over nine data points and a new data point left forvalidation provides an evaluation of the consistency of the

estimations across all data points. The mean relative errorbetween the model estimate and the new validation data pointis 3.5%. In other words, the model predicts total vocabularysize from raw CDI score for this infant with a relative error ofless than 3.5%.

Since the cross-validation procedure relies on data from thesame infant, we conducted a third test using other directcomparisons of total vocabulary size and CDI scores. Unfor-tunately, this type of data is scarce, as most corpora are basedon relatively short interactions, in which the infant is likely touse only a small subset of her total vocabulary. In the Spee-chome project (Roy et al., 2009), an infant’s utterances wererecorded with a high temporal resolution over the first 3 yearsof the infant’s life. By 24 months of age, the infant produced517 different words, 265 of which were on the online LEX2005database. The online version of the LEX2005 database listsonly 652 words out of the 680 words listed on the CDI-WSforms. As a consequence, the number of words produced byRoy et al.’s (2009) infant, and that belong to the CDI-WS, may

Table A1 Total productive vocabulary sizes based on CDI-WS scores (the model estimate for Robinson & Mervis data is reported inbold)

CDI count Estimate Error (%) CDI count Estimate Error (%) CDI count Estimate Error (%)

5 5.2 3.4 230 338.6 47.2 455 916.4 101.410 10.5 4.7 235 348.0 48.1 460 934.8 103.215 15.8 5.2 240 358.8 49.5 465 951.0 104.520 21.3 6.5 245 368.6 50.4 470 969.6 106.325 26.9 7.5 250 378.4 51.4 475 988.4 108.130 32.4 8.1 255 388.5 52.4 480 1004.9 109.435 38.3 9.4 260 398.8 53.4 485 1023.8 111.140 44.0 9.9 265 409.2 54.4 490 1042.9 112.845 49.9 10.8 270 419.8 55.5 495 1062.0 114.650 56.0 11.9 275 430.5 56.6 500 1081.3 116.355 62.1 12.9 280 441.5 57.7 505 1103.1 118.460 68.2 13.6 285 452.6 58.8 510 1122.5 120.165 74.8 15.0 290 463.9 60.0 515 1142.0 121.770 81.1 15.9 295 475.3 61.1 520 1164.0 123.975 87.5 16.6 300 486.9 62.3 525 1186.2 125.980 93.9 17.3 305 498.7 63.5 530 1208.4 128.085 100.6 18.4 310 509.0 64.2 535 1230.6 130.090 107.3 19.2 315 521.1 65.4 540 1253.0 132.095 114.4 20.4 320 533.4 66.7 545 1277.9 134.5100 121.3 21.3 325 545.8 67.9 550 1302.8 136.9105 128.5 22.4 330 558.4 69.2 555 1327.8 139.2110 135.5 23.1 335 569.4 70.0 560 1352.9 141.6115 142.7 24.1 340 582.3 71.3 565 1377.9 143.9120 150.3 25.3 345 595.3 72.6 570 1405.6 146.6125 157.6 26.0 350 608.6 73.9 575 1433.2 149.3130 165.0 27.0 355 620.0 74.7 580 1463.4 152.3135 172.8 28.0 360 633.5 76.0 585 1493.6 155.3140 180.9 29.2 365 647.2 77.3 590 1523.8 158.3145 188.4 30.0 370 661.0 78.6 595 1556.4 161.6150 196.2 30.8 375 675.0 80.0 600 1591.6 165.3155 205.2 32.4 380 687.0 80.8 605 1626.7 168.9160 212.5 32.8 385 701.3 82.1 610 1664.2 172.8165 221.1 34.0 390 715.6 83.5 615 1704.1 177.1170 229.9 35.3 395 730.2 84.8 620 1746.4 181.7175 238.0 36.0 400 744.8 86.2 625 1791.0 186.6180 246.3 36.8 405 759.6 87.6 630 1837.9 191.7185 255.9 38.3 410 774.5 88.9 635 1889.5 197.6190 264.6 39.3 415 789.5 90.2 640 1945.7 204.0195 273.6 40.3 420 804.7 91.6 645 2006.4 211.1200 281.6 40.8 425 820.0 92.9 650 2076.2 219.4205 291.0 41.9 430 835.4 94.3 655 2157.4 229.4210 300.6 43.1 435 850.9 95.6 660 2249.7 240.9215 310.4 44.4 440 868.8 97.5 665 2364.4 255.5220 319.2 45.1 445 884.6 98.8 670 2516.9 275.7225 329.5 46.4 450 900.5 100.1 675 2742.2 306.3

782 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

vary between 265 (if the infant does not produce any of thewords missing from the online version of the LEX2005 data-base) and 293 (if the infant produces all the missing words). We

reported both estimates on Figure 7 ("). Although this uncer-tainty in the correct CDI-WS score prevents us from using Royet al.’s (2009) data to constrain the parameter a in the secondcorrection, there is a clear compatibility between the model’sestimate and Roy’s data.

A second transcript, from the CHILDES database(MacWhinney, 1991), is the Haggerty corpus (Haggerty, 1929),which described the vocabulary produced by a 30-month-oldchild during a single day. The child produced 693 differentwords, 302 of them being on the CDI-WS. The use ofHaggerty’s (1929) corpus for constraining the second estimateis limited in two ways. First, it is unlikely that this childproduced all the words in her lexicon on a single day.Therefore, analysis of the corpus provides only a lower boundfor her vocabulary. Similarly, the selection of words utteredthat belong to the CDI provides only a lower bound to herpotential CDI score. Direct counts of the words produced bythe infant and the associated CDI-WS count are therefore abiased estimate of both total vocabulary size and of the cor-rect CDI-WS score. Fortunately, we can evaluate the totalvocabulary size from a sufficiently dense speech corpus. It isknown that the frequency distribution of word usage followsZipf’s law (Zipf, 1949): Typically, speakers produce the mostfrequent word twice as often as the second most frequentword, three times more often than the third, etc. Conse-quently, there is an algorithmic relation between the numberof different words uttered during a given time window and thetotal number of words uttered in that same time window,given the total lexicon size of that person. We ran additionalsimulations in which we sampled N words (corresponding tothe number of words uttered by Haggerty’s child in a day) outof different lexicon sizes L, and measured the number ofdifferent words produced. Analysis of our simulations revealedthat the maximum likelihood estimation for the expectedlexicon size is 836 words, with 95% confidence intervalsranging from 794 to 878 words.

The shaded area on Figure 7 depicts the set of potentialcoordinates associated with the total vocabulary size of Hagg-erty’s child and the corresponding CDI-WS scores. The upperand lower edges correspond to the confidence intervals in theestimation of total vocabulary size. The left edge would indicatethat none of the extra words (words that belong to the child’sproductive vocabulary size; however, not produced on the daythe recording was made) are included in the CDI while thelower right edge would indicate that all extra words belong tothe CDI. We see no solution for narrowing down the shaded

Table A2 Total vocabulary sizes in comprehension, based onthe CDI-WG scores

CDIcount Estimate

Error(%)

CDIcount Estimate

Error(%)

5 5.2 3.6 205 355.5 73.410 10.5 5.2 210 367.3 74.915 16.1 7.6 215 381.2 77.320 21.7 8.7 220 395.4 79.725 27.6 10.3 225 407.7 81.230 33.6 11.9 230 422.4 83.635 39.7 13.3 235 437.3 86.140 45.9 14.8 240 452.4 88.545 52.6 16.8 245 465.5 90.050 59.0 18.0 250 481.0 92.455 66.1 20.3 255 496.8 94.860 72.8 21.3 260 512.7 97.265 79.9 22.9 265 531.1 100.470 87.7 25.2 270 547.4 102.775 95.2 27.0 275 563.9 105.080 102.5 28.1 280 582.9 108.285 110.2 29.6 285 602.1 111.390 118.4 31.5 290 619.0 113.595 127.0 33.7 295 638.5 116.5100 135.0 35.0 300 660.6 120.2105 143.5 36.6 305 680.4 123.1110 152.3 38.5 310 702.8 126.7115 161.5 40.5 315 725.3 130.3120 169.9 41.6 320 747.9 133.7125 179.9 43.9 325 773.1 137.9130 188.9 45.3 330 798.3 141.9135 198.3 46.9 335 826.1 146.6140 208.0 48.6 340 856.5 151.9145 218.0 50.3 345 886.8 157.1150 228.3 52.2 350 919.6 162.7155 238.9 54.1 355 957.3 169.7160 249.8 56.1 360 994.8 176.3165 261.0 58.2 365 1039.6 184.8170 270.8 59.3 370 1088.9 194.3175 282.6 61.5 375 1149.8 206.6180 294.7 63.7 380 1219.5 220.9185 305.3 65.0 385 1316.4 241.9190 317.9 67.3 390 1462.1 274.9195 330.8 69.7 395 1869.9 373.4200 342.1 71.1

Table A3 Normative data for total productive vocabulary forgirls, boys and both sexes combined

Age (m)

Girls Boys Both

Mean SD Mean SD Mean SD

16 80.2 82.7 67.3 105.5 73.8 94.417 134.9 187.7 101.6 110.7 117.2 151.718 211.2 230.3 118.5 192.4 158.1 213.419 315.5 335.5 213.0 230.5 270.5 297.120 328.2 284.6 252.5 272.9 291.3 280.421 394.6 385.9 334.9 338.4 360.0 358.422 541.2 404.2 375.7 364.4 467.6 394.123 655.1 496.1 560.1 409.2 607.6 455.124 666.8 486.6 521.6 423.9 599.0 462.525 831.1 592.4 541.1 422.6 701.0 540.526 912.0 554.6 763.3 556.2 833.2 557.627 1061.6 623.1 828.6 642.5 939.9 641.228 1018.5 427.5 814.9 623.5 914.3 543.429 1233.9 737.1 849.5 529.8 1032.1 661.430 1359.7 579.0 1265.0 535.8 1313.6 556.9

Table A4 Normative data for total vocabulary in compre-hension, for girls, boys and both sexes combined

Age (m)

Girls Boys Both

Mean SD Mean SD Mean SD

8 64.8 114.0 55.5 84.8 59.9 98.99 57.1 58.3 40.2 43.8 49.3 52.510 90.5 109.4 64.0 66.2 76.5 89.811 193.9 372.6 82.4 82.0 135.2 267.612 132.2 93.3 108.3 95.9 120.3 95.113 225.7 262.8 174.9 175.6 199.9 223.214 295.2 280.7 254.5 181.0 275.8 238.215 367.0 394.9 310.5 272.9 340.6 342.716 477.0 368.6 296.1 196.7 384.4 305.517 437.4 202.5 364.8 188.0 397.6 196.718 637.0 425.2 502.0 248.0 577.7 362.2

A statistical estimate of vocabulary size 783

! 2010 Blackwell Publishing Ltd.

area further, and the actual total vocabulary size of that childand the associated CDI score can be anywhere within the boxboundaries. Note, however, that this zone is markedly differentfrom the direct count of words on the CDI (dotted line) andprovides additional confirmation that the underestimation ofvocabulary size becomes large for high raw CDI scores. Whenthe raw CDI score is used as a proxy for total vocabulary sizefor Haggerty’s child, underestimation errors are in the range of46% to 64%. In contrast, Haggerty’s (1929) child may be con-sistent with the model estimate with perfect agreement (thecurve described by the model crosses the shaded area).Unfortunately, the sampling in Haggerty’s corpus, of a singleday, is too small to narrow the range further.

Appendix 2. Mathematical procedure forestimating vocabularies from CDI scores

We detail the procedure for constructing tables of totalvocabulary sizes for infants based on CDI scores. It can also beapplied to other languages, as well as to lists of a subset of wordtypes (e.g. closed classes, verbs, adjectives).

The information required to constrain the estimate is:

1. Population data concerning the probability that an infantknows any given word on the list for an age group. The data

are used to measure the overlap b between individualvocabularies.

2. A detailed diary study of the production of an infant, so asto compare the number of words on the CDI with the totalnumber of words produced. This comparison constrains theestimate of the number of frequent words omitted from thelist, a.

The procedure involves performing an ‘item-based’ analysis,which can be shown to be equivalent to a ‘subject-based’analysis, when evaluating mean vocabulary size. The averageof individual CDI scores over all infants is equivalent to com-puting the sum, over all words wi on the CDI, of the proba-bilities p(wi) that words wi are known by an infant.

Vocest !XW

i!1

p"wi# !1N

XN

j!1

voc"j# "1#

where W is the number of words on the CDI and voc(j) mea-sures the CDI score of infant j and N represents the number ofinfants. The task is now to provide an estimate of the fractionof infants knowing words that are not included in the CDI. Wecan then estimate the total vocabulary size in terms of the directCDI measure, plus a term that corresponds to the underesti-mate. The underestimate is simply the sum of the probabilitiesfor words that are not listed on the CDI:

Table A5 Non-fitted percentiles for total productive vocabulary for girls, boys and both sexes combined

Age(m)

Girls Boys Both

10th 25th 50th 75th 90th 10th 25th 50th 75th 90th 10th 25th 50th 75th 90th

16 16.9 29.0 57.2 103.7 142.7 9.4 11.5 33.7 96.7 131.6 10.5 18.0 46.3 98.2 142.717 18.0 33.7 56.0 156.1 317.9 18.0 35.9 64.4 121.3 223.0 18.0 34.8 64.4 127.3 269.118 14.7 46.3 147.5 280.4 597.2 11.5 24.5 49.9 141.4 246.3 13.6 30.3 83.6 186.7 319.219 51.2 73.2 185.9 424.4 848.7 24.5 46.3 110.5 265.7 538.7 27.9 67.1 154.6 388.5 669.020 46.3 106.3 264.6 425.9 723.9 34.8 70.7 144.1 315.4 512.4 39.3 78.3 193.6 407.7 683.021 21.3 82.3 251.6 558.4 974.3 39.3 96.7 191.9 475.3 813.4 34.8 93.9 211.6 526.3 974.322 54.8 238.0 492.0 751.1 1062.0 30.3 88.8 219.2 533.4 907.3 46.3 121.3 388.5 709.5 1042.923 75.9 245.2 509.0 921.0 1307.8 168.1 250.5 451.0 730.2 1230.6 108.9 250.5 505.6 868.8 1277.924 110.5 300.6 574.9 1002.5 1267.9 51.2 163.5 403.2 826.6 1139.6 75.9 221.1 500.4 914.2 1198.525 179.3 283.9 776.6 1062.0 1418.1 59.6 177.6 488.6 791.7 1002.5 103.7 243.1 669.0 990.7 1372.926 232.9 356.1 955.7 1287.8 1523.8 79.9 304.2 757.5 1042.9 1488.6 151.8 307.9 774.5 1257.9 1523.827 229.9 542.2 1069.2 1440.8 1828.1 168.1 330.8 649.1 1093.4 1649.2 186.7 407.7 833.2 1367.9 1828.128 384.2 697.2 1074.1 1267.9 1523.8 137.4 304.2 675.0 1042.9 1671.7 262.4 512.4 833.2 1203.4 1656.729 250.5 586.0 1191.1 1798.5 1945.7 129.7 455.8 848.7 1066.8 1611.6 234.9 507.3 941.8 1481.0 1867.430 560.2 905.0 1400.5 1763.8 2107.4 524.6 817.8 1253.0 1686.7 1982.1 560.2 873.3 1352.9 1704.1 2032.9

Table A6 Non-fitted percentiles for total vocabulary in comprehension for girls, boys and both sexes combined

Age(m)

Girls Boys Both

10th 25th 50th 75th 90th 10th 25th 50th 75th 90th 10th 25th 50th 75th 90th

8 3.1 8.4 18.3 43.5 179.9 5.2 8.4 25.3 64.4 95.2 5.2 8.4 22.8 59.0 138.29 10.5 19.4 31.2 69.1 145.6 6.2 14.9 21.7 54.0 87.7 8.4 17.3 27.6 61.7 133.010 8.4 19.4 47.2 102.5 234.3 8.4 19.4 38.6 84.8 141.3 8.4 19.4 44.7 95.2 173.611 36.2 48.5 82.0 153.4 228.3 11.6 24.2 51.2 105.9 190.3 16.1 39.7 66.1 128.0 212.212 34.9 69.1 112.0 177.4 223.8 18.3 51.2 78.6 129.9 232.8 26.3 54.0 93.7 161.5 232.813 47.2 104.2 168.7 256.1 414.0 30.0 59.0 112.0 231.2 330.8 38.6 72.8 152.3 249.8 357.514 55.5 125.0 208.0 371.2 619.0 41.2 102.5 237.3 349.7 461.1 51.2 115.6 228.3 355.5 501.315 43.5 96.8 245.1 568.6 735.3 77.2 110.2 179.9 448.0 788.2 52.6 105.9 231.2 487.7 768.016 118.4 237.3 441.6 592.5 798.3 90.6 158.0 234.3 375.2 575.7 118.4 173.6 284.3 531.1 768.017 173.6 287.8 433.0 563.9 778.1 92.1 213.6 359.4 510.4 578.1 168.7 228.3 367.3 540.4 648.318 140.3 317.9 575.7 798.3 1059.3 143.5 280.9 510.4 673.0 725.3 143.5 319.7 538.1 725.3 1004.8

784 Julien Mayor and Kim Plunkett

! 2010 Blackwell Publishing Ltd.

Vocreal ! Vocest $XW1

i!W$1

p"wi# "2#

with W¥ being the total number of words in the given language.We distinguish between two types of words that are not listed

on the CDI; low probability words (referred to as idiosyncraticwords, estimated via a first correction) and high frequencywords, not included when making a stratified selection of wordsfor the CDI (estimated via a second correction).

One can rank words wi on the list in descending order (i = 1to i = W), from the words known by most infants – highprobability p(wi) – to the words known by a smaller number ofinfants – low probability p(wi). We can model the distributionof knowledge using a standard sigmoid function that describesthe probability p(wi) that a word is known given its rank iamong other words:

f "wi# ! 100 1% 1

1$ e%"i%a#

b

!

& p"wi# "3#

where f(wi) is fitted to the raw probabilities p(wi). Thisequation has only two free parameters, a and b. Therefore,finding an optimal value for the parameters is likely to beunique, and the algorithm for finding the solution fast andstable. It also reduces the risk of over-fitting the data, sincethe number of free parameters is much lower than the numberof data points used to constrain the optimization. The surfacebelow the sigmoidal curve, corresponding to the vocabularysize after applying the first correction, is a simple expressionof parameters a and b:

VocCorr1 ! b ' ln"1$ ea=b# "4#

where ln is the natural logarithm. If parameter b is constantover age groups, the mapping from CDI score and vocabu-lary size after the first correction is unique. This defines thecurve of vocabulary growth as being the set of coordinatepairs (x(a),y(a)) where x"a# ! b ' ln"1$ ea=b=1$ e"a%W #=b# andy(a) = b Æ ln (1 + ea ⁄ b). The vocabulary size after first cor-rection y(a) as a function of raw CDI score x(a) is thenobtained by varying parameter a over a large range of values,negative and positive.

The stratified construction of the CDI means that frequentwords are not listed on the CDI. The probability that the CDIlacks a particular word increases with decreasing rank, as thenumberof potential candidates for inclusion in theCDI increaseswith decreasing rank. The second underestimate is thereforedirectly related to the number of words that an infant is reportedto know. The fraction of omitted words can be written as:

fomission ! a ' VocCorr1 "5#

The only way to quantify this underestimate is by directcomparison with exhaustive word lists that individual infantsknow such as that reported by Robinson and Mervis (1999).The correction for the omission of frequent words on theCDI is achieved by fitting the curve defined above to diarydata, with the inclusion of the correcting factor a;"x"a#;~y"a##; with ~y"a# ! y"a# ' "1$ a ' x"a##.

We can now provide mappings for individual vocabularies oralternatively construct look-up tables:1. For an individual CDI score Vj, find aj that solves the

following relation: Vj ! b ' ln"1$ eaj=b=1$ e"aj%W #=b# whereln is the natural logarithm

2. The total vocabulary size for that infant is Vocj = y(aj) Æ(1 + aVj), with y"aj# ! b ' ln"1$ eaj=b#

A statistical estimate of vocabulary size 785

! 2010 Blackwell Publishing Ltd.


Recommended