+ All Categories
Home > Documents > Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus...

Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus...

Date post: 17-Jan-2018
Category:
Upload: julianna-atkins
View: 219 times
Download: 0 times
Share this document with a friend
Description:
Motivating questions What is meant by the phrase ‘a balanced corpus’? –How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? Examples: ICE-GB and DCPSE –Should the data have been more sociolinguistically representative, by social class and region?
57
Rebalancing corpora Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University College London [email protected]
Transcript
Page 1: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Rebalancing corporaRebalancing corporaDisentangling effects of unstratified sampling and multiple variables in

corpus data

Sean WallisSurvey of English Usage

University College [email protected]

Page 2: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced

corpus’?– How do sampling decisions made by corpus builders

affect the type of research questions that may be asked of the data?

Page 3: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced

corpus’?– How do sampling decisions made by corpus builders

affect the type of research questions that may be asked of the data?

• Examples: ICE-GB and DCPSE– Should the data have been more sociolinguistically

representative, by social class and region?

Page 4: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced

corpus’?– How do sampling decisions made by corpus builders

affect the type of research questions that may be asked of the data?

• Examples: ICE-GB and DCPSE – Should the data have been more sociolinguistically

representative, by social class and region?– Should texts have been stratified: sampled so that

speakers of all categories of gender and age were (equally) represented in each genre?

Page 5: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced

corpus’?– How do sampling decisions made by corpus builders affect the

type of research questions that may be asked of the data?• Examples: ICE-GB and DCPSE

– Should the data have been more sociolinguistically representative, by social class and region?

– Should texts have been stratified: sampled so that speakers of all categories of gender and age were (equally) represented in each genre?

• Can we compensate for sampling problems in our data analysis?

Page 6: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GBICE-GB• British Component of ICE• Corpus of speech and writing (1990-1992)

– 60% spoken, 40% written; 1 million words; orthographically transcribed speech, marked up, tagged and fully parsed

• Sampling principles– International sampling scheme, including broad range

of spoken and written categories– But:

• Adults who had completed secondary education• ‘British corpus’ geographically limited

– speakers mostly from London / SE UK (or sampled there)

Page 7: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSEDCPSE• Diachronic Corpus of Present-day Spoken

English (late 1950s - early 1990s)– 800,000 words (nominal)– London-Lund component annotated as ICE-GB

• orthographically transcribed and fully parsed• Created from subsamples of LLC and ICE-GB

– Matching numbers of texts in text categories– Not sampled over equal duration

• LLC (1958-1977) • ICE-GB (1990-1992) – Text passages in LLC larger than ICE-GB

• LLC (5,000 words) • ICE-GB (2,000 words)• But text passages may include subtexts

– telephone calls and newspaper articles are frequently short

Page 8: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSEDCPSE• Representative?

– Text categories of unequal size– Broad range of text types sampled– Not balanced by speaker demography

Page 9: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSEDCPSE• Representative?

– Text categories of unequal size– Broad range of text types sampled– Not balanced by speaker demography

text category LLC (1960s) ICE-GB (1990s) TOTALformal face-to-face 46,291 (51) 39,201 (58) 85,492 (109)informal face-to-face 207,852 (146) 176,244 (398) 384,096 (544)telephone conversations 25,645 (110) 19,455 (30) 45,100 (140)broadcast discussions 43,620 (47) 42,002 (101) 85,622 (148)broadcast interviews 20,359 (12) 21,385 (26) 41,744 (38)spontaneous commentary 45,765 (50) 48,539 (60) 94,304 (110)parliamentary language 10,081 (14) 10,226 (58) 20,307 (72)legal cross-examination 5,089 (4) 4,249 (5) 9,338 (9)assorted spontaneous 10,111 (8) 10,767 (5) 20,878 (13)prepared speech 30,564 (14) 32,180 (71) 62,744 (85)TOTAL 445,377 (450) 404,248 (818) 849,625 (1,268)

Page 10: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

A balanced corpus?A balanced corpus?• Corpora are reusable experimental datasets

– Data collection (sampling) should avoid limiting future research goals

– Samples should be representative• What are they representative of?

• Quantity vs. quality– Large/lighter annotation vs. small/richer– Are larger corpora more (easily) representative?

• Problems for historical corpora– Can we add samples to make the corpus more

representative?

Page 11: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

““Representativeness”Representativeness”• Do we mean representative...

– of the language?• A sample in the corpus is a genuine random

sample of the type of text in the language

Page 12: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

““Representativeness”Representativeness”• Do we mean representative...

– of the language?• A sample in the corpus is a genuine random

sample of the type of text in the language– of text types?

• Effort made to include examples of all types of language “text types” (including speech contexts)

Page 13: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

““Representativeness”Representativeness”• Do we mean representative...

– of the language?• A sample in the corpus is a genuine random sample

of the type of text in the language– of text types?

• Effort made to include examples of all types of language “text types” (including speech contexts)

– of speaker types?• Sampling decisions made to include equal numbers

(by gender, age, geography, etc.) of participants in each text category

• Should subdivide data independently (stratification)

Page 14: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

““Representativeness”Representativeness”• Do we mean representative...

– of the language?• A sample in the corpus is a genuine random sample

of the type of text in the language– of text types?

• Effort made to include examples of all types of language “text types” (including speech contexts)

– of speaker types?• Sampling decisions made to include equal numbers

(by gender, age, geography, etc.) of participants in each text category

• Should subdivide data independently (stratification)

“broad”

“stratified”

“random sample”

Page 15: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Stratified samplingStratified sampling• Ideal

– Corpus independently subdivided by each variable

Page 16: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Stratified samplingStratified sampling• Ideal

– Corpus independently subdivided by each variable

Page 17: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Stratified samplingStratified sampling• Ideal

– Corpus independently subdivided by each variable

– Equal subdivisions?

Page 18: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Stratified samplingStratified sampling• Ideal

– Corpus independently subdivided by each variable

– Equal subdivisions?• Not required• Independent variables =

constant probability in each subset

– e.g. proportion of words spoken by women not affected by text genre

– e.g. same ratio of women:men in age groups, etc.

Page 19: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Stratified samplingStratified sampling• Ideal

– Corpus independently subdivided by each variable

– Equal subdivisions?• Not required• Independent variables =

constant probability in each subset

– e.g. proportion of words spoken by women not affected by text genre

• What is the reality?

Page 20: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category

spoken by women and men– The authors of some texts are unspecified– Some written material may be jointly authored

– female/male ratio varies slightly (=0.02)0 0.2 0.4 0.6 0.8 1

TOTAL

spoken

written femalefemale

malemale

p

Page 21: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GB: gender / spoken ICE-GB: gender / spoken genresgenres• Gender variation in spoken subcategories

0 0.2 0.4 0.6 0.8 1

TOTAL spoken dialogue private

direct conversations telephone calls

public broadcast discussions

broadcast interviews business transactions

classroom lessons legal cross-examinations

parliamentary debates mixed

broadcast news monologue

scripted broadcast talks

non-broadcast speeches unscripted

demonstrations legal presentations

spontaneous commentaries unscripted speeches

p

femalefemalemalemale

Page 22: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GB: gender / written ICE-GB: gender / written genresgenres• Gender variation in written genres

TOTAL written non-printed

correspondence business letters

social letters non-professional writing

student examination scripts untimed student essays

printed academic writing

humanities natural sciences social sciences

technology creative writing

novels/stories instructional writing

administrative/regulatory skills/hobbies

non-academic writing humanities

natural sciences social sciences

technology persuasive writing press editorials

reportage press news reports

p0 0.2 0.4 0.6 0.8 1

femalefemalemalemale

<author unknown/joint>

Page 23: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GBICE-GB• Sampling was not stratified across

variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation

• academic writing: technology, natural sciences• non-academic writing: technology, social

science

Page 24: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GBICE-GB• Sampling was not stratified across

variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation

• academic writing: technology, natural sciences• non-academic writing: technology, social

science– Is this representative?

Page 25: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GBICE-GB• Sampling was not stratified across

variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation• academic writing: technology, natural sciences• non-academic writing: technology, social science

– Is this representative?– When we compare

• technology writing with creative writing• academic writing with student essays

– are we also finding gender effects?

Page 26: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ICE-GBICE-GB• Sampling was not stratified across variables

– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation• academic writing: technology, natural sciences• non-academic writing: technology, social science

– Is this representative?– When we compare

• technology writing with creative writing• academic writing with student essays

– are we also finding gender effects?– Difficult to compensate for absent data in analysis!

Page 27: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Disentangling variablesDisentangling variables• When we compare

– technology writing with creative writing• are we also finding gender effects?

Page 28: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Disentangling variablesDisentangling variables• When we compare

– technology writing with creative writing• are we also finding gender effects?

• Rebalancing the corpus– Subsample the corpus on stratified lines, or

mathematically rescale corpus• reduces the amount of data• what do we do about missing data?

Page 29: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Disentangling variablesDisentangling variables• When we compare

– technology writing with creative writing• are we also finding gender effects?

• Rebalancing the corpus– Subsample the corpus on stratified lines, or

mathematically rescale corpus• reduces the amount of data• what do we do about missing data?

• Rebalancing the dataset

Page 30: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Disentangling variablesDisentangling variables• When we compare

– technology writing with creative writing• are we also finding gender effects?

• Rebalancing the corpus• Rebalancing the dataset• Test contribution of interacting variables

– Evaluate each independent variable and their interaction in predicting DV

– cf. analysis of covariance (ANCOVA)• but for categorical variables

Page 31: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Rebalancing corporaRebalancing corpora• Aim: equalise the ratios

– spoken:written (across m/f) – male:female (across sp/w)

• Drawback:– throws away information– problems with empty subsets

• Methods:– random subsampling– rescaling

• counting instances as <1 item

f

msp w

Page 32: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Rebalancing datasetsRebalancing datasets• Attempting to obtain a balanced corpus

is good practice in data-collection– avoid zero speakers for each sociolinguistic

combination

Page 33: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Rebalancing datasetsRebalancing datasets• Attempting to obtain a balanced corpus is

good practice in data-collection– avoid zero speakers for each sociolinguistic

combination• But different

research questions are likely to obtain different ratios– Tensed VP density

in DCPSE(Bowie et al 2013)

form

al f-

to-f

info

rmal

f-to

-f

tele

phon

e

b di

scus

sions

b in

terv

iew

s

com

men

tary

parli

amen

t

lega

l x-e

xam

asso

rt sp

ont

prep

ared

sp

Total

1960s1990s

Page 34: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Accounting for interactionAccounting for interaction• Another way of considering the problem

– We cannot be sure that we are seeing independent effects of two variables

A

B

C

Page 35: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Accounting for interactionAccounting for interaction• Another way of considering the problem

– We cannot be sure that we are seeing independent effects of two variables

– Or that the two variables are essentially the same

AB C

Page 36: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Accounting for interactionAccounting for interaction• Another way of considering the problem

– We cannot be sure that we are seeing independent effects of two variables

– Or that the two variables are essentially the same

– In the worst case the two variables measure the same thing (e.g. m = sp, f = w)

AB C

Page 37: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Testing for interactionTesting for interaction• A statistical test checks whether ratios

are constant (homogeneity)– 2x2 chi-square χ2 = 0– Cramér’s φ = χ2/kN = 0

• k = diagonal - 1 f

msp w

Page 38: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Testing for interactionTesting for interaction• A statistical test checks whether ratios are

constant (homogeneity)– 2x2 chi-square χ2 = 0– Cramér’s φ = χ2/kN = 0

• k = diagonal - 1• Can we use χ2 to see

if an uneven distributioncauses the variables tointeract?– Assume A, B and C are

binary variables for simplicity

f

msp w

Page 39: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Testing for interactionTesting for interaction• We can use χ2 to test

– A C

1122

2010

30

33

3132

63N =

values of C

values of A

20 10

11 22

χ2 = 6.99 = 0.33

Page 40: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Testing for interactionTesting for interaction• We can use χ2 to test

– A C and B C

1023

219

1122

2010

30

33

3132

6332

31values of B

20 10

11 22

21 9

10 23

C

χ2 = 6.99 = 0.33

χ2 = 9.91 = 0.40

Page 41: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Testing for interactionTesting for interaction• We can use χ2 to test

– A C and B C

• Now use χ2 to test– AB C 10

23

219

1122

2010

30

33

3132

6332

31

C

B A

Page 42: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

Testing for interactionTesting for interaction• We can use χ2 to test

– A C and B C

• Now use χ2 to test– AB C

• Method– Create a 3D table

• 1 2D ‘layer’ for each value of C

1023

219

11

47

616

155

64

22

2010

30

C

B A

Page 43: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

122019

12

3132

6332

31

Testing for interactionTesting for interaction• We can use χ2 to test

– A C and B C

• Now use χ2 to test– AB C

• Method– Create a 3D table

• 1 2D ‘layer’ for each value of C– Define expected distribution

• eabc = nab nc / N – expected = no variation across C– compensates for uneven sample

1023

219

11

47

616

155

64

22

2010

30

C

B A

nab

nc

uneven sample

Page 44: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

122019

12

3132

3231

Testing for interactionTesting for interaction• We can use χ2 to test

– A C and B C

• Now use χ2 to test– AB C

• Method– Create a 3D table

• 1 2D ‘layer’ for each value of C– Define expected distribution

• eabc = nab nc / N – expected = no variation across C

– Calculate χ2 = Σ(o – e)2/e• test has single degree of freedom

1023

219

11

47

616

155

64

22

2010

30

C

B A

χ2 = 13.79 = 0.47

Page 45: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

122019

12

3132

3231

Testing for interactionTesting for interaction• Method

– Create a 3D table• 1 2D ‘layer’ for each value of C

– Define expected distribution• eabc = nab nc / N

– expected = no variation across C– Calculate χ2 = Σ(o – e)2/e

• test has single degree of freedom– χ2 = 13.79, = 0.47

– BUT this tests A or B• Subtract χ2(A) and χ2(B)

– result non-significant (or < 0) no interaction

1023

219

11

47

616

155

64

22

2010

30

Page 46: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ConclusionsConclusions• Ideal would be that:

– the corpus was “representative” in all 3 ways:

• a genuine random sample• a broad range of text types• a stratified sampling of speakers

– But these principles are unlikely to be compatible

• e.g. speaker age and utterance context

Page 47: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ConclusionsConclusions• Ideal would be that:

– the corpus was “representative” in all 3 ways:• a genuine random sample• a broad range of text types• a stratified sampling of speakers

– But these principles are unlikely to be compatible• e.g. speaker age and utterance context

• Some compensatory approaches may be employed at research (data analysis) stage– what about absent or atypical data?– what if we have few speakers/writers?

Page 48: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ConclusionsConclusions• Data-collection is important

– Pay attention to stratification in selecting texts/speakers

• consider replacing texts in outlying categories– Justify and document non-inclusion of

stratum by evidence• e.g. “there are no published articles attributable

to authors of this age in this time period”

Page 49: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ConclusionsConclusions• Data-collection is important

– Pay attention to stratification in selecting texts/speakers

• consider replacing texts in outlying categories– Justify and document non-inclusion of stratum

by evidence• e.g. “there are no published articles attributable to

authors of this age in this time period”• But a stratified corpus does not guarantee a

stratified dataset– need to disentangle effects of variables

Page 50: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ConclusionsConclusions• Testing for interaction

– χ2 can measure degree to which • combination of and B affects the choice of C• use uneven sampling for expected distribution

– Cramér’s φ is derived from χ2

• Analysis of covariance– Subtracting χ2 for and B allows us to test if

remaining interaction is significant• a significant result means

– the variables interact to obtain a new result• no effect means

– the variables may be dependent (measure the same thing)

Page 51: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

ReferencesReferences• Bowie, J., Wallis, S.A., and Aarts, B. 2013. Contemporary

change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (ed.) English Modality, Berlin: De Gruyter, 57–94.

Page 52: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSE: gender / genreDCPSE: gender / genre• DCPSE has a simpler genre categorisation

– also divided by time

0 0.2 0.4 0.6 0.8 1

TOTAL

face-to-face conversations

formal

informal

telephone conversations

broadcast discussions

broadcast interviews

spontaneous commentary

parliamentary language

legal cross-examination

assorted spontaneous

prepared speech

femalefemalemalemale

p

Page 53: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSE: gender / timeDCPSE: gender / time• DCPSE has a simpler genre categorisation

– also divided by time• note the gap

0

0.2

0.4

0.6

0.8

1

1958

1960

1962

1964

1966

1968

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

p

time

Page 54: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSE: genre / timeDCPSE: genre / time• Proportion in each spoken genre, over time

– sampled by matching LLC and ICE-GB overall • this is a ‘stratified sample’ (but only LLC:ICE-GB)• uneven sampling over 5-year periods (within LLC)

0

0.2

0.4

0.6

1960 1965 1970 1975 1980 1985 1990

Informal face-to-face

formal face-to-face

spontaneous commentary

telephone conversations

prepared speech

pICE-GB

target for LLC

Page 55: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSEDCPSE• LLC sampling not stratified

– Issue not considered, data collected over extended period

– Some data was surreptitiously recorded

Page 56: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSEDCPSE• LLC sampling not stratified

– Issue not considered, data collected over extended period

– Some data was surreptitiously recorded• DCPSE matched samples by ‘genre’

– Same text category sizes in ICE-GB and LLC– But problems in LLC (and ICE) percolate

Page 57: Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.

DCPSEDCPSE• LLC sampling not stratified

– Issue not considered, data collected over extended period

– Some data was surreptitiously recorded• DCPSE matched samples by ‘genre’

– Same text category sizes in ICE-GB and LLC– But problems in LLC (and ICE) percolate

• No stratification by speaker– Result: difficult and sometimes impossible to

separate out speaker-demographic effects from text category


Recommended