Date post: | 17-Jan-2018 |
Category: |
Documents |
Upload: | julianna-atkins |
View: | 219 times |
Download: | 0 times |
Rebalancing corporaRebalancing corporaDisentangling effects of unstratified sampling and multiple variables in
corpus data
Sean WallisSurvey of English Usage
University College [email protected]
Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced
corpus’?– How do sampling decisions made by corpus builders
affect the type of research questions that may be asked of the data?
Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced
corpus’?– How do sampling decisions made by corpus builders
affect the type of research questions that may be asked of the data?
• Examples: ICE-GB and DCPSE– Should the data have been more sociolinguistically
representative, by social class and region?
Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced
corpus’?– How do sampling decisions made by corpus builders
affect the type of research questions that may be asked of the data?
• Examples: ICE-GB and DCPSE – Should the data have been more sociolinguistically
representative, by social class and region?– Should texts have been stratified: sampled so that
speakers of all categories of gender and age were (equally) represented in each genre?
Motivating questionsMotivating questions• What is meant by the phrase ‘a balanced
corpus’?– How do sampling decisions made by corpus builders affect the
type of research questions that may be asked of the data?• Examples: ICE-GB and DCPSE
– Should the data have been more sociolinguistically representative, by social class and region?
– Should texts have been stratified: sampled so that speakers of all categories of gender and age were (equally) represented in each genre?
• Can we compensate for sampling problems in our data analysis?
ICE-GBICE-GB• British Component of ICE• Corpus of speech and writing (1990-1992)
– 60% spoken, 40% written; 1 million words; orthographically transcribed speech, marked up, tagged and fully parsed
• Sampling principles– International sampling scheme, including broad range
of spoken and written categories– But:
• Adults who had completed secondary education• ‘British corpus’ geographically limited
– speakers mostly from London / SE UK (or sampled there)
DCPSEDCPSE• Diachronic Corpus of Present-day Spoken
English (late 1950s - early 1990s)– 800,000 words (nominal)– London-Lund component annotated as ICE-GB
• orthographically transcribed and fully parsed• Created from subsamples of LLC and ICE-GB
– Matching numbers of texts in text categories– Not sampled over equal duration
• LLC (1958-1977) • ICE-GB (1990-1992) – Text passages in LLC larger than ICE-GB
• LLC (5,000 words) • ICE-GB (2,000 words)• But text passages may include subtexts
– telephone calls and newspaper articles are frequently short
DCPSEDCPSE• Representative?
– Text categories of unequal size– Broad range of text types sampled– Not balanced by speaker demography
DCPSEDCPSE• Representative?
– Text categories of unequal size– Broad range of text types sampled– Not balanced by speaker demography
text category LLC (1960s) ICE-GB (1990s) TOTALformal face-to-face 46,291 (51) 39,201 (58) 85,492 (109)informal face-to-face 207,852 (146) 176,244 (398) 384,096 (544)telephone conversations 25,645 (110) 19,455 (30) 45,100 (140)broadcast discussions 43,620 (47) 42,002 (101) 85,622 (148)broadcast interviews 20,359 (12) 21,385 (26) 41,744 (38)spontaneous commentary 45,765 (50) 48,539 (60) 94,304 (110)parliamentary language 10,081 (14) 10,226 (58) 20,307 (72)legal cross-examination 5,089 (4) 4,249 (5) 9,338 (9)assorted spontaneous 10,111 (8) 10,767 (5) 20,878 (13)prepared speech 30,564 (14) 32,180 (71) 62,744 (85)TOTAL 445,377 (450) 404,248 (818) 849,625 (1,268)
A balanced corpus?A balanced corpus?• Corpora are reusable experimental datasets
– Data collection (sampling) should avoid limiting future research goals
– Samples should be representative• What are they representative of?
• Quantity vs. quality– Large/lighter annotation vs. small/richer– Are larger corpora more (easily) representative?
• Problems for historical corpora– Can we add samples to make the corpus more
representative?
““Representativeness”Representativeness”• Do we mean representative...
– of the language?• A sample in the corpus is a genuine random
sample of the type of text in the language
““Representativeness”Representativeness”• Do we mean representative...
– of the language?• A sample in the corpus is a genuine random
sample of the type of text in the language– of text types?
• Effort made to include examples of all types of language “text types” (including speech contexts)
““Representativeness”Representativeness”• Do we mean representative...
– of the language?• A sample in the corpus is a genuine random sample
of the type of text in the language– of text types?
• Effort made to include examples of all types of language “text types” (including speech contexts)
– of speaker types?• Sampling decisions made to include equal numbers
(by gender, age, geography, etc.) of participants in each text category
• Should subdivide data independently (stratification)
““Representativeness”Representativeness”• Do we mean representative...
– of the language?• A sample in the corpus is a genuine random sample
of the type of text in the language– of text types?
• Effort made to include examples of all types of language “text types” (including speech contexts)
– of speaker types?• Sampling decisions made to include equal numbers
(by gender, age, geography, etc.) of participants in each text category
• Should subdivide data independently (stratification)
“broad”
“stratified”
“random sample”
Stratified samplingStratified sampling• Ideal
– Corpus independently subdivided by each variable
Stratified samplingStratified sampling• Ideal
– Corpus independently subdivided by each variable
Stratified samplingStratified sampling• Ideal
– Corpus independently subdivided by each variable
– Equal subdivisions?
Stratified samplingStratified sampling• Ideal
– Corpus independently subdivided by each variable
– Equal subdivisions?• Not required• Independent variables =
constant probability in each subset
– e.g. proportion of words spoken by women not affected by text genre
– e.g. same ratio of women:men in age groups, etc.
Stratified samplingStratified sampling• Ideal
– Corpus independently subdivided by each variable
– Equal subdivisions?• Not required• Independent variables =
constant probability in each subset
– e.g. proportion of words spoken by women not affected by text genre
• What is the reality?
ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category
spoken by women and men– The authors of some texts are unspecified– Some written material may be jointly authored
– female/male ratio varies slightly (=0.02)0 0.2 0.4 0.6 0.8 1
TOTAL
spoken
written femalefemale
malemale
p
ICE-GB: gender / spoken ICE-GB: gender / spoken genresgenres• Gender variation in spoken subcategories
0 0.2 0.4 0.6 0.8 1
TOTAL spoken dialogue private
direct conversations telephone calls
public broadcast discussions
broadcast interviews business transactions
classroom lessons legal cross-examinations
parliamentary debates mixed
broadcast news monologue
scripted broadcast talks
non-broadcast speeches unscripted
demonstrations legal presentations
spontaneous commentaries unscripted speeches
p
femalefemalemalemale
ICE-GB: gender / written ICE-GB: gender / written genresgenres• Gender variation in written genres
TOTAL written non-printed
correspondence business letters
social letters non-professional writing
student examination scripts untimed student essays
printed academic writing
humanities natural sciences social sciences
technology creative writing
novels/stories instructional writing
administrative/regulatory skills/hobbies
non-academic writing humanities
natural sciences social sciences
technology persuasive writing press editorials
reportage press news reports
p0 0.2 0.4 0.6 0.8 1
femalefemalemalemale
<author unknown/joint>
ICE-GBICE-GB• Sampling was not stratified across
variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation
• academic writing: technology, natural sciences• non-academic writing: technology, social
science
ICE-GBICE-GB• Sampling was not stratified across
variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation
• academic writing: technology, natural sciences• non-academic writing: technology, social
science– Is this representative?
ICE-GBICE-GB• Sampling was not stratified across
variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation• academic writing: technology, natural sciences• non-academic writing: technology, social science
– Is this representative?– When we compare
• technology writing with creative writing• academic writing with student essays
– are we also finding gender effects?
ICE-GBICE-GB• Sampling was not stratified across variables
– Women contribute 1/3 of corpus words– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation• academic writing: technology, natural sciences• non-academic writing: technology, social science
– Is this representative?– When we compare
• technology writing with creative writing• academic writing with student essays
– are we also finding gender effects?– Difficult to compensate for absent data in analysis!
Disentangling variablesDisentangling variables• When we compare
– technology writing with creative writing• are we also finding gender effects?
Disentangling variablesDisentangling variables• When we compare
– technology writing with creative writing• are we also finding gender effects?
• Rebalancing the corpus– Subsample the corpus on stratified lines, or
mathematically rescale corpus• reduces the amount of data• what do we do about missing data?
Disentangling variablesDisentangling variables• When we compare
– technology writing with creative writing• are we also finding gender effects?
• Rebalancing the corpus– Subsample the corpus on stratified lines, or
mathematically rescale corpus• reduces the amount of data• what do we do about missing data?
• Rebalancing the dataset
Disentangling variablesDisentangling variables• When we compare
– technology writing with creative writing• are we also finding gender effects?
• Rebalancing the corpus• Rebalancing the dataset• Test contribution of interacting variables
– Evaluate each independent variable and their interaction in predicting DV
– cf. analysis of covariance (ANCOVA)• but for categorical variables
Rebalancing corporaRebalancing corpora• Aim: equalise the ratios
– spoken:written (across m/f) – male:female (across sp/w)
• Drawback:– throws away information– problems with empty subsets
• Methods:– random subsampling– rescaling
• counting instances as <1 item
f
msp w
Rebalancing datasetsRebalancing datasets• Attempting to obtain a balanced corpus
is good practice in data-collection– avoid zero speakers for each sociolinguistic
combination
Rebalancing datasetsRebalancing datasets• Attempting to obtain a balanced corpus is
good practice in data-collection– avoid zero speakers for each sociolinguistic
combination• But different
research questions are likely to obtain different ratios– Tensed VP density
in DCPSE(Bowie et al 2013)
form
al f-
to-f
info
rmal
f-to
-f
tele
phon
e
b di
scus
sions
b in
terv
iew
s
com
men
tary
parli
amen
t
lega
l x-e
xam
asso
rt sp
ont
prep
ared
sp
Total
1960s1990s
Accounting for interactionAccounting for interaction• Another way of considering the problem
– We cannot be sure that we are seeing independent effects of two variables
A
B
C
Accounting for interactionAccounting for interaction• Another way of considering the problem
– We cannot be sure that we are seeing independent effects of two variables
– Or that the two variables are essentially the same
AB C
Accounting for interactionAccounting for interaction• Another way of considering the problem
– We cannot be sure that we are seeing independent effects of two variables
– Or that the two variables are essentially the same
– In the worst case the two variables measure the same thing (e.g. m = sp, f = w)
AB C
Testing for interactionTesting for interaction• A statistical test checks whether ratios
are constant (homogeneity)– 2x2 chi-square χ2 = 0– Cramér’s φ = χ2/kN = 0
• k = diagonal - 1 f
msp w
Testing for interactionTesting for interaction• A statistical test checks whether ratios are
constant (homogeneity)– 2x2 chi-square χ2 = 0– Cramér’s φ = χ2/kN = 0
• k = diagonal - 1• Can we use χ2 to see
if an uneven distributioncauses the variables tointeract?– Assume A, B and C are
binary variables for simplicity
f
msp w
Testing for interactionTesting for interaction• We can use χ2 to test
– A C
1122
2010
30
33
3132
63N =
values of C
values of A
20 10
11 22
χ2 = 6.99 = 0.33
Testing for interactionTesting for interaction• We can use χ2 to test
– A C and B C
1023
219
1122
2010
30
33
3132
6332
31values of B
20 10
11 22
21 9
10 23
C
χ2 = 6.99 = 0.33
χ2 = 9.91 = 0.40
Testing for interactionTesting for interaction• We can use χ2 to test
– A C and B C
• Now use χ2 to test– AB C 10
23
219
1122
2010
30
33
3132
6332
31
C
B A
Testing for interactionTesting for interaction• We can use χ2 to test
– A C and B C
• Now use χ2 to test– AB C
• Method– Create a 3D table
• 1 2D ‘layer’ for each value of C
1023
219
11
47
616
155
64
22
2010
30
C
B A
122019
12
3132
6332
31
Testing for interactionTesting for interaction• We can use χ2 to test
– A C and B C
• Now use χ2 to test– AB C
• Method– Create a 3D table
• 1 2D ‘layer’ for each value of C– Define expected distribution
• eabc = nab nc / N – expected = no variation across C– compensates for uneven sample
1023
219
11
47
616
155
64
22
2010
30
C
B A
nab
nc
uneven sample
122019
12
3132
3231
Testing for interactionTesting for interaction• We can use χ2 to test
– A C and B C
• Now use χ2 to test– AB C
• Method– Create a 3D table
• 1 2D ‘layer’ for each value of C– Define expected distribution
• eabc = nab nc / N – expected = no variation across C
– Calculate χ2 = Σ(o – e)2/e• test has single degree of freedom
1023
219
11
47
616
155
64
22
2010
30
C
B A
χ2 = 13.79 = 0.47
122019
12
3132
3231
Testing for interactionTesting for interaction• Method
– Create a 3D table• 1 2D ‘layer’ for each value of C
– Define expected distribution• eabc = nab nc / N
– expected = no variation across C– Calculate χ2 = Σ(o – e)2/e
• test has single degree of freedom– χ2 = 13.79, = 0.47
– BUT this tests A or B• Subtract χ2(A) and χ2(B)
– result non-significant (or < 0) no interaction
1023
219
11
47
616
155
64
22
2010
30
ConclusionsConclusions• Ideal would be that:
– the corpus was “representative” in all 3 ways:
• a genuine random sample• a broad range of text types• a stratified sampling of speakers
– But these principles are unlikely to be compatible
• e.g. speaker age and utterance context
ConclusionsConclusions• Ideal would be that:
– the corpus was “representative” in all 3 ways:• a genuine random sample• a broad range of text types• a stratified sampling of speakers
– But these principles are unlikely to be compatible• e.g. speaker age and utterance context
• Some compensatory approaches may be employed at research (data analysis) stage– what about absent or atypical data?– what if we have few speakers/writers?
ConclusionsConclusions• Data-collection is important
– Pay attention to stratification in selecting texts/speakers
• consider replacing texts in outlying categories– Justify and document non-inclusion of
stratum by evidence• e.g. “there are no published articles attributable
to authors of this age in this time period”
ConclusionsConclusions• Data-collection is important
– Pay attention to stratification in selecting texts/speakers
• consider replacing texts in outlying categories– Justify and document non-inclusion of stratum
by evidence• e.g. “there are no published articles attributable to
authors of this age in this time period”• But a stratified corpus does not guarantee a
stratified dataset– need to disentangle effects of variables
ConclusionsConclusions• Testing for interaction
– χ2 can measure degree to which • combination of and B affects the choice of C• use uneven sampling for expected distribution
– Cramér’s φ is derived from χ2
• Analysis of covariance– Subtracting χ2 for and B allows us to test if
remaining interaction is significant• a significant result means
– the variables interact to obtain a new result• no effect means
– the variables may be dependent (measure the same thing)
ReferencesReferences• Bowie, J., Wallis, S.A., and Aarts, B. 2013. Contemporary
change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H.J. and van der Auwera, J. (ed.) English Modality, Berlin: De Gruyter, 57–94.
DCPSE: gender / genreDCPSE: gender / genre• DCPSE has a simpler genre categorisation
– also divided by time
0 0.2 0.4 0.6 0.8 1
TOTAL
face-to-face conversations
formal
informal
telephone conversations
broadcast discussions
broadcast interviews
spontaneous commentary
parliamentary language
legal cross-examination
assorted spontaneous
prepared speech
femalefemalemalemale
p
DCPSE: gender / timeDCPSE: gender / time• DCPSE has a simpler genre categorisation
– also divided by time• note the gap
0
0.2
0.4
0.6
0.8
1
1958
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
p
time
DCPSE: genre / timeDCPSE: genre / time• Proportion in each spoken genre, over time
– sampled by matching LLC and ICE-GB overall • this is a ‘stratified sample’ (but only LLC:ICE-GB)• uneven sampling over 5-year periods (within LLC)
0
0.2
0.4
0.6
1960 1965 1970 1975 1980 1985 1990
Informal face-to-face
formal face-to-face
spontaneous commentary
telephone conversations
prepared speech
pICE-GB
target for LLC
DCPSEDCPSE• LLC sampling not stratified
– Issue not considered, data collected over extended period
– Some data was surreptitiously recorded
DCPSEDCPSE• LLC sampling not stratified
– Issue not considered, data collected over extended period
– Some data was surreptitiously recorded• DCPSE matched samples by ‘genre’
– Same text category sizes in ICE-GB and LLC– But problems in LLC (and ICE) percolate
DCPSEDCPSE• LLC sampling not stratified
– Issue not considered, data collected over extended period
– Some data was surreptitiously recorded• DCPSE matched samples by ‘genre’
– Same text category sizes in ICE-GB and LLC– But problems in LLC (and ICE) percolate
• No stratification by speaker– Result: difficult and sometimes impossible to
separate out speaker-demographic effects from text category