Corpus Linguisticsthe basics *
Jorge BaptistaUniversidade do Algarve, Portugal
Spoken Language Laboratory L2F, INESC [email protected]
* based on McEnery et al. (2006): 3ff.
Université Nationale de Taurida, 15-septembre-2014BMU-MID Eramus+ Programme
2
Corpus Linguistics: the basics Plan *
Past and Present What is a corpus? Why use computers to study language? Corpus-based vs. Intuition-based approach Theory vs. Methodology
* Based on McEnery et al. 2006
3
Corpus Linguistics: Past & Present «corpus linguistics» 80’s (Leech 1992) origins: field linguists (Boas, Sapir, Bloomfield) shoeboxes and paper slips:
simple collections of written or transcribed texts not representative methodology essentially corpus-based
1950’s and Chomsky’s criticisms small size, primarily for distinguishing features in
Pnonetics few linguists used paper-based corpora for
grammar (Jaspersen, Fries) corpora deemed «skewed»
4
Corpus Linguistics: Past & Present Today
technology (personal computer) processing power, massive storage low cost
First Modern Corpus Brown corpus (AE) early 60’s
Corpus-based methodology widely popular open new, promising areas of research applied in every field of linguistics
5
What is a corpus? A body of currently or naturally occurring
language, collected with explicit (linguistic) criteria with a particular purpose in mind and structured in view of its representativeness
Consensus on:(1) machine-readable(2) Authentic texts(3) Sampled (from population of texts)(4) Representative (of that population)
Different corpus and different expectations Balanced, general corpus Specialized corpus
6
Why use computers to study language?
speed of processing data manipulation data storage minimal cost accurate and consistent processing enriched with metadata
7
Corpus-based vs. Intuition-based approach intuition-based approach
purer instances of linguistic phenomena readily available invented examples are free from language-external
influences include negative exemples
BUT: intuition should be used with much caution ! possible influence of dialect/idiolect/sociolect at least, corpus evidence represent what speakers believe to
be acceptable utterances invented examples involve monitering language production, may not represent typical language use introspection is not directly observable
8
Corpus-based vs. Intuition-based approachSTILL:
not all questions can be addressed by a corpus-based methodology
no corpus can provide (theorethically fundamental) negative examples:
a ausência de prova não é prova de ausência the two methodologies are not mutually exclusive
but complementary
in medio uirtus
9
Methodology vs. Theory methodology body of techniques and principles, with
theoretical status (but not a theory) applied in many branches of Linguistics
Phonetics Morphology Syntax Semantics etc…
Representativeness, Balance, and Sampling
11
Representativeness(Leech 1991): a corpus is representative of a
given language variety if the findings based on its contents can be generalized to that language variety
(Biber 1993): a corpus is always a sample of a language or language variety (population); sampling is entailed in the compilation of any corpus
Representativeness balance : the range of genres included in the
corpus sampling : how the chunks of each genre are
selected
12
Representativeness and text selection external criteria (situational)
Text categories or GENRES or REGISTERS internal criteria (linguistic features
distribution) Text types
change over time sample / monitor corpus
general/specialized corpora general language (broad genre coverage) domain-specific or genre-specific (closure)
13
Closure (or saturation) defined for a particular linguistic feature (e.g.
lexicon size) finite or subject to small variation after a
certain point
sample
1
sample
2
sample
3
sample
4
sample
5
sample
6
sample
7
sample
8
sample
9
sample
100%
10%20%30%40%50%60%70%80%90%
100%
New wordsBase words
14
Balance range of text categories included in corpus depends on the corpus intended use (research
question) e.g. , general-purpose corpus must include
written and oral texts domain-specific can be balanced: wide range
of text categories pertaining to the domain proportional sampling (down-scale model of
population) balance: more an act of faith than a statement
of fact more important for sample corpus than for
dynamic corpus
15
Sampling inevitable, but must match our research
question scale-down versions of a population sampling unit and population boundaries
e.g. sampling unit = book population = assembly of sampling units sampling frame = list of sampling units
16
Sampling Population can be defined in terms of
language production (demography : sex, age, social status, etc.) language reception (demography) language as product (text category/genre)
Sampling techniques simple random sampling (all s.u. numbered and then chosen from random numbers’ table) Stratified random sampling (all s.u. divided by homogeneous strata, then each s.u.in each
strata is numbered and only then it is chosen from random numbers’ table)
demographic sampling (similar to stratified, but demographic criteria)
NB: stratified sampling proportional to target population
17
ReferencesBiber, D. 1993. Representativeness in corpus design. Literary and Linguistic
Computing 8/4, pp.243-257.Leech, G.1991. The state of the art in corpus linguistics. in K.Aijmer and B.
Altenberg (eds.). English Corpus Linguistics. London: Longman, pp. 8-29.McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003, pp.
448-463. McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language
Studies. An advanced resource book. Routledge.Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics.
Oxford: Oxford University Press.