+ All Categories
Home > Education > Corpus linguistics the basics

Corpus linguistics the basics

Date post: 07-Dec-2014
Category:
Upload: jorge-baptista
View: 191 times
Download: 1 times
Share this document with a friend
Description:
Introductory lecture on Corpus Linguistics. Contents: Corpus linguistics: past and present, What is a corpus?, Why use computers to study language? Corpus-based vs. Intuition-based approach, Theory vs. Methodology. This lecture was based on McEnery et al. 2006. Corpus-based Language Studies. An Advanced resource book. Routlege.
17
Corpus Linguistics the basics * Jorge Baptista Universidade do Algarve, Portugal Spoken Language Laboratory L2F, INESC IDLisboa [email protected] * based on McEnery et al. (2006): 3ff. Université Nationale de Taurida, 15-septembre-2014 BMU-MID Eramus+ Programme
Transcript
Page 1: Corpus linguistics the basics

Corpus Linguisticsthe basics *

Jorge BaptistaUniversidade do Algarve, Portugal

Spoken Language Laboratory L2F, INESC [email protected]

* based on McEnery et al. (2006): 3ff.

Université Nationale de Taurida, 15-septembre-2014BMU-MID Eramus+ Programme

Page 2: Corpus linguistics the basics

2

Corpus Linguistics: the basics Plan *

Past and Present What is a corpus? Why use computers to study language? Corpus-based vs. Intuition-based approach Theory vs. Methodology

* Based on McEnery et al. 2006

Page 3: Corpus linguistics the basics

3

Corpus Linguistics: Past & Present «corpus linguistics» 80’s (Leech 1992) origins: field linguists (Boas, Sapir, Bloomfield) shoeboxes and paper slips:

simple collections of written or transcribed texts not representative methodology essentially corpus-based

1950’s and Chomsky’s criticisms small size, primarily for distinguishing features in

Pnonetics few linguists used paper-based corpora for

grammar (Jaspersen, Fries) corpora deemed «skewed»

Page 4: Corpus linguistics the basics

4

Corpus Linguistics: Past & Present Today

technology (personal computer) processing power, massive storage low cost

First Modern Corpus Brown corpus (AE) early 60’s

Corpus-based methodology widely popular open new, promising areas of research applied in every field of linguistics

Page 5: Corpus linguistics the basics

5

What is a corpus? A body of currently or naturally occurring

language, collected with explicit (linguistic) criteria with a particular purpose in mind and structured in view of its representativeness

Consensus on:(1) machine-readable(2) Authentic texts(3) Sampled (from population of texts)(4) Representative (of that population)

Different corpus and different expectations Balanced, general corpus Specialized corpus

Page 6: Corpus linguistics the basics

6

Why use computers to study language?

speed of processing data manipulation data storage minimal cost accurate and consistent processing enriched with metadata

Page 7: Corpus linguistics the basics

7

Corpus-based vs. Intuition-based approach intuition-based approach

purer instances of linguistic phenomena readily available invented examples are free from language-external

influences include negative exemples

BUT: intuition should be used with much caution ! possible influence of dialect/idiolect/sociolect at least, corpus evidence represent what speakers believe to

be acceptable utterances invented examples involve monitering language production, may not represent typical language use introspection is not directly observable

Page 8: Corpus linguistics the basics

8

Corpus-based vs. Intuition-based approachSTILL:

not all questions can be addressed by a corpus-based methodology

no corpus can provide (theorethically fundamental) negative examples:

a ausência de prova não é prova de ausência the two methodologies are not mutually exclusive

but complementary

in medio uirtus

Page 9: Corpus linguistics the basics

9

Methodology vs. Theory methodology body of techniques and principles, with

theoretical status (but not a theory) applied in many branches of Linguistics

Phonetics Morphology Syntax Semantics etc…

Page 10: Corpus linguistics the basics

Representativeness, Balance, and Sampling

Page 11: Corpus linguistics the basics

11

Representativeness(Leech 1991): a corpus is representative of a

given language variety if the findings based on its contents can be generalized to that language variety

(Biber 1993): a corpus is always a sample of a language or language variety (population); sampling is entailed in the compilation of any corpus

Representativeness balance : the range of genres included in the

corpus sampling : how the chunks of each genre are

selected

Page 12: Corpus linguistics the basics

12

Representativeness and text selection external criteria (situational)

Text categories or GENRES or REGISTERS internal criteria (linguistic features

distribution) Text types

change over time sample / monitor corpus

general/specialized corpora general language (broad genre coverage) domain-specific or genre-specific (closure)

Page 13: Corpus linguistics the basics

13

Closure (or saturation) defined for a particular linguistic feature (e.g.

lexicon size) finite or subject to small variation after a

certain point

sample

1

sample

2

sample

3

sample

4

sample

5

sample

6

sample

7

sample

8

sample

9

sample

100%

10%20%30%40%50%60%70%80%90%

100%

New wordsBase words

Page 14: Corpus linguistics the basics

14

Balance range of text categories included in corpus depends on the corpus intended use (research

question) e.g. , general-purpose corpus must include

written and oral texts domain-specific can be balanced: wide range

of text categories pertaining to the domain proportional sampling (down-scale model of

population) balance: more an act of faith than a statement

of fact more important for sample corpus than for

dynamic corpus

Page 15: Corpus linguistics the basics

15

Sampling inevitable, but must match our research

question scale-down versions of a population sampling unit and population boundaries

e.g. sampling unit = book population = assembly of sampling units sampling frame = list of sampling units

Page 16: Corpus linguistics the basics

16

Sampling Population can be defined in terms of

language production (demography : sex, age, social status, etc.) language reception (demography) language as product (text category/genre)

Sampling techniques simple random sampling (all s.u. numbered and then chosen from random numbers’ table) Stratified random sampling (all s.u. divided by homogeneous strata, then each s.u.in each

strata is numbered and only then it is chosen from random numbers’ table)

demographic sampling (similar to stratified, but demographic criteria)

NB: stratified sampling proportional to target population

Page 17: Corpus linguistics the basics

17

ReferencesBiber, D. 1993. Representativeness in corpus design. Literary and Linguistic

Computing 8/4, pp.243-257.Leech, G.1991. The state of the art in corpus linguistics. in K.Aijmer and B.

Altenberg (eds.). English Corpus Linguistics. London: Longman, pp. 8-29.McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003, pp.

448-463. McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language

Studies. An advanced resource book. Routledge.Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics.

Oxford: Oxford University Press.


Recommended