+ All Categories
Home > Documents > Seminar on Corpus Linguistics -...

Seminar on Corpus Linguistics -...

Date post: 10-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
78
Seminar on Corpus Linguistics RED_CORPUS PROJECT 9/7/2010 Amaya Mendikoetxea
Transcript
Page 1: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Seminar on Corpus LinguisticsRED_CORPUS PROJECT

9/7/2010

Amaya Mendikoetxea

Page 2: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Background

•Much current SLA research relies on elicited experimental data and disfavours natural language use data.

•This situation, however, is beginning to change thanks to the availability of computerized linguistic databases and learner corpora.

• The area of linguistic inquiry known as learner corpus research has recently come into being as a result of the confluence of two previously disparate fields: corpus linguistics and SLA (see Granger 2002).

General purpose: to discuss the relation between these two fields of study and, in particular, the contribution of corpus linguistics to the field of SLA.

Page 3: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Aims of the talk: To evaluate the impact of learner corpus research in

recent SLA theory:

- The use of corpora as a source of learner data

- The type of research conducted using learner corpora.

To describe the key design principles of a learner corpus (WriCLE) and its exploitation for SLA research purposes.

To point out future challenges for learner corpus research.

Page 4: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

PART I

Learner Language

Learner Corpora &

Learner Corpus Research

Page 5: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner Language (1)

Until the late 1960s most people regarded second language learners’ speech as an incorrect version of the TL.

Errors were believed to be the result mainly of transfer from L1.

Contrastive Analysis (CA) was the basis for identifying differences between L1 and L2.

Page 6: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner Language (2)

As a result of the finding that many aspects of learners´ errors could not be explained by CAH, researchers began to take a different approach to analysing learners´ errors.

In the 1970s Error Analysis (EA) was the paradigm to replace CA.

EA was based on the assumption that, like child language, L2 learners´language was a system in its own right – rule-governed and predictable: an INTERLANGUAGE.

Interlanguages are systematic, but they are also dynamic, continually evolving as learners receive more input and revise their hypotheses about the L2.

An L2 learner, at any particular moment in his learning sequence, is using a language system which is neither the L1, nor the L2. It is a third language, with its own grammar, its own lexicon, etc. which may or may not be related to the learner´s L1: an Interlanguage. (Selinker 1972)

Page 7: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (1)

SLA research has traditionally drawn on a variety of data types, among which Ellis (1994: 670) distinguishes three major types:

Language use (comprehension & production). Metalinguistic judgements (e.g. acceptability judgements) Introspective methods (e.g. self-report).

There is no single prescribed elicitation measure, nor is there a ‘right’ or‘wrong’ elicitation measure, although many research paradigms have common measures associated with them.

The choice of one measure upon another is highly dependent on the research questions asked and may also be related to the theoreticalframework within which research is conducted.

Findings in SLA research are highly dependent on the data collection (often known as data elicitation) measures used.

Page 8: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (2)

Second Language Research, Vol 23 (2007) & Vol 25 (2009): Natural: 4 out of 20 studies:

2007: - 46 recordings of 40-90 minutes (no info on coding, etc.) to study tense and aspect; - transcriptions of lessons (9 learners) for the acquisition of German syntax (annotated with CHAT); 2009: - 6 sessions of oral data collection (interviews and communicative task) (2 learners) for the study of the role of morphology in English L2; - spoken corpus (16 learners, 20 natives) for a phonology study.

Experimental: map task (communicative task), written translation task, word association test, sentence-matching, acceptability judgementes, picture priming task, neuroimaging, functional magnetic resonance.

Much current SLA research favours experimental, metalinguistic and introspective data, and tends to be dismissive of natural language use data.

[Granger 2002]

Page 9: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (3)

SLA: interdisciplinary, diverse from a theoretical and empirical perspective (methodology and data)

► Aim of SLA research: To build models of:

The underlying mental representations of learners at a particular stage in the process of L2 learning.

The developmental processes which shape and constrain L2 production.

“The language produced by learners, whether spontaneously or throughvarious elicitation procedures, remains a central source of evidence forthese mental processes, and the success of SLA research thereforrelies on having access to good quality data.” [Myles 2005: 3724]

Cognitive perspective: generative SLA, competition model , Cognitive linguistics (usage-based models)

Page 10: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (4)

Why are elicitation techniques favoured in SLA research?

1. The particular structure you want to investigate may not occur in natural production: it may be absent or there may not be enoughinstances.

2. To answer your research question you may need to know what learners rule out as a possible L2 sentence.

Presence of a particular structure/feature in the learners´ natural output does not necessarily indicate that the learners “know” the structure.

Absence of a particular structure/feature in natural language use data does not necessarily indicate that learners do not ´know´ the structure.

In addition, if learners do not use a form at all, we cannot assume that they do not use that form unless they consistently do not use it in a required context.

Page 11: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (5)

3. It is difficult to control the variables that affect learner output in a non-experimental setting (Granger 2002: 6).

Consequence

“As it is difficult to subject a large number of informants to experimentation, SLA research tends to be based on a relatively narrow empirical base, focusing on the language of a very limited number of subjects, which consequently raises questions about the generalizibiity of the results.” [Granger 2002: 6]

Page 12: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (6)

►Why do we need corpora in SLA research?

To test current hypotheses on larger and better constructed datasets in order to see if SLA findings can be generalized [as in L1 acquisition]

To find sets of data not normally found in small studies: structures which are crucial to inform current debates [large datasets]

[Myles 2005]

Example: the debate on where optionality/variation is located in L2 interlanguage [computational system or syntax or interfaces between e.g. syntax and discourse]

Page 13: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (7)

►Why do we need corpora in SLA research?

To discover patterns that may influence the learner (in the input language, the target language, the learners’ L1) or directly in the learner´s output (interlanguage patterns).

To study frequency (particularly interesting for usage-based approaches and input driven models of L2 acquisition).

[Gries 2008]

Page 14: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Gathering learner data (8)

►Is there a dichotomy?“...there is actually no strict corpora-experiments dichotomy. Rather, just as linguistic data in general form a continuum of naturalness of production/ collection, so do corpora: they vary along the above dimensions, which results in a continuum ranging from prototypical corpora via less typical corpora to corpora whose compilation is distinctly experimental in nature.”

[Gilquin & Gries 2009: 6]

Corpora:-No manipulation of variables.-The data do not undergovariation at the time theresearcher queries the corpus.-High external validity(naturalness).- Noise in the data.

Experimental studies-Independent variables are manipulated to have an effect ondependent variables.-There is considerable variationbetween subjects, stimuli, stimuluspresentations.-High internal validity but(sometimes) lower externalvalidity (artificiality).- Noise is minimised.

Page 15: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpora (1): Definition

Rough definition of learner corpora:

Electronic collections of learner data. This definition is too broad and fuzzy: it leads to the term being used for data types

which are in effect not corpora

Granger´s (2002: 7) definition of learner corpora (based on Sinclair´s 1996 definition of corpora):

Computer learner corpora are electronic collections ofauthentic FL/SL textual data according to explicit design criteria for a particular SLA/FLT purpose. They areencoded in a standardised and homogeneous way andare documented as to their origin of provenance.

Page 16: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

FL/SL data = produced by learners. BUT What do we mean by “learner” corpora?

• Learner corpora are to be considered as a subtype of non-native corpora

Like the concept of native speaker corpora (McEnery & Wilson 2001: 29ff.), the concept of learner corpora is thus prototypical, i.e. there are more and less typical learner corpora. [Nesselhauf 2004: 128]

Learner corpus data contain continuous stretches of (oral or written) discourse: not sentences extracted out of a text, or isolated words.

Learner corpora (2): FL/SLdata

Page 17: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpora (3): Types

Commercial: Longman Learners´Corpus (10 million words).

Cambridge Learner Corpus (16 million words).

Academic: The Hong-Kong University of Science and Technology Learner Corpus (25 million words)

International Corpus of Learner English (ICLE) (2.5 million words).

French Language Learner Oral Corpus (FLLOC) (2 million words) and Spanish Language Learner Oral Corpus (SPLLOC)Written Corpus of Learner English (WriCLE) (1 million words)Corpus Escrito del Español como L2 (CEDEL2) (1 million words)

Monolingual Bilingual (Parallel corpora)General Technical (ESP)Synchronic Diachronic (Longitudinal)Written Spoken

[Granger 2002: 11]

Page 18: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (1)

It uses the main principles, tools and methods from corpus linguistics.

It aims too provide improved descriptions of learner language which can be used:

For a wide range of purposes in SLA (hypothesis-testing, hypothesis-finding, frequency, patterns, etc.)

To improve FLT

o Most scholars argue for a strict separation between SLA and languagepedagogy.

o The two areas overlap where pedagogy affects (positively) the course of L2 acquisition.

o A well designed learner corpus can serve both a variety of SLA research purposes and pedagogical purposes, but the connection between the two is not straightforward.

Page 19: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (2)

Although corpora are but one source of evidence among many, complementing rather than replacing other data sources, there is general agreement today that they are “the only reliable source of evidence for features such as frequency” [McEnery & Wilson 1996: 12].

Thus, corpus linguistics has contributed to the discovery of new facts which “have led to far-reaching hypotheses about language, for example about the co-selection of lexis and syntax” [Stubbs 1996: 232].

The major obvious strength of the computer corpus methodology lies in its suitability for conducting quantitative analyses.

Corpus linguistics lends itself mostly to an inductive, exploratory approach to language analysis, but corpora can also be used within deductive/formal approaches for hypothesis-testing.

Page 20: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (3): Methodology

Gries (2008: sec 4): Three different kinds of corpus linguistics methods for L2 research:

1) Frequency lists and collocate lists or collocations

2) Colligations and colloconstructions

Co-occurrence of lexical items with a particular grammatical element or structure

3) Concordances of search expressions

Page 21: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (4): Methodology

Gries (2008: sec 4.1)

1) Frequency lists and collocate lists or collocations

Assumption: it is more useful for learners to learn first those units that are more important (=more frequent) in the target variety.

Practical Applications:

a) Syllabus or curriculum development (irregular verbs, modality markers, progressive aspect...)

b) Using frequency lists to quantify and/or compare the attainment of language proficiency (lexical richness, development of vocabulary...)

More theoretical applications:

a) ‘Approximations’ of the frequencies of elements in the input to the L2 learner or in the target language.

b) To pick experimental stimuli.

c) Using collocations (n-grams) to compare use of formulae in native and non-native language.

Page 22: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (5): Methodology

Gries (2008: sec 4.2)

2) Colligations and colloconstrucions

a) Studying grammatical proficiency (interface lexicon-syntax): e.g. Verb subcategorization patterns (Tono 2004).

b) Approximations to input frequencies and target frequencies of lexico-grammatical co-occurrences-

3) Concordances

Syntactic constructions, semantic prosodies, , etc.

- Morphemes and words = easy to retrieve; syntactic patterns of more complexity are more problematic.

- In the absence of well-annotated corpora searchers often rely on retrieval of lexical units and require extensive post-editing.

Page 23: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (6) Methodology

Contrastive Interlanguage Analysis (CIA)

NS vs. NNS comparisons: non-native vs. native data.

It involves a detailed analysis of linguistic features in native and non-native corpora to uncover and study non-native features in the speech and writing of (advanced) non-native speakers. This includes errors, but it is conceptually wider as it seeks to identify overuse and underuse of certain linguistic features and patterns (Granger 2002: 12-13).

NNS vs. NNS: different non-native data.

By comparing learner data from different L1 backgrounds, we can gain a better understanding of interlanguage processes and features, such as those which are the result of transfer or those which are developmental, common to learners with different L1.

[Granger 1996; Gilquin 2001]

Page 24: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (7): Methodology

Guilquin, G,, S. Papp and M. B. Díez-Bedmar (eds.) (2008). Linking up

contrastive and learner corpus research. Amsterdam: Rodopi.

[Granger 2002: 12] CIA

NS NNS NNS NNS

Page 25: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

25

SV vs. VS production in NNS and NS [Lozano &Mendikoetxea 2009]

97,4

91,9 92,9

97,7 97,8

2,6

8,1 7,1

2,3 2,2

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Italian ICLE Spanish ICLE Spanish ICLE &

WriCLE

French ICLE LOCNESS

Fre

qu

en

cy (

%)

of

VS

pro

du

cti

on

SV

VS

Page 26: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (7): Methodology

Computer-aided error analysis [Granger 2002: 5.2]

Selecting an error prone item (word, phrase, syntactic structure) and scanning the corpus to retrieve all instances of misuse of the item with the help of standard text retrieval tools.

Requires a preconceived idea of what is an error.

Devising a standardised system of error-tags and tagging all the errors in a learner corpus or at least all the errors within a category, using an error editor.

Does not require a preconceived idea of what is an error.

Page 27: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

The INTELeNG Project: The Database

IDN Yr Ess GrC ET Original Error Correction

1 1 1 2 GP I think that is the main

reason that *them* became

criminals

I think that is the main

reason that they became

criminals

2 1 1 1 XNCO I think that is the main

reason *that* them became

criminals

I think that is the main

reason why they became

criminals

3 1 1 1 XNCO A person *loved* is a happy

one

A person who is loved is

(a) happy (one/person)

4 1 1 1 LSF *Miles* of children are

abandoned every year

Thousands of children are

abandoned every year

5 1 1 2 WM Older children take care of

the *younger*

Older children take care of

(the) younger ones

6 1 1 2 GA They feel important in *this*

groups

They feel important in

these groups

7 1 1 5 GVN Nowadays violence against

women *have* increased in

Spain

Nowadays violence against

women has increased in

Spain

Page 28: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research: (8) Types of studies

1) Hypothesis-driven/corpus-based

Using CLC data to test specific hypotheses/ research questions about the nature of IL generated through introspection, SLA theories, or as a result of the analysis of experimental or other non-corpus based sources of data.

2) Hypothesis-finding/corpus-driven

Investigating CLC data in a more exploratory way and initiating analyses that yield patterns of data, which can then be inspected for unusual features. Such features may then be used to generate hypotheses about learner language.

[from Barlow 2005]

Page 29: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research: (9) Types of studies

Alternberg, B. (2002) ‘Using bilingual corpus evidence in learner corpus research’ in Granger et al. (eds). It studies one type of causative construction (make somebody happy),

which Swedish learners overuse, in a parallel corpus. It wants to explore questions like: how central are these structures in the

two corpora?; to what extent are these constructions retained in translation?; what are the main causative types and how are they used?

Its aim is to see how contrastive data can use to explain Swedish overuse of the construction. Transfer is explained in terms of prototypicality.

Aijmer, K. (2002) ‘Modality in advance Swedish learner interlanguage’ in Granger et al. (eds.) Aijmer uses CLC data to compare the range and frequency of some key

modal words in native English writing and L2 English writing of advanced level university students: L1 Swedish and some L1 French and L1 German from ICLE.

The author is interested in the overuse and underuse of some forms and suggests possible pedagogical implications.

Page 30: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research: (10) Types of studies

Housen, A. (2002) ‘A corpus-based study of the L2-acquisition of the English verb system’ in S. Granger et al (eds.)

It aims to investigate the Aspect Hypothesis put forward in the L2 literature: the emergence, early use and subsequent development of verb morphology in L2 is strongly influenced by the inherent semantic properties of the lexical verb which the learner selects to refer to a particular event.

Tono, Y. (2004) ‘Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English’, in G. Aston et al (eds.)

It investigates the acquisition of verb subcategorization frame patterns by Japanese learners of English by examining the relative influence of factors such as : The effect of L1, the amount of exposure to L2 input, and the properties of inherent verb semantics on the use and misuse of verb subcategorization patterns.

Oshita. H., 2000. What is happened may not be what appears to be happening: a corpus study of „passive‟ unaccusatives in L2 English. Second Language Research 16, 4, 393-324.Oshita, H. 2004., Is there anything there when there is not there? Null expletives and second language data. Second Language Research 20, 2, 95-130.

[both using the Longman Learners‟ Corpus]

Page 31: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research: (11) Evaluation

The majority of CLC studies are hypothesis-finding studies.

There are biases in practice [Barlow 2005]: The experimental/generative tradition favours hypothesis-driven/corpus-based

studies.

Corpus linguists have a preference for a hypothesis-finding/corpus-driven methodology.

On the whole, the contribution of CLC research so far has been much more substantial in description than interpretation of SLA data (Granger

2004, Myles 2005).

Two reasons [Granger 2004: 134-135)]: Learner corpus research has been mainly conducted by corpus linguists,

rather than SLA specialists (Hasselgard 1999).

The type of IL CLC researchers have been most interested in (intermediate to advanced) was so poorly described in the literature that they felt the need to establish the facts before launching into theoretical generalizations

Page 32: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research: (12) Evaluation

Researchers use almost exclusively written L2 corpora. Very little use has been made of oral corpora.

Overwhelming focus on advanced learners (but no formal measure of proficiency provided).

Most L2 corpora are untagged and when they are tagged they use very specific schemes, which makes it difficult to share data.

Most of the studies using corpora make little use of software other than concordances.

Analysis tools are fairly limited: lexical searches, frequency counts, concordances and manual annotation.

Unsophisticated statistics: raw frequencies. The developmental dimension is almost lacking. Most work is rather descriptive: documenting differences between native and

non-native English, rather than explaining. CLC studies are also not sufficiently informed about SLA theory: Little or no

reference to current debates and hypotheses in the SLA literature. Strong pedagogical bias: researchers tend to assume that finding out

differences in use between learners and L1 speakers will have direct pedagogical implications, which is not always the case.

Page 33: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpus research (13): Conclusion

Such research is useful nonetheless, as we need to have good descriptions of learner language in order to inform our understanding of what shapes its development, but it is now time that corpus linguists and SLA specialists work more closely together in order to advance both their agendas [Myles 2005: 381].

Page 34: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

PART II

WriCLE: a corpus for L2 research

Page 35: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpora : Explicit Design criteria

The first major step in corpus building is the determination of the criteria on which the texts that form the corpus will be selected: mode (written, oral); type of text (book, journal, letter…); domain (academic, popular); language; date of texts…etc.

For a corpus to be trusted, the structural criteria must be chosen with care, because the concerns of balance and representativeness depend of these choices. [Sinclair 2005]

Design criteria are very important in the case of learner data because there is so much variation in EFL/ESL.

A random collection of heterogeneous learner data is NOT a learner corpus.

The usefulness of a learner corpus is directly proportional to the care that has been exerted in controlling and encoding the variables.

[Granger 2002: 9]

Page 36: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Learner corpora: Explicit design criteria

While some of the criteria are the same as for native corpora, others are specific to learner corpora:

Learner criteria:• Learning context

• L1

• Other FL

• Level of Proficiency

• …

Task criteria• Time limit

• Use of reference tools

• Exam

• Audience/interlocutor

• …

Page 37: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (1)

Key design principles:

[Rollinson, P. & A. Mendikoetxea. 2010. Learner corpora and second language acquisition:Introducing WriCLE. In J. L. Bueno Alonso et al. (eds.) Analizar datos>Describirvariación/ Analysing data>Describing variation, 1-20. Vigo: Universidad de Vigo.]

Principle 1: Focus on written language:

Semi-naturalistic L2 speech data vs. written data: “spontaneous speech produced in face to face interaction is likely to provide more direct evidence about the state of the L2 learner´s underlying interlanguage system.” [Mitchell et al. 2008]

Page 38: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (2)

There are likely to be fewer performance errors in the written language and the errors found are those that escape monitoring, indicating grammatical or lexical gaps in the learners’ mental grammar.

Learners tend to use more complex structures when they are writing, which could be more revealing in terms of their linguistic competence than the simplified language often found in oral language.

Written corpora are often used to study native grammars and are considered to be a good reflection of language competence.

Written corpora are particularly suitable to study the features of the interlanguage of advanced learners, especially in comparison with similar corpus of native speakers:

Learner corpus research in the ICLE tradition shows that advanced learner texts are a valuable source of data to study aspects such as modality, degree adverbs, tenses, collocations, phraseology, the expression of causativity, information structure, clefts, anaphora, etc.

Written corpora can also be used in hypothesis-testing studies: passivised structures and expletives (Oshita 2000, 2004), the study of subject inversion in L2 English (Lozano & Mendikoetxea 2008, 2009, in press).

Page 39: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (3)

Principle 2: Authenticity

Learners (L1 Spanish- L2 English) contribute argumentative or discussion essays written outside the classroom environment for the Academic Writing component of English Language I and English Language IIIcourses for a degree in English Studies.

Page 40: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Authenticity

What is authentic data in corpora?“All the material gathered from the genuine communications of people going about their normal business” vs. data gathered “in experimental conditions or in artificial conditions of various kinds!” (Sinclair 1996).

What is authentic data in L2 corpora?• Learner data is rarely fully natural, especially in the case of EFL learners, who

learn English in a classroom.

• Scale of naturalness (Nesselhauf 2004:128)

fully natural – product of teaching process – controlled task – scripted

In general, the more intervention by the researcher, the furtheeaway we are from ‘authentic’ data.

Page 41: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

‘Authentic’ learner data in a classroom environment = “data resulting from authentic classroom activity” [Granger 2002: 8)

In a foreign language environment, what comes closest to naturally occurring texts are:

- Texts that are produced for pedagogical reasons.

- Texts that are produced for the corpus but that use procedures exerting very little control.

[Nesselhauf 2004: 128].

Free compositions produced for a certain course. Free compositions produced for a corpus. A text read aloud in class

Oral interview for a corpus.

Authenticity

Page 42: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (4)

Principle 3: Variety of learner levels

The corpus includes learners at 6 different proficiency levels as to maximize its usefulness to study development in L2 English.

A standardised proficiency measure is used:

The levels correspond to those of the Common European Framework of Reference for Languages and

They are determined by the score obtained in the Quick Oxford Placement Test taken by each of the learners

Page 43: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (5)

Principle 4: Documentation

The corpus is fully documented:

Each learner fills in a Learner Profile questionnaire [1 profile per learner] (info about learner)

Each learner fills in an Essay Profile questionnaire [1 profile per essay – a learner may contribute more than 1 essay] (info about task)

This information, together with other information (e.g. proficiency level, year of collection, year of study, etc.) is stored in a database, available to researchers with the corpus and allows researchers to build subcorpora.

Page 44: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (6)

Principle 5: Homogeneity

In WriCLE internal criteria guarantee homogeneity (topic, learner type, contextof instruction, etc.).

‘Rogue’ texts are rejected by researchers (e.g. plagiarism).

Homogeneity facilitates comparison with similar corpora [crucial for L2 research]:

CEDEL2 ‘Corpus Escrito del Español como L2’

[L1 English- L2 Spanish; native Spanish] (see Lozano 2009, forthcoming)

LOCNESS ‘Louvain Corpus of native English Essays’ UCL/CECL, Louvain-la Neuve

ICLE ‘International Corpus of Learner English’

(Granger et al. 2002)

“A corpus should aim for homogeneity in its components while maintainingadequate coverage, and rogue texts should be avoided.”

[Sinclair 2005: criterion 10]

Page 45: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (7)

Principle 6: Annotation

Most learner data are ´raw´ because of the difficulty of annotating learner language.

Researchers tend to develop their own annotation schemes & software.

Standard annotation in L1 acquistion: CHILDES

Limitation: user-friendliness, suitability for complex written data

UAM CorpusTool (by Michael O’Donnell):

- Flexibility (it can be adapated for each project)

- It allows for semi-automatic annotation.

- It produces an XML-encoded version of the text file, including the featuresassigned to the segments

An annotated learner corpus should ideally be based on standardised annotation software in order to ensure comparability of annotated learner corpora with annotated native

corpora. [Granger 2002: 10]

Page 46: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (8)

Principle 7: Accessibility

Progress in L1 acquisition research is partly due to data sharing (CHILDES).

If L2 acquisition research is going to experience a similar development, it is essential that researchers share their data in a similar fashion

The complete corpus and database will be available for research purposes

in:

http://www.uam.es/proyectosinv/woslac/Wricle/

Page 47: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: A Written Corpus of Learner English (9)

Page 48: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: Data (1)

Number of texts Texts per year

705

1st yr 3rd yr

451

(64.0%)

254

(36.0%)

Page 49: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: Data (2)

Average length

of essay

Average length per year

933 words

1st yr 3rd yr

691 words 1360 words

Number of

words

Words per year

658,000

1st yr 3rd yr

312,000 346,000

Page 50: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: Data (3)

0

20

40

60

80

100

120

140

160

180

200

a1 a2 b1 b2 c1 c2

Year1

Year3

Essays across proficency levels: 1st -3rd year

Page 51: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: Data (4)

a2

b1

b2

c1c2

Year 1

b1

b2

c1

c2

Year 3Year 1 Year 3

B1

B1

B2

B2

C1

C1

Page 52: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE: Annotation

Page 53: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Automatic Segmentation (Using the Stanford Parser)

Automatic NP Segmentation

Automatic Clause Segmentation

Page 54: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA
Page 55: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Annotation scheme (Corpus Tool)

Page 56: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

WriCLE Inf(ormal)

- 1st year students of English

- Broad variety of genders: descriptions, narratives, e-mail, autobiograpgy, blog (diary entries).

-1, 140 texts (up to 8,000 words per text).

-1.100,000 words (target: 3. 000.000).

-Same design principles as WriCLE.

Page 57: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Studies with WriCLE (1): Woslac(Word Order in Second Language Acquisition Research)

MAIN PURPOSE:

To determine the lexicon-syntax and syntax-discourse properties which constrain word order in the interlanguage of L2 learner corpora

L2 English (with L1 Spanish) (WriCLE)

L2 Spanish (with L1 English) (CEDEL2)

Page 58: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Studies with WriCLE (2): Woslac(Word Order in Second Language Acquisition Research)

Verb-Subject order in L2 English(Lozano & Mendikoetxea 2008, 2009, 2010) (L & M)

What are the conditions governing the production of VS structures in L2 English by L1 Spanish and L1 Italian learners? (L & M 2008)

Do learners of L2 English (L1 Spanish) produce inverted subjects (VS) under the same conditions as English nativesdo, regardless of problems to do with syntactic encoding (grammaticality)? (L & M 2009, 2010)

Page 59: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Studies with WriCLE (3): Woslac(Word Order in Second Language Acquisition Research)

Hypotheses:

H1 [LEXICON]: Lexicon-syntax interface: Postverbal subjects with unaccusatives (never with unergatives)

H2 [WEIGHT]: Syntax-PF interface: Postverbal subjects: heavy (but preverbal light)

H3 [FOCUS]: Syntax-Discourse interface: Postverbal subjects: focus (but preverbal topic)

A proper analysis of VS structures must take into account not only the properties of V (Unaccusative Hypothesis) but also the properties of the postverbal S.

Page 60: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

60

Copora: L1 Spa – L2 Eng (2 corpora: ICLE + WriCLE)

Eng natives (LOCNESS: Louvain Corpus of Native English Essays) )

Query software: WordSmith v. 4.0 (Scott 2004)

Table 1: Corpora details

Learner corpora Native corpus

Words ICLE-Spanish

WriCLE

200,376

63,836

LOCNESS USarg

LOCNESS USmixed

LOCNESS Alevels

LOCNESS BRsur

149,574

18,826

60,209

59,568

Total no. of words 264,212 288,177

Corpus Verb type Usable concordances

Unerg 181 Learner

Unac 820

Unerg 185 Native

Unac 719

TOTAL 1905

Page 61: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

61

UNACCUSATIVES UNERGATIVES

SEMANTIC CLASS VERB

SEMANTIC CLASS SEMANTIC SUBCLASS VERB

EXISTENCE exist EMISSION LIGHT EMISSION beam

flow burn

grow flame

hide flash

live SOUND EMISSION bang

remain beat

rise blast

settle boom

spread clash

survive crack

APPEARANCE appear crash

arise cry

awake knock

begin ring

break roll

develop sing

emerge SMELL EMIS. smell

flow SUBSTANCE EMISSION

pour

follow sweat

happen COMMUNICAT. MANNER OF SPEAKING

cry (*)

occur shout

rise sing (*)

DISAPPEARANCE die TALK VERBS speak

disappear talk

INHERENTLY DIRECTED MOTION

arrive BODILY PROCESSES BREATHE VERBS breathe

come cough

drop cry (*)

enter sweat (**)

DATA ANALYSIS: Based on Levin (1993) and Levin & Rappaport-Hovav (1995):

Unergatives: cough, cry, shout, speak, walk, dance… [TOTAL: 41]

Unaccusatives: exist, live, appear, emerge, happen, arrive… [TOTAL: 32]

Page 62: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

62

WordSmith: query searches:

For every lemma (e.g., APPEAR, ARISE), we searched for:

All possible native forms: appear, appears, appearing, appeared

arise, arises, arising, arose, arisen

All posible overregularised and overgeneralised learner forms: arised, arosed,arisened, arosened (“So arised the Sain Inquisition”)

All possible forms with probable L1 transfer of spelling: apear, apears, apearing, apeared

All other possible misspelled forms:

appeard, apeard

Page 63: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

63

CONCORDANCES: RAW OUTPUT

Thousands of concordances, BUT approx. ¾ were unusable.

Filtering criteria had to be applied manually.

Page 64: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

64

100.0%

0.0%

92.9%

7.1%

100.0%

0.0%

97.8%

2.3%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SV VS SV VS

Unerg Unac

Fre

qu

en

cy o

f p

rod

ucti

on

(in

%)

Learners

Natives

Studies with WriCLE (4): Woslac

Table 1: Frequency of postverbal subjects produced

Corpus Verb type Postverbal subjects Usable concordances % frequency

Unerg 0 181 0% Learner

Unac 58 820 7.1%

Unerg 0 185 0% Native

Unac 16 719 2.2%

Page 65: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

65

Studies with WriCLE (5): Woslac There-insertion:

Natives: there exists a demand for this work to be done…Learners: There exist positive means of earning money.

AdvP-insertion:Natives: Thus began the campaign to educate the public…Learners: …and here emerges the problem.

Locative inversion:Natives: [no production]Learners: In the main plot appear the main characters: Volpone and Mosca.

* it-insertion:Learners: *In the name of religion it had occurred some important events.

* Ø-insertion:Learners: …*because exist the science technology and the industrialisation.

* XP-insertion:Learners: *In 1760 occurs the restoration of Charles II in England.

GRAMM.

UNGRAM.

Natives

100.0%

0.0%

Unac VS Gram

Unac VS Ungram

Learners

36.2%

63.8%

Unac VS Gram

Unac VS Ungram

Page 66: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

VS structure types: NNS (L1 Spanish) vs. NS (L & M 2009, 2010)

66

41,4%

15,5%13,8%

10,4% 10,3%8,6%

0,0% 0,0%

43,8%

37,5%

18,8%

0,0%

0%

10%

20%

30%

40%

50%

*It-insertion Locative

inversion

XP-insertion There-

insertion

AdvP-

insertion

*Ø-insertion

Type of preverbal material

Fre

qu

en

cy

of

pro

du

cti

on

(in

%)

Learners

Natives

Ungrammatical Ungrammatical

Page 67: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

67

Data coding/analysis: EXCEL

Page 68: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

68

VS in NNS: a look at processing

Though the precise nature of processing limitations is not well understood, it could well be that they may be responsible for at least some of the difficulties attested at the interfaces.

They could in principle explain why our L1 Spanish and L1 Italian learners produce mostly ungrammatical VS structures,

Structures requiring the integration of syntactic knowledge and knowledge from other domains require more processing resources than structures requiring only syntactic knowledge.

Learners may be less efficient at integrating multiple types of information in on-line comprehension and production of structures at the syntax-discourse interface.

[See Sorace & Serratrice, to appear and references cited therein]

Page 69: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

69

VS in NNS: a look at processing Overproduction of VS structures may be a result of processing

limitations in L2 learners: VS structures maybe regarded as the default unmarked form for presentational purposes.

This is also supported by the fact that the end-weight principle and the end-focus principle reinforce each other.

Learners experience processing difficulties and choose the option

which is easier to process. More advanced learners, however, experience fewer difficulties and

are thus able to encode syntactic information more efficiently.

“Natural language syntax must be such that it can be easily acquiredby children, rapidly parsed by listeners, and efficiently employed byspeakers to express their thoughts.” [Wasow 2002: 57].

Page 70: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Studies with WriCLE (6): OpogramOptionality and pseudo-optionality in native and non-native grammars

Objective: to study the factors causing optionality and pseudo-optionality in the non native grammars of L2 learners of English (L1 Spanish) and L2 learners of Spanish (L1 English) at different proficiency levels.

Hypotheses: 1. Optionality is in part determined by competition between L1 and L2 forms (intermediate stages) and by competition between two L2 forms in apparently free variation in the input which yield different discursive interpretations (advanced and near native states)

2. At different proficiency levels variation is located in different areas of the grammar (syntax at beginner and intermediate stages and the interfaces between syntax and other modules at advanced and near native stages)

3. Advanced learners exhibit residual optionality in the syntax-discourse interface, as argued by Sorace (2000, 2004, 2005).

To verify these hypotheses we need to consult large datasets (WriCLE, CEDEL2, native corpora) and to improve existing tools for analysis.

Page 71: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Studies with WriCLE (7): TREACLE

Exploration of the changes in grammatical competence of Spanish Learners of English (L1 Spanish-L2 English) as they progress in competence.

Idea is not just to measure what they do wrong (error analysis) but also to measure what structures they are attempting.

Competence determined by two measures: Difficulty of structures attempted Degree of correctness of use of these structures

Page 72: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Studies with WriCLE (8): TREACLE

Measuring what they attempt: Automatic annotation of syntactic features of clauses

and phrases (with manual correction)

Measuring what they do wrong: Manual annotation of errors

Will link into better assessment of students not only based on the errors they make.

Page 73: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

PART III

Learner Corpora and SLA research:

The way forward

Page 74: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

The way forward (1)

Existing corpora need to be made available to the research community.

Corpora of L2´s other than English have to be created

There is also a need for spoken corpora and for longitudinal corpora to address the developmental dimension of L2 learning, as well as for cross-sectional corpora, with learners at different levels of proficiency.

Such corpora must be compiled according to explicit design criteria which make them useful to conduct SLA research: they must be compiled by SLA researchers (or in collaboration with them).

- Most available corpora are ‘opportunistic’.- No formal measurement of proficiency is provided.

Corpora must be fully documented: e.g. to select texts for subcorpora

Page 75: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

The way forward (2)

Analysis tools must be developed which are suitable for learner data and are not reliant on manual tagging (see Gries 2008).

It would be useful to use standard annotation to make it possible for data to be shared: Rutherford & Thomas (2001) advocate the use of CHILDES (but is CHILDES really suitable?)

Page 76: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

The way forward (3)

More sophisticated statistical analyses have to be developed: corpus studies have to go beyond raw frequencies (e.g. Collocational statistics in Gries, Hampe and Schönefeld 2005, forthcoming, multifactorial analyses in Tono 2004).

How is ‘overuse’ and ‘underuse’ statistically defined? How to determine the idea span size to investigate collocations? When can we say there is variation (how to provide a working operation of

optionality)?

Corpus data has to be regarded as one type of data – not the only type. It should be combined with experimental data in search for converging evidence, as has been argued e.g. by Gilquin (2007) and Mönnink (1997, 2000).

“Because the advantages and disadvantages of corpora and experiments are largelycomplementary, using the two methodologies in conjunction with each other often makes it possible to (i) solve problems that would be encountered if one employed one type of data only and (ii) approach phenomena from a multiplicity of perspectives.” [Gilquin & Gries 2009: 9]

Page 77: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

The way forward (4)

There is a need for a clearer relationship between (learner) corpus linguists and SLA, with more hypothesis-testing, more explanatory studies. Corpus linguists tend to use corpora in an exploratory fashion, letting the data

speak for themselves. SLA researchers tend to start our with explicitly formulated hypotheses and

use the corpus data to validate or refute those hypotheses.

This line of research has to be made more visible both in corpus linguistics forums and in SLA forums.

In the area of language pedagogy, the issue of whether corpus frequency-based materials are necessary or useful is still unresolved, as is the issue of authenticity(is exposure to authentic materials better for L2 acquisition/learning?):

“ ... so far even the most ardent defenders of the authenticity/frequency camp ... have not yet provided experimental evidence to bolster their claims as to the utility or even indispensability of corpus based-curricula” [Gries 2008: 425].

Page 78: Seminar on Corpus Linguistics - Presentaciónlingesp.usc.es/sites/default/files/mendikoetxea-present.pdf · confluence of two previously disparate fields: corpus linguistics and SLA

Despite all this, it is obvious that corpus-based work has a lot to offer to the fields of SLA and language pedagogy: More and more learner corpora are being created and

shared among researchers, with stricter design criteria, and a lot of work is going into annotation and retrievalof linguistic units in learner language.

Thank you!!!


Recommended