+ All Categories
Home > Documents > Corpus Linguistics

Corpus Linguistics

Date post: 05-Mar-2016
Category:
Upload: blackpearlmely
View: 227 times
Download: 1 times
Share this document with a friend
Description:
What is corpus linguistics; the design, development, and types of corpora

of 38

Transcript
  • WHAT IS CORPUS LINGUISTICS; THE DESIGN, DEVELOPMENT, AND TYPES OF CORPORA

  • CONTENTS

    IntroductionWhat is a corpus?What is CL?- CL: Methodology vs. disciplineThe development of corpora Historical originsThe influence of technology in the development of corporaThe many applications of corpus linguistics - Language teaching and learningThe types of corpora The design of corpora

  • WHAT IS A CORPUS? ....any collection of more than one text can be called a corpus: the term corpus is simply the Latin for body, hence a corpus may be defined as any body of text. (McEnnery & Wilson, 2001, p.29)

    A corpus is a set of texts whose size defies analysis by hand and eye alone within any reasonable timeframe. It is the large scale of the data used that explains the use of machine-readable text. (McEnery & Hardie, 2012)

    ...a corpus is a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description.(Kennedy, 1998, p.1)

  • WHAT IS A CORPUSIn contrast to being simply any body of text, a corpus in modern linguistics can be more accurately described as a finite body of text which can be stored and manipulated using a computer and which is sampled to be maximally representative of a particular variety of language. (ibid.)

    Its a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair, 2004)

  • WHAT IS A CORPUS?

    A corpus is different from a random collection of texts or an archive.

    As language is infinite but a corpus has to be finite in size, we sample and proportionally include a wide range of text types to ensure maximum balance and representativeness. (Xiao, 2007)

    Representativeness is a defining feature of a corpus.

    According to McEnery & Wilson (2001), the main characteristics of a corpus are: 1. sampling and representativeness2. finite size3. machine-readable from 4. standard reference

  • WHAT IS CORPUS LINGUISTICS?

    Its the study of linguistic phenomena through large collections of machine-readable texts:corpora. (1)

    Corpus linguistics is the study and analysis of data obtained from a corpus.(3)

    Its the study of language based on real life use of language. (McEnery and Wilson, 2001)

    Corpus linguistics is a new scholarly enterprise established through the compilation and analysis of the data stored in computerized databases over the last three decades. (Kennedy, 1998)

  • WHAT IS CORPUS LINGUISTICS?CL: METHODOLOGY vs. DISCIPLINE Corpus linguistic community is divided between considering CL as a methodology or a discipline:

    Its not directly about the study of any particular aspect of language but rather an area whose focus is upon a set of procedures, or methods, for studying language. (McEnery & Hardie, 2012)

    Corpus linguistics is not a linguistic discipline in the same sense as syntax or semantics, for example, as these focus upon describing or explaining some aspects of language use. In contrast, corpus linguistics is a methodology rather than an aspect of language that requires explanation or description. (McEnery & Wilson, 2001)

  • WHAT IS CORPUS LINGUISTICS?CL: METHODOLOGY vs. DISCIPLINE

    Corpus based vs. Corpus driven linguistics (the terms originally introduced by Tognini-Bonelli, 2001)Corpus based studies typically use corpus data to explore a theory or hypothesis established in the current literature for the purpose of validing it, refuting it or refining itThe definition of CL as a method buttresses this approach to the use of corpus data in linguistics. Corpus driven linguistics rejects the characterisation of CL as a method and instead claims that the corpus itself should be the sole source of hypotheses about language. The corpus itself embodies its own theory of language. (as cited in McEnery & Hardie, 2012, p.6)

  • WHAT IS CORPUS LINGUISTICS?CL: METHODOLOGY vs. DISCIPLINE

    Leech promotes computer-based corpus research as a new paradigm of linguistics.He denies that it may be viewed as a mere technique or method: "I wish to argue that computer corpus linguistics defines not just a newly emerging methodology for studying language, but a new research enterprise, and in fact a new philosophical approach to the subject. (Leech, 1992: 106-107)Leech further states its features:focus on linguistic performance as opposed to competencefocus on linguistic description as opposed to linguistic universalsfocus on both quantitative and qualitative models of languagefocus on a more empiricist than a rationalist view of scientific researchAccording to Leech each of these properties underlines a contrast between the CL paradigm and the Chomskyan paradigm (as cited in Leon, 2005)

  • WHAT IS CORPUS LINGUISTICS?CL: METHODOLOGY vs. DISCIPLINE

    CORPUS LINGUISTICS IN THE CONTEXT OF EMPIRICISM AND RATIONALISM

    Empiricists vs. Rationalists: Rely on naturally ocurring observations or to rely on artificially induced observations?A rationalist theory is based on artificial behavioural data and conscious introspective judgements A native speaker of a language reflecting on the language and making theoretical claims founded on those reflections Rationalist theories - founded on the development of a theory of mind the aim is to develop a theory of language which both emulates the external effects of human language processing and actively seeks to make the claim that it represents how the processing is actually undertaken (relates to competence)Empiricist approach observation of naturally occurring data typically occurring through the medium of the corpus (relates to performance) (Leon, 2005)

  • WHAT IS CORPUS LINGUISTICS?CL: METHODOLOGY vs. DISCIPLINEWHAT CHOMSKY SAYS

    "Corpus linguistics does not exist." (Chomsky, in an interview with Bas Aarts, 2000)

    Chomsky argues that any particular corpus of utterances obtained by linguists in their fieldwork cannot be identified as the set of grammatical sentences, inasmuch as the notion of grammaticality inolves those of projection, infiniteness and ideal speaker:

    "Any grammar of a language will project the finite and somewhat accidental corpus of observed utterances to a set (presumably) infinite of grammatical utterances. In this respect, a grammar mirrors the behaviour of the speaker who, on the basis of a finite and accidental experience with language, can produce or understand an indefinite number of new sentences." (Chomsky, 1957: 15)

  • WHAT IS CORPUS LINGUISTICS?CL: METHODOLOGY vs. DISCIPLINEWHAT CHOMSKY SAYS

    "Any natural corpus is skewed...If generated, it will produce non-sentences or conversely be incomplete and not proved every grammatical sentence". the argument reported by Leech (1991, 1992)"Grammaticality cannot be identified with high statistical approximation." (Chomsky, 1957) - grammaticallity and acceptability must be distinguishedChomskys arguments against the post-Bloomfieldians have been used to explain an alleged gap of corpus production which did not really occurr. Actually, it may be assumed that there was no discontinuity between the present annotated corpora and vocabulary count corpora which were flourishing throughout the early 20th century. Chomskys criticis of corpora and statistical methods, on the other hand, did not concern vocabulary counts. He, rather, appeared to find the use of Markov and word statistics models quite worth while, as long as they did not deal with syntax. Kucera (one of the Brown corpuss authors) agreed with Chomsky on this point, using Markovs model in a comparative study of phonemes.

  • THE DEVELOPMENT OF CL HISTORICAL ORIGINS

    The early beginnings of corpus linguistics can be traced back to the thirteenth century, when Biblical scholars were indexing words from the Christian Bible, manually, page by page.

    In 1736, using the Bible as a corpus, Alexander Cruden, a London bokseller and proofreader, produced Crudens Concordance, which included major content words in the Bible, collocations and function words.

    In 1755, dr Samuel Johnsons Dictionary of the English Language was published, and it represented a large corpus of about 40,000 headword entries, recorded on slips of paper.

  • THE DEVELOPMENT OF CL HISTORICAL ORIGINS

    The Oxford English Dictionary (OED) was also an example of a corpus on slips of paper, first published in 1928.

    In 1940, Friess American English Grammar was published which described social class differences in language usage.

    The SEU corpus (The Survey of English Usage), was particularly assembled for grammatical descriptions. Founded by Quirk, it was published in 1968. Initially, SEU wasnt able to be analyzed by computer, but it marked the transition between pre-electronic and modern corpus linguistics.

  • THE INFLUENCE OF TECHNOLOGY ON THE DEVELOPMENT OF CL

    The advance of technology stimulated the development of corpora. Although the first computers were extremely difficult to work with, as they had small memories, their great potential was recognised from an early date.

    Computational work with texts began with Father Busas Thomisticus before 1950.

    The invention of the tape recorder in the late 1950s enabled making collections of spoken data. In the 1960s, the first electronic corpus of written language, the Brown Corpus, was compiled at Brown University by Nelson Francis and Henry Kuera.

  • THE INFLUENCE OF TECHNOLOGY ON THE DEVELOPMENT OF CLDuring the1970s, there wasnt some notable advance in computer technology, nevertheless this was the period when corpora of million words were assembled and also a spoken corpus with detailed phonological transcription, as a section of SEU

    The revolution in hardware and software in the 1980s an 1990s largely contributed to the development of modern linguistics.

    The invention of scanners and OCR software improved acces to the printed materials, and video and DVD recorders had positive effects on creating spoken corpora

    Data and search results were more easily trasferred from scholar to scholar with the growth of the Internet and fast download speeds (OKeefe & McCarthy, 2010)

  • THE MANY APPLICATIONS OF CORPORAAccording to OKeefe and McCarthy (2010), there are many areas in which the use of corpora have been adopted, such as:

    Language teaching and learningDiscourse analysisLiterary studies and translation studiesForensic linguisticsPragmatics Sociolinguistics, media discourse and political discourse

  • THE MANY APPLICATIONS OF CORPORA

    LANGUAGE TEACHING AND LEARNING

    Momentous and revolutionary Data-driven Learning (DDL) bolstered by Johns and Tribble (first coined by Johns in 1991) A student-centered method in which learners read large amounts of authentic language and try to discover linguistic patterns and rules on their own, learner-autonomy promptedThe corpus becomes the center of knowledge, the students take on the role of questioner and the teacher becomes facilitator/prompterTeacher researh director and collaborator, doesnt transmit information directly or explicitly

  • THE MANY APPLICATIONS OF CORPORALANGUAGE TEACHING AND LEARNING

    Over the years, different investigators and language teachers have taken advantage of DDL for teaching different components of language such as collocations, grammatical points, affixes, etc. (Ball, 1996; Dyck, 1999; Kettemann, 1995; Tribble, 1997).Samples of authentic language for preparing DDL exercises are commonly taken from linguistic corpora (spoken and written). It is claimed that DDL is advantageous only for the advanced students, more intelligent and sophisticatedJohns (1986) said that DDL is appropriatefor the learners who are adult, have enough motivation and, are sophisticated and intelligent.

  • TYPES OF CORPORAGeneral vs. Specialised corporaMonitor vs. Sample corporaSynchronic vs. diachronic corpora Monolingual vs. multilingual corporaComparable vs. parallel corporaWritten vs. Spoken corpora

  • GENERAL vs. SPECIALISED CORPORA

    GENERAL CORPORA - designed for unspecified linguistic research-used by researchers to answer questions about vocabulary, grammar or discourse structure-contain texts from various genres of use including spoken and written, private and public they are balanced EXAMPLES: The SEU corpus (The Survey of English Usage)The British National Corpus (BNC)SPECIALISED CORPORA designed for particular research projects- sources of word frequency data and citation for the compilation of modern dictionaries- may be small in size and easy to build, e.g. a corpus of French essays written by English high school students EXAMPLES: Child Language Data Exchange System (CHILDES) 1st language acquisitonOxford English Dictionary (OED)

  • TYPES OF CORPORACorpus of the Survey of English Usage

  • MONITOR VS. SAMPLE CORPORA

    MONITOR CORPORA originally proposed by Sinclair (1982) A collection of data which grows in size over time and which contains a variety of materials, as more and more texts are added to it continuallyEXAMPLES: The Bank of English (BoE) from the 1980sThe Corpus of Contemporary American English (COCA)The Web

    SAMPLE CORPORA represent a particular type of language over a specific period of time and are constructed according to a specific sampling frame. These corpora are also called snapshot corpora as they are samples of a given language at a given time. EXAMPLES: The Brown Corpus compiled in the 1960s; the original sample corpus which contains a large number of texts from informative and imaginative prose The Lancaster-Oslo/Bergen corpus (LOB) representing a snapshot of standard written form of British English in the early 1960s

  • MONITOR VS. SAMPLE CORPORACorpus of Contemporary American English COCA(1990-2012)

    - The largest freely-available corpus of English- The only large and balanced corpus of American English - Created at Brigham Young University by Mark Davis- Contains 440 million words of full-text data (190.000 texts) - Equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. http://corpus.byu.edu/coca/

  • SYNCHRONIC vs. DIACHRONIC CORPORA

    SYNCHRONIC CORPORA enable analysis of different varieties of language at the same point in time EXAMPLE The International Corpus of English (ICE) from 1990; 26 research teams around the world are preparing electronic corpora of their own national or regional variety of English

    DIACHRONIC CORPORA allow language change to be studied over time across a range of texts from different genres. EXAMPLES: The Helsinki Corpus of English Texts the first specialised diachronic corpus of EnglishTheHelsinki Corpus of Older Scots

  • THE DESIGN OF CORPUSSome key considerations in designing a corpus (OKeefe and McCarthy, 2010)

    PurposeSizeCollecting textsGetting permission to collect textsRepresentativeness, balance, sampling

  • THE DESIGN OF CORPUSPURPOSE Before constructing a new corpus we need to answer two basic questions:

    Why do we want another corpus?There is already a plenty of electronic corpora readily available online, so this question must be very seriously considered.

    What do we want to achieve with it? The purpose of the corpus must be clear as well as the use of it.

  • THE DESIGN OF CORPUS

    SIZE

    The size of the corpus depends on the purpose of it, i.e. what is intended to be used for. Many corpus creators agree that largest is the best,e.g. for lexicographic projects, such as dictionaries. However, if we want to create a corpus for pedagogical purposes,e.g. teaching and language use in the classroom, we will aim for a smaller corpus.

  • THE DESIGN OF CORPUSSIZE So, how large should a corpus be?Leech (1991): Size is not all-important.Krishnamurthy (2001): Size matters.Aside from the purpose for which is intended there a number of practical considerations as well:- The kind of query that is anticipated from usersAre you studying common or rare linguistic features?- The methodology they use to study the data How much work can be done by the machine and how much has to be done by hand? - For corpus creators, also the source of data Are the data in electronic form readily available at a reasonable cost? Can copyright permissions be granted easily if at all? (Xiao,2007)

  • THE DESIGN OF CORPUSCOLLECTING TEXTS

    Again, this is closely related to the purpose of the corpus. The texts that are to be collected and included in the corpus must be related to it.

    Texts can be obtained from two basic sources: publicly available sources and privately available sources.

    Publicly available data can be collected from newspapers, journals, magazines and a number of sites on the Internet.

  • THE DESIGN OF CORPUSGETTING PERMISSION

    There are texts that are considered public domain and are available for free, but there is also copyrighted material.

    Therefore, prior to collecting texts, we should obtain permission from institutions or parties involved, if there are such requirements.

  • THE DESIGN OF CORPUSREPRESENTATIVENESS AND BALANCE

    REPRESENTATIVENESS The corpus that is to be built must be representative of the language being investigated.

    There must be a match between the language being examined and the type of material being collected. (Biber, 1993)

    This means that if we want to describe the language of newspaper editorials, collecting personal letters wont be representative of the language of newspaper editorials.

    Representativeness is achieved by balance and sampling.

  • THE DESIGN OF CORPUSBALANCE Is achieved by including a range of various genres of the texts in the corpus.This depends on the purpose of the corpuse and its intended use. For example, a general-purpose corpus must include both spoken and written texts. The British National Corpus (BNC)is considered to be a well balanced corpus.

  • THE DESIGN OF CORPUSBALANCEBNCGenerally accepted as being a balanced corpusHas been followed in the construction of a number of corpora4,124 texts (including transcripts of recording)ca. 100 million words: 90% Written + 10% SpokenThree criteria for Written- Domain: the content type (i.e. subject field)- Time: the period of text production- Medium: the type of text publication (book, periodicals etc)Two criteria for Spoken- Demographic: informal conversations by speakers selected by age group, sex, social class and geographical region- Context-governed: formal encounters such as meetings, lectures and radio broadcasts recorded in 4 broad context categories

  • THE DESIGN OF CORPUSSAMPLING Means how the chunks of each genre are selected. Sampling is inevitable in terms of achieving the representativeness of the corpus. A sample is representative if what we find for the sample is also true for the general population.

  • THE DESIGN OF THE CORPUSSAMPLING In order to obtain a representative sample from a population we need to define: sampling unit, population and sampling frame.

    For example, if it is about a written text, we can have the following: sampling unit a book, periodical, or newspaperpopulation assembly of all sampling units sampling frame list of all sampling units

    For example, the population from which samples for the Brown corpus were drawn was written English text published in the United States in 1961 while its sampling frame was a list of the collection of books and periodicals in the Brown University Library and the Providence Athenaeum.

  • REFERENCESBOOKS McEnery, T. & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. New York, NY: Cambridge University PressMcEnery, T. & Wilson, A. (2001). Corpus Linguistics, An Introduction. Second Edition. Edinbourgh: Edinbourgh University PressKennedy, D.G (1998). An Introduction To Corpus Linguistics. Addison Wesley Longman OKeefe, A. & McCarthy, M. (2010). The Routledge Handbook of Corpus Linguistics. New York, NY: RoutledgeJOURNALSTaylor, C. (n.d.) What Is Corpus Linguistics? What The Data Says. ICAME Journal No.32. University of Siena. Retrieved from: http://clu.uni.no/icame/ij32/ij32_179_200.pdf Representativeness, Balance and Sampling. (n.d.). Retrieved from: http://www.lancaster.ac.uk/fass/projects/corpus/ZJU/xCBLS/chapters/A02.pdf Leon, J. (2005). Claimed and Unclaimed Sources of Corpus Linguistics. Henry Sweet Society Bulletin. Issue No 44. CNRS, Universite Paris. Retrieved from: http://htl.linguist.univ-paris-diderot.fr/leon/leon.pdfPPT SLIDE PRESENTATIONS Xiao, R. (2007). Corpus Design And Types Of Corpora. Retrieved from: https://www.google.ba/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCEQFjAAahUKEwiF2IWbwYvJAhVE6Q4KHYHtC68&url=http%3A%2F%2Fwww.lancaster.ac.uk%2Ffass%2Fprojects%2Fcorpus%2FZJU%2Fxpresentations%2Fsession%25201.ppt&usg=AFQjCNHrK-8YLAVHqQ-1D4vW-7O0HRAahg PICTURES https://www.google.ba/search?q=what+is+corpus+linguistics&biw=1024&bih=610&source=lnms&tbm=isch&sa=X&ved=0CAYQ_AUoAWoVChMI0MbdhcKLyQIVxCgPCh2NUQxF

  • THANK YOU!


Recommended