+ All Categories
Home > Documents > Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.

Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.

Date post: 04-Jan-2016
Category:
Upload: nigel-mccoy
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
25
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009
Transcript

Seminar in Applied Corpus Linguistics: Introduction

APLNG 597AXiaofei Lu

August 26, 2009

2

Overview

What is a corpus Corpus design and compilation Corpus annotation Corpus querying and analysis Resources GOLD

3

What is a corpus?

Leech (1992): an unexciting phenomenon, a helluva lot of text, stored on a

computer Sinclair (1991):

a collection of naturally-occurring language text, chosen to characterise a state or a variety of language

Sinclair (2004): a collection of pieces of language text in electronic form,

selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

4

Types of corpora

General-purpose vs. specialized corpora The British National Corpus Michigan Corpus of Academic Spoken English

Native vs. learner corpora International Corpus of Learner English

Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer

Corpora representing one or diverse language varieties International Corpus of English

Synchronic vs. diachronic corpora Spoken vs. written corpora

5

Corpus design

Purpose/orientation, type External criteria for content selection

Communicative function of a text Mode, medium, interaction, domain, topic

Sampling, size Representativeness, balance, homogeneity Design of the BNC

6

Corpus annotation

Why annotate Levels of corpus annotation Difficulties for corpus annotation Standards and encoding

7

Why annotate

For linguistic research Allow more effective corpus searches

For natural language processing Spelling and grammar checking Machine translation

8

Levels of corpus annotation

Sentence and word segmentation Lemmatization and part-of-speech (POS) tagging Chunking and syntactic parsing Semantic, pragmatic, discourse, and stylistic tagging Learner corpora: error annotation Project-specific annotation

9

Difficulties for corpus annotation

Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD

Unknown words Identification POS tagging Semantic annotation

Precision, recall, inter-annotator agreement

10

Standards and encoding

Useful standards Separable Documentation Linguistically consensual Compatibility with existing standards

Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w>

11

Corpus querying and analysis

Using windows- or web-based software Good for processing raw corpora Word frequency, concordances, lexical bundles,

and keyword lists Examples: AntConc and GOLD

Using natural language processing tools Good for processing annotated corpora Extracting occurrences of grammatical patterns Examples: Stanford parser and Tregex

12

Interpreting corpus data

Statistical analysis examples Are frequency differences statistically

significant? w appears x times in an n-word corpus, and y times

in an m-word corpus Chi-square test and Fisher’s Exact Test

Collocation analysis How strongly are x and y associated Mutual information and t-test

13

Resources

Books Hunston (2002): Corpora in Applied Linguistics McEnery (2006): Corpus-Based Language Studies

Journals International Journal of Corpus Linguistics Corpus Linguistics and Linguistic Theory Corpora

Websites and mailing lists Bookmarks for corpus-based linguists Linguistic data consortium The corpora list

14

Resources

Corpus annotation and analysis tools Stanford Natural Language Processing Group

Places for exploration MICASE BNC Online

15

Note on research project design

Purpose of project Corpus compilation and annotation Corpus analysis

Bottom-up: from observations of recurring patterns to hypothesis and generalizations

Top-down: start with given categories and search for evidence of use and variance

Caution on generalizability

16

GOLD: Graphic Online Language Diagnostic

One of 10 projects in CALPER Co-directors: Michael McCarthy & Xiaofei Lu This is work in progress (2006-2010)

17

Overview of functions

An online tool for users to Build, upload, and update their own corpora Share corpora with each other Search corpora

18

Corpus compilation

A user can compile a corpus by Directly creating and uploading an XML file Using the guided XML creation interface

An uploaded corpus can be easily updated Documents can be added or deleted The whole corpus can be deleted

19

Corpus sharing

GOLD facilitates easy data sharing A corpus may be set to be

Private, shared, or public Corpus owner may give others right to

View, add, edit, or delete corpora

20

Metadata information

A corpus should contain informative metadata Information about the learner Information about the sample

Facilitates contrastive and longitudinal studies

21

Corpus search

Select one or more corpora to search Specify key words or phrases

May use the wildcard character, e.g. book* Specify contexts

Size of context window Context words and their positions

Specify metadata conditions

22

Corpus search results

Display of search results Sortable KWIC display of search results Sortable graphic display of search results

Additional statistics of selected corpora Sortable wordlist MLS, MLW, Type/Token ratio

23

N-gram search

Procedure Select one or more corpora to search Specify search word Specify contexts Specify metadata conditions

Search results Sortable list of n-grams found in selected corpora

24

Summary of features

Difference from other online tools Can create, share, and search multiple corpora Ability to work with any language

With informative metadata, one can Compare performance of different learners Track development of a learner or a group of

learners over time

25

Challenges

Corpora for benchmarking Multilingual natural language processing Suggestions on desirable functions welcome


Recommended