Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | nigel-mccoy |
View: | 215 times |
Download: | 0 times |
2
Overview
What is a corpus Corpus design and compilation Corpus annotation Corpus querying and analysis Resources GOLD
3
What is a corpus?
Leech (1992): an unexciting phenomenon, a helluva lot of text, stored on a
computer Sinclair (1991):
a collection of naturally-occurring language text, chosen to characterise a state or a variety of language
Sinclair (2004): a collection of pieces of language text in electronic form,
selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research
4
Types of corpora
General-purpose vs. specialized corpora The British National Corpus Michigan Corpus of Academic Spoken English
Native vs. learner corpora International Corpus of Learner English
Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer
Corpora representing one or diverse language varieties International Corpus of English
Synchronic vs. diachronic corpora Spoken vs. written corpora
5
Corpus design
Purpose/orientation, type External criteria for content selection
Communicative function of a text Mode, medium, interaction, domain, topic
Sampling, size Representativeness, balance, homogeneity Design of the BNC
6
Corpus annotation
Why annotate Levels of corpus annotation Difficulties for corpus annotation Standards and encoding
7
Why annotate
For linguistic research Allow more effective corpus searches
For natural language processing Spelling and grammar checking Machine translation
8
Levels of corpus annotation
Sentence and word segmentation Lemmatization and part-of-speech (POS) tagging Chunking and syntactic parsing Semantic, pragmatic, discourse, and stylistic tagging Learner corpora: error annotation Project-specific annotation
9
Difficulties for corpus annotation
Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD
Unknown words Identification POS tagging Semantic annotation
Precision, recall, inter-annotator agreement
10
Standards and encoding
Useful standards Separable Documentation Linguistically consensual Compatibility with existing standards
Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w>
11
Corpus querying and analysis
Using windows- or web-based software Good for processing raw corpora Word frequency, concordances, lexical bundles,
and keyword lists Examples: AntConc and GOLD
Using natural language processing tools Good for processing annotated corpora Extracting occurrences of grammatical patterns Examples: Stanford parser and Tregex
12
Interpreting corpus data
Statistical analysis examples Are frequency differences statistically
significant? w appears x times in an n-word corpus, and y times
in an m-word corpus Chi-square test and Fisher’s Exact Test
Collocation analysis How strongly are x and y associated Mutual information and t-test
13
Resources
Books Hunston (2002): Corpora in Applied Linguistics McEnery (2006): Corpus-Based Language Studies
Journals International Journal of Corpus Linguistics Corpus Linguistics and Linguistic Theory Corpora
Websites and mailing lists Bookmarks for corpus-based linguists Linguistic data consortium The corpora list
14
Resources
Corpus annotation and analysis tools Stanford Natural Language Processing Group
Places for exploration MICASE BNC Online
15
Note on research project design
Purpose of project Corpus compilation and annotation Corpus analysis
Bottom-up: from observations of recurring patterns to hypothesis and generalizations
Top-down: start with given categories and search for evidence of use and variance
Caution on generalizability
16
GOLD: Graphic Online Language Diagnostic
One of 10 projects in CALPER Co-directors: Michael McCarthy & Xiaofei Lu This is work in progress (2006-2010)
17
Overview of functions
An online tool for users to Build, upload, and update their own corpora Share corpora with each other Search corpora
18
Corpus compilation
A user can compile a corpus by Directly creating and uploading an XML file Using the guided XML creation interface
An uploaded corpus can be easily updated Documents can be added or deleted The whole corpus can be deleted
19
Corpus sharing
GOLD facilitates easy data sharing A corpus may be set to be
Private, shared, or public Corpus owner may give others right to
View, add, edit, or delete corpora
20
Metadata information
A corpus should contain informative metadata Information about the learner Information about the sample
Facilitates contrastive and longitudinal studies
21
Corpus search
Select one or more corpora to search Specify key words or phrases
May use the wildcard character, e.g. book* Specify contexts
Size of context window Context words and their positions
Specify metadata conditions
22
Corpus search results
Display of search results Sortable KWIC display of search results Sortable graphic display of search results
Additional statistics of selected corpora Sortable wordlist MLS, MLW, Type/Token ratio
23
N-gram search
Procedure Select one or more corpora to search Specify search word Specify contexts Specify metadata conditions
Search results Sortable list of n-grams found in selected corpora
24
Summary of features
Difference from other online tools Can create, share, and search multiple corpora Ability to work with any language
With informative metadata, one can Compare performance of different learners Track development of a learner or a group of
learners over time