0
The <richer> the
people, the <bigger> the crates
the more <literal>
the <better>
the less <elaborate> you can be the
<better>
The <slower> the <better>
the <higher> the temperature the <darker> the malt
The <older> a
database is, the <richer> it becomes
The <higher> the clay content, the more <sticky> the
soil
The <fresher> the soot, the <longer> it
needed
the <higher> the number, the <greater>
the protection
The <older> you are the <greater> the
chance
The <smaller> the mesh the more
<expensive> the net
Linguistic Institute 2017
Corpus Linguistics Annotation
Amir Zeldes [email protected] Nathan Schneider [email protected]
1
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Some corrections – <s>
All text must be within some <s> …</s>
<s>This is a sentence</s>…
<s>This is a sentence…</s>
We mark full utterances, not clauses:
<s>I know,</s> <s> that’s silly</s>
<s>I know, that’s silly</s>
Motivation: Maximal predictability!
<mantra>a ‘stupid’ 100% predictable guideline is often better than a clever but inconsistent one</mantra>
2
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Last time
Corpora allow us to:
Find similarities and differences in context (smart=?clever)
Model unpredictable alternations probabilistically (of or ’s?)
Compare types of language (Gulliver’s Travels vs. hotdogs)
Many differences are quantitative:
No categorical phenomenon makes a text “native(-like)”
Working with probabilities means counting
Counting means categorizing = qualitative analysis
3
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Discussion – Fillmore (1992)
Some open questions:
How could grammaticality judgments be untrue?
Why would corpus results be disinteresting?
Is a dichotomy armchair/corpus (still) justified?
Does use of corpora mandate/preclude certain types of theories?
Armchair of Charles Dickens (Wikimedia)
4
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
What do I do if I need a corpus?
First of all we need: a research question
The research question will determine the type of corpus you need
A specially tailored corpus created for your purpose? (expensive, potential pitfalls)
An existing corpus created for similar/other purposes? (less effort [but still some], potentially less suitable)
5
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
What kind of corpus would you need to study...
The loss of English thou in favor of you?
What learners of English find hard to acquire?
Differences between literary English and spoken varieties?
American/Canadian/British/Australian English? Other World Englishes?
Real spoken language and performed spoken language in media or subtitles?
Differences in political opinions in social media?
6
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Parameters
These questions all refer to different parameters:
Time
Native language
Medium
Difficulty(?)
Politics
These parameters must be varied and coded across the corpus – how do we recognize target varieties?
Coding scheme must be documented, reproducible
7
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Sampling data
A corpus in the philological sense can be exhaustive: no chance of getting more data
The complete works of Shakespeare = the language of Shakespeare[an plays]
Gothic = the Bible of Wulfila (really?)
In the context of linguistics, a corpus is always a sample
Sampling data has been studied in depth in the field of statistics
8
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Representativeness
Some populations are infinite (e.g. all utterances of the English language, past, present and future)
Can’t examine ‘entire’ population – we need a sample
A sample should be representative of the population it represents:
US Americans are: (census data)
33.9% under 25
51.1% female
50% earn $46,326 or less per household …
Therefore a representative sample of American English should have the same distribution among speakers/writers
9
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Representativeness
But what parameters should we check for?
Everything: how tall they are, what they weigh, hair color....
But we can't get a group of 100 people split exactly like the population
And there are ∞ parameters
Are they really all relevant?
Representativeness is always discussed with respect to a subset of parameters and applications (marketing toothpaste)
The residual bias should be eliminated by random sampling – but chance of error is never 0%
10
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Open questions
Are reference corpora really representative?
What are the proportions?
Example: British National Corpus (BNC)
10% spoken data (expensive) : 90% written (cheaper)
Internal divisions, e.g. spoken: demographic part (age, region, sex); context part (educational, business, institutional, leisure, others) – dialogues and monologues
What do the proportions correspond to?
Should a corpus attempt to be proportional?
11
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Variability
How can we ensure our target variety is fully represented?
“Representativeness refers to the extent to which a sample includes the full range of variability in a population.”
Biber (1993), see also Hunston (2008) [recommended readings in Canvas!]
We still need to decide on the population our results should generalize to
Be explicit about our assumptions
12
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Proportional sampling – text types
if we collected all the language produced and received for a week by residents of a city in the U.S., we could identify the actual proportions of language varieties that these people experienced probably something like 80% conversation, 10% television shows, 1% newspapers, 1% novels, 2% meetings, 2% radio broadcasts, 2% texts that they wrote (memos, email messages, letters), and 2% other texts (signs, instructions, specialist written texts, etc.). (Biber & Jones 2009:1288)
Good model of what a person is likely to hear in a day?
But minimal chances to encounter many registers and constructions (scientific English?)
On some level, we want balance because a reference corpus should have enough examples of "everything"
13
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Proportional sampling – text types
Most 'general purpose' corpora attempt to include some amount from as many genres as possible
Sampling within each subgroup (stratified)
Really good reference corpora have both written and spoken language (expensive to collect!)
Some corpora repeat existing designs for comparability:
Brown (US) / LOB (UK) – 1960s
Frown / F-LOB – 1990s (same text types)
14
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Reception and production
This time written down:
<controversial-statement> We do not necessarily use the same grammar for reception and production </controversial-statement>
Some texts are certainly more received than others:
The Washington Post
Game of Thrones
Some of us produce lots of text others don't produce, or only receive:
SMS, Twitter, e-mail, …
Lectures, State of the Union, best man speech, …
What is representative English writing?
15
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Variation – Brown corpus (Kučera & Francis 67)
Nouns with striking frequency difference in Brown (60s)/Frown (90s) “Learned” (J) section:
rank word brown j frown j LL 2 formula 29 467 -460.56 6 woman 21 215 -182.78 8 model 38 218 -137.35
14 X 14 111 -84.27 15 household 10 100 -84.22 33 signal 8 65 -49.94 35 logic 5 56 -49.26 38 brain 6 56 -45.83 40 network 5 52 -44.48 42 wife 5 51 -43.29 43 conflict 13 71 -43.25
…
16
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Variation – Brown corpus (Kučera & Francis 67)
Nouns that only occur in Frown J (Freiburg, 90s):
rank word freq frown j
11 caption 55 12 DNA 52 14 Gender 50 24 Opiate 37 28 retrieval 33 38 African-Americans 28 39 document 28 40 EC 28 41 galaxies 28
…
17
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Reddit – 2013-17 vs. 2005-09
iPad 130:0
Skyrim 141:0
Kickstarter 61:0
Streamer 41:0
so unless we get the streamer to comment on his political views we've got no real data (2014)
It was orchestrated by another streamer who coordinated his followers… (2014; different user)
18
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Reddit – 2013-17 vs. 2005-09
Borat 29:0
Enterprisey 19:0
found a bunch of meaningless enterprisey IT consultant bullshit. (2006)
Britannica 17:0
Pwned 14:0
the first girl who commented would probably have pwned you seriously. (2009)
19
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Interim conclusion
What you put in (co-)determines what you get out
In many cases it makes sense to be probabilistically representative (no need to seek out the weirdest texts)
In some cases you want more texts from rare varieties
Generally want to maximize variability within the relevant domain (e.g. reddit -> random subreddits)
Doing this well entails:
Determining your parameters
Filling up all possible combinations of categories
Being explicit about your criteria (re-sampling same author OK?...)
20
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Preprocessing - a quick primer
Before we can search in corpora, we need to prepare them:
Structural markup
Tokenization
Linguistic annotation
21
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Preprocessing
So far we've talked about corpora as massive collections of text
We've taken for granted that we can search for sensible things like words and sentences
Massive collections of text initially look like this:
..._bye_now_Stan_21_JUNE_2014_at_12:04_PM_Hi_Stan,_I_also_thikn_...
This is not just a practical problem – there are important theoretical issues involved
22
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Preprocessing as analysis
Whether you do your own preprocessing or use an existing corpus:
Preprocessing is part of the analysis
Every step is a linguistic decision, e.g. normalization:
thikn > think //reason: “I know better”
No ads, images //reason: “not interesting”
Like everything else in methodology – it must be documented
General possibilities:
Manual (accurate?) vs. automatic (predictable?)
Stochastic vs. rule-based vs. hybrid (different types of errors)
More or less linguistic knowledge (opportunities and pitfalls)
23
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Getting to plain text
In many simple corpus architectures, we want to reduce the data to plain text
No pictures
No formatting
…
Pros: - Easy to search - Easy to count - Normalization - Automatic processing
Cons: - Loss of information - Decisions are tricky - Some types of study
precluded
24
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Back to our texts
If your text has a short_name metadatum and all text is inside <s> sentences, then your data should validate:
If not, there may be problems… We will fix them next!
Validating is important!
Ensures data was annotated uniformly
Helps you to find what you’re looking for later
25
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Back to our texts
Find your document and click on Edit
26
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
XML Mark-up
Mark-up means denoting categories that are not part of the text via textual means, typically in < brackets >
Non textual elements and information about formatting*:
<photo>
<heading>Wikinews</heading>
*not real examples
27
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Tags and attributes
The text between <> is called an XML element or XML tag
Tags can also have attributes:
<figure description="Photo of York Minster">
</figure>
<hi rend="bold, yellow background">Important!</hi>
Closing tags have no attributes
Tip: never use ‘smart quotes’ with tags!
28
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Let’s add some attributes!
Go through your sentences – for each sentence add a type attribute
Some possible common values:
decl – declarative
wh – wh question
q – yes/no question, or any other type of question
frag – NP/PP fragment, no VP (The only one., By car!)
sub – ‘subjunctive’, including would (I’d go there), could, and conditional subjunctives (I’d go there if I could), but not declarative conditionals (if you hit someone they hit back) or indicative "you can …"
29
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Let’s add some attributes!
Less common types:
imp – Imperative – Show me! includes negative Don’t!
inf – infinitive, including modifiers (to be or not to be, how to dance)
ger – gerund (Hitting rock bottom.)
intj – interjection headed (greeting hi, expletive darn, answer particle yeah)
other (e.g. verbless predication: nice, that; coordination of multiple types: find it and I don't care how!; method 2 to the rescue!)
30
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
<s type="imp">Don’t go!</s>
decl – declarative
wh – wh question
q – yes/no question, or any other type of question
frag – NP/PP fragment, no VP (The only one., By car!)
sub – ‘subjunctive’, including would (I’d go there), could, and conditional subjunctives (I’d go there if I could), but not declarative conditionals (if you hit someone they hit back) or indicative "you can …"
imp – Imperative – Show me! includes negative Don’t!
inf – infinitive, including modifiers (to be or not to be, how to dance)
ger – gerund (Hitting rock bottom.)
intj – interjection headed (greeting hi, expletive darn, answer particle yeah)
other (e.g. verbless predication: nice, that; coordination of multiple types: find it and I don't care how!; method 2 to the rescue!)
31
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
My text
Example:
32
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
Escaping
What happens if your text actually contains angle brackets?
Use XML escapes with &…;
> = > (greater than)
< = < ( less than)
What if the text contains an ampersand??
& = &
33
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
What else should we encode?
Vast literature on coding various categories
What do you lose by going to plain text?
What would you worry about losing?
Any added value categories you are interested in?
For this exercise we will encode some implicit information:
Paragraphs – already included in data ✓
Sentence boundaries ✓
Orthographic and other errors ([sic!])
Dates and times: interesting for tracking change
Quotations: studying intertextuality, metalanguage
Other important formatting: bold, italics, bullets
34
in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version
Linguistic Institute 2017 - Corpus Linguistics
For Friday
Finish assigning sentence types
Go through your file and add tags for errors:
<sic> for errors: carries more <sic>wait</sic> in the end
If words are misspelled together, also add a space and <w>: Theywant -> <sic><w>They want<w></sic>
Use the ‘validate’ button to make sure everything checks out
Contact me if you can’t figure out validation error messages (note: the message will usually tell you the line number containing a problem)