+ All Categories
Home > Documents > The the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore...

The the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore...

Date post: 02-Aug-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
35
0 The <richer> the people, the <bigger> the crates the more <literal> the <better> the less <elaborate> you can be the <better> The <slower> the <better> the <higher> the temperature the <darker> the malt The <older> a database is, the <richer> it becomes The <higher> the clay content, the more <sticky> the soil The <fresher> the soot, the <longer> it needed the <higher> the number, the <greater> the protection The <older> you are the <greater> the chance The <smaller> the mesh the more <expensive> the net Linguistic Institute 2017 Corpus Linguistics Annotation Amir Zeldes [email protected] Nathan Schneider [email protected]
Transcript
Page 1: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

0

The <richer> the

people, the <bigger> the crates

the more <literal>

the <better>

the less <elaborate> you can be the

<better>

The <slower> the <better>

the <higher> the temperature the <darker> the malt

The <older> a

database is, the <richer> it becomes

The <higher> the clay content, the more <sticky> the

soil

The <fresher> the soot, the <longer> it

needed

the <higher> the number, the <greater>

the protection

The <older> you are the <greater> the

chance

The <smaller> the mesh the more

<expensive> the net

Linguistic Institute 2017

Corpus Linguistics Annotation

Amir Zeldes [email protected] Nathan Schneider [email protected]

Page 2: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

1

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Some corrections – <s>

All text must be within some <s> …</s>

<s>This is a sentence</s>…

<s>This is a sentence…</s>

We mark full utterances, not clauses:

<s>I know,</s> <s> that’s silly</s>

<s>I know, that’s silly</s>

Motivation: Maximal predictability!

<mantra>a ‘stupid’ 100% predictable guideline is often better than a clever but inconsistent one</mantra>

Page 3: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

2

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Last time

Corpora allow us to:

Find similarities and differences in context (smart=?clever)

Model unpredictable alternations probabilistically (of or ’s?)

Compare types of language (Gulliver’s Travels vs. hotdogs)

Many differences are quantitative:

No categorical phenomenon makes a text “native(-like)”

Working with probabilities means counting

Counting means categorizing = qualitative analysis

Page 4: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

3

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Discussion – Fillmore (1992)

Some open questions:

How could grammaticality judgments be untrue?

Why would corpus results be disinteresting?

Is a dichotomy armchair/corpus (still) justified?

Does use of corpora mandate/preclude certain types of theories?

Armchair of Charles Dickens (Wikimedia)

Page 5: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

4

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

What do I do if I need a corpus?

First of all we need: a research question

The research question will determine the type of corpus you need

A specially tailored corpus created for your purpose? (expensive, potential pitfalls)

An existing corpus created for similar/other purposes? (less effort [but still some], potentially less suitable)

Page 6: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

5

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

What kind of corpus would you need to study...

The loss of English thou in favor of you?

What learners of English find hard to acquire?

Differences between literary English and spoken varieties?

American/Canadian/British/Australian English? Other World Englishes?

Real spoken language and performed spoken language in media or subtitles?

Differences in political opinions in social media?

Page 7: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

6

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Parameters

These questions all refer to different parameters:

Time

Native language

Medium

Difficulty(?)

Politics

These parameters must be varied and coded across the corpus – how do we recognize target varieties?

Coding scheme must be documented, reproducible

Page 8: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

7

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Sampling data

A corpus in the philological sense can be exhaustive: no chance of getting more data

The complete works of Shakespeare = the language of Shakespeare[an plays]

Gothic = the Bible of Wulfila (really?)

In the context of linguistics, a corpus is always a sample

Sampling data has been studied in depth in the field of statistics

Page 9: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

8

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Representativeness

Some populations are infinite (e.g. all utterances of the English language, past, present and future)

Can’t examine ‘entire’ population – we need a sample

A sample should be representative of the population it represents:

US Americans are: (census data)

33.9% under 25

51.1% female

50% earn $46,326 or less per household …

Therefore a representative sample of American English should have the same distribution among speakers/writers

Page 10: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

9

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Representativeness

But what parameters should we check for?

Everything: how tall they are, what they weigh, hair color....

But we can't get a group of 100 people split exactly like the population

And there are ∞ parameters

Are they really all relevant?

Representativeness is always discussed with respect to a subset of parameters and applications (marketing toothpaste)

The residual bias should be eliminated by random sampling – but chance of error is never 0%

Page 11: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

10

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Open questions

Are reference corpora really representative?

What are the proportions?

Example: British National Corpus (BNC)

10% spoken data (expensive) : 90% written (cheaper)

Internal divisions, e.g. spoken: demographic part (age, region, sex); context part (educational, business, institutional, leisure, others) – dialogues and monologues

What do the proportions correspond to?

Should a corpus attempt to be proportional?

Page 12: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

11

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Variability

How can we ensure our target variety is fully represented?

“Representativeness refers to the extent to which a sample includes the full range of variability in a population.”

Biber (1993), see also Hunston (2008) [recommended readings in Canvas!]

We still need to decide on the population our results should generalize to

Be explicit about our assumptions

Page 13: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

12

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Proportional sampling – text types

if we collected all the language produced and received for a week by residents of a city in the U.S., we could identify the actual proportions of language varieties that these people experienced probably something like 80% conversation, 10% television shows, 1% newspapers, 1% novels, 2% meetings, 2% radio broadcasts, 2% texts that they wrote (memos, email messages, letters), and 2% other texts (signs, instructions, specialist written texts, etc.). (Biber & Jones 2009:1288)

Good model of what a person is likely to hear in a day?

But minimal chances to encounter many registers and constructions (scientific English?)

On some level, we want balance because a reference corpus should have enough examples of "everything"

Page 14: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

13

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Proportional sampling – text types

Most 'general purpose' corpora attempt to include some amount from as many genres as possible

Sampling within each subgroup (stratified)

Really good reference corpora have both written and spoken language (expensive to collect!)

Some corpora repeat existing designs for comparability:

Brown (US) / LOB (UK) – 1960s

Frown / F-LOB – 1990s (same text types)

Page 15: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

14

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Reception and production

This time written down:

<controversial-statement> We do not necessarily use the same grammar for reception and production </controversial-statement>

Some texts are certainly more received than others:

The Washington Post

Game of Thrones

Some of us produce lots of text others don't produce, or only receive:

SMS, Twitter, e-mail, …

Lectures, State of the Union, best man speech, …

What is representative English writing?

Page 16: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

15

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Variation – Brown corpus (Kučera & Francis 67)

Nouns with striking frequency difference in Brown (60s)/Frown (90s) “Learned” (J) section:

rank word brown j frown j LL 2 formula 29 467 -460.56 6 woman 21 215 -182.78 8 model 38 218 -137.35

14 X 14 111 -84.27 15 household 10 100 -84.22 33 signal 8 65 -49.94 35 logic 5 56 -49.26 38 brain 6 56 -45.83 40 network 5 52 -44.48 42 wife 5 51 -43.29 43 conflict 13 71 -43.25

Page 17: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

16

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Variation – Brown corpus (Kučera & Francis 67)

Nouns that only occur in Frown J (Freiburg, 90s):

rank word freq frown j

11 caption 55 12 DNA 52 14 Gender 50 24 Opiate 37 28 retrieval 33 38 African-Americans 28 39 document 28 40 EC 28 41 galaxies 28

Page 18: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

17

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Reddit – 2013-17 vs. 2005-09

iPad 130:0

Skyrim 141:0

Kickstarter 61:0

Streamer 41:0

so unless we get the streamer to comment on his political views we've got no real data (2014)

It was orchestrated by another streamer who coordinated his followers… (2014; different user)

Page 19: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

18

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Reddit – 2013-17 vs. 2005-09

Borat 29:0

Enterprisey 19:0

found a bunch of meaningless enterprisey IT consultant bullshit. (2006)

Britannica 17:0

Pwned 14:0

the first girl who commented would probably have pwned you seriously. (2009)

Page 20: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

19

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Interim conclusion

What you put in (co-)determines what you get out

In many cases it makes sense to be probabilistically representative (no need to seek out the weirdest texts)

In some cases you want more texts from rare varieties

Generally want to maximize variability within the relevant domain (e.g. reddit -> random subreddits)

Doing this well entails:

Determining your parameters

Filling up all possible combinations of categories

Being explicit about your criteria (re-sampling same author OK?...)

Page 21: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

20

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Preprocessing - a quick primer

Before we can search in corpora, we need to prepare them:

Structural markup

Tokenization

Linguistic annotation

Page 22: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

21

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Preprocessing

So far we've talked about corpora as massive collections of text

We've taken for granted that we can search for sensible things like words and sentences

Massive collections of text initially look like this:

..._bye_now_Stan_21_JUNE_2014_at_12:04_PM_Hi_Stan,_I_also_thikn_...

This is not just a practical problem – there are important theoretical issues involved

Page 23: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

22

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Preprocessing as analysis

Whether you do your own preprocessing or use an existing corpus:

Preprocessing is part of the analysis

Every step is a linguistic decision, e.g. normalization:

thikn > think //reason: “I know better”

No ads, images //reason: “not interesting”

Like everything else in methodology – it must be documented

General possibilities:

Manual (accurate?) vs. automatic (predictable?)

Stochastic vs. rule-based vs. hybrid (different types of errors)

More or less linguistic knowledge (opportunities and pitfalls)

Page 24: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

23

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Getting to plain text

In many simple corpus architectures, we want to reduce the data to plain text

No pictures

No formatting

Pros: - Easy to search - Easy to count - Normalization - Automatic processing

Cons: - Loss of information - Decisions are tricky - Some types of study

precluded

Page 25: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

24

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Back to our texts

If your text has a short_name metadatum and all text is inside <s> sentences, then your data should validate:

If not, there may be problems… We will fix them next!

Validating is important!

Ensures data was annotated uniformly

Helps you to find what you’re looking for later

Page 26: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

25

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Back to our texts

Find your document and click on Edit

Page 27: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

26

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

XML Mark-up

Mark-up means denoting categories that are not part of the text via textual means, typically in < brackets >

Non textual elements and information about formatting*:

<photo>

<heading>Wikinews</heading>

*not real examples

Page 28: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

27

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Tags and attributes

The text between <> is called an XML element or XML tag

Tags can also have attributes:

<figure description="Photo of York Minster">

</figure>

<hi rend="bold, yellow background">Important!</hi>

Closing tags have no attributes

Tip: never use ‘smart quotes’ with tags!

Page 29: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

28

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Let’s add some attributes!

Go through your sentences – for each sentence add a type attribute

Some possible common values:

decl – declarative

wh – wh question

q – yes/no question, or any other type of question

frag – NP/PP fragment, no VP (The only one., By car!)

sub – ‘subjunctive’, including would (I’d go there), could, and conditional subjunctives (I’d go there if I could), but not declarative conditionals (if you hit someone they hit back) or indicative "you can …"

Page 30: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

29

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Let’s add some attributes!

Less common types:

imp – Imperative – Show me! includes negative Don’t!

inf – infinitive, including modifiers (to be or not to be, how to dance)

ger – gerund (Hitting rock bottom.)

intj – interjection headed (greeting hi, expletive darn, answer particle yeah)

other (e.g. verbless predication: nice, that; coordination of multiple types: find it and I don't care how!; method 2 to the rescue!)

Page 31: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

30

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

<s type="imp">Don’t go!</s>

decl – declarative

wh – wh question

q – yes/no question, or any other type of question

frag – NP/PP fragment, no VP (The only one., By car!)

sub – ‘subjunctive’, including would (I’d go there), could, and conditional subjunctives (I’d go there if I could), but not declarative conditionals (if you hit someone they hit back) or indicative "you can …"

imp – Imperative – Show me! includes negative Don’t!

inf – infinitive, including modifiers (to be or not to be, how to dance)

ger – gerund (Hitting rock bottom.)

intj – interjection headed (greeting hi, expletive darn, answer particle yeah)

other (e.g. verbless predication: nice, that; coordination of multiple types: find it and I don't care how!; method 2 to the rescue!)

Page 32: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

31

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

My text

Example:

Page 33: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

32

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

Escaping

What happens if your text actually contains angle brackets?

Use XML escapes with &…;

&gt; = > (greater than)

&lt; = < ( less than)

What if the text contains an ampersand??

&amp; = &

Page 34: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

33

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

What else should we encode?

Vast literature on coding various categories

What do you lose by going to plain text?

What would you worry about losing?

Any added value categories you are interested in?

For this exercise we will encode some implicit information:

Paragraphs – already included in data ✓

Sentence boundaries ✓

Orthographic and other errors ([sic!])

Dates and times: interesting for tracking change

Quotations: studying intertextuality, metalanguage

Other important formatting: bold, italics, bullets

Page 35: The  the Corpus Linguistics...Linguistic Institute 2017 - Corpus Linguistics Discussion – Fillmore (1992) Some open questions: How could grammaticality judgments be

34

in rented accommodation forever. Hi Bruno. <Long time no speak>. I prefer your new identity to old ones 20 April, 2006 at 1:03 PM Hello <Long time no browse>, they have banned the "google translated" version

Linguistic Institute 2017 - Corpus Linguistics

For Friday

Finish assigning sentence types

Go through your file and add tags for errors:

<sic> for errors: carries more <sic>wait</sic> in the end

If words are misspelled together, also add a space and <w>: Theywant -> <sic><w>They want<w></sic>

Use the ‘validate’ button to make sure everything checks out

Contact me if you can’t figure out validation error messages (note: the message will usually tell you the line number containing a problem)


Recommended