+ All Categories
Home > Documents > Corpus Linguistics L615/415: Annotation - Week...

Corpus Linguistics L615/415: Annotation - Week...

Date post: 29-Mar-2018
Category:
Upload: vukien
View: 230 times
Download: 4 times
Share this document with a friend
15
Corpora Annotation Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner 1 / 14
Transcript
Page 1: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Corpus Linguistics L615/415:Annotation - Week 2

Olga Scrivner

1 / 14

Page 2: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Review: Corpus Characteristics

Prototypical Corpus

1. Machine-readable (Unicode/ASCII) text files

2. Representative

3. Balanced

4. Data from natural communicative settings

http://images.clipartpanda.com/

row-of-books-clipart-5384556-pile-of-books--vector-illustration.jpg

2 / 14

Page 3: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Review: Corpus Characteristics

Prototypical Corpus

1. Machine-readable (Unicode/ASCII) text files

2. Representative

3. Balanced

4. Data from natural communicative settings

http://images.clipartpanda.com/

row-of-books-clipart-5384556-pile-of-books--vector-illustration.jpg

2 / 14

Page 4: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Unicode vs ASCII

Character encoding - translating a character to a numberMorse code → character to tone

ASCII - 7-bit encoding and only 128 characters (AmericanEnglish)

Unicode - 8-, 16-, or 32-bit characters (UTF-8, UTF-16,UTF-32)

A bit (binary unit) can hold only one oftwo values: 0 or 1

Eight bits make a byte

3 / 14

Page 5: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Unicode

https://www.branah.com/unicode-converter

http://online-toolz.com/tools/hex-binary-convertor.php4 / 14

Page 6: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

What Are the Differences?

Documentary-linguistic corpora

Small corpus with audio/video recording designed toprovide an overview of an endangered language(unbalanced)

Prototypical corpora

Balanced corpus from natural communicative settings

Experimental corpora

Corpus violating natural communicative setting: subjectsbehavior is controlled with carefully-developedexperimental stimuli

5 / 14

Page 7: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

What Are the Differences?

Documentary-linguistic corporaSmall corpus with audio/video recording designed toprovide an overview of an endangered language(unbalanced)

Prototypical corporaBalanced corpus from natural communicative settings

Experimental corporaCorpus violating natural communicative setting: subjectsbehavior is controlled with carefully-developedexperimental stimuli

5 / 14

Page 8: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Annotation

Process of assigning a label to a tokenized wordidentifying the part of speech of the word

Process of marking each word with its base (dictionary)form

Initial segmentation process (words, numbers,punctuation)

Annotation with a phrase-structure representation ordependency-tree representation

Annotation of senses of word forms

Annotation of the set of sounds

Annotation of features such as tone units, pause, stress

Nonverbal annotation

6 / 14

Page 9: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Lemmatization vs Stemming

Lemma → base form

Stem → truncation

1 worked working works

2 managed manager manageable managing

3 apples apple

Stemmer: http://9ol.es/porter_js_demo.html

Lemmatizer: http:

//textanalysisonline.com/nltk-wordnet-lemmatizer

7 / 14

Page 10: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

POS Tagging

Task: POS tag the following sentence:Corpus linguistics is my favorite class!

http://textanalysisonline.com/nltk-pos-tagging

8 / 14

Page 11: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

POS Tagging - Results

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html9 / 14

Page 12: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Parsing

Parse: It is very hot today.

10 / 14

Page 13: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Parsing - Results

Identify: a) phrase-structure parsing and b) dependency parsing

11 / 14

Page 14: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Semantic Annotation: Word Sense Disambiguation

“drop me a line when you get there”

http://wordnetweb.princeton.edu/perl/webwn

12 / 14

Page 15: Corpus Linguistics L615/415: Annotation - Week 2cl.indiana.edu/~obscrivn/docs/AnnotationOverview-w2.pdf · Corpus Linguistics L615/415: Annotation - Week 2 Olga Scrivner ... Morse

Corpora

Annotation

Semantic Annotation

“Semantic annotation is an extremely time- andresource-consuming task”

13 / 14


Recommended