+ All Categories
Home > Documents > 1/25 LELA 30922 Lecture 5 Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds)...

1/25 LELA 30922 Lecture 5 Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds)...

Date post: 20-Dec-2015
Category:
View: 215 times
Download: 2 times
Share this document with a friend
25
1/25 LELA 30922 Lecture 5 Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997) Longman, ch. 1 “Introduction” by G Leech; something similar available at http://llc.oxfordjournals.org/cgi/reprint/8/4/2 75.pdf CM Sperberg-McQueen and L Burnard (eds) Guidelines for Electronic Text Encoding and Interchange, ch. 2 “A Gentle Introduction to SGML”, available at http://www-sul.stanford.edu/tools/tutorials/ html2.0/gentle.html
Transcript

1/25

LELA 30922Lecture 5

Corpus annotation and SGML

See esp. R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997) Longman, ch. 1 “Introduction” by G Leech; something similar available at http://llc.oxfordjournals.org/cgi/reprint/8/4/275.pdfCM Sperberg-McQueen and L Burnard (eds) Guidelines for Electronic Text Encoding and Interchange, ch. 2 “A Gentle Introduction to SGML”, available at http://www-sul.stanford.edu/tools/tutorials/html2.0/gentle.html

2/25

Annotation

• Difference between a corpus and a “mere collection of texts” is mainly due to the value added by annotation

• Includes generic information about the text, usually stored in a “header”

• But more significantly, annotations within the text itself

3/25

Why annotate?

• Adds information• Reflects some analysis of text

– Inasmuch as this may reflect commitment to some theoretical approach, this can be a barrier sometimes (but see later)

• Increases usefulness/reusability of text• Multi-functionality

– May make corpus usable for something not originally foreseen by its compilers

4/25

Golden rules of annotation• Recoverability

– It should always be possible to ignore the annotation and reconstruct the corpus in its raw form

• Extricability– Correspondingly, annotations should be easily accessible so they can be

stored separately if necessary (“Before and after” versions)

• Transparency: documentation– Purpose and meaning of annotations– How (eg manually or automatically), where and by whom

annotations were done• If automatic, information about the programs used

– Quality indication• Annotations almost inevitably include some errors or inconsistencies• To what extent have annotations been checked?• What is the measured accuracy rate, and against what benchmark?

5/25

Theory-neutrality

• Schools of thought– Annotations may reflect a particular theoretical approach, and this

should be acknowledged

• Consensus– corpus annotations which are more (rather than less) theory-

neutral will be more widely used– given the amount of work involved, it pays to be aware of the

descriptive traditions of the relevant field

• Standards– There are very few absolute standards, but some schemes can

become de facto standards through widespread use– For example, BNC designers were aware of the likely side effects

of any decisions (regarding annotation) that they took

6/25

Types of annotation

• Plain corpus: it appears in its existing raw state of plain text

• Corpus marked up for formatting attributes e.g. page breaks, paragraphs, font sizes

• Corpus annotated with identifying information, such as title, author, genre, register, edition date

• Corpus annotated with linguistic information• Corpus annotated with additional interpretive

information, eg error analysis in learner corpus

7/25

Levels of linguistic annotation

• Paragraph and sentence-boundary disambiguation– Naive fullstop+space+capital unreliable for genuine

texts– May also involve distinguishing titles/headings from

running text • Tokenization: identification of lexical units

– multi-word units, cliticised words (eg can’t) • Lemmatisation: identification of lemmas (or

lexemes) – Makes accessible variants of lexemes for more generic

searches– May involve some disambiguation (eg rose)

8/25

Levels of linguistic annotation

• POS tagging (grammatical tagging) – assigning to each lexical unit a code indicating its part

of speech – most basic type of linguistic corpus annotation and

forms an essential foundation for further forms of analysis

• Parsing (treebanking)– Identification of syntactic relationships between words

• Semantic tagging– Marking of word senses (sense resolution) – Marking of semantic relationships eg agent, patient– Marking with semantic categories eg human, animate

9/25

Levels of linguistic annotation• Discourse annotation

– especially for transcribed speech– Identifying discourse function of text eg apology, greeting – or other pragmatic aspects, eg politeness level,

• Anaphoric annotation – Identification of pronoun reference – and other anaphoric links (eg different references to the same

entity)• Phonetic transcription (only in spoken language corpora)

– Indication of details of pronunciation not otherwise reflected in transcription eg weak forms,

– Explicit indication of accent/dialect features eg vowel qualities, allophonic variation

• Prosodic annotation (only in spoken language corpora)– Suprasegmental iformation, eg stress, intonation, rhythm

10/25

Some examplesPROSODIC ANNOTATION, LONDON-LUND CORPUS:well ^very nice of you to ((come and)) _spare the !t\/ime and #^come and !t\alk # -^tell me a’bout the - !pr\oblems#And ^incidentally# .^I [@:] ^do ^do t\ell me#^anything you ‘want about the :college in ”!g\eneral

Source: Leech chapter in Garside et al. 1997

11/25

EXAMPLE OF SKELETON PARSING, FROM THE SPOKEN ENGLISH CORPUS:[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]

Source: http://ucrel.lancs.ac.uk/annotation.html

EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS:hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP ._.

EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS:hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP ._.

12/25

S.1 (0) The state Supreme Court has refused to release {1 [2 Rahway State Prison 2] inmate 1}} (1 James Scott 1) on bail . S.2 (1 The fighter 1) is serving 30-40 years for a 1975 armed robbery conviction . S.3 (1 Scott 1) had asked for freedom while <1 he waits for an appeal decision . S.4 Meanwhile , [3 <1 his promoter 3] , {{3 Murad Muhammed 3} , said Wednesday <3 he netted only $15,250 for (4 [1 Scott 1] 's nationally televised light heavyweight fight against {5 ranking contender 5}} (5 Yaqui Lopez 5) last Saturday 4) . S.5 (4 The fight , in which [1 Scott 1] won a unanimous decision over (5 Lopez 5) 4) , grossed $135,000 for [6 [3 Muhammed 3] 's firm 6], {{6 Triangle Productions of Newark 6} , <3 he said .

ANAPHORIC ANNOTATION OF AP NEWSWIRES.1 The state Supreme Court has refused to release Rahway State Prison inmate James Scott on bail. S.2 The fighter is serving 30-40 years for a 1975 armed robbery conviction. S.3 Scott had asked for freedom while he waits for an appeal decision. S.4 Meanwhile, his promoter, Murad Muhammed, said Wednesday he netted only $15,250 for Scott's nationally televised light heavyweight fight against ranking contender Yaqui Lopez last Saturday. S.5 The fight, in which Scott won a unanimous decision over Lopez, grossed $135,000 for Muhammed's firm, Triangle Productions of Newark, he said.

Source: http://ucrel.lancs.ac.uk/annotation.html

13/25

SGML

• Although none of the examples just shown use it, for all but the simplest of mark-up schemes, SGML is widely recommended and used

• SGML = standard generalized mark-up language• Actually suitable for all sorts of things, including

web pages (HTML is SGML-conformant)

14/25

What is a mark-up language?• Mark-up historically referred to printer’s marks on a manuscript to

indicate typesetting requirements.• Now covers all sorts of codes inserted into electronic texts to govern

formatting, printing, or other information.• Mark-up, or (synonymously) encoding, is defined as any means of

making explicit an interpretation of a text. • By “mark-up language” we mean a set of mark-up conventions used

together for encoding texts. Language must specify– what mark-up is allowed– what mark-up is required– how mark-up is to be distinguished from text – what the mark-up means

• SGML provides the means for doing the first three• Separate documentation/software is required for the last

– eg (1) difference between identifying something as <emph>and how that appears in print; (2) why something may or may not be tagged as a “relative clause”

15/25

Rules of SGML

• SGML allows us to define – Elements– Specific features of elements– Hierarchical/structural relations between elements

• These specified in a “document type definition” (DTD)

• DTD allows software to be written to – Help annotators annotate consistently– Explore documents marked-up

16/25

Elements in SGML

• Have a (unique) name• Semantics of name are application dependent

– up to designer to choose appropriate name, but nothing automatically follows from the choice of any particular name

• Each element must be explicitly marked or tagged in some way – Most usual is with <element>and </element>pairs, called

start- and end-tags– Much SGML-compliant software seems to allow start-only tags– &element; (esp. useful for single words or characters)– _tag suffix

17/25

Attributes

• Elements can have named attributes with associated values

• When defined, values can be identified as– #REQUIRED: must be specified

– #IMPLIED: optional

– #CURRENT: inferred to be the same as the last specified value for that attribute

• Values can be from a predefined list, or can be of a general type (string, integer, etc)

18/25

DTD (Document type definition)

• Helps to impose uniformity over the corpus

• Defines the (expected or to-be-imposed) structure of the document

• For each element, defines– How it appears (whether end tags are required)– What its substructure is, ie what elements, how

many of them, whether compulsory or not

19/25

Example of DTD<!ELEMENT anthology - - (poem+)> <!ELEMENT poem - - (title?, stanza+ | couplet+)> <!ELEMENT title - O (#PCDATA) > <!ELEMENT stanza - O (line+) > <!ELEMENT couplet – O (cline, cline) ><!ELEMENT (line | cline) O O (#PCDATA) >

• Start and end tags necessary (-) or optional (O)• Anthology consists of 1 or more poems• Poem has an optional title, then 1 or more stanzas or 1 or more

couplets• Title consists of “parsed character data”, ie normal text• Stanza has one or more lines, couplet has two lines• Both lines and clines have the same definition: normal text

20/25

Attributes

• DTD defines the attributes expected/required for each element

• A poem has an id and a status• Value of id is any identifier, and is optional

• Status is one of three values, default draft

<!ATTLIST poem id ID #IMPLIED status (draft | revised | published) draft >

21/25

<anthology><poem id=12 status=revised><title>It’s a grand old team</title><stanza><line>It’s a grand old team to play for<line>It’s a grand old team to support<line>And if you know your history<line>It’s enough to make your heart goWhoooooah</stanza></poem><poem id=13>...</poem></anthology>

22/25

Mark-up exemplifiedRAW TEXT:Two men retained their marbles, and as luck would have it they're both roughie-toughie types as well as military scientists - a cross between Albert Einstein and Action Man!

TOKENIZED TEXT:<w orth=CAP>Two</w> <w>men</w> <w>retained</w> <w>their</w> <w>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w>'re</w> <w>both</w> <w>roughie-toughie</w> <w>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w>scientists <c PUN>&mdash;</c></w> <w>a</w> <w>cross</w> <w>between</w> <w orth=CAP>Albert</w> <w orth=CAP>Einstein</w> <w>and</w> <w orth=CAP>Action</w> <w orth=CAP>Man<c PUN>!</c>

23/25

LEMMATIZED TEXT:<w orth=CAP>Two</w> <w lem=man>men</w> <w lem=retain>retained</w> <w>their</w> <w lem=marble>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w lem=be>'re</w> <w>both</w> <w>roughie-toughie</w> <w lem=type>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w lem=scientist>scientists</w> <c PUN>&mdash;</c> <w>a</w> <w>cross</w> <w>between</w> <w orth=CAP>Albert</w> <w orth=CAP>Einstein</w> <w>and</w> <w orth=CAP>Action</w> <w orth=CAP>Man</w><c PUN>!</c>

24/25

POS TAGGED TEXT:<w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w> <w VVD lem=retain>retained</w> <w DPS>their</w> <w NN2 lem=marble>marbles</w><c PUN>,</c> <w CJC>and</w> <w CJS>as</w> <w NN1-VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w> <w PNP>they</w><w VBB lem=be>'re</w> <w AV0>both</w> <w AJ0>roughie-toughie</w> <w NN2>types</w> <w AV0>as</w> <w AV0>well</w> <w CJS>as</w> <w AJ0>military</w> <w NN2>scientists</w> <c PUN>&mdash</c> <w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <w NP0>Albert</w> <w NP0>Einstein</w> <w CJC>and</w> <w NN1>Action</w> <w NN1-NP0>Man<c PUN>!</c>

25/25

POS TAGGED TEXT with idioms and named entities:<w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w> <phrase type=idiom><w VVD lem=retain>retained</w> <w DPS>their</w> <w NN2 lem=marble>marbles</w></phrase><c PUN>,</c> <w CJC>and</w> <phrase type=idiom><w CJS>as</w> <w NN1-VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w></phrase> <w PNP>they</w><w VBB lem=be>'re</w> <w AV0>both</w> <w AJ0>roughie-toughie</w> <w NN2>types</w> <phrase type=compound pos=CJS><w AV0>as</w> <w AV0>well</w> <w CJS>as</w></phrase> <phrase type=compound pos=NN2><w AJ0>military</w> <w NN2>scientists</w></phrase> <c PUN>&mdash</c> <w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <phrase type=compound pos=NP0><w NP0>Albert</w> <w NP0>Einstein</w></phrase><w CJC>and</w> <phrase type=compound pos=NP0><w NN1>Action</w> <w NN1-NP0>Man</phrase><c PUN>!</c>


Recommended