Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Functional Generative Description
Prague Dependency Treebank
1.0
Functional Generative Description
theoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School
methodological requirements of a formal description levels:
tectogrammatical (underlying) representations (TRs) with dependency based syntax
morphemics phonemics and phonetics
TRs (see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way)
Prague Dependency Treebank
1.0
Dependency treeMy younger brother arrived there yesterday.
Linearized form, one-to-one relation:((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)
Prague Dependency Treebank
1.0
Dependency Tree labels - lexical meanings (abstract symbols) with indices
functors subscripts at parentheses oriented towards head
grammatemes - values of morphological categories Tense, Modality, Number, Definiteness, etc.
projectivity valency
arguments (inner participants) and adjuncts (circumstantials or 'free modifications')
obligatory and optional with a given head, deletable or not
Prague Dependency Treebank
1.0
Dependency Treeparticipants
(arguments) of verbs Actor/Bearer
(underlying subject) Objective (Patient,
underlying direct object) Addressee
(underlying indirect object)
Effect ('second' object: to choose so. as sth.)
Origin (to make sth. out of sth.)
adjuncts Locative, several
Directional and Temporal modifications
Condition, Means, Manner, etc.
Prague Dependency Treebank
1.0
Dependency Tree
inner participants Material (Partitive)
two baskets of sth. Identity
the river Danube; the notion of operator
free modifications Possession
(Appurtenance) my table; Jim's brother
Restrictive rich man
Descriptive the Swedes, who are a Scandinavian nation
Complementations dependent mainly on nouns
Prague Dependency Treebank
1.0
Dependency Tree
syntactic grammatemes Loc, Dir - in, on, under, between... Regard - with, without
operational (testable) criteria for distinguishing
arguments from adjuncts,from each other
deletability (dialogue test)
Prague Dependency Treebank
1.0
Simplified valency frames read V Act Addr Obj
change V Act Obj Orig Eff
give V Act Addr Obj
brother N Appurt
man N
glass N Material
full A Materialobligatory complementations in blue
Prague Dependency Treebank
1.0
Topic-focus articulation contextual boundness
main verb CB/NB (T/F) dependents to the left/right
communicative dynamism left-right (mother, sisters,
transitive) partial ordering
underlying word order left-right linear ordering
left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR)
young
there
T
Prague Dependency Treebank
1.0
Topic-focus articulation
TFA - one of the basic aspects of underlying structures
young
there
T
yesterday
F
Prague Dependency Treebank
1.0
Complex sentence
a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause
My brother, whom you know, arrived there yesterday.
Prague Dependency Treebank
1.0
Complex sentence
function words (synsemantic) are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs
Martin came there late, since he had to accompany his sick mother.
Prague Dependency Treebank
1.0
Complex sentence
Martin arrived late to the session, since he had to accompany his sick mother.
schematically (morphemes):Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother.
dot - close connection of morphemes ('semes')
Prague Dependency Treebank
1.0
deleted items restored order of items - difference between 'underlying' and surface
(morphemic) word order transductive components - Panevová, Oliva, Borota
coordination (multidimensional) Jim and Mary, who have two children, went to Boston. the linearized notation is adequate: ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr
children)))Act went (Dir Boston)
structures close to Boolean, i.e. no complex 'innate properties' specific for natural language are needed.
Prague Dependency Treebank
1.0
Prague Dependency Treebank - corpus annotationan intermediate level - 'analytical'
representations dependency trees, not always projective nodes for all word tokens, even for
punctuation markstectogrammmatical tree:
coordinating conjunction as the head
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Morphological Layer
Prague Dependency Treebank
1.0
ACKNOWLEDGEMENTS
Prague Dependency Treebank
1.0
ANNOTATED CORPORA
PDT version 1.0, 2000 (1996 - 2000)
Penn Treebank, release 3, 1999(1989 - 1999)
Prague Dependency Treebank
1.0
TAG SETs
Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, …
English - language with poor inflectionwork, works, worked, working
Prague Dependency Treebank
1.0
Prague Dependency Treebank
1.0
TEXT SOURCES
Lidové noviny Mladá Fronta Dnes Vesmír Českomoravský
Profit
...taken from Czech National Corpus
´88, ´89 WSJ articles Air Travel Information
System transcripts
Brown Corpus
Switchboard transcripts
Prague Dependency Treebank
1.0
ANNOTATION STRATEGY - Penn Treebank
TEXT
Ken Church‘s stochastic tagger,Eric Brill‘s transformation tagger
corrections by annotator (GNU Emacs Lisp based package)
Prague Dependency Treebank
1.0
ANNOTATION STRATEGY - PDT
Automatic Morphological Analyzer (AMA)
two independent annotators; Linux, Win tools
differences resolved by third annotator
comparison with the current AMA; manual resolution; Win tools
Prague Dependency Treebank
1.0
INTERNAL FORMAT
SGML coding, csts dtd word/tag(|tag)*
Prague Dependency Treebank
1.0
<s id=“ln95040:020-p1s1“><f>Pokus<l>pokus<t>NNIS1-----A----<f>o<l>o<t>RR--4----------<f>zázrak<l>zázrak<t>NNIS4-----A----<d>.<l>.<t>Z:-------------
The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.
SAMPLES
Prague Dependency Treebank
1.0
SGML coding
SGML coding
word/tag
word/lemma/tag
CONVERSION
pdt2wsj.pl
pdt2wsjFLT.pl
Prague Dependency Treebank
1.0
DATA SIZE
# wordtokens
# sentences
PDT 1.0 1 730K 112K
Penn Treebankrelease 3
4 600K 350K
Prague Dependency Treebank
1.0
DATA SETs of MORPHOLOGICALLY ANNOTATED
DATAfor tagging only #tokens/sentences
training data 1 470K/95K
development test data 130K/8K
evaluation test data 127K/8K
for parsing (preprocessing step)
training data 475K/29K
development test data 130K/8K
evaluation test data 127K/8K
Prague Dependency Treebank
1.0
TOOLS
Automatic Morphological Analyser/Generator of Czech HMAnalyze.pl,
HMGenerate.pl Dictionary: CZE_a Remote Acces
Czech Taggers
HMM
Exponential
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Analytical Layer in PDT
Prague Dependency Treebank
1.0
Introduction
Input: morphologically tagged sentences
Graph Editor: “user-friendly” software
Output: ATS structure „surface“ syntax tree structure nodes labelled by the analytical functions
Prague Dependency Treebank
1.0
Two stages (chronologically)
(A) manual „analytic“ annotation (ATS) training data for (B)(a)
(B) (a) semiautomatic procedure (Collin‘s
parser) (b) manual correcting of (B)(a)
Prague Dependency Treebank
1.0
Constraints and limitations any string has a node of its own
word-form, punctuation mark, etc. AuxV, AuxP, AuxC, AuxX, AuxG…
reflecting the coordination and apposition relations so called third dimension of the graph in the
plain tree (X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.)
Prague Dependency Treebank
1.0
Constraints and limitations no missing nodes (on the surface) can be
added analytic funtion Ex_D is used
relations between semi-automatic and manual procedure 80% edges are established correctly
automatically
Prague Dependency Treebank
1.0
Project organization
team consisting of 5-6 annotatorshandbook for ATS structure
annotation1999: 100000 sentences on ATStectogrammatical annotation follows
Prague Dependency Treebank
1.0
Adv
AuxT
První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang.
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
From the Analyticaltowards
the Tectogrammatical layer
Prague Dependency Treebank
1.0
Introduction
ATS annotation nodes:
word formspunctuationgraphical symbols
TGTS annotationautosemantic
wordsdeletions
edges:surface relations
deep layer functions
Prague Dependency Treebank
1.0
Input Czech
sentence
Morphological tagging and lexical
disambiguationTokenization Syntactic parsing
and analytic functionassignment
Tree structure pruning
Attribute assignments TGTS
ATS PDT1.0
Annotation process
Prague Dependency Treebank
1.0
Transition proceduredeterministic procedure operating on treesmacro language for Graph Editor (C++ like)automatic changes & tools for annotators
Requirements new attributes for tectogrammatical layer ATS is recoverable from TGTS automatized to a maximally high degree
Prague Dependency Treebank
1.0
New attributestrlemma - lemma of the original node or lemma
composed of joined nodesmorphological grammatemes
gender, number, degree of comparison, tense,gender, number, degree of comparison, tense, aspect, iterativeness, verbal modality, deontic aspect, iterativeness, verbal modality, deontic
modality, sentence modalitymodality, sentence modalitypositionposition of the nodeof the node
functor, topic-focus articulation, syntactic grammateme,functor, topic-focus articulation, syntactic grammateme, type of relation (dependency, coordination, apposition), type of relation (dependency, coordination, apposition), phraseme, deletion, quoted word, direct speech, phraseme, deletion, quoted word, direct speech, coreference, antecedentcoreference, antecedent
Prague Dependency Treebank
1.0
Tree Structure Pruning U toho, kdo začíná opravdu od nuly, není daňový
výnos pro stát podstatný. For those, who start actually at zero, the tax
outcome for the state is not substantial.
Prague Dependency Treebank
1.0
Tree Structure Pruning U toho, kdo začíná opravdu od nuly, není daňový
výnos pro stát podstatný. For those, who start actually at zero, the tax outcome
for the state is not substantial.
REG
Prague Dependency Treebank
1.0
Verbal Nodes
•… enterpreneurs should have (their) taxes …•… podnikatelé by měli mít daně …
PRED
verbmod=CDNdeontmod=HRT
Prague Dependency Treebank
1.0
Attribute Assignments
prepositions stored as fw attributequoted words
clause in quotes -> DSP one pair of quotes in the sentence -> DSPP string in quotes -> QUOT
gender, number, tense, degcmp, aspectdefault values
Prague Dependency Treebank
1.0
Macros for Annotators
keyboard shortcuts (in Graph editor) structure changes
hide/recover nodesmerge nodes
add new nodes functor assignments
Prague Dependency Treebank
1.0
Manual annotation
structure checkingfunctorsdeletions of obligatory modifications
feedback for formulating the handbook for annotators
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Prague Dependency Treebank 1.0
CD-ROM PRESENTATIONDec 18, 2000
Tectogrammatical Layer
Prague Dependency Treebank
1.0
Prague Dependency Treebank
1.0
C T
T
T
T
T
F
FT
T
Prague Dependency Treebank
1.0
Jirka se včera opil do němoty a Honza dneska. George himself yesterday drank to silence and Honza today.
Prague Dependency Treebank
1.0
Attributes of Coreferrential relationsonly in MC
attribute valuescoref the lemma of the antecedentcorsnt NIL - in the same sentence
PREV1 ... PREVi - position of the sentence which includes the antecedent
grammatical coreferenceantec the functor of the antecedent
Prague Dependency Treebank
1.0
Example
Honza slíbil přijít včas.Honza promised to come in time.
coref: Honzacorsnt: NILcornum: 1antec: ACT