EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]
Treebanks – An Overview
Erhard W. Hinrichs and Sandra Kübler
SfS-CL
Eberhard-Karls-Universität Tübingen
ESSLLI-05 – p.1
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]What are Treebanks
Treebanks provide annotations of natural languagecorpora at various levels of structure:
the word level (part-of-speech, and in some casesinflectional morphology)
the phrase level,
the sentence level,
in some cases also grammatical functions (e.g.,subject, object, indirect object, adjunct)
ESSLLI-05 – p.2
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Uses of Treebanks
training material for data-driven approaches tonatural language processing (e.g. statistical parsing)
gold standard for evaluation of NLP tools
grammar extraction
data-driven linguistic research
ESSLLI-05 – p.3
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Penn WSJ Treebank – Example
( (S (NP-SBJ (NP Pierre Vinken),(ADJP (NP 61 years)
old),)
(VP will(VP join
(NP the board)(PP-CLR as
(NP a nonexecutive director))(NP-TMP Nov. 29)))
.))
ESSLLI-05 – p.4
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebanks for English
Penn Treebank
The Penn-Helsinki Parsed Corpus of Middle English
Susanne Corpus and Christine Project
International Corpus of English ICE
Lancaster Treebank
The Redwoods HPSG Treebank
ESSLLI-05 – p.5
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebanks Projects
Basque
Eus3LB project
Bulgarian
HPSG-based Syntactic Treebank of Bulgarian(BulTreeBank)
Catalan
CAT3LB project
Chinese
The Chinese Treebank Project
Czech
Prague Dependency Treebank
ESSLLI-05 – p.6
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebanks Projects
Danish
Danish Dependency Treebank
Dutch
The Alpino Treebank
French
Project TALANA
German
German
NeGra Project - NeGra Corpus
Project TIGER
Verbmobil Treebank of Spoken German(TüBa-D/S)
The Tübingen Treebank of Written German(TüBa-D/Z)
ESSLLI-05 – p.7
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Characteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
ESSLLI-05 – p.8
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Annotation Principles
Longest Match Principle
as many daughter nodes as possible arecombined into a single mother node, providedthat the resulting construction is syntactically aswell as semantically well-formed.
Speech errors, repetitions, corrections, andhesitations are structured as much as possible,but are not typically connected to surroundingconstituents as a whole.
ESSLLI-05 – p.9
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Interruptions
0 1 2 3
500 501
502
503
sieben
CARD
Uhr
NN
f"unf
CARD
am
APPRART
HD HD
ADJX
− HD
NX
HD
ADJX
−
NX
ESSLLI-05 – p.10
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Parentheticals
0 1 2 3 4 5 6 7 8 9 10 11 12
500 501 502 503 504 505 506 507
508 509 510 511 512 513
514 515
516
da
ADV
k"onnen
VMFIN
wir
PPER
uns
PRF
auf
APPR
das
ART
Hotel
NN
,
$,
glaube
VVFIN
ich
PPER
,
$,
einigen
VVINF
.
$.
HD HD HD HD − HD HD HD HD
ADVX
MOD
VXFIN
HD −
NX
HD
VXFIN
HD
NX
ON
VXINF
OV
NX
ON
NX
OA
PX
FOPP
LK
−
MF
−
SIMPX
VF
−
LK
−
MF
−
VC
−
SIMPX
ESSLLI-05 – p.11
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebanks Projects
Italian
Turin University Treebank TUT
Portuguese
The Floresta Sint?(c)tica project
Slovene
Slovene Dependency Treebank
Spanish
UAM Treebank of Spanish
Swedish
Swedish Treebank
Turkish
METU treebankESSLLI-05 – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebank Management
Treebanking is extremely labor-intensive (i.e. costly).
Good planning is therefore necessary.
Good tools are crucial.
For annotation, I recommend the tool Annotate.
A detailed stylebook is essential.
Every time you hire a well-trained linguist, yourtreebank will get better.
ESSLLI-05 – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebank Management
Treebanking is extremely labor-intensive (i.e. costly).
Good planning is therefore necessary.
Good tools are crucial.
For annotation, I recommend the tool Annotate.
A detailed stylebook is essential.
Every time you hire a well-trained linguist, yourtreebank will get better.
ESSLLI-05 – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebank Management
Treebanking is extremely labor-intensive (i.e. costly).
Good planning is therefore necessary.
Good tools are crucial.
For annotation, I recommend the tool Annotate.
A detailed stylebook is essential.
Every time you hire a well-trained linguist, yourtreebank will get better.
ESSLLI-05 – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebank Management
Treebanking is extremely labor-intensive (i.e. costly).
Good planning is therefore necessary.
Good tools are crucial.
For annotation, I recommend the tool Annotate.
A detailed stylebook is essential.
Every time you hire a well-trained linguist, yourtreebank will get better.
ESSLLI-05 – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebank Management
Treebanking is extremely labor-intensive (i.e. costly).
Good planning is therefore necessary.
Good tools are crucial.
For annotation, I recommend the tool Annotate.
A detailed stylebook is essential.
Every time you hire a well-trained linguist, yourtreebank will get better.
ESSLLI-05 – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Treebank Management
Treebanking is extremely labor-intensive (i.e. costly).
Good planning is therefore necessary.
Good tools are crucial.
For annotation, I recommend the tool Annotate.
A detailed stylebook is essential.
Every time you hire a well-trained linguist, yourtreebank will get better.
ESSLLI-05 – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]The Annotation Scheme
Should the annotation scheme be dependent on aparticular theory?
Theory-neutrality is a fiction. Every annotationscheme is at least implicitly theory-dependent.
Grounding an annotation scheme in a linguistictheory tends to improve consistency of annotations.
ESSLLI-05 – p.14
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Theory-dependent Treebanks
Prague Dependeny Treebank
based on Dependency Grammar
The Redwoods HPSG Treebank
based on Head-Driven Phrase StructureGrammar
CCGbank
translation of the Penn Treebank into a corpus ofCombinatory Categorial Grammar derivations
ESSLLI-05 – p.15
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t[CLSfS ]Theory-neutral Treebanks
do not adhere to any particular linguistic theory
encode those grammatical properties that aredistinguished by many, if not all grammaticalframeworks
advantage: more widely usable and less dependenton whatever version of a particular grammaticaltheory may have existed at the time when thetreebank annotation scheme was determined.
Examples: Penn Treebank, Negra treebank,Tübingen treebanks.
ESSLLI-05 – p.16