+ All Categories
Home > Documents > Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction...

Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction...

Date post: 19-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
21
E BERHARD K ARLS UNIVERSITÄT T ÜBINGEN Seminar für Sprachwissenschaft [ C L SfS ] Treebanks – An Overview Erhard W. Hinrichs and Sandra Kübler SfS-CL Eberhard-Karls-Universität Tübingen ESSLLI-05 – p.1
Transcript
Page 1: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]

Treebanks – An Overview

Erhard W. Hinrichs and Sandra Kübler

SfS-CL

Eberhard-Karls-Universität Tübingen

ESSLLI-05 – p.1

Page 2: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]What are Treebanks

Treebanks provide annotations of natural languagecorpora at various levels of structure:

the word level (part-of-speech, and in some casesinflectional morphology)

the phrase level,

the sentence level,

in some cases also grammatical functions (e.g.,subject, object, indirect object, adjunct)

ESSLLI-05 – p.2

Page 3: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Uses of Treebanks

training material for data-driven approaches tonatural language processing (e.g. statistical parsing)

gold standard for evaluation of NLP tools

grammar extraction

data-driven linguistic research

ESSLLI-05 – p.3

Page 4: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Penn WSJ Treebank – Example

( (S (NP-SBJ (NP Pierre Vinken),(ADJP (NP 61 years)

old),)

(VP will(VP join

(NP the board)(PP-CLR as

(NP a nonexecutive director))(NP-TMP Nov. 29)))

.))

ESSLLI-05 – p.4

Page 5: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebanks for English

Penn Treebank

The Penn-Helsinki Parsed Corpus of Middle English

Susanne Corpus and Christine Project

International Corpus of English ICE

Lancaster Treebank

The Redwoods HPSG Treebank

ESSLLI-05 – p.5

Page 6: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebanks Projects

Basque

Eus3LB project

Bulgarian

HPSG-based Syntactic Treebank of Bulgarian(BulTreeBank)

Catalan

CAT3LB project

Chinese

The Chinese Treebank Project

Czech

Prague Dependency Treebank

ESSLLI-05 – p.6

Page 7: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebanks Projects

Danish

Danish Dependency Treebank

Dutch

The Alpino Treebank

French

Project TALANA

German

German

NeGra Project - NeGra Corpus

Project TIGER

Verbmobil Treebank of Spoken German(TüBa-D/S)

The Tübingen Treebank of Written German(TüBa-D/Z)

ESSLLI-05 – p.7

Page 8: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Characteristics of Spontaneous Speech

Fragmentary Utterances

Repetitions

False starts

Speech errors (with correction)

Interruptions

Parentheticals

Discourse markers

Hesitation noises

ESSLLI-05 – p.8

Page 9: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Annotation Principles

Longest Match Principle

as many daughter nodes as possible arecombined into a single mother node, providedthat the resulting construction is syntactically aswell as semantically well-formed.

Speech errors, repetitions, corrections, andhesitations are structured as much as possible,but are not typically connected to surroundingconstituents as a whole.

ESSLLI-05 – p.9

Page 10: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Interruptions

0 1 2 3

500 501

502

503

sieben

CARD

Uhr

NN

f"unf

CARD

am

APPRART

HD HD

ADJX

− HD

NX

HD

ADJX

NX

ESSLLI-05 – p.10

Page 11: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Parentheticals

0 1 2 3 4 5 6 7 8 9 10 11 12

500 501 502 503 504 505 506 507

508 509 510 511 512 513

514 515

516

da

ADV

k"onnen

VMFIN

wir

PPER

uns

PRF

auf

APPR

das

ART

Hotel

NN

,

$,

glaube

VVFIN

ich

PPER

,

$,

einigen

VVINF

.

$.

HD HD HD HD − HD HD HD HD

ADVX

MOD

VXFIN

HD −

NX

HD

VXFIN

HD

NX

ON

VXINF

OV

NX

ON

NX

OA

PX

FOPP

LK

MF

SIMPX

VF

LK

MF

VC

SIMPX

ESSLLI-05 – p.11

Page 12: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebanks Projects

Italian

Turin University Treebank TUT

Portuguese

The Floresta Sint?(c)tica project

Slovene

Slovene Dependency Treebank

Spanish

UAM Treebank of Spanish

Swedish

Swedish Treebank

Turkish

METU treebankESSLLI-05 – p.12

Page 13: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebank Management

Treebanking is extremely labor-intensive (i.e. costly).

Good planning is therefore necessary.

Good tools are crucial.

For annotation, I recommend the tool Annotate.

A detailed stylebook is essential.

Every time you hire a well-trained linguist, yourtreebank will get better.

ESSLLI-05 – p.13

Page 14: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebank Management

Treebanking is extremely labor-intensive (i.e. costly).

Good planning is therefore necessary.

Good tools are crucial.

For annotation, I recommend the tool Annotate.

A detailed stylebook is essential.

Every time you hire a well-trained linguist, yourtreebank will get better.

ESSLLI-05 – p.13

Page 15: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebank Management

Treebanking is extremely labor-intensive (i.e. costly).

Good planning is therefore necessary.

Good tools are crucial.

For annotation, I recommend the tool Annotate.

A detailed stylebook is essential.

Every time you hire a well-trained linguist, yourtreebank will get better.

ESSLLI-05 – p.13

Page 16: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebank Management

Treebanking is extremely labor-intensive (i.e. costly).

Good planning is therefore necessary.

Good tools are crucial.

For annotation, I recommend the tool Annotate.

A detailed stylebook is essential.

Every time you hire a well-trained linguist, yourtreebank will get better.

ESSLLI-05 – p.13

Page 17: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebank Management

Treebanking is extremely labor-intensive (i.e. costly).

Good planning is therefore necessary.

Good tools are crucial.

For annotation, I recommend the tool Annotate.

A detailed stylebook is essential.

Every time you hire a well-trained linguist, yourtreebank will get better.

ESSLLI-05 – p.13

Page 18: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Treebank Management

Treebanking is extremely labor-intensive (i.e. costly).

Good planning is therefore necessary.

Good tools are crucial.

For annotation, I recommend the tool Annotate.

A detailed stylebook is essential.

Every time you hire a well-trained linguist, yourtreebank will get better.

ESSLLI-05 – p.13

Page 19: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]The Annotation Scheme

Should the annotation scheme be dependent on aparticular theory?

Theory-neutrality is a fiction. Every annotationscheme is at least implicitly theory-dependent.

Grounding an annotation scheme in a linguistictheory tends to improve consistency of annotations.

ESSLLI-05 – p.14

Page 20: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Theory-dependent Treebanks

Prague Dependeny Treebank

based on Dependency Grammar

The Redwoods HPSG Treebank

based on Head-Driven Phrase StructureGrammar

CCGbank

translation of the Penn Treebank into a corpus ofCombinatory Categorial Grammar derivations

ESSLLI-05 – p.15

Page 21: Erhard W. Hinrichs and Sandra Küblerkuebler/esslli05/treebank-intro.pdf · grammar extraction data-driven linguistic research ESSLLI-05 – p.3 . E B E R H A R D K A R L S U N I

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NSe

min

arfü

rSp

rach

wis

sens

chaf

t[CLSfS ]Theory-neutral Treebanks

do not adhere to any particular linguistic theory

encode those grammatical properties that aredistinguished by many, if not all grammaticalframeworks

advantage: more widely usable and less dependenton whatever version of a particular grammaticaltheory may have existed at the time when thetreebank annotation scheme was determined.

Examples: Penn Treebank, Negra treebank,Tübingen treebanks.

ESSLLI-05 – p.16


Recommended