+ All Categories
Home > Documents > A probabilistic approach to language structure Annarita Felici and Paul Pal Royal Holloway,...

A probabilistic approach to language structure Annarita Felici and Paul Pal Royal Holloway,...

Date post: 16-Dec-2015
Category:
Upload: paulina-lynch
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
34
A probabilistic approach to language structure Annarita Felici and Paul Pal Royal Holloway, University of London Helsinki 2-4 June 2008 [email protected] [email protected]
Transcript

A probabilistic approach to language structure

Annarita Felici and Paul PalRoyal Holloway, University of London

Helsinki 2-4 June 2008

[email protected] [email protected]

2-4 June 2008 QITL3

Outline

Field of investigationResearch goalsDataProbabilistic analysisInformation TheoryEntropy results

2-4 June 2008 QITL3

Field of investigation

Repetitive language structure in multilingual legal text

EU normative statements in translation

Languages of investigationEnglish, French, German and Italian

2-4 June 2008 QITL3

Field of investigation: legal norms

Deontic norms (from the Greek deon = duty). obligations, prohibitions, permissions and authorizations

Constitutive performatives The uttering of a performative is, or is part of, the doing of a certain kind of action or speech acts (Austin 1962)

Uttering a sentence = doing things

2-4 June 2008 QITL3

Other norm types

Logical necessity necessary requirements or competences

Non-binding norms guidelines, correct procedure

2-4 June 2008 QITL3

Research goals

1. To evaluate the degree of prescriptive standardization in French, German and Italian with reference to English

2. To predict translation equivalents in French, German and Italian

2-4 June 2008 QITL3

English legal drafting is highly standardized

The EU and the main English drafting suggest modal verbs for prescriptive norms (Coode 1843, Driedger 1976, Dickerson 1975, Thornton 1996)

Text types under investigation are repetitive and reusable

Text types under investigation can be more or less binding

under the conditions that:

2-4 June 2008 QITL3

DataMultilingual parallel corpus

Origin: EUCorpus size: 1.404.723 wordsText type:normativeType of docs: Secondary Legislation

(Regulations,Decisions,Directives, Recommendations)

Years: 2001-04 Languages: English, French, German,

Italian

2-4 June 2008 QITL3

Probabilistic Analysis

Information TheoryTo measure the amount of linguistic alternatives when translating a repetitive normative statement from English into French, German and Italian

= Quantifying information by reducing uncertainty

more alternatives = more uncertainty (high entropy) less alternatives = more standardization, certainty (low

entropy)

2-4 June 2008 QITL3

Probabilistic Variables

Categories of expressionsLinguistic forms

English modals Entry point for parallel

retrieval shall, must, may, can, should

2-4 June 2008 QITL3

Categories of expression

Constitutive norms and performatives

Logical necessityPermissions and authorizationsCapabilityNon-binding norms

2-4 June 2008 QITL3

Linguistic forms

Indicative (pres.)Modal verbs (mv)Verbal periphrasis (vp)Lexicalized modal expressions (le)Ellipses (0- correspondence)

2-4 June 2008 QITL3

Linguistic formsLinguistic equivalents used in constitutive and

performative normsREGULATIONSEnglish Italian raw no.% French raw no.% German raw no.%shall 1382 Indicative - Others 1192 86.3 Indicative - Others 1223 88.5 Indicative - Others 1125 81.4

MV 81 5.86 MV 58 4.2 MV 94 6.8dovere (58) devoir (32) muessen (34)potere (23) 9 neg. pouvoir(26) 14 neg. sollen (3)

duerfen (33) 9 neg.Koennen (24)

Verbal periphrasis 6 0.43 Verbal periphrasis 8 0.58 Verbal periphrasis 71 5.14va + Past part. (1) être à (1) sein…zu (61)essere tenuto (5) être tenu (7) haben…zu (10)Modal expressions 51 3.69 Modal expressions 43 3.11 Modal expressions 46 3.33vietare (1) interdire (1) untersagen (1)essere obbligatorio (15) être obligatoire (15) verbindlich sein (15)soggetto a obbligo (1) soumis à obligation (1) verpflichten (8)

avere il potere (1) avoir le droit (11) das Recht/Anspruch haben (10)avere il diritto (12) il importe (1) befuegt sein (2)consentire, autorizzare,

autoriser, octroyer,

bewilligt/zugelassen/erlaubt sein

occorre (3) il importe (1) gewaehrt sein (3)spettare (2)

Ellipsis 52 3.76 Ellipsis 50 3.62 Ellipsis 46 3.33TOT 1382 TOT 1382 TOT 1382

2-4 June 2008 QITL3

Linguistic formsLinguistic equivalents used to convey

permissions and authorizations

REGULATIONSItalian raw no.% French raw no.% German raw no.%

may 294 Indicative - Others 34 11.56 Indicative - Others 36 12.24 Indicative - Others 31 10.54MV 246 83.67 MV 250 85.03 MV 247 84.01potere (45) pouvoir (249) koennen (218)dovere (1) neg. devoir (1) neg. duerfen (27)

sollen (1)moegen (1)

Verbal periphrasis 0 0 Verbal periphrasis 0 0 Verbal periphrasis 1 0.34sind…zu (1)

Oth. Modal express. 6 2.041 Oth. Modal express. 2 0.68 Oth. Modal express. 7 2.381avere facoltà(3) permettre (1) zulaessig sein (4)essere consentito/ammesso être habilite (1) berechtigt/ essere abilitato (1)elllipsis 8 2.721 elllipsis 6 2.041 elllipsis 8 2.721TOT 294 TOT 294 TOT 294

English MV

2-4 June 2008 QITL3

Given the English system of modality, which is the relative probability of choosing an equivalent modal verb in the translation of may or must and a different linguistic form as the equivalent of shall?

Is the probability of a choice in a system affected by a choice in another?

2-4 June 2008 QITL3

Information Theory

the information value or content h(p) is dependent on the probability of occurrence (p) of an event (Shannon 1949)

h(p) = - log (p) = log (1/p)

Entropy degree of uncertainty (= shortage of information due to the

large number of alternatives)

2-4 June 2008 QITL3

Probabilistic analysis

The frequency of occurrence (ni) of each linguistic form is associated with a category

A probability variable (pi) is derived from the estimated proportion of a particular linguistic form

2-4 June 2008 QITL3

Probabilistic analysis

In English P1 = p mv→ shall = n shall / n; p2 = pmv → must = nmust / n;

p3 = pmv →should = nshould / n; p4 = pmv → can = ncan / n;

p5 = pmv → may = nmay / n

In French, German and Italian p1 = pindicative + pmv + pvp + pme + pellipses;

p2 = pindicative + pmv + pvp + pme + pellipses

and so on.

2-4 June 2008 QITL3

Linguistic forms and frequencies of occurrences in the EU Regulation for the selected categories of 1) constitutive norms and 2) permissions and authorization

ENGLISH FRENCH GERMAN ITALIAN

a) Constitut ive norms and performatives

mv (0.655)

Pres. Ind.(0.58)

mv (0.03)

vp (0.003)

me (0.02)

ellipses (0.02)

Pres.Ind.(0.53)

mv (0.04)

vp (0.02)

me (0.02)

ellipses (0.02)

Pres. Ind. (0.56)

mv (0.04)

vp (0.002)

me (0.02)

ellipses (0.02)

c) Perm ission and aut horization

mv (0.14)

Pres. Ind.(0.017)

mv (0.118)

vp (0)

me (0.0009)

ellipses (0.0028)

Pres.Ind.(0.014)

mv (0.117)

vp (0.0004)

me (0.002)

ellipses (0.004)

Pres. Ind. (0.016)

mv (0.116)

vp (0)

me (0.0028)

ellipses (0.004)

2-4 June 2008 QITL3

Probabilistic approach

The sum of these probabilities produces different information values

The expected information content of a system is the sum of the information contents weighted by the probabilities for each possible outcome ⎥⎦⎤⎢⎣⎡−=∑==51 2)(logii ii ppH

2-4 June 2008 QITL3

Entropy : extrema Variations in the language-specific p(i) values of

linguistic forms produce distribution profiles reflecting the characteristics of the corresponding language.

Mathematically it can be shown that

If all the p(i) values are equal (equi-probable situation), the profile is a uniform distribution and results in maximum entropy.

If only one probability p(i) is maximum and the remaining p(i) values are zero, the entropy is minimum (e.g. English).

All other distributions lie between these two limits (e.g. French, German and Italian)

2-4 June 2008 QITL3

A concrete example

Regulation document in English, French, German and Italian + a fictitious language.

One category of expression: e.g. the constitutive norms.

5 linguistic forms for this category. Total number of modal verbs and

alternatives: 2075.

2-4 June 2008 QITL3

Constitutive norm

English French German

Italian Fictitious

mv 1382 58 94 81 276

ind 0 1223 1125 1192 276

vp 0 8 71 6 276

me 0 43 46 51 276

el 0 50 46 52 276

Frequency of occurrences of expression modes in 4 real languages and one fictitious language

2-4 June 2008 QITL3

Histogram of 5 modes of expression

Histogram plot of frequencies

0

200

400

600

800

1000

1200

1400

1600

English French German Italian Fictitious

pi

mv

vp

me

el

2-4 June 2008 QITL3

Comparison based on Entropy

Computed Entropy of Constitutive norm

EN H = 0 + Hmv + 0 + 0 + 0 = 0.405FR H = Hind + Hmv + Hvp + Hme + Hme =0.857GE H = Hind + Hmv + Hvp + Hme + Hme =1.08IT H = Hind + Hmv + Hvp + Hme + Hme =0.88FI H = Hind + Hmv + Hvp + Hme + Hme =2.32

2-4 June 2008 QITL3

Computed Entropy of constitutive norms (English, French, German, Italian and Fictitious)

Entropy of real and fictitous languages (Constituve norm)

0

0.5

1

1.5

2

2.5

English French German Italian Fictititous

2-4 June 2008 QITL3

Entropy results

1. In the EU Regulation according to the 5 categories of expression(1. Constitutive and performative norms, 2. Logical necessity, 3.Permissions and authorizations, 4.Capability, 5. Non-binding norms)

2. In the EU Secondary Legislation overall according to the 4 types of documents

(Regulations, Decisions, Directives, Recommendations)

2-4 June 2008 QITL3

Entropy in the EU Regulation

0

0.2

0.4

0.6

0.8

1

1.2

cost./perf.norms

logicalnecessity

perm/author. capability non-bindingnorms

English

Italian

French

German

2-4 June 2008 QITL3

Entropy resultsEU Regulation

Logical necessity, permissions and authorizations and capability (< entropy)

quite standardized in the 4 languages = almost equivalent translations

Constitutive performative norms (> entropy)

translation is more difficult to predict Definitions, const. statements, obligations FR: < entropy than IT DE: > entropy (VP sein/haben…zu)

2-4 June 2008 QITL3

Entropy resultsEU Regulation

Non -binding normsfairly amount of variation among the

4 languagesFR/IT: >entropy DE: < entropy (should is most likely

translated with sollen- Soll-Vorschriften)

2-4 June 2008 QITL3

Entropy overall the 4 EU documents

0

0.5

1

1.5

2

2.5

3

3.5

English Italian French German

Decisions

Directives

Regulations

Reccomendations

2-4 June 2008 QITL3

Entropy resultsEU Secondary Legislation

Regulations and Decisions (< entropy)

Direct applicability of the norms = more precision and standardization

FR looks more standardized than IT and DEDirectives (> entropy than Reg. and Dec.)

Binding only as to the result to be achievedRecommendations (> entropy)

Not-binding: more freedom DE : sollen

2-4 June 2008 QITL3

Conclusions

Given certain conditions, it is possible to predict with some certainty the occurrence of a particular factor

If applied to repetitive texts, entropy analysis can enhance research in langauge testing, evaluation and in the development of automated translation’s tools

2-4 June 2008 QITL3

References Austin, J. L. 1962. How to do things with words.Oxford:

Oxford University Press. Coode, G. 1843. Legislative Expressions. Appendix to

the Report of the Poor Law Commissioners on Local Taxation. Published separately 1845, 2nd Ed.1852.

Driedger, E. A. 1976. The Composition of legislation. Legislative forms and precedents (2nd Ed.). Ottawa:The Department of Justice

Shannon, Cand W. Weaver. 1963 (1949) The mathematical theory of communication. Urbana: University of Illinois Press.USA.

Thornton G.C. 1996. Legislative Drafting (4th Ed.). Butterworths, London.

http://publications.europa.eu/code/en/en-6000000.htm


Recommended