Frequency, Chunks & sentationU... · PDF fileSpeech is often extremely complex but still...

Frequency, Chunks & HesitationsAn Empirical Analysis of Bybee’s Exemplar Model

Dr. Ulrike Schneider

1

Linguistisches Kolloquium Mainz, 19. Januar 2015

Ulrike Schneider | Linguistisches Kolloquium | 19. Januar 2015

Hesitations

• uh I don’t agree

• and uh fortunately we agreed

• and when they say you know [pause] buy one get one free it’s hard to resist

• we have a tremendous amount of um sunny days

2


Hesitation Placement

3


Proposed Explanations


§ Intonation UnitsFilled and unfilled pauses are preferentially placed after the first word in a phonemic clause (Boomer 1965)Filled pauses are most likely to occur at intonation unit boundaries (Clark & Fox Tree 2002)

§ ConstituentsHesitations are preferably placed at constituent boundaries (e.g. Maclay & Osgood 1959; Clark & Clark 1977; Swerts 1998; Biber et al. 1999)Major planning points (Clark & Clark 1977)

a. Grammatical junctures [clause boundaries]b. Other constituent boundariesc. Before the 1st content word within a constituent

4


New Ideas


§ Lounsbury (1954):Hypothesis 1: Hesitation pauses correspond to the points of highest statistical uncertainty in the sequencing of units of any given order.Hypothesis 2: [These points] correspond to the beginning of units of encoding.

§ Goldman-Eisler (1968):Speech is often extremely complex but still fluent.The conception of ready-made sentence schemata, models of sentences or modules implies that they are selected in one piece so to speak, that they are not constructed from individual lexical elements – and this would account for the fluency of speakers irrespective of their complexity, in the same way as efficiency in mass production is a matter of use of prefabricated units

5


New Ideas


§ Lounsbury (1954):Hypothesis 1: Hesitation pauses correspond to the points of highest statistical uncertainty in the sequencing of units of any given order.Hypothesis 2: [These points] correspond to the beginning of units of encoding.

§ Goldman-Eisler (1968):Speech is often extremely complex but still fluent.The conception of ready-made sentence schemata, models of sentences or modules implies that they are selected in one piece so to speak, that they are not constructed from individual lexical elements – and this would account for the fluency of speakers irrespective of their complexity, in the same way as efficiency in mass production is a matter of use of prefabricated units

6


Usage-Based Models

7


“Prefabricated Units”

Usage-Based Theories

... are the basic units of grammar

§ No separation between grammar and the lexicon§ Both concrete units (like words) and abstract units (like

constructions) are stored in the mental lexicon§ Even compositional units can be stored in the lexicon§ Grammatical structure emerges because speakers combine

several read-made units§ What is mentally stored and how strongly it is represented is

determined by the individual speaker’s experience (i.e. usage)

8


“Prefabricated Units”


§ Constructionsabstracte.g. ‘time’-away construction (Jackendoff 1997)

twistin‘ the night awaydanced the night awaywhile the day awayVERB the TIME away

§ Chunksconcretesequences of units, often words

9


Chunks


10source: http://www.madeleineshaw.com.au/

tree of

cour

se

how

are

you?

onth

eot

her

hand

sorr

yto

keep

you

wai

ting


Chunking

11


1. What Causes Chunking?

Chunking

§ Frequency of use – frequently used sequences = chunks (e.g. Bybee 2002, 2010)

it’s, that’s, don’t? and I, in the

§ Likelihood of co-occurrence – determined e.g. by means of transitional probabilities or the MI score (e.g. Gries 2008; Hilpert 2013; Wiechman 2008)

can’t, willing to, wind up, oh dear, I suppose? aesthetically pleasing, collapsible sailboat, juvenile delinquents

12


2. Is Chunking Abrupt or Gradual?

Chunking

§ Threshold ApproachWe have two distinct classes of multi-word sequences: chunks and non-chunks. Any sequence which the mind deems sufficiently frequent is stored as a chunk (e.g. Pawley & Syder 1983; Erman & Warren 2000).

§ Continuous ApproachChunking is a gradual phenomenon. There aren’t chunks and non-chunks, but more or less chunky sequences (e.g. Langacker 1987; Bybee 2002, 2010; Arnon & Snider 2010)

13


3. How are Chunks Stored?

Chunking

§ Holistic StorageChunks receive a separate entry in the metal lexicon. Chunking strength (should the model require it) is reflected by stronger representations (e.g. Arnon & Snider 2010).

§ NetworkChunks are not stored holistically, but instead as connections between the representations of their components. Chunking strength is reflected by stronger connections (e.g. McClelland & Rumelhart 1981)

14


Bybee’s (2010) Model

15


1. What Causes Chunking?

Bybee’s Exemplar Model

§ Co-occurrence Frequency§ Some combinations that receive a strong chunkiness rating based

on co-occurrence frequency actually receive a low rating based on probabilistic measures of co-occurrence (e.g. in the).

§ The formulae to calculate probabilistic measures (such as transitional probabilities) always contain co-occurrence frequency. This means that the other factors in the formula “devalue” the frequency rating.

§ Bybee argues that this does not happen in the mind: The mind does not “devalue” frequent combinations.

§ See Bybee’s (2010) discussion of Collostruction Analysis (Stefanowitsch & Gries 2003).

16


2. Is Chunking Abrupt or Gradual?


§ Continuous ApproachChunking is a gradual phenomenon. From the first encounter, sequences of any length are mentally stored.The more often a sequence is used, the chunkier it becomes.

17


3. How are Chunks Stored?


§ Holistic StorageHolistic storage at the first encounterCertainly words that have never been experienced together do not constitute a chunk, but otherwise there is a continuum from words that have been experienced together only once and fairly recently, which will constitute a weak chunk whose internal parts are stronger than the whole, to more frequent chunks such as lend a hand and pick and choose which are easily accessible as wholes while still maintaining connections to their parts. (Bybee 2010)

§ Network[I]tems that are used together frequently will form tighter bonds than items that occur together less often (Bybee 2007)

18


Chunking and Constituents


§ Chunks are not units of planning that speakers can revert to in addition to constituents.

§ Chunks do not even result from constituents, but:§ “Sequentiality is more basic than hierarchy” (Bybee 2010)

Chunks can be combinedSmaller chunks can occur within larger onesFrom these combinatorial possibilities and the varying chunking strengths within a string thus created emerges the hierarchical structure of languageConcrete surface sequences are primary.Abstract hierarchical phrase structure is derived.

§ Not all of the abstractions that linguists have made (e.g. certain phrase boundaries) should be rethought based on the frequency data we now have.

19


Strong Chunks


20

frequentsequences

stronglyrepresented

unit-like appearancein speech (+ writing)

string frequencynot frequency of the individual components

strong, easily accessible holistic representation

fluent pronunciationphonetic reductionuninterrupted

Strong Chunks

Ulrike Schneider | Linguistisches Kolloquium | 19. Januar 2015 21

frequentsequences

unit-like appearancein speech (+ writing)

form the basis of constituents


Strong Chunks

Strong Chunks

stronglyrepresented


1. Co-occurrence frequency should be a better predictor of hesitation placement than transitional probabilities and similar probabilistic measures.

2. The frequency of a sequence and its chance of being interrupted to hesitate should be inversely related.

3. Co-occurrence frequency should be a better predictor of hesitation placement than phrase structure.

Hypotheses

22


Chunking in the PP

23


Contexts

Chunking in the PP

§ Prepositional phrases

24

1. Prep N about baseball

2. Prep Det N of the cowboys

3. Prep N N of Princess Di

4. Prep Det N N through a fax machine

5. Prep Adj N with stiff penalties

6. Prep Det Adj N in a nice neighbourhood


Data

Chunking in the PP

§ SWITCHBOARD NXT§ Telephone conversations between strangers (1990/91)§ Spoken American English§ 830,000 words § annotated: Part-of-Speech, phrases etc.§ time-aligned

25


Hesitations

Chunking in the PP

§ Unfilled pauses (0.2 - 1 sec.)§ Filled pauses (uh, um)§ Discourse markers (well, like, you know, I mean)

26


Hesitations

Chunking in the PP

§ Prepositional phrases§ n = 4,724 data points

27

1. Prep N about baseball n = 1,231

2. Prep Det N of the cowboys n = 1,440

3. Prep N N of Princess Di n = 346

4. Prep Det N N through a fax machine n = 218

5. Prep Adj N with stiff penalties n = 254

6. Prep Det Adj N in a nice neighbourhood n = 575


Possible Positions

Chunking in the PP

28

and in the movieuh

Position1

uh

Position 2

uh

Position 3


Chunking in the PP

29

before Prep before N

Prep N


Tota

l Am

ount

of H

esita

tions

0

100

200

300

400

500

600

before Prep before Det before N

Prep Det N


Tota

l Am

ount

of H

esita

tions

0

200

400

600

800

before Prep before N1 before N2

Prep N N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

200

250

300

before Prep before Det before N1 before N2

Prep Det N N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

before Prep before Adj before N

Prep Adj N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

200

250

before Prep before Det before Adj before N

Prep Det Adj N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

200

Figure 4.1: Distribution of hesitations across prepositional phrase types. White bars indicate unfilled pauses, ruled bars indicate filled pauses and grey bars indicate discourse markers.

96Hesitation Placement in Prepositional Phrases


Prep N


Tota

l Am

ount

of H

esita

tions

0

100

200

300

400

500

600


Prep Det N


Tota

l Am

ount

of H

esita

tions

0

200

400

600

800


Prep N N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

200

250

300


Prep Det N N


Tota

l Am

ount

of H

esita

tions

0

50

100

150


Prep Adj N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

200

250


Prep Det Adj N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

200




Prep N


Tota

l Am

ount

of H

esita

tions

0

100

200

300

400

500

600


Prep Det N


Tota

l Am

ount

of H

esita

tions

0

200

400

600

800


Prep N N

Hesitation PlacementTo

tal A

mou

nt o

f Hes

itatio

ns

0

50

100

150

200

250

300


Prep Det N N


Tota

l Am

ount

of H

esita

tions

0

50

100

150


Prep Adj N


Tota

l Am

ount

of H

esita

tions

0

50

100

150

200

250


Prep Det Adj N

Hesitation PlacementTo

tal A

mou

nt o

f Hes

itatio

ns

0

50

100

150

200




Chunking ‘Grain Size’

Chunking in the PP

§ Bigram: 2 consecutive words, though not across sentence boundaries

§ Word: Word form + POS-Tag, separated by spaces from other word forms

30


Predictors

Chunking in the PP

§ Bigram frequencies§ Direct transitional probability§ Backwards transitional probability§ Mutual Information Score (MI)§ Lexical Gravity G§ Word frequencies§ Hesitation type

31


Lounsbury’s Hypothesis

32



§ Lounsbury (1954):Hypothesis 1: Hesitation pauses correspond to the points of highest statistical uncertainty in the sequencing of units of any given order.

33


Phrase TypeDistribution of Lowest TPDDistribution of Lowest TPDDistribution of Lowest TPDDistribution of Lowest TPD % at

Lowest pPhrase Type1 2 3 4

% at Lowest p

Prep N 180 1,050 59.9% p<.001

Prep Det N 195 102 1,140 32.4% -

Prep N N 23 519 4 58.4% p<.001

Prep Det N N 26 9 433 18 39.5% p<.001

Prep Adj N 21 382 27 59.5% p<.001

Prep Det Adj N 57 9 424 81 33.3% p<.001



Phrase TypeDistribution of Lowest TPDDistribution of Lowest TPDDistribution of Lowest TPDDistribution of Lowest TPD % at


% at Lowest p

Prep N 180 1,050 59.9% p<.001

Prep Det N 195 102 1,140 32.4% -

Prep N N 23 519 4 58.4% p<.001

Prep Det N N 26 9 433 18 39.5% p<.001

Prep Adj N 21 382 27 59.5% p<.001

Prep Det Adj N 57 9 424 81 33.3% p<.001



Phrase TypeDistribution of Lowest MIDistribution of Lowest MIDistribution of Lowest MIDistribution of Lowest MI % at


% at Lowest p

Prep N 805 415 63.9% p<.001

Prep Det N 696 577 168 48.7% p<.001

Prep N N 305 239 3 47.2% p<.001

Prep Det N N 201 216 29 0 35.3% p<.001

Prep Adj N 192 231 5 51.9% p<.001

Prep Det Adj N 254 247 74 0 39.0% p<.001



Phrase TypeDistribution of Lowest MIDistribution of Lowest MIDistribution of Lowest MIDistribution of Lowest MI % at


% at Lowest p

Prep N 805 415 63.9% p<.001

Prep Det N 696 577 168 48.7% p<.001

Prep N N 305 239 3 47.2% p<.001

Prep Det N N 201 216 29 0 35.3% p<.001

Prep Adj N 192 231 5 51.9% p<.001

Prep Det Adj N 254 247 74 0 39.0% p<.001



Results

§ The different measures of association make very different assessments concerning the location of the point of highest statistical uncertainty.

§ The hypothesis is not confirmed in its strongest formHesitations are not always placed at the point of highest statistical uncertainty.

§ BUT:More hesitations are placed at the point of highest statistical uncertainty than expected by chance.This holds for all measures tested except backwards transitional probability.

38



Multifactorial Analysis

39


Conditions


§ Multinomial outcomes§ Multifactorial§ Partially correlated/collinear predictors

40


CART-Trees


§ Classification and Regression Trees§ Algorithm ‘grows’ trees through recursive binary partitioning§ Can handle multinomial outcomes, complex interactions &

collinear predictors§ ctree function for party package for R (Hothorn et al. 2006)

41


CART-Trees


42

bi0.freq.NXTp < 0.001

1

≤ 336 > 336

MI0.NXTp < 0.001

2

≤ 2.873 > 2.873

G2.NXTp < 0.001

3

≤ -0.549 > -0.549

Node 4 (n = 202)

1 2 30

0.2

0.4

0.6

0.8

1

MI1.NXTp = 0.007

5

≤ 2.502 > 2.502

MI0.NXTp = 0.043

6

≤ 1.835 > 1.835

Node 7 (n = 295)

1 2 30

0.2

0.4

0.6

0.8

1Node 8 (n = 69)

1 2 30

0.2

0.4

0.6

0.8

1Node 9 (n = 252)

1 2 30

0.2

0.4

0.6

0.8

1

G0.NXTp < 0.001

10

≤ -0.507 > -0.507

Node 11 (n = 292)

1 2 30

0.2

0.4

0.6

0.8

1

TPD.bi0.NXTp < 0.001

12

≤ 0.261 > 0.261

bi0.freq.NXTp = 0.045

13

≤ 176 > 176

Node 14 (n = 163)

1 2 30

0.2

0.4

0.6

0.8

1Node 15 (n = 15)

1 2 30

0.2

0.4

0.6

0.8

1Node 16 (n = 115)

1 2 30

0.2

0.4

0.6

0.8

1Node 17 (n = 37)

1 2 30

0.2

0.4

0.6

0.8

1


Random Forests


§ reliance on a single tree may be problematiconly locally optimal splitsvariable predictionssome predictors never appearimportance of predictors hard to assess

§ reliance on several thousand trees (here: 3,000)§ random subset of predictors§ random subset of data points§ cforest command for party package in R (Hothorn et al. 2006,

Strobl et al. 2007, Strobl et al.2008)

43


Random Forests - Variable Importance


44sort(PDN.varimp)

MI2.NXT

TPB.bi2.NXT

w2.freq.NXT

bi1.freq.NXT

G1.NXT

bi2.freq.NXT

TPD.bi2.NXT

w3.freq.NXT

TPB.bi1.NXT

MI1.NXT

TPD.bi1.NXT

hes.type

TPB.bi0.NXT

G2.NXT

w1.freq.NXT

w0.freq.NXT

G0.NXT

bi0.freq.NXT

TPD.bi0.NXT

MI0.NXT

0.000 0.005 0.010 0.015 0.020 0.025

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!


What causes chunking?

45


Results (Random Forests)


46

Phrase Type Correct Pred.

Sig. Level ResidualsResiduals

Prep N 82.7% p<.001 15.2 -15.69

Prep Det N 71.9% p<.001 6.46 -7.72

Prep N N 72.7% p<.001 4.84 -5.59

Prep Det N N 65.4% p<.001 10.35 -7.94

Prep Adj N 71.0% p<.001 3.47 -4.1

Prep Det Adj N 63.1% p<.001 7.86 -6.68


Results (Out-of Bag)


47



Prep N 69.9% p<.001 8.93 -9.22

Prep Det N 64.4% p<.001 2.78 -3.33

Prep N N 59.5% non-sig. - -

Prep Det N N 44.1% p<.01 2.59 -1.98

Prep Adj N 57.3% non-sig. - -

Prep Det Adj N 48.2% p<.001 2.38 -2.02


Results (Out-of Bag)


48



Prep N 69.9% p<.001 8.93 -9.22

Prep Det N 64.4% p<.001 2.78 -3.33

Prep N N 59.5% non-sig. - -

Prep Det N N 44.1% p<.01 2.59 -1.98

Prep Adj N 57.3% non-sig. - -

Prep Det Adj N 48.2% p<.001 2.38 -2.02






49


Performance of Predictors


50

0.005

0.010

0.015

Predictor

Var

iabl

e Im

porta

nce

1 TPB 2 w.freq 3 bi.freq 4 MI 5 TPD 6 G 7 hes.type



No, co-occurrence frequency performs on par with the other predictors.

No evidence that co-occurrence frequency is the sole cause of chunking.

But: No sign of highly frequent sequences being ‘devalued’.


51


CART-Trees


52

bi0.freq.NXTp < 0.001

1

≤ 336 > 336

MI0.NXTp < 0.001

2

≤ 2.873 > 2.873

G2.NXTp < 0.001

3

≤ -0.549 > -0.549

Node 4 (n = 202)

1 2 30

0.2

0.4

0.6

0.8

1

MI1.NXTp = 0.007

5

≤ 2.502 > 2.502

MI0.NXTp = 0.043

6

≤ 1.835 > 1.835

Node 7 (n = 295)

1 2 30

0.2

0.4

0.6

0.8

1Node 8 (n = 69)

1 2 30

0.2

0.4

0.6

0.8

1Node 9 (n = 252)

1 2 30

0.2

0.4

0.6

0.8

1

G0.NXTp < 0.001

10

≤ -0.507 > -0.507

Node 11 (n = 292)

1 2 30

0.2

0.4

0.6

0.8

1


12

≤ 0.261 > 0.261

bi0.freq.NXTp = 0.045

13

≤ 176 > 176

Node 14 (n = 163)

1 2 30

0.2

0.4

0.6

0.8

1Node 15 (n = 15)

1 2 30

0.2

0.4

0.6

0.8

1Node 16 (n = 115)

1 2 30

0.2

0.4

0.6

0.8

1Node 17 (n = 37)

1 2 30

0.2

0.4

0.6

0.8

1



Yes, there is not a single split in the CART-trees which suggests the opposite.

We always find: The higher the score of a bigram, the less likely the speaker is to interrupt the speech flow at this transition.

Splits in the trees are made across the spectrum and based on all predictors.

Is Chunking Abrupt or Gradual?

53



Yes, frequency-derived measures are far better predictors of hesitation placement than phrase structure.

Chunking across the prepositional phrase is possible and, in fact, common.


54


Chunking in Violation of the PP Boundary


55


1

≤ 0.13 > 0.13

hes.typep = 0.022

2

u {dm, pause}

Node 3 (n = 85)

1 2 30

0.2

0.4

0.6

0.8

1

Node 4 (n = 181)

1 2 30

0.2

0.4

0.6

0.8

1

G0.NXTp = 0.002

5

≤ 3.11 > 3.11

Node 6 (n = 61)

1 2 30

0.2

0.4

0.6

0.8

1

Node 7 (n = 104)

1 2 30

0.2

0.4

0.6

0.8

1


§ Quantifier + ofIt would be great to have some of those [pause] organisations [...]Examples: one of, many of, (a) lot ofn = 289Hesitation before the preposition: 7.3 % (rest: 47.17 %)

§ Further of-CollocatesExamples: sort(s) of, kind(s) of, out of, terms ofn = 121Hesitation before the preposition: 4.1 % (rest: 44.6 %)

§ Ctree models perform above average on these structures§ Characterised by positive or high MI score and high direct

transitional probability

56






-10

-50

510

15

Backwards Transitional Probability and Gfor 'Support Noun+of' Bigrams

Backwards Transitional Probability (log scaled)

G

0.0001 0.001 0.01 0.1 1 -5 0 5 10 15

MI and Direct Transitional Probabilityfor 'Support Noun+of' Bigrams

MI

Dire

ct T

rans

ition

al P

roba

bilit

y (lo

g sc

aled

)

0.0001

0.001

0.01

0.1

1

kind ofkinds of

sort ofsorts of

type oftypes of

form offorms of

-10

-50

510

15

Backwards Transitional Probability and Gfor 'out of' & 'terms of'

Backwards Transitional Probability (log scaled)

G

0.0001 0.001 0.01 0.1 1 -5 0 5 10 15

MI and Direct Transitional Probabilityfor 'out of' & 'terms of'

MI

Dire

ct T

rans

ition

al P

roba

bilit

y (lo

g sc

aled

)

0.0001

0.001

0.01

0.1

1

out ofterms of



0.000

0.005

0.010

0.015

0.020

Predictor

Var

iabl

e Im

porta

nce

1 TPB 2 bi.freq 3 MI 4 TPD 5 G



Yes, frequency-derived measures are far better predictors of hesitation placement than phrase structure.

Chunking across the prepositional phrase is possible and, in fact, common.

Chunking strengths across the phrase boundary vary + are immensely important for the model.


59


Summary

60





Hypotheses

61

✔ ︎✘

✔ ︎


How are Chunks Stored?


§ Holistic StorageHolistic storage at the first encounter

62



63

a lot of people

alot

ofpeople

a lotlot of

of people

aaaaalotlot

ofofofofofofofpeoplepeople

of peoplelot oflot oflot ofa lota lota lot

a lot of peoplesemantic

filter



64

alot

ofpeople


Conclusions Concerning the Mental Model


§ Measures like the MI score tend to rate sequences which form a semantic unit much higher than sequences which do not form a semantic unit

The good performance of the MI score could be interpreted as a semantic filter being at workBUT: MI is no better predictor than frequency

§ Word frequencies are poor predictorsIn an exemplar model, we would expect competition between the parts and the whole, for which we find no evidence in the data

§ A very simple network model of chunking suffices to explain a processing phenomenon like hesitation placement.

65


Well, thank you for uh your attention.

66


Sources:http://www.freidok.uni-freiburg.de/volltexte/9793/

67

Date post:	06-Mar-2018
Category:	Documents
Upload:	truongthuan
View:	214 times
Download:	1 times

Frequency, Chunks & sentationU... · PDF fileSpeech is often extremely complex but still...

Documents