ML, AI, & Cognitive Modeling - University of California,...

1

Cognitive Modeling:How Humans Learn Complex Linguistic Systems

Lisa Pearl, UC IrvineMar 10, 2008

AIML Seminar SeriesCenter for Machine Learning & Intelligent Systems

UC Irvine

ML, AI, & Cognitive Modeling

Machine Learning: development ofalgorithms and techniques thatallow machines to learn, motivatedby capabilities of computers


Artificial Intelligence & Learning:development of algorithms andtechniques that allow machines to learnlike humans, motivated by humanbehavior

Cognitive ModelingCognitive Modeling: development ofmodels that allow understanding of howallow understanding of howhumans learnhumans learn, attempting to simulatehuman behavior by using techniqueshumans use





Extraction (wordsegmentation):Swingley, 2005;Goldwater,Griffiths,& Johnson 2007 Machine Learning: development of

algorithms and techniques thatallow machines to learn, motivatedby capabilities of computers




Extraction (wordsegmentation):Swingley, 2005;Goldwater,Griffiths,& Johnson 2007

Categorization(phonemes):Vallabha et al. 2007





Extraction (wordsegmentation):Swingley, 2005;Goldwater,Griffiths,& Johnson 2007

Categorization(phonemes):Vallabha et al. 2007

Semi-supervised learning(inductive biases in causation):Masinghka et al. 2006


2

Cognitive Modeling of LanguageDifferent problems: more and lessless easily discernible from dataCategorization/Clustering Ex: What are the contrastive sounds of a language?

Vowel categories in English & Japanese

Hypothesis space: 3 dimensions of variation

English relevant dimensions: 1 and 2

Japanese relevant dimensions: 2 and 3

Vallabha et al. 2007

Cognitive Modeling of LanguageDifferent problems: more and lessless easily discernible from data

Extraction Ex: Where are words in fluent speech?

Categorization/Clustering Ex: What are the contrastive sounds of a language?

Assumption from experimental work:Relevant unit of word segmentationfor infants is the syllableWho’s afraid of the big bad wolf?

huw z´ frej d´v D´ bIg bQd w´lfwho ‘sa frai dof the big bad wolf

húwz´ fréjd´vD´ bÍg bQ‘d w´‘lfwho‘sa fraidofthe big bad wolf

huw z´ frejd ´v D´ bIgbQdw´lfwho ‘sa fraid of the bigbadwolf

Gambell &Yang 2006

Swingley 2005

húwz ´fréjd ´v D´ bÍg bQ‘d w´‘lfwho‘s afraid of the big bad wolf


Mapping What are the word affixes that signal meaning (e.g. past tense in English)?


Extraction Ex: Where are words in fluent speech?

blink~blinked ping~pinged confide~confidedblINk blINkt pIN pINd k´nfajd k´nfajd´d

drink~drank sing-sang hide-hiddrINk drejNk sIN sejN hajd hId

think~thoughtTINk Tçt

regularity

irregularity


Extraction Ex: Where are words in fluent speech?Mapping What are the word affixes that signal meaning (e.g. past tense in English)?

Observable data: word orderGenerative system: syntax

Subject Verb Object

Complex systemsComplex systems: What is the generative system that creates the observed(structured) data of language (ex: syntax, metrical phonology)?



Observable data: word orderGenerative system: syntax

Subject Verb Object

Subject Verb Object

Subject Verb tSubject Object tVerb

English

GermanKannada

Subject tObject Verb Object




Observable data: stress contourGenerative system: metrical phonology

EMphasis


3





Observable data: stress contourGenerative system: metrical phonology

EMphasis

EM pha sis( H L ) H EM pha sis

( S S ) S

EM pha sis( S S S )

EM pha sis( H L L )


Complex systemsComplex systems: What is the generative system that creates the observed(structured) data of language (ex: syntax, metrical phonologymetrical phonology)?



Today’s focus

Road MapIntroduction to complex linguistic systemsIntroduction to complex linguistic systems General problems Parametric systems Parametric metrical phonology

Learnability Learnability of complex linguistic systemsof complex linguistic systems General learnability framework Case study: English metrical phonology Available data & associated woes Unconstrained probabilistic learning Constrained probabilistic learning

Where next? Implications & Extensions




General Problemswith Learning Complex Linguistic Systems

What children encounter: the output ofthe generative linguistic system EMphasis


What children encounter: the output ofthe generative linguistic system EMphasis

What children must learn: thecomponents of the system thatcombine to generate thisobservable output EM pha sis

Are syllablesAre syllablesdifferentiated?differentiated?

Are all syllablesAre all syllablesincluded?included?

Which syllableWhich syllableof aof a larger unitlarger unitis stressed?is stressed?

4


What children encounter: the output ofthe generative linguistic system

Why this is trickyWhy this is tricky: There is often a non-transparent relationshipbetween the observable form of the data and theunderlying system that produced it. Hard toHard toknow what parameters of variation to consider.know what parameters of variation to consider.

Moreover, data are often ambiguousdata are often ambiguous, even ifparameters of variation are known.

What children must learn: thecomponents of the system thatcombine to generate thisobservable output

((HH L L)) H H EM pha sis

((SS S S S S)) EM pha sis

Levels ofabstractstructure

EM pha sis


Are all syllablesAre all syllablesincluded?included?

Which syllableWhich syllableof aof a larger unitlarger unitis stressed?is stressed?

EMphasis


Hypothesis for a language consists of acombination of generalizationscombination of generalizations aboutthat language (grammargrammar). But thisleads to a theoretically infinitehypothesis space.



Are syllables differentiated?Are syllables differentiated?{No, Yes-2 distinctions, Yes-3 distinctions, {No, Yes-2 distinctions, Yes-3 distinctions, ……}}

Are all syllables included?Are all syllables included?{Yes, No-not leftmost, No-not rightmost, {Yes, No-not leftmost, No-not rightmost, ……}}

Which syllable of aWhich syllable of a larger unit is stressed?larger unit is stressed?{Leftmost, Rightmost,{Leftmost, Rightmost, SecondSecond from Left,from Left,……}}

Rhyming matters?Rhyming matters?{No, Yes-every other, {No, Yes-every other, ……}}



Are all syllables included?Are all syllables included?{Yes, No-not leftmost, No-not rightmost, {Yes, No-not leftmost, No-not rightmost, ……}}

Which syllable of aWhich syllable of a larger unit is stressed?larger unit is stressed?{Leftmost, Rightmost,{Leftmost, Rightmost, SecondSecond from Left,from Left,……}}

Are syllables differentiated?Are syllables differentiated?{No, Yes-2 distinctions, Yes-3 distinctions, {No, Yes-2 distinctions, Yes-3 distinctions, ……}}Observation:

Languages only differ in constrainedways from each other. Not allgeneralizations are possible.

Rhyming matters?Rhyming matters?{No, Yes-every other, {No, Yes-every other, ……}}



Are all syllables included?Are all syllables included?{Yes, No-not leftmost, No-not rightmost}{Yes, No-not leftmost, No-not rightmost}

Which syllable of aWhich syllable of a larger unit is stressed?larger unit is stressed?{Leftmost, Rightmost}{Leftmost, Rightmost}

Are syllables differentiated?Are syllables differentiated?{No, Yes-2 distinctions, Yes-3 distinctions}{No, Yes-2 distinctions, Yes-3 distinctions}Observation:

Languages only differ in constrainedways from each other. Not allgeneralizations are possible.

Idea: Children’s hypotheses areconstrained so they only considergeneralizations that are possible inthe world’s languages.

Linguistic parameters = finite (if large)hypothesis space of possible grammarsChomsky (1981), Halle & Vergnaud (1987)

Learning Parametric Linguistic Systems

Linguistic parameters gives the benefit of a finite hypothesis space. Still,the hypothesis space can be quite large.

For example, assuming there are n binaryparameters, there are 2n core grammars tochoose from.

(Clark 1994)

Exponentially growing hypothesis space

5

Parametric Metrical Phonology

Metrical phonology:What tells you to put the EMEMphasis on a particular SYLSYLlable

Process speakers use: Basic input unit: syllables

Larger units formed: metrical feet The way these are formed varies from language to language. Only syllables in metrical feet can be stressed.

Stress assigned within metrical feet The way this is done also varies from language to language.

Observable Data: stress contour of word

em pha sis

(em pha) sis

(EM pha) sis

EMphasis

systemparameters ofvariation - to bedetermined bylearner fromavailable data

Parametric Metrical Phonology

Metrical phonology system here: 5 main parameters, 4 sub-parameters(adapted from Dresher 1999 and Hayes 1995)

Sub-parameters: optionsthat become available ifmain parameter value is acertain one

Most parameters involvemetrical foot formation

All combine to generate stress contour output

A Brief Tour of Parametric Metrical Phonology

Are syllables differentiated?Are syllables differentiated?

NoNo: system is quantity-insensitive (QIQI)lu di crous

CVV CV CCVC S S S S S S





YesYes: system is quantity-sensitive (QSQS)

Only allowed method: differ by rime weight SyllableSyllable

onset rimerime

nucleus coda

crous krkr´s´s

krkr

´́ sslu di crous

CVV CV CCVC





YesYes: system is quantity-sensitive (QSQS)

Only allowed method: differ by rime weight Only allowed number of divisions: 2 HHeavy vs. LLight

lu di crousCVV CV CCVC

H H L L HH

VV always HeavyVV always Heavy VV always Lightalways Light

Option 1: VC Heavy (QS-VC-H) (QS-VC-H)

lu di crousCVV CV CCVC

H H L L LL Option 2: VC Light (QS-VC-L) (QS-VC-L)

narrowing ofhypothesis space


Are all syllables included inAre all syllables included inmetrical feet?metrical feet?

YesYes: system has no extrametricality (Em-NoneEm-None) af ter noonVC VC VV

L L L L HH( ( …… ))

6




L L L L HH( ( …… ))

NoNo: system has extrametricality (Em-SomeEm-Some)

Only allowed # of exclusions: 1 Only allowed exclusions: LeftLeftmost or RightRightmost syllable





L L L L HH( ( …… ))

NoNo: system has extrametricality (Em-SomeEm-Some)

Only allowed # of exclusions: 1 Only allowed exclusions: LeftLeftmost or RightRightmost syllable


a gen daV VC V

L L H H LL ( ( …… ) )

Leftmost syllableexcluded: Em-LeftEm-Left

lu di crousVV V VC

H H L L HH

( ( …… ))

Rightmost syllableexcluded: Em-RightEm-Right


WhatWhat direction are metrical feet constructed?direction are metrical feet constructed?

From the leftFrom the left:Metrical feet are constructed from theleft edge of the word (Ft Dir LeftFt Dir Left)

From the rightFrom the right:Metrical feet are constructed from theright edge of the word (Ft Dir RightFt Dir Right)

Two logical options

H H LL HH

H H LL HH

((

) )

lu di crous

lu di crous

VV V VC

VV V VC


Are metrical feet unrestricted in size?Are metrical feet unrestricted in size?

YesYes: Metrical feet are unrestricted,delimited only by Heavy syllables ifthere are any (UnboundedUnbounded).





L L L L L L H H LL

Ft Dir LeftFt Dir Left

( ( L L L L L L H H LL

( ( L L L L L L )()(H LH L

( ( L L L L L L )()(H LH L))





( ( L L L L L L )()(H LH L)) L L L L L L H H LL

L L L L L L H H LL))

L L LL L L HH)) ((LL))

( ( L L LL L L HH) ) ((LL))

Ft Dir RightFt Dir Right

7





((L L L L L L L LL L

Ft Dir Left/RightFt Dir Left/Right

((L L L L L L L LL L))

S S S S SS S S S S))

((SS S S S S SS SS))

( ( L L LL L L HH) ) ((LL))( ( L L L L L L )()(H LH L))

Ft Dir RightFt Dir Right




( ( L L L L L L )()(H LH L))

( ( L L LL L L HH) ) ((LL))((L L L L L L L LL L))


NoNo: Metrical feet are restricted (BoundedBounded).

The size is restricted to 2 options: 2 or 3. narrowing ofhypothesis space




( ( L L L L L L )()(H LH L))




The size is restricted to 2 options: 2 or 3.

x x x xx x x x

2 units per foot (Bounded-2Bounded-2)

(( x x x x ) () (x x xx

(( x x x x ) () (x xx x))

x x x x x x xx

3 units per foot (Bounded-3Bounded-3)

(( x x xx xx) () ( x x

(( x x xx xx) () ( x x ))






( ( L L L L L L )()(H LH L))




The size is restricted to 2 options: 2 or 3.The counting units are restricted to 2 options:syllables or moras.


(( x x x x ) () (x xx x))(( x x xx xx) () ( x x ))

B-2B-2

B-3B-3




( ( L L L L L L )()(H LH L))





( ( H H LL)()(L HL H))Count by syllables(Bounded-SyllabicBounded-Syllabic)(( L L L L ) () (L L HH))

(( S SS S) () ( S S S S))

Ft Dir LeftFt Dir LeftBounded-2Bounded-2



B-2B-2

B-3B-3

x xx x




( ( L L L L L L )()(H LH L))





( ( H H LL)()(L HL H))

Count by syllables(Bounded-SyllabicBounded-Syllabic)

xx x xx x x x xxxx

Count by moras(Bounded-MoraicBounded-Moraic)




( ( H H ) ( ) ( LL LL) ( ) ( H H ))

H H LL L L HHMoras Moras (unit of weight):(unit of weight):HH = 2 moras xxxxLL = 1 mora xx

B-2B-2

B-3B-3

x xx x

8




( ( L L L L L L )()(H LH L))





( ( H H LL)()(L HL H))

Count by syllables(Bounded-SyllabicBounded-Syllabic)

Count by moras(Bounded-MoraicBounded-Moraic)

compare compare



( ( H H ) ( ) ( LL LL) ( ) ( H H ))


B-2B-2

B-3B-3


Within aWithin a metrical foot, which syllable is stressed?metrical foot, which syllable is stressed?

LeftmostLeftmost:Stress the leftmost syllable (FtFt Hd Hd LeftLeft)

RightmostRightmost:Stress the rightmost syllable (FtFt Hd Hd RightRight)

Two options, hypothesis space restriction

( ( HH ) ( ) ( LL LL) ( ) ( HH ))

( ( HH ) () (L L LL) ( ) ( HH ))

( ( H H ) ( ) ( LL LL) ( ) ( H H ))

Generating a Stress Contour

em pha sis


Yes.Yes.

VC syllables areVC syllables areHeavy.Heavy. VC CV CVC

Process speaker usesto generate stresscontour

HH LL HH


em pha sis VC CV CVC


Are any syllablesAre any syllablesextrametrical?extrametrical?

Yes.Yes.

Rightmost syllable isRightmost syllable isnot included in metricalnot included in metricalfoot.foot.

HH LL HH( ( …… ))




Which direction areWhich direction arefeet constructed from?feet constructed from?

From the right.From the right.

HH LL) ) HH




Are feet unrestricted?Are feet unrestricted?

No.No.

2 syllables per foot.2 syllables per foot.

((HH LL) ) HH

9




Which syllable of theWhich syllable of thefoot is stressed?foot is stressed?

Leftmost.Leftmost.

((HH LL) ) HH ((HH LL) ) HH


EM pha sis VC CV CVC


Learner’s task: Figureout which parametervalues were used togenerate this contour.




Choosing among grammars

Human learning seems to be gradualand somewhat robust to noise - needsome probabilistic learning componentprobabilistic learning component

Since grammars are parameterizedparameterized, child canmake use of this information to constrainconstrainhypothesis spacehypothesis space. Learn over parameters, notentire parameter value sets.

or

or

or

?

?

?

probabilisticprobabilistic learninglearningover parameterover parametervaluesvalues

A caveat about learning parameters separately

Parameters are system components that combinetogether to generate output.

Choice of one parameter may influence choice ofsubsequent parameters.

or

or

or

?

?

?


or

or

or

?

?

?

1 Parameters are system components that combinetogether to generate output.


10


or

or

or

?

?

?1




or

or

or

?

?

?

Point: The order in which parametersare set may determine if they are setcorrectly from the data.

Dresher 1999



The learning framework: 3 components

(1) Hypothesis spaceHypothesis space

(2) DataData

(3) Update procedureUpdate procedure

input

0.5 0.5

0.5 0.5

0.5 0.5

d d d

dd

ddd

dd

dd

d

0.3 0.7

0.6 0.4

0.5 0.5

Key point for cognitive modeling: psychological plausibility

Any probabilistic update procedure must, at the very least, beincremental/online.

Why? Humans (especially human children) don’t have infinite memory.

Unlikely: human children can hold awhole corpus worth of data in theirminds for analysis later on

Models that do this are AI (notcognitive modeling) - they cansimulate human behavior, but notnecessarily the way humans produceit

(ex: Foraker et al. 2007, Goldwater etal. 2007)

inputd d d

dd

ddd

dd

dd

Two psychologically plausibleprobabilistic update procedures

Naïve Parameter Learner (NParLearnerNParLearner)

Probabilistic generation & testing of parameter valuecombinations. (incremental)

Hypothesis update: Linear reward-penaltyLinear reward-penalty(Bush & Mosteller 1951)

Yang (2002)

Two psychologically plausibleprobabilistic update procedures

Naïve Parameter Learner (NParLearnerNParLearner)


Hypothesis update: Linear reward-penaltyLinear reward-penalty(Bush & Mosteller 1951)

Yang (2002)

Bayesian Learner (BayesLearnerBayesLearner)


Hypothesis update: Bayesian updatingBayesian updating(Chew 1971: binomial distribution)

11

Case study: English metrical phonology Adult English system values:

QSQS, QSVCHQSVCH, Em-SomeEm-Some, Em-RightEm-Right, Ft Dir RightFt Dir Right, BoundedBounded,Bounded-2Bounded-2, Bounded-SyllabicBounded-Syllabic, Ft Ft Hd Hd LeftLeft

Estimate of child input: caretaker speech to childrenbetween the ages of 6 months and 2 years (CHILDES[Brent & Bernstein-Ratner corpora]: MacWhinney 2000)

Total Words: 540505 Mean Length of Utterance: 3.5

Words parsed into syllables using the MRCPsycholinguistic database (Wilson, 1988) and assignedlikely stress contours using the American EnglishCALLHOME database of telephone conversation(Canavan et al., 1997)

English Data

Em-SomeEm-Some

Em-SomeEm-Some

QSQSEm-SomeEm-Some

QSQS

BoundedBounded

Em-SomeEm-SomeEm-SomeEm-Some Em-SomeEm-Some

BoundedBounded

Ft Ft Hd Hd LeftLeft


Ft Dir Ft Dir RtRt

B-2B-2

B-SylB-Syl

English Data

QIQIEm-SomeEm-Some

Em-NoneEm-None

Em-SomeEm-Some

QSQS

UnbUnb

Em-SomeEm-Some

Em-NoneEm-None

QIQI QSQS

BoundedBounded

Em-SomeEm-SomeEm-SomeEm-Some Em-SomeEm-Some

Em-NoneEm-NoneBoundedBounded



Ft Dir Ft Dir RtRt

Ft Dir LeftFt Dir Left B-2B-2

B-SylB-Syl

B-MorB-Mor

Case study: English metrical phonology

Non-trivial language: English (full of exceptionsexceptions) Noisy data: 27%27% incompatible with correct English grammar on at least

one parameter value

Exceptions:QIQI, QSVCLQSVCL, Em-NoneEm-None, Ft Dir LeftFt Dir Left, UnboundedUnbounded,Bounded-3Bounded-3, Bounded-MoraicBounded-Moraic, Ft Ft Hd Hd RightRight

Adult English system values:QSQS, QSVCHQSVCH, Em-SomeEm-Some, Em-RightEm-Right, Ft Dir RightFt Dir Right, BoundedBounded,Bounded-2Bounded-2, Bounded-SyllabicBounded-Syllabic, Ft Ft Hd Hd LeftLeft

Hard - therefore interesting!

Probabilistic learning for EnglishProbabilistic generation and testing of parameter values (Yang 2002)

For each parameter, the learner associates a probability with each ofthe competing parameter values.

QI = 0.5QI = 0.5 QS = 0.5QS = 0.5QSVCL = 0.5QSVCL = 0.5 QSVCH = 0.5QSVCH = 0.5Em-Some Em-Some = 0.5= 0.5 Em-None Em-None = 0.5= 0.5Em-Left Em-Left = 0.5= 0.5 Em-Right Em-Right = 0.5= 0.5Ft Dir Left = 0.5Ft Dir Left = 0.5 Ft Dir Ft Dir Rt Rt = 0.5= 0.5BoundedBounded = 0.5= 0.5 Unbounded = 0.5Unbounded = 0.5Bounded-2 = 0.5Bounded-2 = 0.5 Bounded-3 = 0.5Bounded-3 = 0.5Bounded-Syl Bounded-Syl = 0.5= 0.5 Bounded-Mor Bounded-Mor = 0.5= 0.5Ft Ft Hd Hd Left = 0.5Left = 0.5 Ft Ft Hd Rt Hd Rt = 0.5= 0.5

Initially all are equiprobable

Probabilistic learning for English

For each data point encountered, the learner probabilistically generates a setof parameter values (grammar).

Probabilistic generation and testing of parameter values (Yang 2002)

AFterNOONQI = 0.5QI = 0.5 QS = 0.5QS = 0.5QSVCL = 0.5QSVCL = 0.5 QSVCH = 0.5QSVCH = 0.5Em-Some Em-Some = 0.5= 0.5 Em-None Em-None = 0.5= 0.5Em-Left Em-Left = 0.5= 0.5 Em-Right Em-Right = 0.5= 0.5Ft Dir Left = 0.5Ft Dir Left = 0.5 Ft Dir Ft Dir Rt Rt = 0.5= 0.5BoundedBounded = 0.5= 0.5 Unbounded = 0.5Unbounded = 0.5Bounded-2 = 0.5Bounded-2 = 0.5 Bounded-3 = 0.5Bounded-3 = 0.5Bounded-Syl Bounded-Syl = 0.5= 0.5 Bounded-Mor Bounded-Mor = 0.5= 0.5Ft Ft Hd Hd Left = 0.5Left = 0.5 Ft Ft Hd Rt Hd Rt = 0.5= 0.5

QI/QS?QI/QS?……if QS, QSVCL or QSVCH?if QS, QSVCL or QSVCH?Em-None/Em-SomeEm-None/Em-Some??…………

QSQS, , QSVCLQSVCL, , Em-NoneEm-None, Ft Dir RightFt Dir Right, BoundedBounded, Bounded-2Bounded-2, Bounded-SylBounded-Syl, Ft Hd RightFt Hd Right

12


The learner then uses this grammar to generate a stress contour for theobserved data point.


AFterNOON


If the generated stress contour matches the observed stresscontour, the grammar successfully “parses” the data point. Allparticipating parameter values are rewarded.

((LL) ) ((L L HH))

AF ter NOON

VC CVC CVVC

reward all




AFterNOONQSQS, , QSVCLQSVCL, , Em-NoneEm-None,Ft Dir RightFt Dir Right, BoundedBounded,Bounded-2Bounded-2, Bounded-SylBounded-Syl,Ft Hd RightFt Hd Right

((LL) ) ((L L HH))

AF ter NOON VC CVC CVVC

If the generated stress contour does not match the observed stress contour, thegrammar does not successfully “parse” the data point. All participatingparameter values are punished.

reward all

QSQS, , QSVCLQSVCL, , Em-NoneEm-None, Ft Dir LeftFt Dir Left, BoundedBounded, Bounded-2Bounded-2, Bounded-SylBounded-Syl, Ft Hd RightFt Hd Right

((LL LL)) ((HH))

af TER NOON

VC CVC CVVC

punish all




AFterNOONQSQS, , QSVCLQSVCL, , Em-NoneEm-None,Ft Dir RightFt Dir Right, BoundedBounded,Bounded-2Bounded-2, Bounded-SylBounded-Syl,Ft Hd RightFt Hd Right

((LL) ) ((L L HH))

AF ter NOON VC CVC CVVC

reward all

QSQS, , QSVCLQSVCL, , Em-NoneEm-None,Ft Dir LeftFt Dir Left, BoundedBounded,Bounded-2Bounded-2, Bounded-Bounded-SylSyl, Ft Hd RightFt Hd Right

((LL LL)) ((HH))

af TER NOON VC CVC CVVC

punish all


Update parameter value probabilities

NParLearner (Yang 2002): Linear Reward-Penalty

Learning rate γ:small = small changeslarge = large changes

!

pv1 = pv1 + (1- pv1)

pv2 = 1- pv1

!

pv1 = (1- ")pv1

pv2 = 1- pv1

Parameter values v1 vs. v2

reward v1 punish v1




Learning rate γ:small = small changeslarge = large changes

!

pv1 = pv1 + "(1- pv1)

pv2 = 1- pv1

!

pv1 = (1- ")pv1

pv2 = 1- pv1


reward v1 punish v1

BayesLearner: Bayesian update of binomial distribution (Chew 1971)

!

pv =" +1+ successes

" + # + 2 + total data seen

Parameter value v

reward: success + 1 punish: success + 0

Parameters α, β:

α = β: initial bias at p = 0.5α, β < 1: initial bias towardendpoints (p = 0.0, 1.0)

here: α = β = 0.5



After learning: expect probabilities of parameter values to convergenear endpoints (above/below some threshold).

QI = 0.3QI = 0.3 QS = 0.7QS = 0.7QSVCL = 0.6QSVCL = 0.6 QSVCH = 0.4QSVCH = 0.4Em-Some Em-Some = 0.1= 0.1 Em-None Em-None = 0.9= 0.9

…

13

Once set, a parameter value is always used during generation,since its probability is 1.0.



After learning: expect probabilities of parameter values to convergenear endpoints (above/below some threshold).

QI = 0.3QI = 0.3 QS = 0.7QS = 0.7QSVCL = 0.6QSVCL = 0.6 QSVCH = 0.4QSVCH = 0.4Em-Some Em-Some = 0.1= 0.1 Em-None Em-None = 0.9= 0.9

…

Em-None Em-None = 1.0= 1.0

QI/QS?QI/QS?……if QS, QSVCL or QSVCH?if QS, QSVCL or QSVCH?Em-NoneEm-None……


Probabilistic learning for EnglishGoal: Converge on Englishvalues after learning period isover

Learning Period Length: 1,160,000 words(based on estimates of words heard in a 6month period, using Akhtar et al. (2004)).

QSQS, QSVCHQSVCH, Em-SomeEm-Some, Em-RightEm-Right, Ft Dir RightFt Dir Right, BoundedBounded, Bounded-2Bounded-2,Bounded-SyllabicBounded-Syllabic, Ft Ft Hd Hd LeftLeft




Success rate (1000 runs)Model

0.0%BayesLearner1.2%NParLearner, 0.01 ≤ γ ≤ 0.05

Examples of incorrect target grammars NParLearner: Em-NoneEm-None, Ft Hd Left, UnbUnb, Ft Dir LeftFt Dir Left, QIQI QS, Em-NoneEm-None, QSVCH, Ft Dir Rt, Ft Hd Left, B-MorB-Mor, Bounded, Bounded-2

BayesLearner: QS, Em-Some, Em-Right, QSVCH, Ft Hd Left, Ft Dir Rt, UnbUnb Bounded, B-Syl, QIQI, Ft Hd Left, Em-NoneEm-None, Ft Dir Left, Ft Dir Left, B-2

Probabilistic learning for English: ModificationsProbabilistic generation and testing of parameter values (Yang 2002)


Batch-learning (for very small batch sizes): smooth out some of theirregularities in the data

Implementation (Yang 2002): Success = increase parameter value’s batch counter by 1 Failure = decrease parameter value’s batch counter by 1

Invoke update procedure (Linear Reward-Penalty or BayesianUpdating) when batch limit b is reached. Then, reset parameter’sbatch counters.


Update parameter value probabilities + Batch Learning


Invoke when the batchcounter for pv1 or pv2equals b.

!

pv1 = pv1 + "(1- pv1)

pv2 = 1- pv1

!

pv1 = (1- ")pv1

pv2 = 1- pv1


reward v1 punish v1

BayesLearner: Bayesian update of binomial distribution (Chew 1971)

!

pv =" +1+ successes

" + # + 2 + total data seen

Parameter value v

reward: success + 1 punish: success + 0

Invoke when the batchcounter for pv1 or pv2 equals b.

Note: total data seen + 1

Probabilistic learning for English: ModificationsGoal: Converge on Englishvalues after learning period isover






14

Goal: Converge on Englishvalues after learning period isover




1.0%BayesLearner + Batch,2 ≤ b ≤ 10

0.8%NParLearner + Batch,0.01 ≤ γ ≤ 0.05, 2 ≤ b ≤ 10



Probabilistic learning for English: ModificationsProbabilistic generation and testing of parameter values (Yang 2002)

Update parameter value probabilities + Batch Learning

Learner bias: metrical phonology relies in part on knowledge of rhythmicalproperties of the language

Human infants may already have knowledge of Ft Ft Hd Hd LeftLeft (Jusczyk,Cutler, & Redanz (1993) and QS QS (Turk, Jusczyk, & Gerken (1995).

Build this bias into a modelBuild this bias into a model: set probability of QS = Ft Hd Left = 1.0.These will always be chosen during generation.

QSQS……QSVCL or QSVCH?QSVCL or QSVCH?……Ft Ft Hd Hd LeftLeft

QSQS, , QSVCLQSVCL, , Em-NoneEm-None, Ft Dir RightFt Dir Right, BoundedBounded, Bounded-2Bounded-2, Bounded-SylBounded-Syl, Ft Hd LeftFt Hd Left







0.0%BayesLearner

1.2%NParLearner, 0.01 ≤ γ ≤ 0.05





1.0%BayesLearner + Batch + Bias,2 ≤ b ≤ 10

5.0%NParLearner + Batch + Bias,0.01 ≤ γ ≤ 0.05, 2 ≤ b ≤ 10




0.0%BayesLearner

1.2%NParLearner, 0.01 ≤ γ ≤ 0.05





1.0%BayesLearner + Batch + Bias,2 ≤ b ≤ 10

5.0%NParLearner + Batch + Bias,0.01 ≤ γ ≤ 0.05, 2 ≤ b ≤ 10




0.0%BayesLearner

1.2%NParLearner, 0.01 ≤ γ ≤ 0.05

The bestisn’t sogreat

Where else can we modify?


(2) DataData


input

0.5 0.5

0.5 0.5

0.5 0.5

d d d

dd

ddd

dd

dd

d

0.3 0.7

0.6 0.4

0.5 0.5

15



(2) DataData


input

0.5 0.5

0.5 0.5

0.5 0.5

d d d

dd

ddd

dd

dd

d

0.3 0.7

0.6 0.4

0.5 0.5

Linear Reward-Penalty,Bayesian, Batch…



(2) DataData


input

1.0 0.0

0.5 0.5

1.0 0.0

d d d

dd

ddd

dd

dd

d

0.3 0.7

0.6 0.4

0.5 0.5

Prior knowledge, biases:QS, Ft Hd Left known…




(2) DataData


input

1.0 0.0

0.5 0.5

1.0 0.0

d d d

dd

ddd

dd

dd

d

0.3 0.7

0.6 0.4

0.5 0.5



What about the data thelearner uses?

Data Intake Filtering“Selective Learning”

“Equal Opportunity” Intuition: Use allavailable data to uncover a full range ofsystematicity, and allow probabilisticmodel enough data to converge.

intake

inputd d d

d

dd

dd

dd

inputd

d dd

d

dd d

“Selective” Intuition: Use the really good data only.

One instantiation of “really good” = highly informative.

One instantiation of “highly informative” = data viewed bythe learner as unambiguous (Fodor, 1998; Dresher,1999; Lightfoot, 1999; Pearl & Weinberg, 2007)



(2) DataData


input

1.0 0.0

0.5 0.5

1.0 0.0

d d d

dd

ddd

dd

dd

d

0.3 0.7

0.6 0.4

0.5 0.5



What about the data thelearner uses?



(2) DataData


input

1.0 0.0

0.5 0.5

1.0 0.0

d d d

dd

ddd

dd

dd

d

0.3 0.7

0.6 0.4

0.5 0.5



Data intake filter

intakeinput

dd d

d

d

dd d

16

Practical matters:Feasibility of unambiguous data

“It is unlikely that any example … wouldshow the effect of only a single parametervalue; rather, each example is the result ofthe interaction of several different principlesand parameters”

(S S) (S) af ter noon

AFterNOON

(L L) (H) af ter noon

(L) (L H) af ter noon

Clark 1994

Existence?

Even if unambiguous data existed, how could achild identify them?

Identification?

What’s the same here,other than the output?


Existence? Depends on data set (empirically determined).



Identification?

Identifying unambiguous data: CuesCues (Dresher 1999; Lightfoot 1999): heuristic pattern-matching to observableform of the data. Cues are available for each parameter value, known already bythe learner.

S…S af ter noon Em-None



Identification?

Identifying unambiguous data: CuesCues (Dresher 1999; Lightfoot 1999): heuristic pattern-matching to observableform of the data. Cues are available for each parameter value, known already bythe learner.

ParsingParsing (Fodor 1998; Sakas & Fodor 2001): extract necessary parameter valuesfrom all successful parses of data point

(QIQI, , Em-NoneEm-None, Ft Dir LeftFt Dir Left, Ft Hd LeftFt Hd Left, BB, B-2B-2, B-SylB-Syl)(QSQS, , QSVCLQSVCL, , Em-NoneEm-None, Ft Dir LeftFt Dir Left, Ft Ft Hd Hd LeftLeft, BB, B-2B-2, B-SylB-Syl)


Em-NoneEm-None, , Ft DirFt DirLeftLeft, , Ft Ft Hd Hd LeftLeft,,Bounded, Bounded-Bounded, Bounded-2, 2, Bounded-SylBounded-Syl



Identification?

Identifying unambiguous data: CuesCues (Dresher 1999; Lightfoot 1999): heuristic pattern-matching to observableform of the data. Cues are available for each parameter value, known already bythe learner

ParsingParsing (Fodor 1998; Sakas & Fodor 2001): extract necessary parameter valuesfrom all successful parses of data point

(QIQI, , Em-NoneEm-None, Ft Dir LeftFt Dir Left, Ft Hd LeftFt Hd Left, BB, B-2B-2, B-SylB-Syl)(QSQS, , QSVCLQSVCL, , Em-NoneEm-None, Ft Dir LeftFt Dir Left, Ft Ft Hd Hd LeftLeft, BB, B-2B-2, B-SylB-Syl)


Both operate over a single data point at a time:compatible with incremental learning

Em-NoneEm-None, , Ft DirFt DirLeftLeft, , Ft Ft Hd Hd LeftLeft,,Bounded, Bounded-Bounded, Bounded-2, 2, Bounded-SylBounded-Syl

Probabilistic learning from unambiguous data(Pearl 2008)

Each parameter has 2 values.

17


Each parameter has 2 values.

Advantage in data: How much more unambiguousdata there is for one value over the other in the datadistribution.

Assumption (Yang 2002):The value with the greater advantage will be theone a probabilistic learner will converge on overtime.

has advantage

inputintake intake

Allows us to be fairly agnostic about the exact natureof the probabilistic learning, provided it has thisbehavior.


The order in which parameters are set may determine ifthey are set correctly from the data.

Dresher 1999


The order in which parameters are set may determine ifthey are set correctly from the data.

Dresher 1999

ParsingGroup 1:QSQS, Ft Ft Hd Hd LeftLeft, BoundedBoundedGroup 2:Ft Dir RightFt Dir Right, QS-VC-HeavyQS-VC-HeavyGroup 3:Em-Some, Em-RightEm-Some, Em-Right, Bounded-2,Bounded-2,

Bounded-SylBounded-Syl

The parameters are freely orderedw.r.t. each other within each group.

Cues(a)(a) QS-VC-HeavyQS-VC-Heavy

before Em-RightEm-Right(b)(b) Em-RightEm-Right

before Bounded-SylBounded-Syl(c)(c) Bounded-2Bounded-2

before Bounded-SylBounded-Syl

The rest of the parameters are freelyordered w.r.t. each other.

Success guaranteed as long as parameter-setting order constraints are followed.




Where we are nowCognitive modeling: aimed at understanding howhumans solve problems, generating human behaviorby using psychologically plausible methods

Language: learning complex systems isdifficult. Success comes from integratingbiases into probabilistic learning models. Bias on hypothesis space:

linguistic parameters alreadyknown, some values already known

Bias on data:interpretive bias to usehighly informative data 0.7 0.3

0.5 0.5

0.8 0.2

inputintake intake

Where we can go(1) Interpretive bias: How successful on other difficult learning cases (noisy datasets, other complex systems)? Are there other methods of implementing interpretative biasesthat lead to successful learning? How necessary is an interpretive bias? Are there clevererprobabilistic learning methods than can succeed?

+ biases?

18


(2) Hypothesis space bias: Is it possible to infer the correct parameters of variation givenless structured information a priori (e.g. larger units thansyllables are required)? [Model Selection]

+ biases?

+ fewer biases?


(2) Hypothesis space bias: Is it possible to infer the correct parameters of variation givenless structured information a priori (e.g. larger units thansyllables are required)? [Model Selection]

(3) Informing AI/ML: Can we import the necessary biases for learning complex

systems into language applications (ex: speech generation)?

+ biases?

+ fewer biases?

necessary biases

The big idea

Complex linguistic systems may wellrequire something beyond probabilisticmethods in order to be learned, andlearned as well as humans learn them.

What this likely is: learner biases inhypothesis space and data intake(how to deploy probabilistic learning)

What we can do: take insights fromcognitive modeling and apply them toproblems in artificial intelligence andmachine learning, & vice versa

Thank YouAmy Weinberg Jeff LidzBill Idsardi Charles YangBill Sakas Janet Fodor

The audiences at

University of California, Los Angeles Linguistics DepartmentUniversity of Southern California Linguistics DepartmentBUCLD 32UC Irvine Language Learning GroupUC Irvine Department of Cognitive SciencesCUNY Psycholinguistics Supper ClubUDelaware Linguistics DepartmentYale Linguistics DepartmentUMaryland Cognitive Neuroscience of Language Lab

Date post:	21-May-2018
Category:	Documents
Upload:	buique
View:	214 times
Download:	1 times

ML, AI, & Cognitive Modeling - University of California,...

Documents