1
LING 696B: Categorical perception, perceptual magnets and neural nets
2
From last time: quick overview of phonological acquisition Prenatal to newborn
4-day-old French babies prefer listening to French over Russian (Mehler et al, 1988)
Prepared to learn any phonetic contrast in human languages (Eimas, 71)
3
A quick overview of phonological acquisition 6 months: Effect of experience
begin to surface (Kuhl, 92)
6 - 12 months: from a universal perceiver to a language-specific one (Werker & Tees, 84)
4
A quick overview of phonological acquisition 6 - 9 months: knows the sequential
patterns of their language English v.s. Dutch (Jusczyk, 93)
“avoid” v.s. “waardig” Weird English v.s. better English
(Jusczyk, 94) “mim” v.s. “thob”
Impossible Dutch v.s. Dutch (Friederici, 93)
“bref” v.s. “febr”
5
A quick overview of phonological acquisition 6 months: segment words out of
speech stream based on transition probabilities (Saffran et al, 96) pabikutibudogolatupabikudaropi…
8 months: remembering words heard in stories read to them by graduate students (Jusczyk and Hohne, 97)
6
A quick overview of phonological acquisition From 12-18 months and above:
vocabulary growth from 50 words to a “burst” Lexical representation?
Toddlers: learning morpho-phonology Example: U-shaped curve
went --> goed --> wented/went
7
Next few weeks: learning phonological categories The innovation of phonological structure
requires us to think of the system of vocalizations as combinatorial, in that the concatenation of inherently meaningless phonological units leads to an intrinsically unlimited class of words ... It is a way of systematizing existing vocabulary items and being about to create new ones.
-- Ray Jackendoff (also citing C. Hockett in “Foundations of Language”)
8
Categorical perception Categorization
Discrimination
9
Eimas et al (1971) Enduring results: tiny (1-4 month)
babies can hear almost anything Stimuli: 3 pairs along the /p/-/b/
synthetic continuum, differing in 20ms
10
Eimas et al (1971) Babies get
Very excited hearing a change cross the category boundary
Somewhat excited when the change stay within the boundary
Bored when there is no change
11
Eimas et al (1971) Their conclusion: [voice] feature
detection is innate But similar things were shown for
Chichillas (Kuhl and Miller, 78) and monkeys (Ramus et al)
So maybe this is a Mammalian aditory system thing…
12
Sensitivity to distributions Percetual Magnet effects (Kuhl,
91): tokens near prototypes are harder to discriminate
Perhaps a fancier term is “perceptual space warping”
13
Categorical perception v.s. perceptual magnets Gradient effects: Kuhl (91-95) tried
hard to demonstrate these are not experimental errors, but systematic Refererce 1 2 3 4
14
Categorical perception v.s. perceptual magnets PM is based on a notion of
perceptual space Higher dimensions Non-category boundaries
15
Kuhl et al (1992) American English has /i/, but no /y/ Swedish has /y/, but no /i/ Test whether infants can detect
differences between prototype and neighbors
Percentof treating
as the same
16
Kuhl et al (92)’s interpretation Lingustic experience changes
perception of native and non-native sounds (also see Werker & Tees, 84)
Influence of input starts before any phonemic contrast is learned
Information stored as prototypes
17
Modeling attempt: Guenther and Gjaja (96) Claim: CP/PM is a matter of
auditory cortex, and does not have to involve higher-level cognition
Methodology: build a connectionist model inspired by neurons
18
Self-organizing map in Guenther and Gjaja Input coding:
So for each (F1,F2) pair, use 4 input nodes This seems to be an artifact of the
mechanism (with regard to scaling), but a standard practice in the field
19
Self-organizing map in Guenther and Gjaja
mj : activation
Connection weights
20
Self-organizing map: learning stage
mj : activation
Connection weights:initialized to be random
Learning rule: mismatch
activationupdate
Learning rate
21
Self-organizing map: learning stage Crucially, only update connections
to the most active N cells The “inhibitory connection”
N goes down to 1 over time Intuition: each cell selectively
encodes certain kind of input. Otherwise the weights are pulled in different directions
22
Self-organizing map: testing stage Population vector: take a weighted sum
of individual votes
Inspired by neural firing patterns in monkey reaching
How to compute Fvote? X+, X- must point to the same direction as Z+,
Z-
Algebra exercise: derive a formula to calculate Fvote (homework Problem 1)
23
The distributin of Fvote after learning
Trainining stimuli Distribution of Fvote
Recall: activation
24
Results shown in the paper
Kuhl
Their model
25
Results shown in the paper
With some choice of parameters to fit the“percent generalization”
26
Local summary Captures a type of pre-linguistic /
pre-lexical learning from distributions (next time) Distribution learning can
happen in a short time (Maye&Gerken, 00)
G&G: No need to have categories in order to get PM effects Also see Lacerda (95)’s exemplar model
Levels of view: cognition or cortex Maybe Guenther is right …
27
Categorization / discrimination training (Guenther et al, 99)
1. Space the stimuli according to JND
2. Test whether they can hear the distinctions in control and training regions
3. TRAIN in some way(45 minutes!)
4. Test again as in 2.
Band-passed white noise
28
Guenther et al, 99 Categorization training:
Tell people that sounds from the training region belong to a “prototype”
Present them sounds from training region and band-edges, ask whether it’s prototype
Discrimination training: People say “same” or “different”, and get
feedback Made harder by tighter spacing over time
29
Guenther et al, 99 Result of distribution learning
depends on what people do with it
control
training
30
Guenther’s neural story for this effect Propose to modify G&G’s
architecture to simulate effects of different training procedure
Did some brain scanto verify the hypothesis(Guenther et al, 04)
New model forthcoming?
31
Guenther’s neural story for this effect Propose to modify G&G’s
architecture to simulate effects of different training procedure
Did some brain scanto verify the hypothesis(Guenther et al, 04)
New model forthcoming?(A+ guaranteed if you turn one in)
32
The untold story of Guenther and Gjaja Size of model matters a lot
Larger models give more flexibility, but no free lunch: Take longer to learn Must tune the learning rate parameter Must tune the activation neighborhood
100 cells 500 cells 1500 cells
33
Randomness in the outcome This is
perhaps the nicest
Some other runs
34
A technical view of G&G’s model “Self-organization”: a type of
unsupervised learning Receive input with no information on
which categories they belong Also referred to as “clustering” (next
time) “Emergent behavior”: spatial input
patterns coded connection weights that realize the non-linear mapping
35
A technical view of G&G’s model A non-parametric approximation to
the input-percept non-linear mapping Lots of parameters, but individually do
not correspond to categories Will look at a parametric approach to
clustering next time Difficulties: tuning of parameters,
rates of convergence, formal analysis (more later)
36
Other questions for G&G If G&G is a model of auditory cortex,
then where do F1 and F2 come from? (see Adam’s presentation)
What if we expose G&G to noisy data? How does it deal with
speech segmentation? More next time.
37
Homework problems related to perceptual magnets Problem 2: warping in one dimension
predicted by perceptual magnets
Problem 3: program a simplified G&G type of model or any other SOM to simulate the effect shown above
Before Training data After?
38
Commercial:CogSci colloquium this year Many big names in connectionism
this semester
39
A digression to the history of neural nets 40’s -- study of neurons (Hebb,
Huxley&Hodgin) 50’s -- self-programming
computers, biological inspirations ADALINE from Stanford (Widrow &
Hoff) Perceptron (Rosenblatt, 60)
40
Perceptron
Input: x1, …, xn
Output: {0, 1} Weights: w1,
…,wn
Bias: a How it works:
41
What can perceptrons do? A linear machine: finding a plane
that separates the 0 examples from 1 examples
42
Perceptron learning rule Start with random weights Take an input, check if the output,
when there is an error weight = weight + a*input*error bias = bias + a*error
Repeat until no more error Recall G&G learning rule See demo
43
Criticism from traditional AI(Minsky & Papert, 69) Simple perceptron not capable of
learning simple boolean functions More sophisticated machines hard
to understand Example: XOR (see demo)
44
Revitalization of neural nets: connectionism Minsky and Papert’s book stopped
the work on neural nets for 10 years …
But as computer gets faster and cheaper,
CMU parallel distributed processing (PDP) group
UC San Diego institute of cognitive science
45
Networks get complicated: more layers 3 layer perceptron can solve the
XOR problem
46
Networks get complicated: combining non-linearities Decision surface can be non-linear
47
Networks get complicated: more connections Bi-directional Inhibitory Feedback loops
48
Key to the revitalization of connectionism Back-propagation algorithm that can be
used to train a variety of networks Variants developed for other
architectureserror
49
Two kinds of learning with neural nets Supervised
Classification: patterns --> {1,…,N} Adaptive control: control variables -->
observed measurements Time series prediction: past events -->
future events Unsupervised
Clustering Compression / feature extraction
50
Statistical insights from the neural nets Overtraining / early-stopping Local maxima Model complexity -- training error
tradeoff Limits of learning: generalization
bounds Dimension reduction, feature
selection Many of these problems become
standard topics in machine learning
51
Big question: neural nets and symbolic computing Grammar: harmony theory
(Smolensky, and later Optimality Theory)
Representation: Can neural nets discovery phonemes?
52
Next week Reading will be posted on the
course webpage