Date post: | 25-Dec-2015 |
Category: |
Documents |
Upload: | madeleine-heath |
View: | 217 times |
Download: | 0 times |
What do cows drink?
Symbolic AI
ISA(cow, mammal)
ISA(mammal, animal)
Rule1: IF animal(X) AND thirsty(X) THEN lack_water(X)
Rule2: IF lack_water(X) then drink_water(X) Conclusion: Cows drink water.
What do cows drink?Connectionism:
COW MILK DRINK100 ms.
What interestsConnectionism
What interests Symbolic AI
What do cows drink?Connectionism:
COW MILK DRINK100 ms.
These neurons are activated without ever have heard the
word “milk”
Artificial Neural Networks
“Systems that are deliberately constructed to make use of some of the organizational principles that are felt to be used in the human brain.” (Anderson & Rosenfeld, 1990, Neurocomputing, p. xiii)
The Origin of Connectionist NetworksMajor Dates
William James (1892): the idea of a network of associations in the brain.McCulloch & Pitts (1943, 1947): the “logical” neuronHebb (1949): The Organization of Behavior: Hebbian learning and the
formation of cell assemblies Hodgkin and Huxley (1952): Description of the chemistry of neuron-firing.Rochester, Holland, Haibt, & Duda (1956): first real neural network
computer modelRosenblatt (1958, 1962): perceptronMinsky and Papert (1969) bring the walls down on perceptronsHopfield (1982, 1984): Hopfield network, settling to an attractorKohonen (1982): unsupervised learning networkRumelhart & McClelland and the PDP Research Group (1986):
backpropagation, etc.Elman (1990): the simple recurrent networkHinton (1980 – present ): just about everything else...
McCulloch & Pitts (1943, 1947)
T
1
0
0Inputs
Output
The real neuron was far, far more complex,but they felt that they had captured itsessence. Neurons were the biologicalequivalent of logic gates.
Conclusion: Collections of neurons, appropriately wired together, can do logical calculus. Cognition is just a complex logical calculus.
Inputs
Output
The McCulloch & Pitts representation of the “essential” neuron was that it was a logic gate (here an AND gate)
Hebb (1949)Connecting changes in neurons to cognition
Hebb asked: What changes at the neuronal level might make possible our acquisition of high-level (semantic) information? His answer: Learning rule of synaptic reinforcement (Hebbian learning).
When neuron A fires and is followed immediately by the firing of neuron B, the synapse between the two neurons is strengthened, i.e., the next time A fires, it will be easier for B to fire.
Connecting neural function to behavior
High level models of human cognition and behavior
Low-level models of single neurons
Even lower-level models of synapses and ion channels
The Hebbian
Gap
Neuronal population coding models
Cell assemblies: Closing the Hebbian Gap
Cell assemblies at the neuronal level give rise to categories at the semantic level.
The formation of cell assemblies involves• persistence of activity without external input. Cell assemblies can overlap. e.g., the cell assembly associated with “dog” will overlap with those associated with “wolf”, “cat”, etc.
• recruitment: creation of a new cell assembly (via Hebbian learning) corresponding to a new concept • fractionation: creation of new cell assemblies from an old one, corresponding to the refinement of a concept.
A Hebbian Cell Assembly
By means of the Hebbian Learning Rule, a circuit of continuously firing neurons could be learned by the network.
The continuing activation in this cell assembly does not require external input.
The activation of the neurons in this circuit would correspond to the perception of a concept.
Rochester, Holland, Haibt, & Duda (1956)
• First real simulation that attempted to implement the principles outlined by Hebb in real computer hardware
• Attempted to simulate the emergence of cell assemblies in a small network of 69 neurons. They found that everything became active in their network.
• They decided that they needed to include inhibitory synapses. (Hebb only discussed excitatory synapses). This worked and cell assemblies did, indeed, form.
• Probably the earliest example in neural network modeling of a network which made a prediction (i.e., inhibitory synapses are needed to form cell assemblies), that was later confirmed in real brain circuitry.
Rosenblatt (1958, 1962): The Perceptron
• Rosenblatt’s perceptron could learn to associate inputs with outputs.
• He believed this was how the visual system learned to associate low-level visual input with higher level concepts.
• He introduced a learning rule (weight-change algorithm) that allowed the perceptron to learn associations.
The elementary perceptron
Consists of:
• two layers of nodes (one layer of weights)
• only feedforward connections
• a threshold function on each output unit
• a linear summation of the weights times inputs
w1 w2
x 1 x2
y
Threshold = T
t
actual output
desired output (“teacher”)
0
11
yelse
ythenThresholdxwifINPUTS
iii
The perceptron (Widrow-Hoff) learning rule (weight-change rule) is:
where is the learning constant,
)( ytxww oldnew
10
X
x i
I
“X”
w i
This perceptron learns to associate the visual input of two crossed straight lines with the character “X”. In other words, the output of the network will be the character “X”.
Generalization
x i
I
“X”
w i
The real image in the world is degraded, but if the network has already learned to correctly identify the original complete “X”, it will recognize the degraded X as being an “X”.
Fundamental limitations of the perceptron
Minsky & Papert (1969) showed that the Rosenblatt two-layer perceptron had some fundamental limitations: They could only classify linearly separable sets.
XX
X
X
X
X
Y
YY
Y
Y
Y
This: But not this:
XX X
X
X
XY
Y
YYY Y
The (infamous) XOR problem
• Minsky and Papert showed there were a number of extremely simple patterns that no perceptron could learn, including a logic function XOR.
• Since cognition supposedly required elementary logical operations, this severely weakened the perceptron’s claim to be able to do general cognition.
Input Output
0 0 0
0 1 1
1 0 1
1 1 0
There is no set of weights w1 and w2 and a threshold T,
such that the perceptron below can learn the above XOR function.
XOR
w 1 w 2
x 1 x 2
y
Threshold = T
t
actual output
desired output (“teacher”)
The activation arriving at the output node is . If then we output 1, otherwise 0.
But is a straight line if we consider x1
and x2 to be the axis of a coordinate system.
2211 xwxw
Txwxw 2211
1 1 2 2w x w x T
(0,0)
(1,1)(0,1)
(1,0)
1
0
x1
x2
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
1 1 2 2w x w x T
NO!
(0,0)
(1,1)(0,1)
(1,0)
1
0
x1
x2
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
1 1 2 2w x w x T
NO!
(0,0)
(1,1)(0,1)
(1,0)
1
0
x1
x2
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
1 1 2 2w x w x T
NO!
(0,0)
(1,1)(0,1)
(1,0)
1
0
x1
x2
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
1 1 2 2w x w x T
NO!
(0,0)
(1,1)(0,1)
(1,0)
1
0
x1
x2
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
1 1 2 2w x w x T
NO!
The Revival of the (Multi-layered) Perceptron:The Connectionist Revolution (1985)
and the Statistical Nature of Cognition
By the early 1980’s Symbolic AI had hit a wall. “Simple” tasks that humans do (almost) effortlessly (face, word, speech recognition, retrieving information from incomplete cues, generalizing, etc) proved to be notoriously hard for symbolic AI.
• Minsky (1967): “Within a generation the problem of creating ‘artificial intelligence’ will be substantially solved.”
• Minsky (1982): “The AI problem is one of the hardest ever undertaken by science.”
By the early 1980’s the statistical nature of much of cognition became ever more apparent.
Three factors contributed to the revival of the perceptron:
• the radical failure of AI to achieve the goals announced in the 1960’s
• the growing awareness of the statistical and “fuzzy” nature of cognition
• the development of improved perceptrons, capable of overcoming the linear separability problems brought to light by Minsky & Papert.
Advantages of Connectionist Models compared to Symbolic AI
• Learning: Specifically designed to learn.
• Pattern completion of familiar patterns.
• Generalization: Can generalize to novel patterns based on previously learned patterns.
• Retrieval with partial information: Can retrieve information in memory based on nearly any attribute of the representation.
• Massive parallelism.
100-step processing constraint (Feldman & Balard, 1982) Neural hardware is too slow and too unreliable for sequential models of processing. But we can do very complex processing in a few hundred ms. But transmission across a synapse (~10-6 in.) occurs in about ~1 ms. Thus, complex tasks must be accomplished in no more than a few hundred serial steps, which is impossible.
• Graceful degradation: when they are damaged, their performance degrades gradually.
Real Brains and Connectionist Networks
Some characteristics of real brains that serve as the basis of ANN design:
•Neurons receive input from lots of other neurons.•Massive parallelism: neurons are slow but there are lots of them•Learning involves modifying the strength of synaptic connections.•Neurons communicate with one another via activation or inhibition.•Connections in the brain have a clear geometric and topological structure.•Information is continuously available to the brain.•Graceful degradation of performance in the face of damage and information overload•Control is distributed, not central (i.e., no central executive).•One primary way of understanding what the brain does is relaxation to attractors.
General principles of all connectionist networks
• a set of processing units• a state of activation defined over all of the units• an output function (“squashing function”) for each unit: Transforms
unit activation into outgoing activation;• a connectivity pattern with two features:• - weights of the connections• - locations of the connections• an activation rule for combining inputs impinging on a unit to produce
a total activation for the unit• a learning rule, by which the connectivity pattern is changed.• an environment in which the system operates (i.e., how is the i/o
represented and given to/taken from the system)
Knowledge storage and Learning
• Knowledge storage: Knowledge is stored exclusively in the pattern of strengths of the connections (weights) between units. The network stores multiple patterns in the SAME set of connections.
• Learning: The system learns by automatically adjusting the strengths of these weights as it receives information from its environment.
There are no high-level rules programmed into the system. Because all patterns are stored in the same set of connections, generalization, graceful degradation, etc. are relatively easy in connectionist networks. It is also what makes planning, logic, etc. are so hard.
Two major classes of networks
• Supervised: Includes all error-driven learning algorithms. The error between the desired output and the actual output determines how to change the weights. This error is gradually decreased by the learning algorithm.
• Unsupervised: There is no error feedback signal. The network automatically clusters the input into categories.
Example: if the network is presented with 100 patterns, half of which are different kinds of ellipses and half of which are different types of rectangles, it would automatically group these patterns into the two appropriate categories. There is no feedback to tell the network explicitly “this is a rectangle” or “this is an ellipse.”
So, how did they solve the problem of linear separability?
ANSWER:
i) By adding another “hidden” layer to the perceptron between the input and output layers,
ii) introducing a differentiable squashing function and
iii) discovering a new learning rule (the “generalized delta rule”)
“Concurrent” learning
Learning a series of patterns:If each pattern in the series is learned to criterion (i.e., completely) sequentially, the learning of the new patterns will erase the learning of the previously-learned patterns. This is why concurrent learning must be used. Otherwise, catastrophic forgetting may occur.
1 epoch
- 1st pattern presented to the network, change its weights a little to reduce the error on that pattern; - 2nd pattern, change its weights a little to reduce the error on that pattern;- etc.- last pattern, change its weights a little to reduce the error on that pattern;- REPEAT until the error for all patterns is below criterion
Concurrent learning
Backpropagation
error
wij
wjk
input layer (nodessubscripted with k’s)
hidden layer (nodessubscripted with j’s)
output layer (nodessubscripted with i’s)
desired output (“teacher”)
actual output
input from the environment
hidden layer representation
Training of a backpropagation network i) Feedforward activation pass with activation “squashed” at hidden layer.ii) The output is compared with the desired output (= error signal)iii) This error signal is “backpropagated” through the network to change the network’s weights (with gradient descent).iv) When the overall error is below a predefined criterion, learning stops.
...but they also suffer from catastrophic interference.
A-B List
A-C List
1 5 10 20 Learning Trials on A-C List
correct
A-B List
A-C List
Learning Epochs on A-C List
0 5 10 15 20 25 30 35 40 45 50
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.9
1.0
0
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.9
1.0
0
correct Humans: Backpropagation networks:
... but they have trouble learning sequences.
Much of our cognition involves learning sequences of patterns. Standard BP networks are fine for learning input-output patterns, they cannot be used effectively to learn sequences of patterns.
Consider the sequence: A B C D E F G H I For this sequence we could train a network to associate the following
A BB CC DD EE FF GG HH I
If we give the network A as it’s “seed”, it would produce B on output,
which we would feed back into the network to produce C on output, and so on. Thus, we could reproduce the original sequence.
But what about context-dependent sequences?
But what if the sequence were:
A B C D E F C H I Here C is repeated. The technique above would give:
A BB CC DD EE FF CC H H I
But the network could not learn this sequence since it has no context to distinguish the two different outputs associated with C (for the first occurrence, D; for the second, H).
A “sliding window” solution
Consider a “sliding window” solution to provide the context. Instead of having the network learn single-letter inputs, it will learn two-letter inputs, thus:
AB CBC DCD EDE FEF GFG HGH I
Now the network is fed AB (here, “A” servers as “context” for “B”) as its seed and it can reproduce the sequence with the repeated C without difficulty. But what if we needed more than one letter’s worth of context, as in a sequence like this:
A B C D E B C H I
Now the network needs another context letter...and so on.Conclusion: The Sliding Window technique doesn’t work in general.
Elman’s solution (1990) The Simple Recurrent Network
Hidden units
Input units Context units
Output units
copy
SRN Bilingual language learning(French, 1998; French & Jacquet, 2004)
BOY LIFTS TOY MAN SEES PEN GIRL PUSHES BALL BOY PUSHES BOOK FEMME SOULEVE STYLO FILLE PREND STYLO GARÇON TOUCHE LIVRE FEMME POUSSE BALLON FILLE SOULEVE JOUET WOMAN PUSHES TOY.... (Note: absence of markers between sentences and between languages.)
Input to the SRN: - Two “micro” languages, Alpha & Beta, 12 words each- An SVO grammar for each language- Unpredictable language switching
The network tries each time to predict the next element.
We do a cluster analysis of its internal (hidden-unit) representations after having seen 20,000 sentences.
Attempted Prediction
Clustering of the internal representations formed by the SRN
LIVRE: STYLO:
BALLON: JOUET:
VOIT: PREND:
POUSSE: SOULEVE: HOMME: FEMME:
FILLE: GARCON: BOOK:
PEN: BALL: TOY:
PUSHES: TAKES: SEES: LIFTS:
WOMAN: MAN: GIRL: BOY:
Alpha
Beta
N.B. It also works for micro languages with 768 words each
Unsupervised learning:Kohonen networks
Kohonen networks cluster inputs in a non-supervised manner. There is no activation spreading or summing processes here: Kohonen networks adjust weight vectors to match input vectors.
1 2
11w62w52w
12w
1 2 3 4 5 6input layer
output nodes
The next frontier...
Computational neuroscience using spiking neurons, and variables such as their connection density, their firing timing and synchrony, and so on, to better understand human cognitive functions.
We are almost at a point where the population dynamics of large networks of these kinds of simulated neurons can realistically be studied.
Further in the future neuronal models with Hodgkin-Huxley equations of membrane potentials and neuronal firing, will be incorporated into our computational models of cognition.
Ultimately...
Gradually, neural network models and the computers they run on will become good enough to give us a deep understanding of neurophysiological processes and their behavioral counterparts and to make precise predictions about them.
They will be used to study epilepsy, Alzheimer’s disease, and the effects of various kinds of stroke, without requiring the presence of human patients.
They will be, in short, like the models used in all of the other hard sciences. Neural modeling and neurobiology will then have achieved a truly symbiotic relationship.