+ All Categories
Home > Documents > Mathematics for Artificial Intelligence

Mathematics for Artificial Intelligence

Date post: 29-Mar-2022
Category:
Upload: others
View: 22 times
Download: 0 times
Share this document with a friend
40
Mathematics for Artificial Intelligence Reading Course Elena Agliari Dipartimento di Matematica Sapienza Università di Roma
Transcript
NN_ReadingCourse/249211
(TENTATIVE) PLAN OF THE COURSE  Introduction Chapter 1: Basics of statistical mechanics
The Curie-Weiss model Chapter 2: Neural networks for associative memory and pattern recognition Chapter 3: The Hopfield model
Hopfield model with low-load and solution via log-constrained entropy Self-average, spurious states, phase diagram Hopfield model with high-load and solution via stochastic stability
Chapter 4: Beyond the Hebbian paradigma Chapter 5: A gentle introduction to machine learning
   
Chapter 7: A few remarks on deep learning, “complex” patterns, and outlooks Multilayered Boltzmann machines and deep learning. Mapping Restricted Bolzmann machines and Hopfield networks
Seminars: Numerical tools for machine learning; Non-mean-field neural networks; (Bio-)Logic gates; Maximum entropy approach, Hamilton-Jacobi techniques for mean-field models, …
212
The machine starts from scratch It takes 24h training Input layer: current screen output layer: button to push When nothing changes for too long the program is stopped and started again
A skilled player…
/249214
The neuronal interaction from an electrical perspective
A neuron transports its information by way of a nerve impulse called an action potential.
When an action potential arrives at the synapse's presynaptic terminal button, it
may stimulate the release of neurotransmitters. These are released into the synaptic cleft to bind onto the receptors
of the postsynaptic membrane and influence another cell, either in an inhibitory or
excitatory way.
Neurons interact at contact points called synapses: a junction within two nerve cells, consisting of a miniature gap within which impulses are carried by a neurotransmitter.
/249215
synaptic vescicles to release neurotransmitter molecules. These molecules diffuse from the
presynaptic terminal across the synaptic cleft and bind to their receptor sites on the ligand-gated sodium
ion (Na+) channles. This causes the ligand-gates sodium ion channels to open and sodium ions diffuse
into the cell, making the membrane potential more positive. If the membrane potential reaches threshold
level, an acption potential will be produced.
There exist different kinds of neurotransmitters, each associated with different functions and possible pathologies
/249217
This model, also known as spiking neuron model, is a mathematical description of the properties of neurons (and other cells in the nervous system) that generate sharp electrical potentials
Biological neuron model
Biological neuron models aim to explain the mechanisms underlying the operation of the nervous system for the purpose of restoring lost control capabilities. Unlike the “artificial neuron” models, biological neuron models allows experimental validation, and the use of physical units to describe the experimental procedure associated with the model predictions.
As for the relationship between neuronal membrane currents at the input stage and membrane voltage at the output stage, the most extensive experimental inquiry was made by Hodgkin–Huxley in the early 1950s using an experimental setup that punctured the cell membrane and allowed to force a specific membrane voltage/current. (Nobel Prize in Physiology and Medicine 1963)
/249218
It models each neuron as a leaky capacitor with membrane resistance Rm, membrane capacitance Cm and resting potential EL. Below the action potential threshold, the voltage of this capacitor decays (or “leaks”) to the resting level EL:
Integrate-and-fire model
Indeed, the exact shape of the action potential does not matter here: since all action potentials sent down the axon are to a good approximation identical, the only informative feature of a neuron’s spiking is the times at which the action potentials occur.
where I is the injection current. Realistic values for the parameters are EL=-70 mV, Rm=10 MΩ, and Cm= 50 μF, V(t=0)=EL. To model the spiking of the neuron when it reaches threshold, one assumes that when the membrane potential reaches Vth=-55 mV, the neuron fires a spike and then resets its membrane potential to Vreset=-75 mV.
Cm dVm(t)
/249219
(a) Leaky integrate-and-fire neuron circuit model, (b) For input (I < Ith), Vm(t) never exceeds Vth- hence neuron never spikes. However, for I ≥ Ith, neuron will fire when Vm(t) ≥ Vth and immediately reset i.e. Vm(t) = EL, (c) With higher input (e.g. I ≥ Ith), firing rate or the frequency increases like a biological neuron while for low input (I < Ith), frequency is zero. The output frequency (fO) vs. input is the signature neuronal function to be mimicked artificially.
I V
V Ith
( I )
Analytical insight into the firing activity of the noisy neuron Estimate spike density, role of topology, mean time taken to reach an absorbing boundary, etc.
/249220
Overall input current on a neuron is assumed as a Poissonian process NI,E = number of active synapses connected to the neuron λI,E = firing rate wI,E = magnitute of input
HC Tuckwell, Introduction to theoretical Neurobiology, (Cambridge University Press, Cambridge, 1988). HC Tuckwell, Stochastic Processes in the Neurosciences, CBMS-NSF Conference Series in App. Math. (1989).
A. Schematic illustration for the network model: individual cells are connected via excitatory (red) and inhibitory (blue) synaptic connections. B. Synaptic connectivity matrix. Weights are randomly distributed around a mean value g=−10mV/Hz. C. Sample network activity. D. Power spectral density of the network mean activity. 
A. Hutt, A. Mierau, J. Lefebvre, PLoNE (2016)
Stein’s model
Ornstein-Uhlenbeck (OU) processes
= q w2
/249
One of the central goals of research in neuroscience is to understand how the biophysical properties of neurons and neuronal organization combine to provide such impressive computing power and speed. An understanding of biological computation may also lead to solutions for related problems in robotics and data processing using non-biological hardware and software.
Conventional silicon integrated circuits
Neural computation circuits
Each logic gate typically obtains inputs from two or three others, and a huge number of independent binary decisions are made in the course of a computation
Each non-linear neural processor (neuron) gets input from hundreds or thousands of others and a collective solution is computed on the basis of the simultaneous interaction of thousands of devices.221
/249222
Each amplifier j has an input resistor ρj leading to a reference ground and an input capacitor Cj.
Amplifiers have sigmoid monotonic input-output relations. The function Vj=gj(uj) characterizes this input-output relation: it describes the output voltage of amplifier Vj due to an input voltage uj.
The processing elements, or “neurons”, are modeled as amplifiers in conjunction with feedback circuits comprised of wires, resistors and capacitors organized so as to model the most basic computational features o neurons, i.e., axons, dendritic arborization, and synapses connecting different neurons.
/249
In order to provide for both excitatory and inhibitory synaptic connections between neurons, each amplifier is given two outputs, a normal (+) output and an inverted (-) output
A synapse between two neurons is defined by a conductance Tij which connects one the two outputs of amplfier j to the input of amplifier i. This connection is made with a resistor of value Rij=1/|Tij|. If the synapse is excitarory (Tij>0), this resistor is connected to the normal (+) output of the amplifier j and vice versa.
The net input current to any neuron i (and hence the input voltage ui) is the sum of the currents flowing through the set of resistors connecting its input to the outputs of the other neurons.
The circuit also includes an externally supplied input current Ii for each neuron. These inputs can be used to set the general level of excitability of the network through constant biases, which effectively shift the input-output relation along the ui axis.
223
/249
The equations describing the time evolution of this circuit is:
Ri is a parallel combination of ρi and the Rij:
For simplicity, set i.e., independent of i (but this is not necessary)
Posing , the equations become
For an "initial-value" problem, this equation provides a full description of the time evolution of the state of the circuit. Integration of this equation allows any hypothetical network to be simulated.
Ci(dui/dt) = NX
j=1
Tij = Tij/C, Ii = Ii/C
224
/249
For a network with symmetric connections (Tij = Tji) these equations always lead to a convergence to stable states, in which the outputs of all neurons remain constant (Hopfield, 1984). Also, when the width of the amplifier gain curve g(u) is narrow - the high-gain limit - the stable states of a network comprised of N neurons are the local minima of the quantity
The state space over which the circuit operates is the interior of the N-dimensional hypercube defined by Vi ∈ {0, 1}. However, in the high-gain limit, the minima only occur at corners of this space → stable states correspond to those locations in the discrete space consisting of the 2N corners of this hypercube which minimize the cost function E.
E = 1/2 NX
ViIi
A. Energy-terrain contour map for the flow map shown in B. B. Typical flow map of neural dynamics for the circuit considered, with symmetric connections (Tij = Tji) C. More complicated dynamics that can occur for unrestricted Tij. Limit cycles are possible.
high-gain limit
/249226
E Agliari, A Barra, L Dello Schiavo, A Moro,  Complete integrability of information processing by biochemical reactions, Sci.Rep.(2016) E Agliari et al., Notes on stochastic (bio)-logic gates: the role of allosteric cooperativity, Sci. Rep. (2015) E Agliari et al., Collective behaviours: from biochemical kinetics to electronic circuits, Sci. Rep. (2013)
From the ’60 to the ’90, “universality” has been a keyword in the SM literature of phase transitions and it was meant to highlight the robust, structural analogies that several (very disparate) systems share “close to criticality”. In recent years, with the extension in the applicability range of SM (covering widespread subjects as biological networks, economical problems, material sciences, etc.), we are discovering a novel class of universal behavios: the main patterns through which systems process information seem to be very similar.
“Universality reloaded”
b) ferromagnet external field magnetization self consistency
c) cortical neuron afferent current spike intensity reponse function
d) chemical reaction
/249
E Agliari, A Barra, L Dello Schiavo, A Moro,  Complete integrability of information processing by biochemical reactions, Sci.Rep.(2016) E Agliari et al., Notes on stochastic (bio)-logic gates: the role of allosteric cooperativity, Sci. Rep. (2015) E Agliari et al., Collective behaviours: from biochemical kinetics to electronic circuits, Sci. Rep. (2013)
/249228
E Agliari, A Barra, L Dello Schiavo, A Moro,  Complete integrability of information processing by biochemical reactions, Sci.Rep.(2016) E Agliari et al., Notes on stochastic (bio)-logic gates: the role of allosteric cooperativity, Sci. Rep. (2015) E Agliari et al., Collective behaviours: from biochemical kinetics to electronic circuits, Sci. Rep. (2013)
… towards (bio)-logical stochastic computation…
/249229
J as a black-box storing information Let us consider a neural network made of N=4 neurons an P=2 patterns given by
ξ1 = (-1, +1, +1, -1) ξ2 = (+1, -1, +1, -1)
and, recalling Jij = ∑μ ξiμ ξjμ, we get (neglecting normalization)
that is, the Hebbian matrix looks like
This corresponds to two flip-flops connected by and AND gate: J is storing information in terms of constraints (this is way the compression P<N).
Clause 1: σ1 misaligned (i.e., anti-correlated) with σ2 AND
Clause 2: σ3 misaligned (i.e., anti-correlated) with σ4
J =
2
664
J11 J12 J13 J14 J21 J22 J23 J24 J31 J32 J33 J34 J41 J42 J43 J44
3
775 =
2
664
0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0
3
775
/249
The TSP problem
"Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?" np-hard problem Solution consists of an ordered list of n cities such that the total path length d of the closed tour is the lowest possible.
To “map” this problem onto the computational network, we require a representation scheme which allows the digital output states of the neurons (operating in the high-gain limit) to be decoded into this list.
n cities n2 neurons
Example n=5
Sequence: C, A, E, B, D Total length: d = dCA+ dAE+dEB+dBD+dDC
n! states of this form n-fold degeneracy (initial city) 2-fold degeneracy (tour order)
⇒ n!/2n distinct paths for closed TSP routes
230
/249
X
i
dXY VXi(VY,i+1 + VY,i1) this contribution is non- null if there are two or more non-null entries in the same city row
this contribution is non-null if there are two or more non-null entries in the same position column
this contribution is non-null if any city or any position is not covered
this contribution grows with the path overall distance
E = 1/2 NX
ViIi
TXi,Y j = AXY (1 ij)Bij(1 XY ) C DXY (j,i+1 + j,i1)
IXi = Cn excitation bias
global inhibition data term
ViIi
TXi,Y j = AXY (1 ij)Bij(1 XY ) C DXY (j,i+1 + j,i1)
The convergence of the 10-city analog circuit to a tour. The linear dimension of each square is proportional to the value of Vxi. a, b, c intermediate times, d the final state. The indices in d illustrate how the final state is decoded into a tour (solution of TSP)
232
/249
Outlooks
Retrieval Improve the performance of the network Role of the neural network topology Relax hp’s towards a more realistic model
Learning Quest for a more rigorous and foundamental understanding Solving tasks that are easy for people to perform but hard for people to describe formally e.g., (informal) language translation
H. Sompolinsky (1986) Neural networks with nonlinear synapses and a static noise, Phys. Rev. A B Wemmenhove, ACC Coolen (2003) Finite connectivity attractor neural networks, J. Phys. A
M. Mezard (2017) Mean-field message-passing equations in the Hopfield model and its generalizations, Phys. Rev. E J. Tubiana, R. Monasson (2017) Emergence of Compositional Representations in Restricted Boltzmann Machines, Phys. Rev. Lett. E. Agliari, A. Barra, A. Galluzzi, F. Guerra, F. Moauro (2012) Multitasking Associative Networks, Phys. Rev. Lett. E. Agliari et al. (2015) Retrieval Capabilities of Hierarchical Networks: From Dyson to Hopfield, Phys. Rev. Lett.
233
/249
Extensions of the Hebbian kernel
M. Griniasty, M.V. Tsodyks, D.J. Amit (1993) Conversion of Temporal Correlations Between Stimuli to Spatial Correlations Between Attractors, Neur. Comp. D.J. Amit, N. Brunel, M.V. Tsodyks (1994) Correlations of Cortical Hebbian Reverberations: Theory versus Experiment, J. Neurosci. L. Cugliandolo, M.V. Tsodyks (1994) Capacity of networks with correlated attractors, J. Phys. A E. Agliari et al. (2013) Parallel retrieval of correlated patterns: From Hopfield networks to Boltzmann machines, Neur. Net.
Pattern correlation The Hebbian coupling Jij=ξiξj can be generalized in order to include possibly more complex combinations among patterns. For instance
where X is a symmetric matrix; of course, by taking X =I we recover the Hebb coupling. A particular choice was introduced to account for temporal correlations
Jij = 1
a ∈ +
Jij = 1
mμ = 0, ∀μ
/249
A more biological topology Introduce a metric and arrange neurons in a modular way (still connected!) Modulate the coupling matrix according to the metric
Different modules can perform different tasks simultaneously
Sequential Parallel retrieval
Getting closer to biology we enrich the emergent phenomenology in the right way!
E. Agliari, A. Barra, A. Galluzzi, F. Guerra, F. Moauro (2012) Multitasking Associative Networks, Phys. Rev. Lett. E. Agliari et al. (2015) Retrieval capabilities of Hierarchical Networks: from Dyson to Hopfield, Phys. Rev. Lett.
235
/249236
Left Right
Dyson network deterministically and recursively built complete, weighted graph endowed with a metric
E. Agliari, A. Barra, A. Galluzzi, F. Guerra, F. Moauro (2012) Multitasking Associative Networks, Phys. Rev. Lett. E. Agliari et al. (2015) Retrieval capabilities of Hierarchical Networks: from Dyson to Hopfield, Phys. Rev. Lett.
/249
4
6
8
10
12
14
16
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
k ! 1k finite 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
4
8
12
16
4
6
8
10
12
14
16
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.02
0.04
0.06
0.08
0.1
0.02
0.04
0.06
0.08
0.1
0.02
0.04
0.06
0.08
0.1
0.02
0.04
0.06
0.08
0.1
0.02
0.04
0.06
0.08
0.1
Jij = kX
l=dij
/249
238
Experience: set of examples (x,y) drawn from an unknown distribution q(x,y) Learning: adopting weights {Jij} so that for a given input x we can get infos about y according to (an approximation of) q(x,y)
Most effective RBMs display a Gaussian layer
Two-layer Boltzmann machine: ask and read from the visible layer
/249
σi
σj
E. Agliari, A. Barra (2011) A Hebbian approach to complex-network generation, Europhys. Lett. E. Agliari, A. Barra, A. De Antoni, A. Galluzzi (2013) Parallel retrieval of correlated patterns: From Hopfield networks to Boltzman machines, Neur. Net. E. Agliari, A. Barra, A. Galluzzi, D. Tantari, F. Tavani (2014) A Walks in the Statistical Mechanical Formulation of Neural Network - Alternative Routes to Hebb Prescription, NCTA
239
Equivalence of Hopfield nets and restricted Boltzmann machines
Bipartite spin-glass Digital visible neurons, σi = ±1, ∀ i = 1, …, N Analog hidden neurons, zμ Gaussian, ∀ μ =1, …, P Iterlayer coupling ξiμ ~ 1/2 [δ(ξiμ -1 ) + δ(ξiμ +1)]
σi

Hopfield model on a complete graph Digital neurons σi = ±1, ∀ i = 1, …, N
Hebbian coupling
Jij = 1
/249240
The set of couplings {ξμi} encodes for learnt patterns {ξμ} There exists a performance limit for RBMs: N>P
HRBM(, z|) = 1p N
2N
NX
/249241
The analog neurons in the hidden layer change continuously in time and their activity can be described by a SDE in analogy with the integrate-and-fire model
T dzµ(t)
dt = zµ(t) +
Pr(zµ|, ) = r
µ=1
Pr(zµ|, )
The activity of digital neurons in the visible layer follows a Glauber dynamics
Pr(|z, ) = NY
PP µ=1 zµ
PP µ=1 zµ
/249242
PP µ=1 zµ
PP µ=1 zµ
The joint probability can be found exploiting Bayes’ formula
In particular,
µ i
µ j
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
d
m
m1 m2 m3 m4 m5 m6
Agliari, et al. (2012) “Multitasking attractor networks with neuronal threshold noise”, Neur. Net. Agliari, et al. (2013) “Multi-tasking capabilities at medium load”, J. Phys. A Agliari, et al. (2013) “Multitasking capabilities near saturation”, J. Phys. A Agliari, et al., (2017) “Retrieving Infinite Numbers of Patterns in a Spin-Glass Model of Immune Networks”, Europhys. Lett.
Once ξ1 retrieved (m1=1-d), it is convenient to coordinate free spins to align with the next pattern, say ξ2 , instead of letting them align randomly
ξ1
ξ2
P(ξiμ = +1) = P(ξiμ = -1) = (1-d)/2 P(ξiμ = 0) = d d>0, d finite
When dilution scales with N, d = 1 - c/Nγ
the topology of the relating network, and retrieval capacity are affected
243
/249244
Below the percolation threshold Graph is fragmented into cliques Each clique corresponds to a different pattern, i.e. to a different clone
NT = 104, α = 0.1, δ = γ γ = 0.9 (left panel) and γ = 0.8 (right panel).
Isolated nodes (8856 and 8913, respectively) are omitted
T
BT
T
T
giovedì 25 aprile 2013
Below the percolation threshold Graph is fragmented into cliques Each clique corresponds to a different pattern, i.e. to a different clone
NT = 104, α = 0.1, δ = γ γ = 0.9 (left panel) and γ = 0.8 (right panel).
Isolated nodes (8856 and 8913, respectively) are omitted
T
BT
T
T
giovedì 25 aprile 2013
Above the percolation threshold Graph forms complex components Different T cells share several B cells signal interference
NT = 104, α = 0.1, δ = 1
γ = 0.9 (left panel) and γ = 0.8 (right panel).
Isolated nodes (6277 and 6487, respectively) are omitted T
BT
T
emerge, modularity progressively decays and a giant component eventually appears hinder parallel retrieval!
γ >
δ
giovedì 25 aprile 2013
Bipartite graph G2, made up of NT and NB nodes, with limNB→∞ NB/NT determines the load regime
P (µi |d) = 1 d
2 µi 1,0 +
Coupling in G2 provided by {ξiμ} iid from
After marginalization, monopartite graph G1, with NT nodes that interact pairwise through the coupling matrix
Jij = NBX
µ=1
/249245
2 µi 1,0 +
d = 1 c
/249246
From a graph theory perspective: γ ≥ δ G1 fragmented into multiple disconnected components, each forming a clique or a collection of cliques connected via a bridge. Each clique corresponds to a pattern simultaneous recall of multiple patterns allowed
From a statistical mechanics perspective: Load (i.e., NB/NT) grows, i.e. when δ grows source of non-Gaussian interference noise that is non-negligible for γ ≤ δ. If δ = γ system still able to retrieve all the patterns, but with a decreasing recall overlap.
γ < δ G1 can exhibit a giant component, which prevents the system from simultaneous pattern recall.
Low/Medium storage case NB ~ NTδ , δ<1
/249247
High storage case: NB = α NT, α >0 NB ~ NTδ → the case δ = γ is borderline: δ = 1 Gaussian noise due to non-condensed patterns, and this is found to destroy the retrieval states. Here the system behaves as a spin-glass extreme diluted for G1 is insufficient to sustain a high pattern load.
αc2 < 1 typical components in G1 are finite-sized, and form cliques whose occurrence frequency decays exponentially with their size. Each clique corresponds to a pattern; this arrangement allows for the simultaneous recall of multiple patterns. αc2 > 1 G1 exhibits giant component, which can compromise the system’s parallel processing ability
RS ansatz → critical surface Tc(α;c) that separates two distinct phases. T>Tc: each subsystem behaves as a paramagnet T<Tc: each subsystem retrieves one particular pattern (or its inverse), representing parallel retrieval (perfectly at zero temperature) of an extensive number of patterns. Tc(α,1/√α) = 0 ∀α ≥ 0 , so for αc2 > 1 no transition at finite temperature away from this phase is possible.
/249
Any new layer improves the performance of the neural network, focusing on finer details
A two-layer RBM corresponds to a Hopfield model with two- and one-body interactions (ME: first and second moments) and accomplishes learning once si sj and si are recovered.
248
Idea… A p-layer RBM corresponds to a Hopfield model with up to p-body interactions (ME: up to the p- th moment) and accomplishes learning once si 1si2…sip are recovered, therefore we can describe a richer (more than Gaussian) reality!
/249
249

Recommended