Deep Learning and Physics -- 2019
Deep Random Neural Field
Shun-ichi AmariRIKEN Center for Brain Science; Araya
Brief History of AI and NN First Boom: start
1956 ~ AI neural networks--perceptron
Dartmouth Conf. Perceptron
symbol universal computationlogic learning machine
Dark period (late 1960~1970’s) stochastic gradient descent learning (1967) for MLP
PerceptronF.Rosenblatt, Principles of Neurodynamics, 1961
McCulloch-Pitts neuron 0,1 binary learning
Multilayer lateral & feedback connection
x z
Deep Neural Networks Rosenblatt: multilayer perceptron
x ( , )z f x W=
( ) ( )( )
2, ,
,,
differentiable : analog neuron
L W y f W
L Wc
W
= −
∂→ +∆ ∆ = −
∂
x x
xw w w wlearning of hidden neurons
analog neuron
stochastic gradient learningAmari, Tsypkin, 1966~67: error back-prop, 1976
Information Theory II--Geometrical Theory of Information
Shun-ichi AmariUniversity of Tokyo
Kyoritu Press, Tokyo, 1968
First stochastic descent learning of MLP (1967;1968)
( ) { } { }1 1 2 2 3 4, max , min ,f v v= ⋅ ⋅ + ⋅ ⋅x w x w x w x w xθ
x
1w
4w
max
max
1vy
2v
Second Boom
1970~ AI 1980~ neural networks expert system MLP (backprop)(MYCIN) associative memory
stochastic inference (Bayes) chess (1997)
Third Boom 2010~Deep learningStochastic inference (graphical model; Bayesian; WATSON)
Deep learningpattern recognition: vision, auditory,sentence analysis, machine translationalpha-go
Language processing; sequence and dynamics (word2vec, deep learning with rec. net)
Integration of (symbol, logic) vs (pattern, dynamics)
Deep Learning Self-Organization + Supervised LearningRBM: Restricted Boltzmann MachineAuto-Encoder, Recurrent Net
DropoutContrastive divergenceConvolutionResnetReLUAdversarial net
Victory of Deep Neural NetworksHinton 2005, 2006 ~ 2012
many others
visual pattern, auditory patternGo-gamesentence analysis, machine translation
adversarial network, pattern generation
Mathematical Neuroscience searches for the principlesmathematical studies using simple idealistic models
(not realistic)
Computational neuroscienceAI : technological realization
Mathematical Neuroscienceand Brain
Brain has found and implemented the principles
through evolution (random search)historical restrictionmaterial restriction
Very complex (not smartly designed)
Theoretical Problems on Learning: 1Local solution and global solution
Simulated annealingQuantum annealing:
Θ
( )ΘL
Theoretical Problem of Learning: 2training loss and generalization loss :overtraining
2
2
( , )1 | ( , ) |
[| ( , ) | ]
emp i i
gen
gen emp
y f x
L y f xN
L E y f xPL LN
θ ε
θ
θ
= +
= −
= −
≈ +
∑
generalization loss
Training loss
Extremely wdie networkP-> ∞: P>>N
Local minimum =global minimumKawaguchi, 2019
Learning curve P>>N
Double descent
Belkin et al. 2019;Hastie et al. 2019
Random Neural Network
Random is excellent !! Random is magic!!
Statistical dynamicsRandom code
Random Deep NetworksPoole et. Al., 2016Schoenholtz et. Al., 2017~~
Signal propagationError back propagation
Jacot et al; Neural tangent kernel
ここに数式を入力します。
2
2
1( , ); ( , ) ( ( , ))2
1( , ) { ( , )} ; ( , ) ( , *)2
( ')( ( , ) ( , *))( , ) ( ) ( ) ( ') ( ', )
t
t t
y f x l x y f x
l x y f x e f x f x
l f x f x f xf x f x f x f x e x
θ θ
θ θ θ
θ θ θ
θ θ θ θ
θ η η θ θθ θ η θ
= = −
= − = −
∂ = − ∂ = − ∂ −∂ = ∂ ∂ = − ∂ ⋅∂
K; Gaussian kernel
( , '; ) ( ) ( ')( , ) ( , '; ) ( ', )t
K x x f x f xf x K x x e x
θ θθθ η θ θ
= ∂ ⋅∂
∂ = − < >
( , '; ) ( , ') : Gaussian kernelinitial
t ini
K x x K x xθθ θ
≈≈
Theorem P>>NOptimal solution lies near a random network.
Bailey et al2019
1( )
1( )
ij
ij
w On
w On
=
∆ =
random
Random Neural Field1
( ') ( , ') ( ) ( ')
( ') ( ( '))
l
l
u z w z z x z dz b z
x z u zϕ
−
= +
=
∫
( ', ) : randam (0 mean Gaussian correla )e; t dw z z
Statistical Neurodynamicsmicrodynamics
( ) ( )1w
t T t+ =x x ( )( )sgn W t= x
1( )
t tX F X+ =
2 1
3 3 1
( ) ( )
( ) ( ) ?
W
W W
X X X T
X X X T T
= =
= =2
x x
x x
macrodynamics
: macrostate
Statistical Neurodynamics
Rozonoer (1969)Amari (1969, 1971; 1973)
SompolinskiAmari et al (2013)Toyoizumi et al (2015)Poole, …, Ganguli (2016)Schoenholz et al (2017)Yang & Schoenholtz (2017),Karakida, et al (2019)Jacot et al. (2019) ……
~ (0, 1)ijw N
Macroscopic behaviorscommon to almost all (typical) networks
Random Deep Networks
1
0
2
1
( )
1
( )
l l l
ij ji i
il ll
l l
x w x w
A xn
A F A
ϕ+
+
= +
=
=
∑
∑2
20
~ (0, / )
~ (0, )ij l
i b
w N n
w N
σ
σ
Macroscopic variables
2
1
1
1activity :
distance: = [ : ']
( )( )
i
l l
l l
A xn
D D
A F AD K D
+
+
=
==
∑
metric,curvature & Fisher informationx x
Dynamics of Activity: law of large numbers
2 20
1
20
( ) ( ) : ( )~ (0, )
1 ( ) [ ( ) ] ( )
( ) ( ) ~ (0,1)
i ik k i i
i
i il
x w x b u x Wx bu N A
A x E u An
A Av Dv v N
ϕ ϕ φ
ϕ χ
χ ϕ+
= + = = +
= = =
=
∑
∑
∫
0
0
2
'(0) 1
( )
convergei
A A
x
χ
χ
>
=
→∑
Pullback Metric & Curvature
2 1li j
ijl
ds g dx dx d dn
= = ⋅∑ x x( )x Wxφ=
Basis vectors
( )
( )
1 1 1 1
1 1
1 1
1 1
( ..
:
. )
.
Jacob
.
n
.
ia
l l
l l l l l l
l l
l l l
i ii i i i i i
l l l
i ii i i
l m m
l l l l m m
a a a
B
dx u W dx B dx
d B d B B d
B B B
u W
ϕ
ϕ− −
− − − −
− −
− −
′= =
= =
′=
= =
∑ ∑
x x x
e e e
ここに数式を入力します。
( )x Wxφ=
1ab a b
l
gn
= ⋅e e
Dynamics of Metric
2 2
21
( ) ( '( ) )
E[ '( )) ] E[ '( )) ]E[ ]mean field approximation
( ) '( )
aa k k
a aa ak a k
a bab k j kj
a a a aa k j a k j
dx B dxB
B B u wg B B g
u w w u w w
A Av Dv
ϕ
ϕ ϕ
χ ϕ
=
=
= =
=
=
− −
=
∑
∑
∫
e e
Mectric
( )
( )
1 1 1 1
1
2
2 2 2
221
,
l l
l l l l ll
l
l l l l
a bab ab
l l l
a bab
i ii i i l i i
i
l i
g BB g
ds g d x d x
BB w w u E
E u
ϕ σ ϕ δ
χ σ ϕ
− − − −
−
′ ′
= =
=
′ ′ = ≈
′=
∑
∑
e e
Law of large numbers
1( ) ( ( )) ( )l
ijijg x x g xχ= ∏conformal geometry
1
1 1
1
conformal transformation!( ) ( ) ( )
( )
ij ij
ij
ll
ij ij
g x A g x
A
g
χ
χ χ δ
χ δ
=
=
⇒ =
rotation, expansion
Domino Theorem
1 1 1 1
1 1 1 1
1 2
2
1
1
1
1 1 1
L L
L L L L L L
L L
L L L L L L
i ii i i i i i
i ii i i i i
l l l
m m m
l l l
m m m
i
B BB BBB
B BB BB
B B
B B BBBB
BW W W
δ χ δ
δ χ χ χ χ δ
− − − −
− − − −
′
− −
− −
′ ′ ′
′′ ′ ′
∂ ∂ ∂= = =
∂ ∂ ∂
∂ ∂ ∂= = =
Σ =
Σ =
∂ ∂ ∂
x x x
x x x
x x x
Dynamics of Curvature
2 2
''( )( )( ) '( )
| |
i iab a b a b
i a b a b
ab ab ab
ab ab
H xu
H
ϕ ϕ⊥
= ∇ = ∂ ∂
= ⋅ ⋅ + ⋅∂
= +
=
ew e w e w e
H H H
H
22
2 21
2 121
1
( ) ''( )
1 ( )(
exponential expansion! creation is smal
2 1) (
!
)
1
l
l l l lab abab
A Av Dv
H A A Hn
χ ϕ
χ δ χχ
χ
+
=
= + +
>
∫
Poole et al (2016)Deep neural networks
Distance
[ ] 21, i iD x yn
= −∑x y
Dynamics of Distance (Amari, 1974)
( )
21( , ') ( ')
1( , ') ' '
' 2
~N(0, V)
' ' V=
( ') E[ (
ii
i i
ii k k
ii k k
D x x x xn
C x x x x x xn
D A A C
u w y
u w y A C
C A C Aϕ
= −
= ⋅ =
= + −
=
=
= −
∑
∑
∑
∑ ) ( ' )]C C A C Cε ν ϕ ε ν+ − +
1
1
( )
1
l lD K D
dDdD
χ
+ =
= >
Problem!
( , ' ) : ( )
equi-distance property
l lD D lD K D
→ →∞
=
x x
dynamics of distancelim ( , )
lim lim ( , ) lim lim ( , )
L L
nL
L L L L
n L L n
D x y
D x y D x y
→∞→∞
→∞ →∞ →∞ →∞≠
Feedback Path
Error backpropFisher Information
Stochastic model : parameter spacemanifold of probability distributions
2
2
( ( ......)..) ; ~ (0,1)1( , : ) exp{ ( ( ; )) } ( )2
[ log ( , : ) log ( , : )]x W W
y Wx Wx N
p y x W c y x W q x
G E p y x W p y x Wds dWGdW
ϕ ϕ ε ε
ϕ
= +
= − −
= ∇ ∇
=
Learning: stochastic gradient descentSteepest Direction---Natural Gradient
( )1
1
, ,n
l ll
l G l
θ θ−
∂ ∂∇ = ∂ ∂ ∇ = ∇
θ
dθθ
( )l θ
( , ; )t t t t tl x yη θ∆ = − ∇θ
Natural Gradient
( ) ( )
( )
2
1
max
KL[p(x, ):p(x, )]=
dl l d l
d d
l G l
ε θ θ θ ε−
= + −
= +
∇ = ∇
θ θ θ
θ
θ
( , ; )t t t t tl x yη θ∆ = − ∇θ
Information Geometry of MLP
Natural Gradient Learning :S. Amari ; H.Y. Park
( )
( )
1
1 1 1 11 1 T
t t t t
lG
G G G f f G
η
ε ε
−
− − − −+
∂∆ = −
∂
= + − ∇ ∇
θ θθ
Fisher Information
( )
( )
1 1 1
2 1 1
1
' ...
, (1/ )
, 0 ~ (1/ ),
, 0 ~ (1/ ),
m ll l l m
m m m m
l l l
il m x p
l m pl l
i j p
G EW W
W B BB BW W W W
G W W E O n
G W W O n l m
G O n i j
ϕ ϕ
ϕ ϕ ϕ ϕϕ
χ ϕ
− − +
− −
∂ ∂= ∂ ∂
∂ ∂ ∂ ∂= = =
∂ ∂ ∂ ∂ ′= +
= ≠
= ≠
∏
x
w x x
w w
Unitwise natural gradient
1WW G lη −∆ = − ∇
Y. Ollivier; Marceau-Caron
Goodnews and bad news
G*: unitwise-diagonal matrix
1 1
*: * :
G G nG G n− −
→ →∞
→ →∞
Karakida theoryeigenvalues of G
( )21 , 1i i On
λ λ= =∑ ∑
distorted Riemannian metric2G
References:
Poole, …, Ganguli (2016)Schoenholz et al (2017)Yang & Schoenholtz (2017), ……
S. Amari, R. Karakida & M. Oizumi, Statistical neurodynamics of deep networks: Geometry of Signal Spaces. arXiv:1808.07169v1, 2018.
S. Amari, R. Karakida & M. Oizumi, Fisher information and natural gradient learning of random deep networks. arXiv:1808.07172v1, 2018(AISTATS-19).
R. Karakida, S. Akaho & S. Amari, Universal statistics of Fisher information indeep neural networks: Mean field approach. arXiv: 1806.01316, 2018(AISTATS-19).