Deep Random Neural Fieldkabuto.phys.sci.osaka-u.ac.jp/~koji/workshop/DLAP2019/...Statistical...

Deep Learning and Physics -- 2019

Deep Random Neural Field

Shun-ichi AmariRIKEN Center for Brain Science； Araya

Brief History of AI and NN First Boom： start

1956 ~ AI neural networks--perceptron

Dartmouth Conf. Perceptron

symbol universal computationlogic learning machine

Dark period (late 1960~1970’s) stochastic gradient descent learning (1967) for MLP

PerceptronF.Rosenblatt, Principles of Neurodynamics, 1961

McCulloch-Pitts neuron 0,1 binary learning

Multilayer lateral & feedback connection

x z

Deep Neural Networks Rosenblatt： multilayer perceptron

x ( , )z f x W=

( ) ( )( )

2, ,

,,

differentiable : analog neuron

L W y f W

L Wc

W

= −

∂→ +∆ ∆ = −

∂

x x

xw w w wlearning of hidden neurons

analog neuron

stochastic gradient learningAmari, Tsypkin, 1966~67: error back-prop, 1976

Information Theory II--Geometrical Theory of Information

Shun-ichi AmariUniversity of Tokyo

Kyoritu Press, Tokyo, 1968

First stochastic descent learning of MLP (1967;1968)

( ) { } { }1 1 2 2 3 4, max , min ,f v v= ⋅ ⋅ + ⋅ ⋅x w x w x w x w xθ

x

1w

4w

max

max

1vy

2v

Second Boom

1970~ AI 1980~ neural networks expert system MLP (backprop)(MYCIN) associative memory

stochastic inference (Bayes) chess (1997)

Third Boom 2010~Deep learningStochastic inference (graphical model; Bayesian; WATSON)

Deep learningpattern recognition: vision, auditory,sentence analysis, machine translationalpha-go

Language processing; sequence and dynamics (word2vec, deep learning with rec. net)

Integration of (symbol, logic) vs (pattern, dynamics)

Deep Learning Self-Organization + Supervised LearningRBM: Restricted Boltzmann MachineAuto-Encoder, Recurrent Net

DropoutContrastive divergenceConvolutionResnetReLUAdversarial net

Victory of Deep Neural NetworksHinton 2005, 2006 ~ 2012

many others

visual pattern, auditory patternGo-gamesentence analysis, machine translation

adversarial network, pattern generation

Mathematical Neuroscience searches for the principlesmathematical studies using simple idealistic models

(not realistic)

Computational neuroscienceAI : technological realization

Mathematical Neuroscienceand Brain

Brain has found and implemented the principles

through evolution (random search)historical restrictionmaterial restriction

Very complex (not smartly designed)

Theoretical Problems on Learning: 1Local solution and global solution

Simulated annealingQuantum annealing:

Θ

( )ΘL

Theoretical Problem of Learning: 2training loss and generalization loss :overtraining

2

2

( , )1 | ( , ) |

[| ( , ) | ]

emp i i

gen

gen emp

y f x

L y f xN

L E y f xPL LN

θ ε

θ

θ

= +

= −

= −

≈ +

∑

generalization loss

Training loss

Extremely wdie networkP-> ∞： P>>N

Local minimum =global minimumKawaguchi, 2019

Learning curve P>>N

Double descent

Belkin et al. 2019；Hastie et al. 2019

Random Neural Network

Random is excellent !! Random is magic!!

Statistical dynamicsRandom code

Random Deep NetworksPoole et. Al., 2016Schoenholtz et. Al., 2017~~

Signal propagationError back propagation

Jacot et al; Neural tangent kernel

ここに数式を入力します。

2

2

1( , ); ( , ) ( ( , ))2

1( , ) { ( , )} ; ( , ) ( , *)2

( ')( ( , ) ( , *))( , ) ( ) ( ) ( ') ( ', )

t

t t

y f x l x y f x

l x y f x e f x f x

l f x f x f xf x f x f x f x e x

θ θ

θ θ θ

θ θ θ

θ θ θ θ

θ η η θ θθ θ η θ

= = −

= − = −

∂ = − ∂ = − ∂ −∂ = ∂ ∂ = − ∂ ⋅∂

K; Gaussian kernel

( , '; ) ( ) ( ')( , ) ( , '; ) ( ', )t

K x x f x f xf x K x x e x

θ θθθ η θ θ

= ∂ ⋅∂

∂ = − < >

( , '; ) ( , ') : Gaussian kernelinitial

t ini

K x x K x xθθ θ

≈≈

Theorem P>>NOptimal solution lies near a random network.

Bailey et al2019

1( )

1( )

ij

ij

w On

w On

=

∆ =

random

Random Neural Field1

( ') ( , ') ( ) ( ')

( ') ( ( '))

l

l

u z w z z x z dz b z

x z u zϕ

−

= +

=

∫

( ', ) : randam (0 mean Gaussian correla )e; t dw z z

Statistical Neurodynamicsmicrodynamics

( ) ( )1w

t T t+ =x x ( )( )sgn W t= x

1( )

t tX F X+ =

2 1

3 3 1

( ) ( )

( ) ( ) ?

W

W W

X X X T

X X X T T

= =

= =2

x x

x x

macrodynamics

: macrostate

Statistical Neurodynamics

Rozonoer (1969）Amari (1969, 1971; 1973)

SompolinskiAmari et al (2013)Toyoizumi et al (2015)Poole, …, Ganguli (2016)Schoenholz et al (2017)Yang & Schoenholtz (2017),Karakida, et al (2019)Jacot et al. (2019) ……

~ (0, 1)ijw N

Macroscopic behaviorscommon to almost all (typical) networks

Random Deep Networks

1

0

2

1

( )

1

( )

l l l

ij ji i

il ll

l l

x w x w

A xn

A F A

ϕ+

+

= +

=

=

∑

∑2

20

~ (0, / )

~ (0, )ij l

i b

w N n

w N

σ

σ

Macroscopic variables

2

1

1

1activity :

distance: = [ : ']

( )( )

i

l l

l l

A xn

D D

A F AD K D

+

+

=

==

∑

metric,curvature & Fisher informationx x

Dynamics of Activity: law of large numbers

2 20

1

20

( ) ( ) : ( )~ (0, )

1 ( ) [ ( ) ] ( )

( ) ( ) ~ (0,1)

i ik k i i

i

i il

x w x b u x Wx bu N A

A x E u An

A Av Dv v N

ϕ ϕ φ

ϕ χ

χ ϕ+

= + = = +

= = =

=

∑

∑

∫

0

0

2

'(0) 1

( )

convergei

A A

x

χ

χ

>

=

→∑

Pullback Metric & Curvature

2 1li j

ijl

ds g dx dx d dn

= = ⋅∑ x x( )x Wxφ=

Basis vectors

( )

( )

1 1 1 1

1 1

1 1

1 1

( ..

:

. )

.

Jacob

.

n

.

ia

l l

l l l l l l

l l

l l l

i ii i i i i i

l l l

i ii i i

l m m

l l l l m m

a a a

B

dx u W dx B dx

d B d B B d

B B B

u W

ϕ

ϕ− −

− − − −

− −

− −

′= =

= =

′=

= =

∑ ∑

x x x

e e e

ここに数式を入力します。

( )x Wxφ=

1ab a b

l

gn

= ⋅e e

Dynamics of Metric

2 2

21

( ) ( '( ) )

E[ '( )) ] E[ '( )) ]E[ ]mean field approximation

( ) '( )

aa k k

a aa ak a k

a bab k j kj

a a a aa k j a k j

dx B dxB

B B u wg B B g

u w w u w w

A Av Dv

ϕ

ϕ ϕ

χ ϕ

=

=

= =

=

=

− −

=

∑

∑

∫

e e

Mectric

( )

( )

1 1 1 1

1

2

2 2 2

221

,

l l

l l l l ll

l

l l l l

a bab ab

l l l

a bab

i ii i i l i i

i

l i

g BB g

ds g d x d x

BB w w u E

E u

ϕ σ ϕ δ

χ σ ϕ

− − − −

−

′ ′

= =

=

′ ′ = ≈

′=

∑

∑

e e

Ｌａｗ of large numbers

1( ) ( ( )) ( )l

ijijg x x g xχ= ∏conformal geometry

1

1 1

1

conformal transformation!( ) ( ) ( )

( )

ij ij

ij

ll

ij ij

g x A g x

A

g

χ

χ χ δ

χ δ

=

=

⇒ =

rotation, expansion

Domino Theorem

1 1 1 1

1 1 1 1

1 2

2

1

1

1

1 1 1

L L

L L L L L L

L L

L L L L L L

i ii i i i i i

i ii i i i i

l l l

m m m

l l l

m m m

i

B BB BBB

B BB BB

B B

B B BBBB

BW W W

δ χ δ

δ χ χ χ χ δ

− − − −

− − − −

′

− −

− −

′ ′ ′

′′ ′ ′

∂ ∂ ∂= = =

∂ ∂ ∂

∂ ∂ ∂= = =

Σ =

Σ =

∂ ∂ ∂

x x x

x x x

x x x

Dynamics of Curvature

2 2

''( )( )( ) '( )

| |

i iab a b a b

i a b a b

ab ab ab

ab ab

H xu

H

ϕ ϕ⊥

= ∇ = ∂ ∂

= ⋅ ⋅ + ⋅∂

= +

=

ew e w e w e

H H H

H

22

2 21

2 121

1

( ) ''( )

1 ( )(

exponential expansion! creation is smal

2 1) (

!

)

1

l

l l l lab abab

A Av Dv

H A A Hn

χ ϕ

χ δ χχ

χ

+

=

= + +

>

∫

Poole et al (2016)Deep neural networks

Distance

[ ] 21, i iD x yn

= −∑x y

Dynamics of Distance (Amari, 1974)

( )

21( , ') ( ')

1( , ') ' '

' 2

~N(0, V)

' ' V=

( ') E[ (

ii

i i

ii k k

ii k k

D x x x xn

C x x x x x xn

D A A C

u w y

u w y A C

C A C Aϕ

= −

= ⋅ =

= + −

=

=

= −

∑

∑

∑

∑ ) ( ' )]C C A C Cε ν ϕ ε ν+ − +

1

1

( )

1

l lD K D

dDdD

χ

+ =

= >

Problem!

( , ' ) : ( )

equi-distance property

l lD D lD K D

→ →∞

=

x x

dynamics of distancelim ( , )

lim lim ( , ) lim lim ( , )

L L

nL

L L L L

n L L n

D x y

D x y D x y

→∞→∞

→∞ →∞ →∞ →∞≠

Feedback Path

Error backpropFisher Information

Stochastic model : parameter spacemanifold of probability distributions

2

2

( ( ......)..) ; ~ (0,1)1( , : ) exp{ ( ( ; )) } ( )2

[ log ( , : ) log ( , : )]x W W

y Wx Wx N

p y x W c y x W q x

G E p y x W p y x Wds dWGdW

ϕ ϕ ε ε

ϕ

= +

= − −

= ∇ ∇

=

Learning: stochastic gradient descentSteepest Direction---Natural Gradient

( )1

1

, ,n

l ll

l G l

θ θ−

∂ ∂∇ = ∂ ∂ ∇ = ∇

θ

dθθ

( )l θ

( , ; )t t t t tl x yη θ∆ = − ∇θ

Natural Gradient

( ) ( )

( )

2

1

max

KL[p(x, ):p(x, )]=

dl l d l

d d

l G l

ε θ θ θ ε−

= + −

= +

∇ = ∇

θ θ θ

θ

θ

( , ; )t t t t tl x yη θ∆ = − ∇θ

Information Geometry of MLP

Natural Gradient Learning :S. Amari ; H.Y. Park

( )

( )

1

1 1 1 11 1 T

t t t t

lG

G G G f f G

η

ε ε

−

− − − −+

∂∆ = −

∂

= + − ∇ ∇

θ θθ

Fisher Information

( )

( )

1 1 1

2 1 1

1

' ...

, (1/ )

, 0 ~ (1/ ),

, 0 ~ (1/ ),

m ll l l m

m m m m

l l l

il m x p

l m pl l

i j p

G EW W

W B BB BW W W W

G W W E O n

G W W O n l m

G O n i j

ϕ ϕ

ϕ ϕ ϕ ϕϕ

χ ϕ

− − +

− −

∂ ∂= ∂ ∂

∂ ∂ ∂ ∂= = =

∂ ∂ ∂ ∂ ′= +

= ≠

= ≠

∏

x

w x x

w w

Unitwise natural gradient

1WW G lη −∆ = − ∇

Y. Ollivier; Marceau-Caron

Goodnews and bad news

G*： unitwise-diagonal matrix

1 1

*: * :

G G nG G n− −

→ →∞

→ →∞

Karakida theoryeigenvalues of G

( )21 , 1i i On

λ λ= =∑ ∑

distorted Riemannian metric2G

References:

Poole, …, Ganguli (2016)Schoenholz et al (2017)Yang & Schoenholtz (2017), ……

S. Amari, R. Karakida & M. Oizumi, Statistical neurodynamics of deep networks: Geometry of Signal Spaces. arXiv:1808.07169v1, 2018.

S. Amari, R. Karakida & M. Oizumi, Fisher information and natural gradient learning of random deep networks. arXiv:1808.07172v1, 2018(AISTATS-19).

R. Karakida, S. Akaho & S. Amari, Universal statistics of Fisher information indeep neural networks: Mean field approach. arXiv: 1806.01316, 2018(AISTATS-19).

Date post:	26-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Deep Random Neural Fieldkabuto.phys.sci.osaka-u.ac.jp/~koji/workshop/DLAP2019/...Statistical...

Documents