Deep Random Neural Fieldkabuto.phys.sci.osaka-u.ac.jp/~koji/workshop/DLAP2019/...Statistical...

Post on 26-Jul-2020

2 views 0 download

transcript

Deep Learning and Physics -- 2019

Deep Random Neural Field

Shun-ichi AmariRIKEN Center for Brain Science; Araya

Brief History of AI and NN First Boom: start

1956 ~ AI neural networks--perceptron

Dartmouth Conf. Perceptron

symbol universal computationlogic learning machine

Dark period (late 1960~1970’s) stochastic gradient descent learning (1967) for MLP

PerceptronF.Rosenblatt, Principles of Neurodynamics, 1961

McCulloch-Pitts neuron 0,1 binary learning

Multilayer lateral & feedback connection

x z

Deep Neural Networks Rosenblatt: multilayer perceptron

x ( , )z f x W=

( ) ( )( )

2, ,

,,

differentiable : analog neuron

L W y f W

L Wc

W

= −

∂→ +∆ ∆ = −

x x

xw w w wlearning of hidden neurons

analog neuron

stochastic gradient learningAmari, Tsypkin, 1966~67: error back-prop, 1976

Information Theory II--Geometrical Theory of Information

Shun-ichi AmariUniversity of Tokyo

Kyoritu Press, Tokyo, 1968

First stochastic descent learning of MLP (1967;1968)

( ) { } { }1 1 2 2 3 4, max , min ,f v v= ⋅ ⋅ + ⋅ ⋅x w x w x w x w xθ

x

1w

4w

max

max

1vy

2v

Second Boom

1970~ AI 1980~ neural networks expert system MLP (backprop)(MYCIN) associative memory

stochastic inference (Bayes) chess (1997)

Third Boom 2010~Deep learningStochastic inference (graphical model; Bayesian; WATSON)

Deep learningpattern recognition: vision, auditory,sentence analysis, machine translationalpha-go

Language processing; sequence and dynamics (word2vec, deep learning with rec. net)

Integration of (symbol, logic) vs (pattern, dynamics)

Deep Learning Self-Organization + Supervised LearningRBM: Restricted Boltzmann MachineAuto-Encoder, Recurrent Net

DropoutContrastive divergenceConvolutionResnetReLUAdversarial net

Victory of Deep Neural NetworksHinton 2005, 2006 ~ 2012

many others

visual pattern, auditory patternGo-gamesentence analysis, machine translation

adversarial network, pattern generation

Mathematical Neuroscience searches for the principlesmathematical studies using simple idealistic models

(not realistic)

Computational neuroscienceAI : technological realization

Mathematical Neuroscienceand Brain

Brain has found and implemented the principles

through evolution (random search)historical restrictionmaterial restriction

Very complex (not smartly designed)

Theoretical Problems on Learning: 1Local solution and global solution

Simulated annealingQuantum annealing:

Θ

( )ΘL

Theoretical Problem of Learning: 2training loss and generalization loss :overtraining

2

2

( , )1 | ( , ) |

[| ( , ) | ]

emp i i

gen

gen emp

y f x

L y f xN

L E y f xPL LN

θ ε

θ

θ

= +

= −

= −

≈ +

generalization loss

Training loss

Extremely wdie networkP-> ∞: P>>N

Local minimum =global minimumKawaguchi, 2019

Learning curve P>>N

Double descent

Belkin et al. 2019;Hastie et al. 2019

Random Neural Network

Random is excellent !! Random is magic!!

Statistical dynamicsRandom code

Random Deep NetworksPoole et. Al., 2016Schoenholtz et. Al., 2017~~

Signal propagationError back propagation

Jacot et al; Neural tangent kernel

ここに数式を入力します。

2

2

1( , ); ( , ) ( ( , ))2

1( , ) { ( , )} ; ( , ) ( , *)2

( ')( ( , ) ( , *))( , ) ( ) ( ) ( ') ( ', )

t

t t

y f x l x y f x

l x y f x e f x f x

l f x f x f xf x f x f x f x e x

θ θ

θ θ θ

θ θ θ

θ θ θ θ

θ η η θ θθ θ η θ

= = −

= − = −

∂ = − ∂ = − ∂ −∂ = ∂ ∂ = − ∂ ⋅∂

K; Gaussian kernel

( , '; ) ( ) ( ')( , ) ( , '; ) ( ', )t

K x x f x f xf x K x x e x

θ θθθ η θ θ

= ∂ ⋅∂

∂ = − < >

( , '; ) ( , ') : Gaussian kernelinitial

t ini

K x x K x xθθ θ

≈≈

Theorem P>>NOptimal solution lies near a random network.

Bailey et al2019

1( )

1( )

ij

ij

w On

w On

=

∆ =

random

Random Neural Field1

( ') ( , ') ( ) ( ')

( ') ( ( '))

l

l

u z w z z x z dz b z

x z u zϕ

= +

=

( ', ) : randam (0 mean Gaussian correla )e; t dw z z

Statistical Neurodynamicsmicrodynamics

( ) ( )1w

t T t+ =x x ( )( )sgn W t= x

1( )

t tX F X+ =

2 1

3 3 1

( ) ( )

( ) ( ) ?

W

W W

X X X T

X X X T T

= =

= =2

x x

x x

macrodynamics

: macrostate

Statistical Neurodynamics

Rozonoer (1969)Amari (1969, 1971; 1973)

SompolinskiAmari et al (2013)Toyoizumi et al (2015)Poole, …, Ganguli (2016)Schoenholz et al (2017)Yang & Schoenholtz (2017),Karakida, et al (2019)Jacot et al. (2019) ……

~ (0, 1)ijw N

Macroscopic behaviorscommon to almost all (typical) networks

Random Deep Networks

1

0

2

1

( )

1

( )

l l l

ij ji i

il ll

l l

x w x w

A xn

A F A

ϕ+

+

= +

=

=

∑2

20

~ (0, / )

~ (0, )ij l

i b

w N n

w N

σ

σ

Macroscopic variables

2

1

1

1activity :

distance: = [ : ']

( )( )

i

l l

l l

A xn

D D

A F AD K D

+

+

=

==

metric,curvature & Fisher informationx x

Dynamics of Activity: law of large numbers

2 20

1

20

( ) ( ) : ( )~ (0, )

1 ( ) [ ( ) ] ( )

( ) ( ) ~ (0,1)

i ik k i i

i

i il

x w x b u x Wx bu N A

A x E u An

A Av Dv v N

ϕ ϕ φ

ϕ χ

χ ϕ+

= + = = +

= = =

=

0

0

2

'(0) 1

( )

convergei

A A

x

χ

χ

>

=

→∑

Pullback Metric & Curvature

2 1li j

ijl

ds g dx dx d dn

= = ⋅∑ x x( )x Wxφ=

Basis vectors

( )

( )

1 1 1 1

1 1

1 1

1 1

( ..

:

. )

.

Jacob

.

n

.

ia

l l

l l l l l l

l l

l l l

i ii i i i i i

l l l

i ii i i

l m m

l l l l m m

a a a

B

dx u W dx B dx

d B d B B d

B B B

u W

ϕ

ϕ− −

− − − −

− −

− −

′= =

= =

′=

= =

∑ ∑

x x x

e e e

ここに数式を入力します。

( )x Wxφ=

1ab a b

l

gn

= ⋅e e

Dynamics of Metric

2 2

21

( ) ( '( ) )

E[ '( )) ] E[ '( )) ]E[ ]mean field approximation

( ) '( )

aa k k

a aa ak a k

a bab k j kj

a a a aa k j a k j

dx B dxB

B B u wg B B g

u w w u w w

A Av Dv

ϕ

ϕ ϕ

χ ϕ

=

=

= =

=

=

− −

=

e e

Mectric

( )

( )

1 1 1 1

1

2

2 2 2

221

,

l l

l l l l ll

l

l l l l

a bab ab

l l l

a bab

i ii i i l i i

i

l i

g BB g

ds g d x d x

BB w w u E

E u

ϕ σ ϕ δ

χ σ ϕ

− − − −

′ ′

= =

=

′ ′ = ≈

′=

e e

Law of large numbers

1( ) ( ( )) ( )l

ijijg x x g xχ= ∏conformal geometry

1

1 1

1

conformal transformation!( ) ( ) ( )

( )

ij ij

ij

ll

ij ij

g x A g x

A

g

χ

χ χ δ

χ δ

=

=

⇒ =

rotation, expansion

Domino Theorem

1 1 1 1

1 1 1 1

1 2

2

1

1

1

1 1 1

L L

L L L L L L

L L

L L L L L L

i ii i i i i i

i ii i i i i

l l l

m m m

l l l

m m m

i

B BB BBB

B BB BB

B B

B B BBBB

BW W W

δ χ δ

δ χ χ χ χ δ

− − − −

− − − −

− −

− −

′ ′ ′

′′ ′ ′

∂ ∂ ∂= = =

∂ ∂ ∂

∂ ∂ ∂= = =

Σ =

Σ =

∂ ∂ ∂

x x x

x x x

x x x

Dynamics of Curvature

2 2

''( )( )( ) '( )

| |

i iab a b a b

i a b a b

ab ab ab

ab ab

H xu

H

ϕ ϕ⊥

= ∇ = ∂ ∂

= ⋅ ⋅ + ⋅∂

= +

=

ew e w e w e

H H H

H

22

2 21

2 121

1

( ) ''( )

1 ( )(

exponential expansion! creation is smal

2 1) (

!

)

1

l

l l l lab abab

A Av Dv

H A A Hn

χ ϕ

χ δ χχ

χ

+

=

= + +

>

Poole et al (2016)Deep neural networks

Distance

[ ] 21, i iD x yn

= −∑x y

Dynamics of Distance (Amari, 1974)

( )

21( , ') ( ')

1( , ') ' '

' 2

~N(0, V)

' ' V=

( ') E[ (

ii

i i

ii k k

ii k k

D x x x xn

C x x x x x xn

D A A C

u w y

u w y A C

C A C Aϕ

= −

= ⋅ =

= + −

=

=

= −

∑ ) ( ' )]C C A C Cε ν ϕ ε ν+ − +

1

1

( )

1

l lD K D

dDdD

χ

+ =

= >

Problem!

( , ' ) : ( )

equi-distance property

l lD D lD K D

→ →∞

=

x x

dynamics of distancelim ( , )

lim lim ( , ) lim lim ( , )

L L

nL

L L L L

n L L n

D x y

D x y D x y

→∞→∞

→∞ →∞ →∞ →∞≠

Feedback Path

Error backpropFisher Information

Stochastic model : parameter spacemanifold of probability distributions

2

2

( ( ......)..) ; ~ (0,1)1( , : ) exp{ ( ( ; )) } ( )2

[ log ( , : ) log ( , : )]x W W

y Wx Wx N

p y x W c y x W q x

G E p y x W p y x Wds dWGdW

ϕ ϕ ε ε

ϕ

= +

= − −

= ∇ ∇

=

Learning: stochastic gradient descentSteepest Direction---Natural Gradient

( )1

1

, ,n

l ll

l G l

θ θ−

∂ ∂∇ = ∂ ∂ ∇ = ∇

θ

dθθ

( )l θ

( , ; )t t t t tl x yη θ∆ = − ∇θ

Natural Gradient

( ) ( )

( )

2

1

max

KL[p(x, ):p(x, )]=

dl l d l

d d

l G l

ε θ θ θ ε−

= + −

= +

∇ = ∇

θ θ θ

θ

θ

( , ; )t t t t tl x yη θ∆ = − ∇θ

Information Geometry of MLP

Natural Gradient Learning :S. Amari ; H.Y. Park

( )

( )

1

1 1 1 11 1 T

t t t t

lG

G G G f f G

η

ε ε

− − − −+

∂∆ = −

= + − ∇ ∇

θ θθ

Fisher Information

( )

( )

1 1 1

2 1 1

1

' ...

, (1/ )

, 0 ~ (1/ ),

, 0 ~ (1/ ),

m ll l l m

m m m m

l l l

il m x p

l m pl l

i j p

G EW W

W B BB BW W W W

G W W E O n

G W W O n l m

G O n i j

ϕ ϕ

ϕ ϕ ϕ ϕϕ

χ ϕ

− − +

− −

∂ ∂= ∂ ∂

∂ ∂ ∂ ∂= = =

∂ ∂ ∂ ∂ ′= +

= ≠

= ≠

x

w x x

w w

Unitwise natural gradient

1WW G lη −∆ = − ∇

Y. Ollivier; Marceau-Caron

Goodnews and bad news

G*: unitwise-diagonal matrix

1 1

*: * :

G G nG G n− −

→ →∞

→ →∞

Karakida theoryeigenvalues of G

( )21 , 1i i On

λ λ= =∑ ∑

distorted Riemannian metric2G

References:

Poole, …, Ganguli (2016)Schoenholz et al (2017)Yang & Schoenholtz (2017), ……

S. Amari, R. Karakida & M. Oizumi, Statistical neurodynamics of deep networks: Geometry of Signal Spaces. arXiv:1808.07169v1, 2018.

S. Amari, R. Karakida & M. Oizumi, Fisher information and natural gradient learning of random deep networks. arXiv:1808.07172v1, 2018(AISTATS-19).

R. Karakida, S. Akaho & S. Amari, Universal statistics of Fisher information indeep neural networks: Mean field approach. arXiv: 1806.01316, 2018(AISTATS-19).