1 Weak Convergence of Random Free Energy in Information Theory Sumio Watanabe Tokyo Institute of...

1

Weak Convergence of Random Free Energy in Information Theory

Sumio WatanabeTokyo Institute of Technology

2

Contents

1. Background

2. Main Theorem

3. Outline of Proof

4. Applications 　 and Future Study

Identification Problem ≡ Math. Phys. with Random Hamiltonian

3

Example : Classical Spin System

Visible

x si

sj wij

Hidden

si

sj wij

Hidden

Visible

samples

Learn

Unknown Learner

Background (1)

p(x|w) = ∑ 　 exp( － ∑　 wij si sj ) 1Z(w) (i,j)

Hidden

4

Identification Problem

q(x)

ClassicalUnknownInformationSource

X1, X2 ,…, Xn

p(x|w) φ(w)

Learning System

p( x | X1, X2 ,…, Xn )

D(q||p) ≡∫dx q [log q –log p] = ?

Observation

(Relative Entropy)

Background (2)

Estimated Distribution

5

Random Free Energy and Relative Entropy

F(X1, X2 ,…, Xn ) ≡ － log∫p(X1|w) p(X2|w) ・・・ p(Xn|w) φ(dw)

D( q(Xn+1 ) || p(Xn+1 | X1 , X2 ,…, Xn ) )

＝ F(X1, X2 ,…, Xn+1 ) － F(X1, X2 ,…, Xn )

Definition. Random Free Energy = log-Likelihood of System

Relation between F and D(q||p)

Background (3)

＋ Σ log q(Xi) n

i=1

6

Identifiability and Singularities

A learning system p(x|w) is called identifiable

p(x|w1) = p(x|w2) (∀x) ⇒ w1=w2

A system which identifies the structure is non-identifiable.

{ w ; p(x|w)=p(x|w0)} is an analytic set with singularities.

W={w}, w1 ～ w2 　⇔ “ p(x|w1) = p(x|w2) (∀x)”

W ／　 is not a manifold because

Background (4)

～

Remark.

7

Mathematical Definitions

X : a random variable on RN with p.d.f. q(x).

W : a real d-dimensional manifold.

L2(q) = {f ; ∫f(x)2 q(x) dx < ∞ } : real Hilbert space.

φ(w) : a p.d.f. on W, C0∞ -class function.

φ(w) dw : prob. Dist. on W

Main Theorem (1)

8

Mathematical Definitions

F = － log ∫exp( － Σ H(Xi, w) ) φ(w) dw n

i=1

Given X1, X2, …,Xn : i.i.d., Random Free Energy

W0 ≡{w∈supp φ; K(w)=0} ≠ O

H( ・ ,w) : an L2(q)-valued real analytic function on W.

e.g. H(x,w)=log q(x) – log p(x|w)

Main Theorem (2)

E X[e -H(X,w)]=1 (∀w). [ ⇒ K(w)≡E X[H(X,w)]≧0 ]s.t.

9

Gel’fand’s Zeta function

　 ζ(z) = ∫ K(w)z φ(w) dw

Difficulty : {w; K(w)=0} is an analytic set with singularities.

The zeta function

(1) ζ(z) can be analytically continued to a meromorphic function on the entire complex plane.(2) All poles are real, negative, and rational numbers.

Theorem (Atiyah,Sato,Bernstein,Bjork,Kashiwara,1970-1980)

Poles: 0>-λ1> -λ2 > -λ3 > ・・・ ,

Orders: m1,m2,m3,…

Main Theorem (3)

: holomorphic in Re(z)>0.

10

Main Theorem

F – λ １ log n + 　 (m1-1)loglog n → F*

The convergence in law holds.

where F* can be represented by a limit process of an empirical process on W0.

(n→∞)

Main Theorem (4)

E[ D(q||p) ] = + 　 o( ) λ １

n

Corollary If E[ D(q||p)] has an asymptotic expansion

1n

11

Hironaka Resolution Theorem

W

K(w)

0

W0

g

U

locally

U0

K(g(u))=a(u) u12s1 u2

2s2 ・・・ ud2sd

Proof Outline (1)

12

Resolution Theorem

Let K(w)≧0 be a real analytic function defined in a

neighborhood of 0∈W⊂Rd. Then there exist an open set 　 W,

a real analytic manifold U, and a proper analytic mapg: U→W such that

H.Hironaka(1964)M.F.Atiyah(1970)

(1)g:U-U0 → W-W0 is an isomorphism.

(2) For each P∈U, there are local coordinates (u1,u2,…,ud)

centered at P so that locally near P K(g(u)) = a(u) u1

2s1 u22s2 ・・・ ud

2sd

where a(u)>0 is an analytic function and si≧0 is integer.

Proof Outline (2)

13

Division of Partition Function

Because suppφ is compact and g is a proper map,

We can assume W = ∪ U (finite sum 、 joint set measure zero)

K(g(u)) = a(u) u12s1 u2

2s2 ・・・ ud2sd

φ(u) = Σ b (u) u1k1 u2

k2 ・・・ udkd

in each U,

Proof Outline (3)

Hereafter, is omitted and K(u) ≡ K(g(u)) is used.

exp(-F) = Σ ∫ exp[ -ΣH(Xi, g(u)) ] φ(u) du U

( Both si and ki depend on )

n

i=1

14

B-function Proof Outline (4)

　 ζ(z) = ∫ K(w)z φ(w) dwThe zeta function

∃P(w,∂w,z) ∃b(z) s.t. P(w,∂w,z) K(w)z+1=b(z)K(w)z

Analytic continuation is carried out using b-function.

If K(w) is a polynomial, then there exists an algorithm to calculate b(z). (Oaku, 1997).

15

Ideals of Local Analytic functions

(2) H(x,u) =∑ gj(u) hj(x,u)

Lemma 1. Let u →H( ・ ,u) be a real analytic function in U.

J

j=1

There exist an open set U ⊂U and a finite set ofanalytic functions { gj(u), hj( ・ ,u) ; j=1,2,…,J } in U

s.t.

(1) T(u)≧I 　 (∀u∈U

Tij(u)≡∫hj(x,u) hk(x,u) q(x) dx

Proof Outline (5)

16

Decomposition of Hamiltonian

Σ H(Xi,u) = nK(u) + (nK(u))1/2σn(u) n

i=1

σn(u) ≡ ∑ r(Xi,u) 1n

r(x,u) ≡

n

i=1

H(x,u) － K(u)

K(u)1/2

Since Lemma 1 and K(u) = ∫{K(x,u)+e-K(x,u)-1} q(x) dx, r(x,u) is

well defined even if K(u)=0.

Proof Outline (6)

RandomHamiltonian

17

Donsker’s Empirical Process

σn( ・ ) → σ ( ・ )

Empirical process Tight Gaussian process

σn(u) ≡ ∑ r(Xi,u) 1n

n

i=1

E 　　　　　　　 [ f(σn)] 　 →　 Eσ[ f(σ)]x1,x2,…,xn

(∀ f : a bounded continuous functional on L∞(supp φ))

Proof Outline (7)

Central limit theorem in Banach Space

18

Poles of Zeta function

K(u) = a(u) u12s1 u2

2s2 ・・・ ud2sd

Φ(u) = Σ b(u) u1k1 u2

k2 ・・・ udkd

　 λ = min Kj+1

2sj

m = ♯{ j ; λ = }Kj+1

2sj

ζ(z) = Σ∫ K(u)z φ(u) du

Proof Outline (8)

19

Zeta function and State Density

∬ L(u) z φ(u,v) dudv

∬ δ(t-L(u)) φ(u,v) dudv

u=(u,v) u =(uj) ; j ∈J : attains min.

L(u)≡ Π uj2sj

j ∈J

: Pole –λ order m

= tλ-1(-log t)m-1∫φ(0,v) dv

Inverse Mellin Transf.

Proof Outline (9)

( t → 0 )

Partial　 Zeta

State Density

20

Partition function and Empirical Process

E[ { Z}iε ]→const.(log n)m-1

nλ

Characteristic function of F : Sufficiently small ε>0

Z = ∬ exp(-nK(u,v) + (nK(u,v))1/2σn(u,v)) φ(u,v) dudv

→ ∬ ( )λ-1(-log( ))m-1 φ(0,v) dv

×exp[ -tK(0,v) + (tK(0,v))1/2σ(0,v) ]

t n

t n

dt n

Proof Outline (10)

Partition function ← State Density ← Zeta function

Q.E.D.( n → ∞ )

21

Information Science & Mathematical Physics

Applications and Future Study (1)

Identification of Unknown Information Source

= Statistical Physics with Random Hamiltonian

Identification of Hidden Structure

= Hamiltonian has Singularities

⇒ Singularities make State Density to be singular.

22

Model Identification


p(x|w), φ(w)

F

F = F(p,φ,X1,X2,…,Xn)

True

From Samples, then true distribution is identified.

23

Poles and orders of Zeta function

1. If φ(w)>0 at W0, then 0<λ≦d/2.

2. 1≦m≦d.

3. If φ(w) is Jeffreys’ prior, λ≧d/2.

4. If ζ(z) has a pole –λ’, then λ≦λ.


24

Concrete Learning Systems

1. Neural Networks, True H0, Model H.

2. Gaussian Mixtures, True H0, Model H.

p(y|x,w)= exp(- )

p(x|w) = Σ ah exp( - )

1(2π)1/2

|| y – Σah f(bh ・ x+ch)||2

2

|| x - bh||2

2

2λ≦H0(M+N+1) + (H-H0) Min(M+1,N)

2λ≦H0 + (M-1)H/2 +(M-3)/2


25

Future Study


2. Large System : Thermo-dynamical limit.

3. Replica Method : f(z) = E[ exp( zF) ].

4. Generalization to Non-commutative System.

1. Testing hypothesis ⇒ q(x)=p(x|w0) ; w0 　 near singularity

Date post:	13-Dec-2015
Category:	Documents
Upload:	christina-dorsey
View:	217 times
Download:	4 times

1 Weak Convergence of Random Free Energy in Information Theory Sumio Watanabe Tokyo Institute of...

Documents