Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | christina-dorsey |
View: | 217 times |
Download: | 4 times |
1
Weak Convergence of Random Free Energy in Information Theory
Sumio WatanabeTokyo Institute of Technology
2
Contents
1. Background
2. Main Theorem
3. Outline of Proof
4. Applications and Future Study
Identification Problem ≡ Math. Phys. with Random Hamiltonian
3
Example : Classical Spin System
Visible
x si
sj wij
Hidden
si
sj wij
Hidden
Visible
samples
Learn
Unknown Learner
Background (1)
p(x|w) = ∑ exp( - ∑ wij si sj ) 1Z(w) (i,j)
Hidden
4
Identification Problem
q(x)
ClassicalUnknownInformationSource
X1, X2 ,…, Xn
p(x|w) φ(w)
Learning System
p( x | X1, X2 ,…, Xn )
D(q||p) ≡∫dx q [log q –log p] = ?
Observation
(Relative Entropy)
Background (2)
Estimated Distribution
5
Random Free Energy and Relative Entropy
F(X1, X2 ,…, Xn ) ≡ - log∫p(X1|w) p(X2|w) ・・・ p(Xn|w) φ(dw)
D( q(Xn+1 ) || p(Xn+1 | X1 , X2 ,…, Xn ) )
= F(X1, X2 ,…, Xn+1 ) - F(X1, X2 ,…, Xn )
Definition. Random Free Energy = log-Likelihood of System
Relation between F and D(q||p)
Background (3)
+ Σ log q(Xi) n
i=1
6
Identifiability and Singularities
A learning system p(x|w) is called identifiable
p(x|w1) = p(x|w2) (∀x) ⇒ w1=w2
A system which identifies the structure is non-identifiable.
{ w ; p(x|w)=p(x|w0)} is an analytic set with singularities.
W={w}, w1 ~ w2 ⇔ “ p(x|w1) = p(x|w2) (∀x)”
W / is not a manifold because
Background (4)
~
Remark.
7
Mathematical Definitions
X : a random variable on RN with p.d.f. q(x).
W : a real d-dimensional manifold.
L2(q) = {f ; ∫f(x)2 q(x) dx < ∞ } : real Hilbert space.
φ(w) : a p.d.f. on W, C0∞ -class function.
φ(w) dw : prob. Dist. on W
Main Theorem (1)
8
Mathematical Definitions
F = - log ∫exp( - Σ H(Xi, w) ) φ(w) dw n
i=1
Given X1, X2, …,Xn : i.i.d., Random Free Energy
W0 ≡{w∈supp φ; K(w)=0} ≠ O
H( ・ ,w) : an L2(q)-valued real analytic function on W.
e.g. H(x,w)=log q(x) – log p(x|w)
Main Theorem (2)
E X[e -H(X,w)]=1 (∀w). [ ⇒ K(w)≡E X[H(X,w)]≧0 ]s.t.
9
Gel’fand’s Zeta function
ζ(z) = ∫ K(w)z φ(w) dw
Difficulty : {w; K(w)=0} is an analytic set with singularities.
The zeta function
(1) ζ(z) can be analytically continued to a meromorphic function on the entire complex plane.(2) All poles are real, negative, and rational numbers.
Theorem (Atiyah,Sato,Bernstein,Bjork,Kashiwara,1970-1980)
Poles: 0>-λ1> -λ2 > -λ3 > ・・・ ,
Orders: m1,m2,m3,…
Main Theorem (3)
: holomorphic in Re(z)>0.
10
Main Theorem
F – λ 1 log n + (m1-1)loglog n → F*
The convergence in law holds.
where F* can be represented by a limit process of an empirical process on W0.
(n→∞)
Main Theorem (4)
E[ D(q||p) ] = + o( ) λ 1
n
Corollary If E[ D(q||p)] has an asymptotic expansion
1n
11
Hironaka Resolution Theorem
W
K(w)
0
W0
g
U
locally
U0
K(g(u))=a(u) u12s1 u2
2s2 ・・・ ud2sd
Proof Outline (1)
12
Resolution Theorem
Let K(w)≧0 be a real analytic function defined in a
neighborhood of 0∈W⊂Rd. Then there exist an open set W,
a real analytic manifold U, and a proper analytic mapg: U→W such that
H.Hironaka(1964)M.F.Atiyah(1970)
(1)g:U-U0 → W-W0 is an isomorphism.
(2) For each P∈U, there are local coordinates (u1,u2,…,ud)
centered at P so that locally near P K(g(u)) = a(u) u1
2s1 u22s2 ・・・ ud
2sd
where a(u)>0 is an analytic function and si≧0 is integer.
Proof Outline (2)
13
Division of Partition Function
Because suppφ is compact and g is a proper map,
We can assume W = ∪ U (finite sum 、 joint set measure zero)
K(g(u)) = a(u) u12s1 u2
2s2 ・・・ ud2sd
φ(u) = Σ b (u) u1k1 u2
k2 ・・・ udkd
in each U,
Proof Outline (3)
Hereafter, is omitted and K(u) ≡ K(g(u)) is used.
exp(-F) = Σ ∫ exp[ -ΣH(Xi, g(u)) ] φ(u) du U
( Both si and ki depend on )
n
i=1
14
B-function Proof Outline (4)
ζ(z) = ∫ K(w)z φ(w) dwThe zeta function
∃P(w,∂w,z) ∃b(z) s.t. P(w,∂w,z) K(w)z+1=b(z)K(w)z
Analytic continuation is carried out using b-function.
If K(w) is a polynomial, then there exists an algorithm to calculate b(z). (Oaku, 1997).
15
Ideals of Local Analytic functions
(2) H(x,u) =∑ gj(u) hj(x,u)
Lemma 1. Let u →H( ・ ,u) be a real analytic function in U.
J
j=1
There exist an open set U ⊂U and a finite set ofanalytic functions { gj(u), hj( ・ ,u) ; j=1,2,…,J } in U
s.t.
(1) T(u)≧I (∀u∈U
Tij(u)≡∫hj(x,u) hk(x,u) q(x) dx
Proof Outline (5)
16
Decomposition of Hamiltonian
Σ H(Xi,u) = nK(u) + (nK(u))1/2σn(u) n
i=1
σn(u) ≡ ∑ r(Xi,u) 1n
r(x,u) ≡
n
i=1
H(x,u) - K(u)
K(u)1/2
Since Lemma 1 and K(u) = ∫{K(x,u)+e-K(x,u)-1} q(x) dx, r(x,u) is
well defined even if K(u)=0.
Proof Outline (6)
RandomHamiltonian
17
Donsker’s Empirical Process
σn( ・ ) → σ ( ・ )
Empirical process Tight Gaussian process
σn(u) ≡ ∑ r(Xi,u) 1n
n
i=1
E [ f(σn)] → Eσ[ f(σ)]x1,x2,…,xn
(∀ f : a bounded continuous functional on L∞(supp φ))
Proof Outline (7)
Central limit theorem in Banach Space
18
Poles of Zeta function
K(u) = a(u) u12s1 u2
2s2 ・・・ ud2sd
Φ(u) = Σ b(u) u1k1 u2
k2 ・・・ udkd
λ = min Kj+1
2sj
m = ♯{ j ; λ = }Kj+1
2sj
ζ(z) = Σ∫ K(u)z φ(u) du
Proof Outline (8)
19
Zeta function and State Density
∬ L(u) z φ(u,v) dudv
∬ δ(t-L(u)) φ(u,v) dudv
u=(u,v) u =(uj) ; j ∈J : attains min.
L(u)≡ Π uj2sj
j ∈J
: Pole –λ order m
= tλ-1(-log t)m-1∫φ(0,v) dv
Inverse Mellin Transf.
Proof Outline (9)
( t → 0 )
Partial Zeta
State Density
20
Partition function and Empirical Process
E[ { Z}iε ]→const.(log n)m-1
nλ
Characteristic function of F : Sufficiently small ε>0
Z = ∬ exp(-nK(u,v) + (nK(u,v))1/2σn(u,v)) φ(u,v) dudv
→ ∬ ( )λ-1(-log( ))m-1 φ(0,v) dv
×exp[ -tK(0,v) + (tK(0,v))1/2σ(0,v) ]
t n
t n
dt n
Proof Outline (10)
Partition function ← State Density ← Zeta function
Q.E.D.( n → ∞ )
21
Information Science & Mathematical Physics
Applications and Future Study (1)
Identification of Unknown Information Source
= Statistical Physics with Random Hamiltonian
Identification of Hidden Structure
= Hamiltonian has Singularities
⇒ Singularities make State Density to be singular.
22
Model Identification
Applications and Future Study (2)
p(x|w), φ(w)
F
F = F(p,φ,X1,X2,…,Xn)
True
From Samples, then true distribution is identified.
23
Poles and orders of Zeta function
1. If φ(w)>0 at W0, then 0<λ≦d/2.
2. 1≦m≦d.
3. If φ(w) is Jeffreys’ prior, λ≧d/2.
4. If ζ(z) has a pole –λ’, then λ≦λ.
Applications and Future Study (3)
24
Concrete Learning Systems
1. Neural Networks, True H0, Model H.
2. Gaussian Mixtures, True H0, Model H.
p(y|x,w)= exp(- )
p(x|w) = Σ ah exp( - )
1(2π)1/2
|| y – Σah f(bh ・ x+ch)||2
2
|| x - bh||2
2
2λ≦H0(M+N+1) + (H-H0) Min(M+1,N)
2λ≦H0 + (M-1)H/2 +(M-3)/2
Applications and Future Study (4)