Une journee autour de Ron DeVore Sep. 13, 2019
Approximation with tree tensor networks
Anthony Nouy
Centrale NantesLaboratoire de Mathematiques Jean Leray
Anthony Nouy Centrale Nantes 1 / 42
Statistical learning
Two typical tasks of statistical learning are to
approximate a random variable Y by a function of a set of variablesX = (X1, . . . ,Xd), from samples of the pair Z = (X ,Y ) (supervised learning)
approximate the probability distribution of a random vector Z = (Z1, . . . ,Zd) fromsamples of the distribution (unsupervised learning)
Anthony Nouy Centrale Nantes 2 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
A classical approach is to introduce a risk functional R(v) whose minimizer over the setof functions v is the target function u and such that
R(v)−R(u)
measures some distance between the target u and the function v .
The risk is defined as an expectation
R(v) = E(γ(v ,Z))
where γ is called a contrast (or loss) function.
For least-squares regression in supervised learning, R(v) = E((Y − v(X ))2),u(X ) = E(Y |X ) and
R(v)−R(u) = E((u(X )− v(X ))2) = ‖u − v‖2L2µwith X ∼ µ.
For unsupervised learning with L2-loss, R(v) = E(‖v‖2L2µ − 2v(Z)) and
R(v)−R(u) = ‖u − v‖2L2µ is the L2 distance between v and the probability density
u of Z with respect to a reference measure µ.
Anthony Nouy Centrale Nantes 3 / 42
Risk
Variational methods for PDEs [Eigel et al 2018]: with Z uniformly distributed onD = (0, 1)d and a risk
R(v) = E(|∇v(Z)|2)− 2v(Z)f (Z))
the target function u in H10 is such that
−∆u = f on D,
andR(v)−R(u) = ‖v − u‖2H1
0
Anthony Nouy Centrale Nantes 4 / 42
Empirical risk minimization
Given i.i.d. samples {zi}ni=1 of Z , an approximation unF of u is obtained by minimization
of the empirical risk
Rn(v) =1
n
n∑i=1
γ(v , zi )
over a certain model class F .
Denoting by uF the minimizer of the risk over F , the error
R(unF )−R(u) = R(un
F )−R(uF )︸ ︷︷ ︸estimation error
+ R(uF )−R(u)︸ ︷︷ ︸approximation error
For a given sample, when taking larger and larger model classes, approximation error↘ while estimation error ↗.
Methods should be proposed for the selection of a model class taking the best fromthe available information.
Anthony Nouy Centrale Nantes 5 / 42
Empirical risk minimization
Given i.i.d. samples {zi}ni=1 of Z , an approximation unF of u is obtained by minimization
of the empirical risk
Rn(v) =1
n
n∑i=1
γ(v , zi )
over a certain model class F .
Denoting by uF the minimizer of the risk over F , the error
R(unF )−R(u) = R(un
F )−R(uF )︸ ︷︷ ︸estimation error
+ R(uF )−R(u)︸ ︷︷ ︸approximation error
For a given sample, when taking larger and larger model classes, approximation error↘ while estimation error ↗.
Methods should be proposed for the selection of a model class taking the best fromthe available information.
Anthony Nouy Centrale Nantes 5 / 42
Empirical risk minimization
Given i.i.d. samples {zi}ni=1 of Z , an approximation unF of u is obtained by minimization
of the empirical risk
Rn(v) =1
n
n∑i=1
γ(v , zi )
over a certain model class F .
Denoting by uF the minimizer of the risk over F , the error
R(unF )−R(u) = R(un
F )−R(uF )︸ ︷︷ ︸estimation error
+ R(uF )−R(u)︸ ︷︷ ︸approximation error
For a given sample, when taking larger and larger model classes, approximation error↘ while estimation error ↗.
Methods should be proposed for the selection of a model class taking the best fromthe available information.
Anthony Nouy Centrale Nantes 5 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 6 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 7 / 42
Tensor ranks
We consider a space H of functions defined on the set X = X1 × . . .×Xd equipped witha product measure µ = µ1 ⊗ . . .⊗ µd .
For a subset of variablesα ⊂ {1, . . . , d} := D,
a function v ∈ H can be identified with a bivariate function
v(xα, xαc )
where xα and xαc are complementary groups of variables.
The canonical rank of this bivariate function is called the α-rank of v , denoted rankα(v).It is the minimal integer rα such that
v(x) =
rα∑k=1
vαk (xα)wαc
k (xαc )
Anthony Nouy Centrale Nantes 8 / 42
Tensor formats
For T ⊂ 2D a collection of subsets of D and a given tuple r = (rα)α∈T , a tensorformat is defined by
T Tr (H) = {v ∈ H : rankα(v) ≤ rα, α ∈ T} .
In the particular case where T is a dimension partition tree, T Tr is a tree-based
tensor format.
{1, 2, 3, 4, 5}
{1} {2} {3} {4} {5}
Tucker
{1, 2, 3, 4, 5}
{1, 2, 3, 4}
{1, 2, 3}
{1, 2}
{1} {2}
{3}
{4}
{5}
Tensor Train
{1, 2, 3, 4, 5}
{1, 2, 3}
{1}
{2, 3}
{2} {3}
{4, 5}
{4} {5}
Hierarchical Tucker
Anthony Nouy Centrale Nantes 9 / 42
Tree-based formats as tensor networks
Consider a tensor space H = H1 ⊗ . . .⊗Hd of functions in L2µ(X ), and let
{φνiν : iν ∈ I ν} be a basis of Hν ⊂ L2µν (Xν), typically polynomials, wavelets...
A function v in T Tr (H) = {v ∈ H : rankT (v) ≤ r} admits an explicit representation
v(x) =∑iα∈Iαα∈L(T )
∑1≤kβ≤rββ∈T
∏α∈T\L(T )
aα(kβ )β∈S(α),kα
∏α∈L(T )
aαiα,kαφαiα(xα)
where each parameter aα is in a tensor space RKα .
a1,2,3,4,5
a1,2,3
a1 a2,3
a2 a3
a4,5
a4 a5
Representation complexity
C(T , r ,H) =∑
α∈T\L(T )
rα∏
β∈S(α)
rβ +∑
α∈L(T )
rα dim(Hα).
Anthony Nouy Centrale Nantes 10 / 42
Tree-based tensor format as a deep neural network
By identifying a tensor a(α) ∈ Rn1×...×ns×rα with a Rrα -valued multilinear function
f (α) : Rn1 × . . .× Rns → Rrα ,
a function v in T Tr admits a representation as a tree-structured composition of
multilinear functions {f (α)}α∈T .
f 1,2,3,4,5
f 1,2,3
f 1 f 2,3
f 2 f 3
f 4,5
f 4 f 5
v(x) = f D(f 1,2,3(f 1(Φ1(x1)), f 2,3(f 2(Φ2(x2)), f 3(Φ3(x3))), f 4,5(f 4(Φ4(x4)), f 5(Φ5(x5))))
where Φν(xν) = (φνiν (xν))iν∈Iν ∈ R#Iν .
It corresponds to a deep network with a sparse architecture (given by T ), a depthbounded by d − 1, and width at level ` related to the α-ranks of the nodes α of level `.
Anthony Nouy Centrale Nantes 11 / 42
Tree-based tensor format as a deep neural network
By identifying a tensor a(α) ∈ Rn1×...×ns×rα with a Rrα -valued multilinear function
f (α) : Rn1 × . . .× Rns → Rrα ,
a function v in T Tr admits a representation as a tree-structured composition of
multilinear functions {f (α)}α∈T .
f 1,2,3,4,5
f 1,2,3
f 1 f 2,3
f 2 f 3
f 4,5
f 4 f 5
v(x) = f D(f 1,2,3(f 1(Φ1(x1)), f 2,3(f 2(Φ2(x2)), f 3(Φ3(x3))), f 4,5(f 4(Φ4(x4)), f 5(Φ5(x5))))
where Φν(xν) = (φνiν (xν))iν∈Iν ∈ R#Iν .
It corresponds to a deep network with a sparse architecture (given by T ), a depthbounded by d − 1, and width at level ` related to the α-ranks of the nodes α of level `.
Anthony Nouy Centrale Nantes 11 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 12 / 42
Estimation error
Given a model class F , a minimizer uF of the risk R over F and a minimizer unF of the
empirical risk Rn over F , the estimation error
R(unF )−R(uF ) ≤ R(un
F )− Rn(unF ) + Rn(uF )−R(uF )
so that
E(R(unF )−R(uF )) ≤ E(sup
v∈F|Rn(v)−R(v)|) = E(sup
v∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|)
Anthony Nouy Centrale Nantes 13 / 42
Estimation error
Assume that F is compact in L∞, and that for all v ,w ∈ F , the contrast is uniformlybounded and Lipschitz,
|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞
Then using a standard concentration inequality (here Hoeffding), we obtain
P(|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2e−nε2
2
where N εM2L
= N ( εM2L,F , ‖ · ‖L∞) is the covering number of F .
We deduce that for any ε,
E(R(unF )−R(uF )) ≤ εM + 2Mη(n, ε,F )
and a bound can be obtained by taking the infimum over ε.
Anthony Nouy Centrale Nantes 14 / 42
Estimation error
Assume that F is compact in L∞, and that for all v ,w ∈ F , the contrast is uniformlybounded and Lipschitz,
|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞
Then using a standard concentration inequality (here Hoeffding), we obtain
P( supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2N εM2Le−
nε2
2 := η(n, ε,F )
where N εM2L
= N ( εM2L,F , ‖ · ‖L∞) is the covering number of F .
We deduce that for any ε,
E(R(unF )−R(uF )) ≤ εM + 2Mη(n, ε,F )
and a bound can be obtained by taking the infimum over ε.
Anthony Nouy Centrale Nantes 14 / 42
Estimation error
Assume that F is compact in L∞, and that for all v ,w ∈ F , the contrast is uniformlybounded and Lipschitz,
|γ(v ,Z)| ≤ M, |γ(v ,Z)− γ(w ,Z)| ≤ L‖v − w‖L∞
Then using a standard concentration inequality (here Hoeffding), we obtain
P( supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))| ≥ εM) ≤ 2N εM2Le−
nε2
2 := η(n, ε,F )
where N εM2L
= N ( εM2L,F , ‖ · ‖L∞) is the covering number of F .
We deduce that for any ε,
E(R(unF )−R(uF )) ≤ εM + 2Mη(n, ε,F )
and a bound can be obtained by taking the infimum over ε.
Anthony Nouy Centrale Nantes 14 / 42
Metric entropy of tree-based tensor formats[with Bertrand Michel]
Assume H ⊂ L∞(X ) with basis functions {φi}i∈I normalized in L∞(X ).
For any representation of v ∈ T Tr (H) with parameters {f α : α ∈ T}, the multilinearity of
the parametrization implies
‖v‖L∞ ≤∏α
‖f α‖α,∞,
with a suitable choice of norms
‖f α‖α,∞ = sup‖zβ‖∞≤1
‖f α((zβ)β∈S(α))‖∞
Considering the model class
F = RF1, F1 = {v ∈ T Tr (H) : max
α∈T‖f α‖α,∞ ≤ 1},
we obtain an upper bound of the metric entropy
logN (ε,F , ‖ · ‖L∞) ≤ C(T , r ,H) log(3ε−1R#T ),
where C(T , r ,H) is the representation complexity of elements of F .
Anthony Nouy Centrale Nantes 15 / 42
Coming back to estimation error
We conclude thatE(R(un
F )−R(uF )) ≤ εM + 2Mη
for
n ≥ 2ε−2
(log(η−1/2) + C(T , r ,H) log(6ε−1RL
M#T )
)and
E(R(unF )−R(uF )) . M
√C(T , r ,H)
n
up to log factors.
Anthony Nouy Centrale Nantes 16 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Improving estimation error
Improved bounds may be obtained using refined concentration inequalities forsuprema of empirical processes
supv∈F|1n
n∑i=1
γ(v , zi )− E(γ(v ,Z))|
A better performance can be obtained by using as empirical risk
Rn(v) =1
n
n∑i=1
wiγ(v , zi )
where the zi are i.i.d. samples from a measure dPZ = w(z)dPZ and the weightswi = w(zi )
−1.
The choice of sampling measure should be adapted to the risk and model class, andmay be deduced from concentration inequalities. See [Cohen and Migliorati 2017]for least-squares regression and linear models.
For tree tensor networks, use a different sample for each parameter. Link toempirical PCA [Nouy 2019] [Haberstich, Nouy, Perrin 2020].
Anthony Nouy Centrale Nantes 17 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 18 / 42
Approximation properties of tree tensor networks
We want to quantify the approximation error
minv∈T T
r
R(v)−R(u)
for a target function u in a given function class, i.e. study the expressive power of treetensor networks, and compare it with other approximation tools.
Anthony Nouy Centrale Nantes 19 / 42
Approximation properties of tree tensor networks
We consider least-squares regression and L2 density estimation, where
minv∈T T
r
R(v)−R(u) = minv∈T T
r
‖u − v‖2L2µ := eT ,r (u)2.
SinceT Tr =
⋂α∈T
{v : rankα(v) ≤ rα},
a lower bound is given byeT ,r (u) ≥ max
α∈Teα,rα(u)
whereeα,rα(u) = min
rankα(v)≤rα‖u − v‖L2µ
Anthony Nouy Centrale Nantes 20 / 42
Singular value decomposition
u(x1, . . . , xd) can be identified with a bivariate function u(xα, xαc ) inL2µα⊗µαc (Xα ×Xαc ) which admits a singular value decomposition
u(xα, xαc ) =
rankα(u)∑k=1
σαk vαk (xα)vα
c
k (xαc ),
The problem of best approximation of u by a function with α-rank rα admits as asolution the truncated singular value decomposition urα of u
urα(xα, xαc ) =
rα∑k=1
σαk vαk (xα)vα
c
k (xαc )
where {vα1 , . . . , vαrα} are the rα α-principal components of u, and
eα,rα(u) =
√∑k>rα
(σαk )2.
Anthony Nouy Centrale Nantes 21 / 42
Linear widths of multivariate functions
The subspace of principal components Uα = span{vα1 , . . . , vαrα} is solution of
mindim(Uα)=rα
∫‖u(·, xαc )− PUαu(·, xαc )‖2L2µα dµαc (xαc ) = eα,rα(u)2
where PUα is the orthogonal projection onto Uα.
Considering the set of partial evaluations of u
Kα(u) = {u(·, xαc ) : xαc ∈ Xαc } ⊂ L2µα(Xα),
andeα,rα(u) ≤ min
dim(Uα)=rαsup
v∈Kα(u)
‖v − PUαv‖L2µα = drα(Kα(u))L2µα,
the upper bound being the Kolmogorov rα-width of Kα(u)
Furthermore, since eα,rα(u) = eαc ,rα(u), we have
eα,rα(u) ≤ min{drα(Kα(u))L2µα
, drα(Kαc (u))L2µαc
}
Anthony Nouy Centrale Nantes 22 / 42
Linear widths of multivariate functions
The subspace of principal components Uα = span{vα1 , . . . , vαrα} is solution of
mindim(Uα)=rα
∫‖u(·, xαc )− PUαu(·, xαc )‖2L2µα dµαc (xαc ) = eα,rα(u)2
where PUα is the orthogonal projection onto Uα.
Considering the set of partial evaluations of u
Kα(u) = {u(·, xαc ) : xαc ∈ Xαc } ⊂ L2µα(Xα),
andeα,rα(u) ≤ min
dim(Uα)=rαsup
v∈Kα(u)
‖v − PUαv‖L2µα = drα(Kα(u))L2µα,
the upper bound being the Kolmogorov rα-width of Kα(u)
Furthermore, since eα,rα(u) = eαc ,rα(u), we have
eα,rα(u) ≤ min{drα(Kα(u))L2µα
, drα(Kαc (u))L2µαc
}
Anthony Nouy Centrale Nantes 22 / 42
Linear widths of multivariate functions
The subspace of principal components Uα = span{vα1 , . . . , vαrα} is solution of
mindim(Uα)=rα
∫‖u(·, xαc )− PUαu(·, xαc )‖2L2µα dµαc (xαc ) = eα,rα(u)2
where PUα is the orthogonal projection onto Uα.
Considering the set of partial evaluations of u
Kα(u) = {u(·, xαc ) : xαc ∈ Xαc } ⊂ L2µα(Xα),
andeα,rα(u) ≤ min
dim(Uα)=rαsup
v∈Kα(u)
‖v − PUαv‖L2µα = drα(Kα(u))L2µα,
the upper bound being the Kolmogorov rα-width of Kα(u)
Furthermore, since eα,rα(u) = eαc ,rα(u), we have
eα,rα(u) ≤ min{drα(Kα(u))L2µα
, drα(Kαc (u))L2µαc
}
Anthony Nouy Centrale Nantes 22 / 42
Upper bound through higher order singular value decomposition
Given the spaces of principal components Uα, α ∈ T , we can define an approximation
ur =∏α∈T
PUαu ∈ TTr
obtained by successive orthogonal projections (suitably ordered) such that [Grasedyck2010]
eT ,r (u)2 ≤ ‖u − ur‖2 ≤∑α∈T
eα,rα(u)2
Then error bounds can be obtained with information on the α-singular values of u, or onKolmogorov witdhs of the sets Kα(u) and Kαc (u) of partial evaluations of u.
Anthony Nouy Centrale Nantes 23 / 42
Expressive power of tree tensor networks
For standard regularity classes, they perform almost as well as standardapproximation tools.
For example, for u ∈ Hkmix((0, 1)d), Kα(u) ⊂ Hk
mix((0, 1)#α) for any α. From boundsof Kolmogorov widths of Sobolev balls
eα,rα(u) ≤ drα(Kα(u)) . r−kα log(rα)k(#α−1)
we obtain that the complexity to achieve a precision ε (with binary trees) is
C(ε) . ε−3/k log(ε−1)dd1+3/(2k) up to powers of log(ε−1)
Performs almost as well as hyperbolic cross approximation (sparse tensors).
Similar results in [Schneider and Uschmajew 2014] using results on bilinearapproximation [Temlyakov 1989]. See also [Griebel and Harbrecht 2019]
Anthony Nouy Centrale Nantes 24 / 42
Expressive power of tree tensor networks
For standard regularity classes, they perform almost as well as standardapproximation tools.
For example, for u ∈ Hkmix((0, 1)d), Kα(u) ⊂ Hk
mix((0, 1)#α) for any α. From boundsof Kolmogorov widths of Sobolev balls
eα,rα(u) ≤ drα(Kα(u)) . r−kα log(rα)k(#α−1)
we obtain that the complexity to achieve a precision ε (with binary trees) is
C(ε) . ε−3/k log(ε−1)dd1+3/(2k) up to powers of log(ε−1)
Performs almost as well as hyperbolic cross approximation (sparse tensors).
Similar results in [Schneider and Uschmajew 2014] using results on bilinearapproximation [Temlyakov 1989]. See also [Griebel and Harbrecht 2019]
Anthony Nouy Centrale Nantes 24 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scales
polynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scales
polynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).
• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scalespolynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks[with M. Bachmayr and R. Schneider]
But they can perform much better for non standard classes of functions, e.g. atree-structured composition of regular functions {fα : α ∈ T}, see [Mhaskar, Liao,Poggio 2016] for deep neural networks.
f1,2,3,4 (f1,2 (f1(x1), f2(x2)) , f3,4 (f3(x3), f4(x4)))
{1, 2, 3, 4}
{1, 2}
{1} {2}
{3, 4}
{3} {4}
Assuming that the functions fα ∈W k,∞ with ‖fα‖L∞ ≤ 1 and ‖fα‖W k,∞ ≤ B, thecomplexity to achieve an accuracy ε
C(ε) . ε−3/k(L + 1)3B3Ld1+3/2k
with L = log2(d) for a balanced tree and L + 1 = d for a linear tree.
• Bad influence of the depth through the norm B of functions fα (roughness).• For B ≤ 1 (and even for 1-Lipschitz functions), the complexity only scales
polynomially in d : no curse of dimensionality !
Anthony Nouy Centrale Nantes 25 / 42
Expressive power of tree tensor networks
A function in canonical format (shallow network)
u(x) =r∑
k=1
u1k(x1) . . . ud
k (xd)
can be represented in tree-based format with a similar complexity.
Conversely, a typical function in tree-based format T Tr has a canonical rank
depending exponentially in d .
Deep is better !
For a balanced or linear binary tree T , the subset of tensors v in T Tr (Rn×...×n) with
canonical rank less than min{n, r}d/2 is of Lebesgue measure 0 [Cohen et al. 2016,Khrulkov et al 2018]
But a typical function in T Tr may admit a representation complexity exponential in d
when using another tree.
Anthony Nouy Centrale Nantes 26 / 42
Expressive power of tree tensor networks
A function in canonical format (shallow network)
u(x) =r∑
k=1
u1k(x1) . . . ud
k (xd)
can be represented in tree-based format with a similar complexity.
Conversely, a typical function in tree-based format T Tr has a canonical rank
depending exponentially in d .
Deep is better !
For a balanced or linear binary tree T , the subset of tensors v in T Tr (Rn×...×n) with
canonical rank less than min{n, r}d/2 is of Lebesgue measure 0 [Cohen et al. 2016,Khrulkov et al 2018]
But a typical function in T Tr may admit a representation complexity exponential in d
when using another tree.
Anthony Nouy Centrale Nantes 26 / 42
Expressive power of tree tensor networks
A function in canonical format (shallow network)
u(x) =r∑
k=1
u1k(x1) . . . ud
k (xd)
can be represented in tree-based format with a similar complexity.
Conversely, a typical function in tree-based format T Tr has a canonical rank
depending exponentially in d .
Deep is better !
For a balanced or linear binary tree T , the subset of tensors v in T Tr (Rn×...×n) with
canonical rank less than min{n, r}d/2 is of Lebesgue measure 0 [Cohen et al. 2016,Khrulkov et al 2018]
But a typical function in T Tr may admit a representation complexity exponential in d
when using another tree.
Anthony Nouy Centrale Nantes 26 / 42
Influence of the tree
As an example, consider the probability distribution f (x) = P(X = x) of a Markov chainX = (X1, . . . ,Xd) given by
f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)
where bivariate functions fi|i−1 have a rank bounded by r .
With the linear tree T containing interior nodes {1, 2}, {1, 2, 3}, . . . , {1, . . . , d − 1},f admits a representation in tree-based format with storage complexity in r 4.
The canonical rank of f is exponential in d .
But when considering the linear tree Tσ = {σ(α) : α ∈ T} obtained by applyingpermutation σ = (1, 3, . . . , d − 1, 2, 4, . . . , d) to the tree T , the storage complexityin tree-based format is also exponential in d .
Anthony Nouy Centrale Nantes 27 / 42
Influence of the tree
As an example, consider the probability distribution f (x) = P(X = x) of a Markov chainX = (X1, . . . ,Xd) given by
f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)
where bivariate functions fi|i−1 have a rank bounded by r .
With the linear tree T containing interior nodes {1, 2}, {1, 2, 3}, . . . , {1, . . . , d − 1},f admits a representation in tree-based format with storage complexity in r 4.
The canonical rank of f is exponential in d .
But when considering the linear tree Tσ = {σ(α) : α ∈ T} obtained by applyingpermutation σ = (1, 3, . . . , d − 1, 2, 4, . . . , d) to the tree T , the storage complexityin tree-based format is also exponential in d .
Anthony Nouy Centrale Nantes 27 / 42
Influence of the tree
As an example, consider the probability distribution f (x) = P(X = x) of a Markov chainX = (X1, . . . ,Xd) given by
f (x) = f1(x1)f2|1(x2|x1) . . . fd|d−1(xd |xd−1)
where bivariate functions fi|i−1 have a rank bounded by r .
With the linear tree T containing interior nodes {1, 2}, {1, 2, 3}, . . . , {1, . . . , d − 1},f admits a representation in tree-based format with storage complexity in r 4.
The canonical rank of f is exponential in d .
But when considering the linear tree Tσ = {σ(α) : α ∈ T} obtained by applyingpermutation σ = (1, 3, . . . , d − 1, 2, 4, . . . , d) to the tree T , the storage complexityin tree-based format is also exponential in d .
Anthony Nouy Centrale Nantes 27 / 42
Approximation properties of tree tensor networks
Choosing a good tree (architecture of network) is a crucial but combinatorial problem...
{2} {3}
{7}
{5} {4}
{8}
{1} {6} {2} {7}
{4}
{8} {1} {5} {3}
{6}
{1} {4}{2} {8}
{3} {7} {6}
{5}
{3} {2}{4}
{7}{5}
{8}{6} {1}
Anthony Nouy Centrale Nantes 28 / 42
Outline
1 Tree tensor networks
2 Estimation error
3 Approximation error
4 Learning algorithms
Anthony Nouy Centrale Nantes 29 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
A function v in the model class T Tr (H) has a representation v(x) = Ψ(x)((aα)α∈T )
where each parameter aα is in a tensor space RKα and Ψ(x) is a multilinear map.
The empirical risk minimization problem over the nonlinear model class T Tr
min(aα)α∈T
1
n
n∑i=1
γ(Ψ(·)((aα)α∈T ), zi )
can be solved using an alternating minimization algorithm, solving at each step anempirical risk minimization problem with a linear model
Ψ(x)((aα)α∈T ) =∑k∈Kα
Ψαk (x)aαk
with functions Ψαk (x) depending on fixed parameters aβ , β 6= α.
In a L2 setting, possible re-parametrization for having orthonormal functions Ψαk (x).
Sparsity in the tensors aα can be exploited in different ways, e.g. by proposingdifferent sparsity patterns and use a model selection technique (e.g. based onvalidation).
For a leaf node ν, the approximation space Hν can be selected from a candidatesequence of spaces Hν0 ⊂ . . . ⊂ HνL ⊂ . . .
Anthony Nouy Centrale Nantes 30 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Learning algorithm for tree tensor networks
Selection an optimal model class T Tr (H) is a combinatorial problem.
An algorithm is proposed in [Grelier, Nouy, Chevreuil 2018] that performs adaptations ofthe tree T (architecture), the rank r (widths) and the approximation space H.
Start with an initial tree T and learn an approximation v ∈ T Tr (H) with rank
r = (1, ..., 1). Then repeat
Increase some ranks rα based on estimates of truncation errors
minrankα(v)≤rα
R(v)−R(u)
Learn an approximation v in T Tr (H), with adaptive selection of H
Optimize the tree for reducing the storage complexity of v (stochastic algorithmusing a suitable distribution over the set of trees)
minT
C(T , rankT (v),H)
Anthony Nouy Centrale Nantes 31 / 42
Example in supervised learning: composition of functions
Consider a tree-structured composition of functions
u(X ) = h(h(h(X1,X2), h(X3,X4)), h(h(X5,X6), h(X7,X8))),
where h(t, s) = 9−1(2 + ts)2 is a bivariate function and where the d = 8 randomvariables X1, . . . ,X8 are independent and uniform on [−1, 1].
h
h
h
X1 X2
h
X3 X4
h
h
X5 X6
h
X7 X8
We use polynomial approximation spaces H (with adaptive selection of the degree), sothat function u could (in principle) be recovered exactly for any choice of tree with asufficiently high rank.
Anthony Nouy Centrale Nantes 32 / 42
Example in supervised learning: composition of functions
We consider the tree T 1 coinciding with the structure of u, for which
C(T 1, rankT 1(u),H) = 2427
{1} {2} {3} {4} {5} {6} {7} {8}
(a) Tree T 1
{8} {1} {6} {4} {7} {2} {3} {5}
(b) Tree T 1σ
By considering a permutation T 1σ = {σ(α) : α ∈ T 1} of T 1, with
σ = (8, 1, 6, 4, 7, 2, 3, 5), we have a complexity
C(T 1σ, rankT1
σ(u),H) ≥ 9.106
Anthony Nouy Centrale Nantes 33 / 42
Example in supervised learning: composition of functions
We consider the tree T 1 coinciding with the structure of u, for which
C(T 1, rankT 1(u),H) = 2427
{1} {2} {3} {4} {5} {6} {7} {8}
(c) Tree T 1
{8} {1} {6} {4} {7} {2} {3} {5}
(d) Tree T 1σ
By considering a permutation T 1σ = {σ(α) : α ∈ T 1} of T 1, with
σ = (8, 1, 6, 4, 7, 2, 3, 5), we have a complexity
C(T 1σ, rankT 1
σ(u),H) ≥ 9.106
Anthony Nouy Centrale Nantes 33 / 42
Example in supervised learning: composition of functions
We consider a linear tree T 2 and start the algorithm from a tree T 2σ = {σ(α) : α ∈ T 2}
obtained by applying a random permutation σ to T 2.
{1} {2}{3}{4}{5}{6}{7}{8}
(e) Tree T 2
{6} {8}{4}{5}{1}{7}{2}{3}
(f) Tree T 2σ for σ = (6, 8, 4, 5, 1, 7, 2, 3)
Anthony Nouy Centrale Nantes 34 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
1 (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 3.38 10−2 79
6 84
51
72
3
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
2 (1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1) 2.95 10−2 100
3 4 2 8
6
7 1 5
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
3 (1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1) 2.45 10−2 121
7 5
3 4 6 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
4 (1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1) 1.85 10−2 1425 (1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2) 8.97 10−3 163
5 6
3 4 7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
6 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 8.89 10−3 188
5 6 3 4
7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
7 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 8.87 10−3 188
5 6
3 4 7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
8 (1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2) 3.97 10−3 1889 (1, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 3) 1.55 10−4 308
3 4
5 6 7 8
2 1
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm with a sample size n = 105
Iteration rank r εtest(v) C(T , r ,H)
10 (1, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3) 1.18 10−4 36411 (1, 3, 4, 3, 4, 2, 4, 3, 4, 2, 4, 3, 4, 4, 4) 6.65 10−6 52012 (1, 3, 5, 3, 5, 3, 5, 3, 5, 3, 5, 3, 5, 5, 5) 1.19 10−6 72313 (1, 4, 5, 4, 5, 3, 5, 4, 5, 3, 5, 4, 5, 5, 5) 1.72 10−7 86514 (1, 4, 6, 4, 6, 3, 6, 4, 6, 3, 6, 4, 6, 6, 6) 1.47 10−8 111315 (1, 5, 6, 5, 6, 3, 6, 5, 6, 3, 6, 5, 6, 6, 6) 7.02 10−9 131116 (1, 5, 7, 5, 7, 3, 7, 5, 7, 3, 7, 5, 7, 7, 7) 1.27 10−10 164317 (1, 5, 8, 5, 8, 3, 8, 5, 8, 3, 8, 5, 8, 8, 8) 3.87 10−12 201518 (1, 5, 9, 5, 9, 3, 9, 5, 9, 3, 9, 5, 9, 9, 9) 2.95 10−14 2427
3 4 2 1 5 6 7 8
Anthony Nouy Centrale Nantes 35 / 42
Example in supervised learning: composition of functions
Behavior of the algorithm for different sample sizes n.
n P(T = T 1) εtest(v) C(T , r ,H)
103 90% [1.75 10−5, 1.75 10−4] [360, 1062]104 90% [2.15 10−8, 4.10 10−3] [185, 2741]105 100% [4.67 10−15, 8.92 10−3] [163, 2594]
Table: training sample size n, estimation of the probability of obtaining the ideal tree T 1 andranges (over the 10 trials) for the test error, and the storage complexity.
Anthony Nouy Centrale Nantes 36 / 42
Example in unsupervised learning
We consider truncated normal distribution with zero mean and covariance matrix Σ. Itssupport is X = ×6
ν=1[−5σν , 5σν ], with σ2ν = Σνν , and its density (with respect to
Lebesgue measure)
f (x) ∼ exp
(−1
2xTΣ−1x
)1x∈X ,
We consider polynomial approximation spaces Hν (with adaptive selection of the degree).
Anthony Nouy Centrale Nantes 37 / 42
Example in unsupervised learning
Consider
Σ =
2 0 0.5 1 0 0.50 1 0 0 0.5 0
0.5 0 2 0 0 11 0 0 3 0 00 0.5 0 0 1 0
0.5 0 1 0 0 2
.
After a permutation (3, 6, 1, 4, 2, 5) of its rows and columns, it comes the matrix2 1 0.5 0 0 01 2 0.5 0 0 0
0.5 0.5 2 1 0 00 0 1 3 0 00 0 0 0 1 0.50 0 0 0 0.5 1
(X1,X3,X4,X6) and (X2,X5) are independent, as well as X4 and (X3,X6), so that
f (x) = f1,3,4,6(x1, x3, x4, x6)f2,5(x2, x5) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
with rank{2,5}(f ) = rank{1,3,4,6}(f ) = 1,rank{1,4}(f ) = rank{1}(f{1,3,6}) = rank{3,6}(f1,3,6) = rank{3,6}(f ).
Anthony Nouy Centrale Nantes 38 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
{1, 2, 3, 4, 5, 6}{2, 3, 4, 5, 6}
{2, 4, 5, 6}{2, 4, 5}
{4, 5}
{4} {5}{2}
{6}{3}
{1}
Figure: (a) Best tree over 10 trials for n = 102
Anthony Nouy Centrale Nantes 39 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
{1, 2, 3, 4, 5, 6}{1, 3, 4, 6}{3, 4, 6}{3, 6}
{6} {3}{4}{1}
{2, 5}
{2} {5}
Figure: (b) Best tree over 10 trials for n = 103
Anthony Nouy Centrale Nantes 39 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
{1, 2, 3, 4, 5, 6}{1, 3, 4, 6}{1, 4}
{1} {4}
{3, 6}
{3} {6}
{2, 5}
{2} {5}
Figure: (c) Best tree over 10 trials for n = 104, 105, 106
Anthony Nouy Centrale Nantes 39 / 42
Example in unsupervised learningf (x) = f4,1(x4, x1)f1,3,6(x1, x3, x6)f2,5(x2, x5)
n Risk × 10−2 L2-error T C(T , r ,H)
102 [−5.50, 119] [0.53, 4.06] Fig. (a) [311, 311]103 [−7.29,−5.93] [0.22, 0.47] Fig. (b) [311, 637]104 [−7.60,−6.85] [0.11, 0.33] Fig. (c) [521, 911]105 [−7.68,−7.66] [0.04, 0.07] Fig. (c) [911, 1213]106 [−7.70,−7.69] [0.01, 0.01] Fig. (c) [1283, 1546]
Table: Ranges over 10 trials
Convergence rate close to O(n−1/2),the minimax rate for the estimation of analytic densities.
Anthony Nouy Centrale Nantes 39 / 42
Concluding remarks
Open questions for a complete learning theory...
The obtained theoretical results are related to the minimizer unF of the empirical risk
over the model class, but available algorithms do not guarantee to find a solution of
minv∈T T
r
Rn(v)
Algorithms generate a sequence of estimations in different model classes. Modelselection strategies should be proposed (for selecting tree, ranks and backgroundapproximation space) that guarantee oracle inequalities.
Convexification of tree tensor networks ?
Anthony Nouy Centrale Nantes 40 / 42
Concluding remarks
Open questions for a complete learning theory...
The obtained theoretical results are related to the minimizer unF of the empirical risk
over the model class, but available algorithms do not guarantee to find a solution of
minv∈T T
r
Rn(v)
Algorithms generate a sequence of estimations in different model classes. Modelselection strategies should be proposed (for selecting tree, ranks and backgroundapproximation space) that guarantee oracle inequalities.
Convexification of tree tensor networks ?
Anthony Nouy Centrale Nantes 40 / 42
Concluding remarks
Open questions for a complete learning theory...
The obtained theoretical results are related to the minimizer unF of the empirical risk
over the model class, but available algorithms do not guarantee to find a solution of
minv∈T T
r
Rn(v)
Algorithms generate a sequence of estimations in different model classes. Modelselection strategies should be proposed (for selecting tree, ranks and backgroundapproximation space) that guarantee oracle inequalities.
Convexification of tree tensor networks ?
Anthony Nouy Centrale Nantes 40 / 42
Concluding remarks
A fundamental problem would be to characterize approximation classes Aγ of treetensor networks, those functions for which tree tensor networks give a certainperformance
infv∈T T
r (H)R(v)−R(u) . γ(C)−1
for some growth function γ and C a measure of complexity of T Tr (H).
These are not standard regularity classes (see [Dahmen et al 2016, Bachmayr andDahmen 2015]).
For an approximation class Aγ , we would like to devise (black box) algorithms thatselect a model class T T
r (H) and provide an estimation u ∈ T Tr (H) such that
E(R(u)−R(u)) . γ(C)−1,
or γ replaced by another growth function.
Anthony Nouy Centrale Nantes 41 / 42
Concluding remarks
A fundamental problem would be to characterize approximation classes Aγ of treetensor networks, those functions for which tree tensor networks give a certainperformance
infv∈T T
r (H)R(v)−R(u) . γ(C)−1
for some growth function γ and C a measure of complexity of T Tr (H).
These are not standard regularity classes (see [Dahmen et al 2016, Bachmayr andDahmen 2015]).
For an approximation class Aγ , we would like to devise (black box) algorithms thatselect a model class T T
r (H) and provide an estimation u ∈ T Tr (H) such that
E(R(u)−R(u)) . γ(C)−1,
or γ replaced by another growth function.
Anthony Nouy Centrale Nantes 41 / 42
Thank you for your attention
References
A. Falco, W. Hackbusch, and A. Nouy.Tree-based tensor formats.SeMA Journal, Oct 2018.
E. Grelier, A. Nouy, M. Chevreuil.Learning with tree-based tensor formats.Arxiv eprints, Nov. 2018.
A. Nouy.Higher-order principal component analysis for the approximation of tensors intree-based low-rank formats.Numerische Mathematik, 141(3):743–789, Mar 2019.
E. Grelier, R. Lebrun, A. Nouy.Learning high-dimensional probability distributions using tree tensor networksComing soon...
Anthony Nouy Centrale Nantes 42 / 42