Dimensionality reduction techniques for large-scale optimization
Coralia Cartis (University of Oxford)Joint with
Jari Fowkes, Estelle Massart, Adilet Otemissov, Zhen Shao (Oxford) Lindon Roberts (ANU Canberra), Jan Fiala (NAG Ltd)
Research supported by the Alan Turing Institute for Data Science, NAG Ltd and NPL
Workshop on Mathematical Foundations of Optimization in Data Science November 24, 2020 (online)
Cantab Capital Institute for the Mathematics of Information
Johnson-Lindenstrauss Lemma and Random Embeddings
A ∈ ℝn×dS ∈ ℝm×n SA ∈ ℝm×d
Let a(ny) real matrix, , .
Then is an -subspace embedding for if
for all .
A n × d n ≫ d ϵs ∈ (0,1]
S m × n ϵs A
(1 − ϵs)∥Ax∥22 ≤ ∥SAx∥2
2 ≤ (1 + ϵs)∥Ax∥22
x ∈ ℝd
Johnson-Lindenstrauss Lemma: [Woodruff,’14]
If is a scaled Gaussian matrix with , then
is an (oblivious) -subspace embedding for with probability at least .
S m = 𝒪 (d | log δs |ϵ−2s )
S ϵs A 1 − δs
But note the high cost of forming SA ⟹ 𝒪(nd2)
Sparse Random EmbeddingsMoving away from Gaussian sketching: uniformly sampling rows of A fast, preserves sparsity.⟶
A ∈ ℝn×d SA ∈ ℝm×dBut it does not work sometimes….
𝒪(1)
𝒪(10−6)
Chance of missing first row: (n-m)/n
A
Sampling provides an embedding when A has low coherence
If , are similar in magnitude; intuitively, the rows of are similarly important in determining the solution.μ(A) ≪ 1 ∥Ui∥2 A
Definition [Leverage score, coherence]: Given , the leverage score of row is (row norm).The coherence of , is the maximum of the leverage scores.
A = UΣV i ∥Ui∥2A μ(A), d /n ≤ μ(A) ≤ 1
[Drineas et al’10,’11, Tropp’11]: If is a random sampling matrix with ,
then is an subspace embedding for with probability at least .
S m = 𝒪 (μ(A)2d log d | log δs |ϵ−2s )
S ϵs− A 1 − δs
Sparse Random Embeddings
Hashing: sparse sketching for dense and sparse matrices
Sampling: one non-zero per row
⎛
⎜⎜⎝
0 · · · 0 1 0 00 0 · · · · · · · · · 10 0 1 0 · · · 00 · · · · · · 1 0 0
⎞
⎟⎟⎠
Hashing: one non-zero per column
0
BB@
0 1 0 1 0 · · · 01 0 0 0 0 · · · 10 0 1 0 0 · · · 00 0 0 0 1 · · · 0
1
CCA
<latexit sha1_base64="WPyQptR0n9joYXrp3aceajFQpSg=">AAACrnicbVHBThsxEPUutAXT0gDHXiwCVU/RblqpvRQhceGYSiQgZVeR1ztJLLz2yp5tG63yP/wSX9FfqBMWtARGGvn5zRt79CYrlXQYRfdBuLX95u27nV269/7D/sfOweHImcoKGAqjjL3JuAMlNQxRooKb0gIvMgXX2e3Fqn79G6yTRl/hooS04DMtp1Jw9NSkc3dCkwxmUtdlwdHKv0tGacQ+s9hn+0xEbtCtLklCH8l2Pgli5hXRRvvzJ9qKtrKloAno/GkoejLpdKNetA72EsQN6JImBpPOvyQ3oipAo1DcuXEclZjW3KIUCpY0qRyUXNzyGYw91LwAl9ZrR5fs1DM5mxrrUyNbs+2OmhfOLYrMK/2Ec7dZW5Gv1cYVTn+ktdRlhaDFw0fTSjE0bLUelksLAtXCAy6s9LMyMeeWC/RLpImGPzgHY6Gom3NZXzXAOxRv+vESjPq9+Guv/6vfPR80Xu2QT+SYfCEx+U7OySUZkCERwX7wLfgZnIVROArTcPIgDYOm54g8i3D+H4gVvnw=</latexit>
A ∈ ℝn×d SA ∈ ℝm×d
Action of Hashing S:
Compared to sampling, Hashing uses every row of A.
•Expect better robustness
SA =n
∑i=1
siai
: column of S : row of A
si ith
ai ith
Can also consider nonzero per columns: s-hashing. More robustness is achieved.
s
Sparse Random Embeddings
Sketching with hashing matrices: theoretical results. [Shao, C, Fiala’20]
Sampling has better embedding properties when coherence of is low. Is this true for hashing?
A
When is sufficiently small, hashing provides an -subspace embedding with an optimal dimensionality reduction bound, , better than the bound for sampling.
μ(A) ϵs𝒪(d) 𝒪(d log d)
Result Coherence of A Size of sketching S
[Meng & Mahoney’13] —
[Bourgain et al’15]
[C, Fiala & Shao,’20]
Θ(d2 | log δs |ϵ−2s )
Throughout, can be replaced by rank of .d A
𝒪(log−3 d) 𝒪(d log2 d | log δs |ϵ−2s )
𝒪(d−1) 𝒪(d | log δs |ϵ−2s )
Using sketching for optimization?
Sketching in the observational domain (subsampling, batch) reduces number of observations/measurements/data points
• linear least squares solver (Solver: Ski-LLS [C, Fiala, Shao’20])• nonlinear least squares - derivative-based Gauss-Newton methods [C, Scheinberg’20]• nonlinear least squares - derivative-free Gauss-Newton methods [C, Ferguson,
Roberts’20]
Sketching in the variable domain (block-coordinate, subspace methods)reduces the number of parameters/variables
• Gauss-Newton variants for derivative-based and derivative-free• Functions with low effective dimensionality, global optimization
How can we use sketching for improving efficiency and scalability of optimization algorithms ?
Today
Nonlinear least squares: derivative-based methods
minx∈ℝd
f(x) =12
∥r(x)∥22 =
n
∑i=1
(ri(x))2
Gauss-Newton method for Non-linear Least Squares (NNLS)
where smooth and possibly nonconvex; Jacobian matrix of first derivatives of . r : ℝd → ℝn J n × d r
Gauss-Newton method(s): state-of-the-art for NNLS
At iterate , calculate direction by approximately minimizing a regularized/constrained/unconstrained variant of the convex quadratic local model, over ,
xk sk ∈ ℝd
s ∈ ℝd
qk(s) =12
∥J(xk)s + r(xk)∥22 = f(xk) + ⟨J(xk)Tr(xk), s⟩ +
12
⟨s, J(xk)TJ(xk)s⟩ .
Regularization, trust-region and linesearch variants have been successfully developed.
We will look at: Sketching in the variable domain (subspace methods)
Randomised Subspace Gauss-Newton (R-SGN) methods: variable sketching
randomly draw sketching matrix ; calculate the subspace-Jacobian and the reduced local quadratic model, , ,
solve the reduced subproblem (inexactly) to find
compute the ratio .
set and if ; else, and
p × d Sk J(xk)STk
s ∈ ℝp p ≪ d
sk ∈ ℝp,
ρk =f(xk) − f(xk + ST
k sk)f(xk) − qk( sk)
xk+1 = xk + STk sk σk+1 < σk(Δk+1 > Δk) ρk ≥ η1 xk+1 = xk
σk+1 > σk(Δk+1 < Δk) .
mins∈ℝp
qk( s)(+σk
2∥ST
k s∥2) or (such that ∥STk s∥ ≤ Δk) .
R-SGN with quadratic regularisation /trust region: at iteration , [C,Fowkes,Shao’20] k
qk(s) =12 (J(xk)ST
k ) s + r(xk)2
= f(xk) + ⟨SkJ(xk)Tr(xk), s⟩ +12
⟨ s, SkJ(xk)TJ(xk)STk s⟩ .
Global rates of convergence for R-SGN methodsAssumptions:
are Lipschitz continuous; [smoothness] Let . At each iterate , with probability at least ,
, and . [sketching accuracy]Typical inexact model minimization conditions for quadratic regularisation/trust-region
r, Jϵs, δs ∈ (0,1) xk 1 − δs
∥Sk ∇f(xk)∥22 ≥ (1 − ϵs)∥∇f(xk)∥2
2 ∥Sk∥2 ≤ Smax
Theorem[R-SGN]: Let , and such that , where is a user-chosen parameter . Then the R-SGN algorithm takes at most
iterations and evaluations of the residual and sketched Jacobian such that , with probability at least .
[ The constant connects the updates of the regularisation/trust-region parameter]
ϵ > 0 δ ∈ (0,1) (1 − δs)δ > c c*
N ≤ [(1 − δs)δ1 − c]−1𝒪 (f(x0)(1 − ϵs)−1ϵ−2)
mink≤N
∥∇f(xk)∥2 ≤ ϵ 1 − e− (1 − δ)22 (1−δs)N
* c
This bound matches deterministic complexity bounds for first-order and Gauss-Newton methods despite having only partial Jacobian information available at each iteration. [C,Fowkes,Shao’20]
Global rates of convergence for R-SGN methods
Proof Idea: uses techniques from probabilistic models complexity analyses [Gratton et al’18; C, Scheinberg’18]
True/false iterations, successful/unsuccessful iterations
There can be at most true and successful iterations (from sufficient decrease condition) and .
Sketching accuracy assumption gives: for any ,where and are the total number of true and total iterations, respectively.
C[ f(x0) − f*]ϵ−2
f* = 0
ℙ(T < δN) ≤ ϵ−(1−δ)2N δ ∈ (0,1)T N
Global rates of convergence for R-SGN methods
Satisfying the sketching accuracy assumption
matrix with iid scaled Gaussian entries with -hashing matrix with
Sk p × d p = 𝒪( | log δs |ϵ−2s )
Sk p × d s p = 𝒪( | log δs |ϵ−2s )
Sufficient for each to be a (one-sided) -subspace embedding for one-dimensional vectors,so that the gradient can be embedded correctly.
Sk ϵs
sampling matrix : need non-uniformity dependent subspace embeddings for vectors with . Then . This implies that sampling embeds correctly the gradient whenever , the gradient components are similar in magnitude.
Sk p × d∥y∥∞ ⋅ ∥y∥−1
2 ≤ νs p = 𝒪(dν2s | log δs |ϵ−2
s ) Sk∥∇f(x)∥∞ ⋅ ∥∇f(x)∥−1
2 ≤ νs
Comparison with probabilistic modelsOur sketching assumption is weaker than probabilistic model conditions [Bandeira, Scheinberg, Vicente’13]: one-sided length preservation of gradient; not required to embed subspace; numerical example
[C,Fowkes,Shao’20]
Block-Coordinate Gauss-Newton (BC-GN) methodsBC-GN= R-SGN with sampling matrixSk
Theorem[R-SGN] global rate of convergence of BC-GN (with quadratic regularisation or trust region) with high probability, provided the gradient has similar components in magnitude
⟹
When is a sampling matrix, is a random subset of size p of the columns of . Sk J(xk)STk
∂r∂xi
J(xk)
Under more general assumptions, we can obtain a weaker global rate analysis for BC-GN with fixed and arbitrary block size. Assume that each coordinate block of size p is drawn with probability P_k (with replacementor from a partition). Then , where each coordinate appears R times in the
set of all possible block.
Bk𝔼Bk
∥∇Bkf(xk)∥2 ≥ PminR∥∇f(xk)∥2
[∇Bkf(xk) = JBk
(xk)Tr(xk)]Theorem[BC-GN]: Assume Lipschitz continuous. Then the number of BC-GN iterations/evaluations until
is at most . In particular, when blocks are drawn uniformlyat random, such as from a partition, then .
r, J𝔼(∥∇f(xk)∥2) ≤ ϵ2 𝒪 (f(x0)(PminR)−1ϵ−2)
PminR = pd−1
R-SGN/BC-GN methods: numerical experiments[C,Fowkes,Shao’20; WPaper 20]
BC-GN with TR on logistic regression for chemotherapy
dataset (Python code)
R-SGN/BC-GN methods: numerical experiments[C,Fowkes,Shao’20]
BC-GN with TR on logistic regression on gisette dataset
Nonlinear least squares: derivative-free methods
Subspace derivative-free Gauss-Newton methods for NNLS
Sketching DFO-GN/DFO-LS in (number of variables/size of interpolation set)d
Less evaluations and lower linear algebra cost per iteration. Global efficiency?
[Roberts, PhD Thesis’19; C, Roberts’20]
Use interpolation set , then solve
Underdetermined system take minimal norm solution. Computational Cost= factorization + solve = Evaluation cost: only need evaluations of on first iteration and a small number/multiple of subsequentlyChoose based on computational resources/evaluation cost
{xk, y1, …, yp} for p < d(y1 − xk)T
⋮(yp − xk)T
JTk =
(r(y1) − r(xk))T
⋮(r(yp) − r(xk))T
⟹
𝒪(dp2) + (np2) ≈ 𝒪(np2)p r p
p
Subspace derivative-free Gauss-Newton methods for NNLS
DFBGN (Derivative-Free Block Gauss-Newton) Algorithm Build low-dimensional model and calculate trust-region step,
Evaluate , accept/reject step, and update (usual DFO choices) (where is basis of interpolation set )
Add to interpolation set and remove points from the interpolation setAdd random orthogonal directions for until we have interpolation points
sk ∈ ℝp,
mins∈ℝp
12
∥r(xk) + Jk sk∥2 s.t. ∥ s∥ ≤ Δk
f(xk + Qk sk) ΔkQk 𝒴k = {y1 − xk, …, yp − xk}
xk + Qk sk pdrop ≥ 2
xk + Δkd d ⊥ 𝒴k p + 1
Comments: Linear algebra cost vs full space method
Choosing points to remove uses Lagrange polynomials (geometry-aware)
Choice of on successful iterations, on unsuccessful iterations
𝒪(np2 + dp2 + p3) 𝒪(nd2 + d3)
pdrop : pdrop = 2 p/10
Subspace derivative-free Gauss-Newton methods for NNLS
Numerical results for DFBGN algorithm
Choose test set CUTEst with , max 12hrs per problem
Relative accuracy=0.1 vs budget; Solver and timeoutDFBGN outperforms DFO-LS for low accuracy solutions …because it does not time out!
d ≈ 1000
1 2 4 8 16 32Budget / min budget of any solver
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
problemssolved
DFO-LS
DFO-LS (init n/100)
DFBGN (p = n)
DFBGN (p = n/2)
DFBGN (p = n/10)
DFBGN (p = n/100)
DFOLS 93%
DFBGN (d/100) 35%
DFBGN (d/10) 74%
DFBGN (d/2) 82%
in figures[n → d]
Subspace derivative-free Gauss-Newton methods for NNLS
Numerical results for DFBGN algorithm
Other advantage: DFBGN make progress after evaluations (especially important when large)
p ≪ d d
normalized objective reduction vs.~\# evaluations, 12hr timeout);
0.0 0.2 0.4 0.6 0.8 1.0Budget (in gradients)
100
2× 10−1
3× 10−1
4× 10−1
6× 10−1
Normalized
ObjectiveValue
DFO-LS
DFO-LS (init n/100)
p = n
p = n/2
p = n/10
p = n/100
0.0 0.2 0.4 0.6 0.8 1.0Budget (in gradients)
100
5× 10−1
6× 10−1
7× 10−1
8× 10−1
9× 10−1
Normalized
ObjectiveValue
DFO-LS
DFO-LS (init n/100)
p = n
p = n/2
p = n/10
p = n/100
ARWHDNE, d=2000 CHANDHEQ, d=2000 in figures[n → d]
Random embeddings for global optimization
Global optimization of functions with low effective dimensionality
Global optimization is generally NP-hard. Can global optimization algorithms be made efficient for `simpler' problems? What is problem/data ‘simplicity'? Can algorithms adapt to data (without knowing it a priori)?
minx
f(x) subject to x ∈ 𝒳 = [−1,1]d
Problem simplicity: Functions which do not vary along certain linear subspaces.
Alternative names: low effective dimensionality, (multi-)ridge, planar waves, active subspaces
Applications: hyper-parameter optimization; complex engineering simulations; parametric, stochastic PDEs; over-parametrized DNNs?
Global optimization of functions with low effective dimensionality
Challenging set-up: The objective function is black box. The orientation of the important subspace is not known.
Solution: Random embeddings [Ziyu Wang et al. Bayesian optimization in a billion dimensions via random embeddings. \textit{J. Artif. Int. Res.}, 55(1), 2016.]
Random embeddings lower dimensional problemsReplace by , where is an Gaussian matrix, and .
⟶f(x) f(STy) S p × d p ≪ d
Functions with low effective dimensionality [Wang et al.’13]: has effective dimensionality if there exists a linear subspace of dimension such that for all vectors in and in . [ is the smallest integer satisfying these properties]. Dimensions of interest: .
f : ℝd → ℝde ≤ d 𝒯 de
f(x⊤ + x⊥) = f(x⊤) x⊤ 𝒯 x⊥ 𝒯⊥ dede ≤ p ≤ d
Global optimization of functions with low effective dimensionality
(R) (AR)miny∈ℝp
f(STy + u)
s.t. y ∈ Y = [−a, a]p
miny∈ℝp
f(STy + u)
s.t. STy + u ∈ 𝒳
Reduced optimization problems [C, Otemissov’20; C, Massart, Otemissov’20]
REGO algorithm: (single random embedding)u=0; solves (R) once (using any global solver) , unconstrained solution f(STy*) ≈ f* STy*
AREGO algorithm: (multiple random embedding)solves (AR) multiple times (with any global solver) updates to best point found so far
, u
f(STy*) ≈ f* STy* ∈ 𝒳Assumption: .p ≥ deTheoretical analysis
REGO: best-known probability of success of (R) and suitable choices of , depends only on
, not on ambient dimension a
p, de d
Numerical experiments: confirm the theoretical findings; include replacing global solvers with local ones
AREGO: probability of success of (AR) and convergence of AREGO, depends on
algebraically, not exponentiallyd
Global optimization of functions with low effective dimensionality(R) reduced subproblem and REGO algorithm min
y∈ℝpf(STy)
s.t. y ∈ Y = [−a, a]py*2 = arg miny: f(STy)=f*
∥y∥2.
We can show that ∥x*T ∥2
2
∥y*2 ∥22
∼ χ2p−de+1⟹ ℙ((R) is successful) ≥ 1 − C(q)(1 +
q2
e−c2/2) ( c2
2 )q2
where , .q = p − de + 1 c = ∥x*T ∥/a
BARON on GO problems with low effective dimensionality
in figures[D → d and d → p]
(R)
Global optimization of functions with low effective dimensionality(AR) reduced subproblem and AREGO algorithm min
y∈ℝpf(STy + u)
s.t. STy + u ∈ 𝒳 ℙ((AR) is successful) ≥ ℙ(−1 ≤ STy*2 + u ≤ 1) > τ(d) > 0
⟶ t-distributionConvergence of AREGO, with prob one, proved to a neighbourhood of global minimum of original problem; multiple embeddings used.Sk
BARON Local KNITRO
Same tests as for REGO, functions with low effective dimensionality
(AR)
in figures[D → d and d → p]