Large-scale Learning with Kernels& libSkylark
Vikas SindhwaniIBM Research, NY
MMDS 2014. Algorithms for Modern Massive Data SetsUC Berkeley, June 19th 2014
IBM Research: Vikas Sindhwani
What do you see?
IBM Research: Vikas Sindhwani 1
Motivation
• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?
Motivation
• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?
– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?
Motivation
• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?
– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22
Motivation
• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?
– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22– Statistician: With high prob (1− δ), a training set with
n = O( 1ε2log(2d2
δ )) enough for generalization error to be within εthat of the best linear model with ‖ · ‖1 ≤ 1 =⇒ dont need n=1B
• Big data Machine Learning stack design requires conversations.
• Avoiding strong assumptions (linearity, sparsity...) upfront to avoidsaturation ⇒ Non-parametric models (models that grow with data)
Motivation
• Train Lasso on (x1, y1) . . . (xn, yn), xi ∈ Rd with n = 1 Billion,d = 10K. What do you see?
– Systems: 80 terabytes ⇒ Mapreduce script on Hadoop cluster?– NLA: argmin‖w‖1≤1 ‖Xw − y‖22 ⇒ argmin‖w‖1≤1 ‖SXw − Sy‖22– Statistician: With high prob (1− δ), a training set with
n = O( 1ε2log(2d2
δ )) enough for generalization error to be within εthat of the best linear model with ‖ · ‖1 ≤ 1 =⇒ dont need n=1B
• Big data Machine Learning stack design requires conversations.
• Avoiding strong assumptions (linearity, sparsity...) upfront to avoidsaturation ⇒ Non-parametric models (models that grow with data)
• This setting needs both Distributed computation and Randomization
IBM Research: Vikas Sindhwani 2
Outline
• Non-parametric modeling with Kernel methods, their scalabilityproblems and Random Fourier Features (Rahimi & Recht, 2007)
• Kernel methods match DNNs on (knowledge-free) speech recognitionbenchmarks (ICASSP, 2014): parallelization + randomization
• Recent efforts towards improving scalability:
– Practical Implementations of Distributed ADMM: handling largenumber of examples and random feature spaces.
– Quasi-Monte Carlo Feature Maps (ICML 2014)
• libskylark: Open-source software library instantiating sketchingprimitives and randomized Numerical Linear Algebra (NLA) techniquesfor large-scale Machine Learning in distributed-memory environments.
IBM Research: Vikas Sindhwani 3
Acknowledgements and References
• XDATA Skylark team: Ken Clarkson (PI), Haim Avron, Costas Bekas,Christos Boutsidis, Ilse Ipsen, Yves Ineichien, Anju Kambadur, GiorgiosKollias, Michael Mahoney, Vikas Sindhwani, David Woodruff
• High-performance Kernel Machines with Implicit DistributedOptimization and Randomization, with Haim Avron, 2014
• Quasi-Monte Carlo Feature Maps for Shift Invariant Kernels, withJiyan Yang, Haim Avron, and Michael Mahoney, ICML 2014
• Kernel Methods match Deep Neural Networks on TIMIT, with P.Huang, H. Avron, T. Sainath and B. Ramabhadran, ICASSP 2014
• Random Laplace Feature Maps for Semigroup Kernels onHistograms with J. Yang, Q. Fan, H. Avron, M. Mahoney, CVPR 2014
IBM Research: Vikas Sindhwani 4
Kernel Methods (Aronszajn, 1950)
• Symm. pos. def. function k(x, z) on input domain X ⊂ Rd
• k ⇔ rich Reproducing Kernel Hilbert Space (RKHS) Hk of real-valuedfunctions, with inner product 〈·, ·〉k and norm ‖ · ‖k
• Regularized Risk Minimization ⇔ Linear models in an implicithigh-dimensional (often infinite-dimensional) feature space.
f? = argminf∈Hk
1
n
n∑i=1
V (yi, f(xi)) + λ‖f‖2Hk, xi ∈ Rd
• Representer Theorem: f?(x) =∑ni=1αik(x,xi)
IBM Research: Vikas Sindhwani 5
The Issue of Scalability
• Regularized Least Squares
(K + λI)α = YO(n2) storage
O(n3 + n2d) trainingO(nd) test speed
Hard to parallelize when working directly with Kij = k(xi,xj)
• Linear kernels: k(x, z) = xTz, f?(x) = xTw, (w = XTα)
(XTX + λI
)w = XTY
O(nd) storageO(nd2) trainingO(d) test speed
IBM Research: Vikas Sindhwani 6
Randomized Algorithms: Definitions
• Data-oblivious explicit feature map: Ψ : Rd 7→ Cs such that,
k(x, z) ≈ 〈Ψ(x), Ψ(z)〉Cs
⇒(Z(X)
TZ(X) + λI
)w = Z(X)
TY ⇒
O(ns) storageO(ns2) trainingO(s) test speed
• Shift-Invariant kernels: k(x, z) = ψ(x− z), for some complex-valuedpositive definite function ψ on Rd
– Given any set of m points, z1 . . . zm ∈ Rd, the m×m the matrixAij = ψ(zi − zj) is positive definite.
IBM Research: Vikas Sindhwani 7
Randomized Algorithms: Bochner’s TheoremTheorem 1 (Bochner, 1932-33). A complex-valued function ψ : Rd 7→ Cis positive definite if and only if it is the Fourier Transform of a finitenon-negative measure µ on Rd, i.e.,
ψ(x) = µ(x) =
∫Rde−ix
Twdµ(w), x ∈ Rd .
ψ(x)
e−iwT x
IBM Research: Vikas Sindhwani 8
Random Fourier Features (Rahimi & Recht, 2007)
• One-to-one correspondence between k and density p such that,
k(x, z) = ψ(x− z) =
∫Rde−i(x−z)Twp(w)dw
Gaussian kernel: k(x, z) = e−‖x−z‖22
2σ2 ⇐⇒ p = N (0, σ−2Id)
• Monte-Carlo approximation to Integral representation:
k(x, z) ≈ 1
s
s∑j=1
e−i(x−z)Twj = 〈ΨS(x), ΨS(z)〉Cs
ΨS(x) =1√s
[e−ix
Tw1 . . . e−ixTws]∈ Cs, S = [w1 . . .ws] ∼ p
IBM Research: Vikas Sindhwani 9
DNNs vs Kernel Methods on TIMIT (Speech)
1 2 3 4 5 6 7 8Number of Random Features (s) / 10000
33
34
35
36
37
38
39
40
41
Cla
ssifi
cati
onE
rror
(%)
TIMIT: n = 2M,d = 440, k = 147
DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)
% Training
G = randn(size(X,1), s);
Z = exp(i*X*G);
alpha = (eye(size(X,2))*lambda+Z’*Z)\(Z’*y(:));
% Testing
ztest = exp(i*xtest*G)*alpha;
IBM Research: Vikas Sindhwani 10
Learning in High-dimensional Random Feature Spaces
Kernel Methods Match Deep Neural Networks on TIMIT, ICASSP 2014
0 5 10 15 20 25 30 35 40Number of Random Features (s) / 10000
33
34
35
36
37
38
39
40
41
Cla
ssifi
cati
onE
rror
(%)
PER: 21.3% < 22.3% (DNN)
TIMIT: n = 2M,d = 440, k = 147
DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)
• Distributed solvers (Z ∈ R2M×400K ≈ 6 terabytes)• More effective feature maps?
IBM Research: Vikas Sindhwani 11
Distributed Learning with ADMM• Alternating Direction Method of Multipliers (1950s; Boyd et al, 2013)
argminx∈Rn,z∈Rm
f(x) + g(z) subject to Ax+ Bz = c
• Several variations: row-splitting if examples are across processors.
argminx∈Rd
R∑i=1
fi(x) + g(x) ⇒R∑i=1
fi(xi) + g(z) s.t xi = z (1)
x(k+1)i = argmin
xfi(x) +
ρ
2‖x− zk + ν
ki ‖
22 (2)
z = proxg/(Rρ)[xk+1
+ νk] (3)
νk+1i = ν
ki + x
k+1i − zk+1
(4)
where proxf [x] = argminy
1
2‖x− y‖2
2 + f(y)
• Note: extra consensus and dual variables need to be managed.• Closed-form updates, Extensibility, Code-reuse, Parallelism.
IBM Research: Vikas Sindhwani 12
ADMM Block Splitting and Hybrid Parallelism
• Implicit Optimization problems
argminW∈Rs×m
n∑i=1
V (yi,WTz(xi)) + λr(W)
• Distributed Memory (MPI) across R nodes, T -cores each (OpenMP)• Implicit R× C Block partitioning of Z
MPI Proc.
Y1
...YR
,
X1...
XR
T→...T→
T threads︷ ︸︸ ︷Z11 Z12 . . . Z1C
... ... ... ...ZR1 ZR2 . . . ZRC
• ADMM coordinates local models on Zij = T[Xi, j], built on-the-fly.
IBM Research: Vikas Sindhwani 13
ADMM Block Splitting and Hybrid Parallelism• ADMM Block splitting formulation (follows Parikh and Boyd, 2013)
R∑i=1
li(Oi) + λ
C∑j=1
rj(Wj) +∑ij
I[Oij = ZijWij]
s.t. Wj = Wij,Oi =∑ij
Oij︸ ︷︷ ︸local-global consistency
– MPI process local, thread local, master process
• Key Steps: proxf , proxr (parallel; closed-form) and Graph Projection
projZij[(Y,X)] = argmin
V=ZijU
1
2‖V − Y‖2
fro +1
2‖U− X‖2
fro
=⇒ U = [ZTijZij + λI]
−1︸ ︷︷ ︸cached
(X + ZTijY), V = ZijU (5)
IBM Research: Vikas Sindhwani 14
Making the implementation practical
• C ↑ ⇒ cache memory ↓, computation↓; shared- memory parallelism↑.
• However, Oij, Oij grow as niCm (e.g., 335K examples, C = 64, m = 100
exhausts 16GB). Fortunately, can be avoided using incremental updates, shared
memory access and structure of Graph projection!
Update rules from Parikh and Boyd, 2013
If C=s/d, per-‐itera@on complexity is linear in n,s
IBM Research: Vikas Sindhwani 15
Scalability and Effect of Splitting
0 5 10 15 20
Number of MPI processes (t=6 threads/process)0
5
10
15
20
25
Speedup
MNIST Strong Scaling (Triloka)
SpeedupIdeal
0 5 10 15 200
20
40
60
80
100
Cla
ssific
ati
on A
ccura
cy (
%)
Accuracy
500 1000 1500 2000
Time (secs)55
56
57
58
59
60
61
62
Acc
ura
cy
TIMIT Column Splitting
501002004008001000
Table 1: Comparison on TIMIT-binary (n = 100k, d = 440, s = 100k)Libsvm PSVM (p = n0.5) BlockADMM
Training Time 80355 47 42Testing Time 1295 259 2.9
Accuracy 85.41% 73.1% 83.47%
IBM Research: Vikas Sindhwani 16
Revisit Efficiency of Approximate Integration
k(x, z) =
∫Rde−i(x−z)Twp(w)dw ≈ 1
s
s∑j=1
e−i(x−z)Tws
• Consider error in approximating an integral on unit cube:
εS[f ] =
∣∣∣∣∣∫
[0,1]df(x)dx− 1
s
∑w∈S
f(w)
∣∣∣∣∣• Monte-carlo approach draws S from U([0, 1]d), with convergence rate:(
ES[εS[f ]2
])1/2= σ[f ]s−1/2 where σ[f ]2 = varX∼U([0,1]d)[f(X)]
O(s−12) =⇒ 4-fold increase in s will only cut error by half.
Can we do better with different S?
IBM Research: Vikas Sindhwani 17
Low-discrepancy Quasi-Monte Carlo Pointsets
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Uniform
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Halton
• Deterministic correlated QMC points avoid clustering, clumping effectsin MC point sets.
• Hierarchical structure: sample the integrand from coarse-to-fine as sincreases.
IBM Research: Vikas Sindhwani 18
Low-discrepancy Quasi-Monte Carlo Pointsets
IBM Research: Vikas Sindhwani 19
Star DiscrepancyIntegration error depends on variation f and uniformity of S.Theorem 2 (Koksma-Hlawka inequality (1941, 1961)).
εS[f ] ≤ D?(S)VHK[f ], where
D?(S) = supx∈[0,1]d
∣∣∣∣vol(Jx)− |{i : wi ∈ Jx}|s
∣∣∣∣
IBM Research: Vikas Sindhwani 20
Quasi-Monte Carlo Sequences
• Low-discrepancy point sets have D?({w1, . . . ,ws}) = O((log s)d/s),conjectured to be optimal.
– Halton, Sobol, Faure, Niederreiter sequences. . . we will treat these asblack boxes.
– Implementations available, e.g. Matlab haltonset, sobolset
– Usually, very cheap to generate.
• For fixed d, asymptotically, QMC rate (log s)d/s beats MC s−12.
– Note: dimension dependence– Empirically, QMC better even for very high dimensional integration.– Modern analysis: worst-case analysis for a nice space of integrands
(RKHS) or average case analysis assuming a distribution overintegrands.
IBM Research: Vikas Sindhwani 21
QMC Fourier Features: Algorithm
• Transform variables∫Rde−i(x−z)Twp(w)dw =
∫[0,1]d
e−i(x−z)TΦ−1(t)dt .
• QMC feature maps for shift-invariant kernels, [YSAM] 2014
1. Given k, compute density p (=∏di=1 pj) via inverse Fourier
transform.2. Generate low-discrepancy sequence t1 . . . ts in [0, 1]d.3. Transform: wi =
(Φ−1
1 (ti1) . . .Φ−1d (tid)
)and set S = [w1 . . .ws].
4. Compute Z′ = XS.
5. compute Zij = 1√se−iZ
′ij.
6. run linear method on Z.
IBM Research: Vikas Sindhwani 22
How do standard QMC sequences perform?
200 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of random features
Relati
ve err
or on
||K|| 2
USPST, n=1506
MC
HaltonSobol’Digital net
Lattice
200 400 600 800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Number of random features
Relati
ve err
or on
||K|| 2
CPU, n=6554
MC
HaltonSobol’Digital net
Lattice
• QMC methods consistently provide better Gram matrix approximations.
How do standard QMC sequences perform?
200 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of random features
Relati
ve err
or on
||K|| 2
USPST, n=1506
MC
HaltonSobol’Digital net
Lattice
200 400 600 800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Number of random features
Relati
ve err
or on
||K|| 2
CPU, n=6554
MC
HaltonSobol’Digital net
Lattice
• QMC methods consistently provide better Gram matrix approximations.
• Why are some QMC sequences better than others?
How do standard QMC sequences perform?
200 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of random features
Relati
ve err
or on
||K|| 2
USPST, n=1506
MC
HaltonSobol’Digital net
Lattice
200 400 600 800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Number of random features
Relati
ve err
or on
||K|| 2
CPU, n=6554
MC
HaltonSobol’Digital net
Lattice
• QMC methods consistently provide better Gram matrix approximations.
• Why are some QMC sequences better than others?
• Can we learn sequences even better adapted to our problem class?
IBM Research: Vikas Sindhwani 23
Chacterization via Box Discrepancy
• Define F�b = {f(w) = e−i(x−z)Tw,−b ≤ x− z ≤ b,x, z ∈ X}Theorem 3 (Expected Integration error wrt p over f ∈ F�b).
Ef∼U(F�b)
[εS,p[f ]2
]∝ D�
p (S)2 . (6)
• Below, h(u,v) = sincb(u,v) = π−d∏dj=1
sin(bi(ui−vi)ui−vi
D�p (S)
2=
∫Rd
∫Rdh(ω, φ)p(ω)p(φ)dωdφ
−2
s
s∑l=1
∫Rdh(wl, ω)p(ω)dω︸ ︷︷ ︸
Alignment with p (wl ≈ ω)
+1
s2
s∑l=1
s∑j=1
h(wl,wj)︸ ︷︷ ︸Non-uniformity of S
(7)
Integrals above can be computed in closed form for Gaussian density.
IBM Research: Vikas Sindhwani 24
Does Box Discrepancy explain performancedifferences?
0 500 1000 150010
−5
10−4
10−3
10−2
Samples
D✷(S
)2
CPU, d=21
Digital NetMC (expected)HaltonSobol’Lattice
IBM Research: Vikas Sindhwani 25
Learning Adaptive QMC Sequences
Unlike Star discrepancy, Box discrepancy admits numerical optimization,
S∗ = argminS=(w1...ws)∈Rds
D�(S) . (8)
0 20 40 60 8010
−6
10−4
10−2
100
CPU dataset, s=100
Iteration
Normalized D✷(S)2
Maximum Squared ErrorMean Squared Error
‖K −K‖2/‖K‖2
IBM Research: Vikas Sindhwani 26
libskylark: sketching-based matrix computations for ML
http://xdata-skylark.github.io/libskylark/
Randomized Kernel Methods • Support Vector Machines • Multinomial Logistic Reg. • Robust Regression • Regularized Least Squares • Gaussian, Laplacian
Polynomial, Semigroup Kernels
Matrix Completion Multitask Learning PCA and CCA
High-performance Sketching NLA ML [Python] El
emen
tal/C
ombB
las
MPI
com
mun
icat
ion
InputMatrixType LocalMatrix numpy.ndarray, scipy.linalg.sparse, elem::Matrix<double>
DistributedDense Elem:DistMatrix (1D, 2D)
DistributedSparse SpParMat (CombBlas/KDT)
Streaming Out-of-core problems
OutputMatrixType LocalMatrix numpy.ndarray, scipy.linalg.sparse, elem::Matrix<double>
DistributedDense Elem:DistMatrix (1D, 2D)
DistributedSparse SpParMat (CombBlas/KDT) CWT, WZT, MMT
JLT, FJLT, CT, FCT, RFT
JLT, CT, RFT
A sketch
SA (column-wise) AS’ (row-wise)
oblivious subspace embeddings
Faster Least Squares Solver Sketch-and-Solve Sketch-based preconditioning (Blendenpik, LSRN) Low-rank Approximations Randomized SVD
Distributed Optimization Block-splitting Alternating Directions Method of Multipliers (ADMM)
IBM Research: Vikas Sindhwani 27
Flavor
A = rand(m, n);
b = rand(m, 1);
% Gaussian Random matrix for JLT
S = randn(t, m);
% Gaussian Random matrix for JLT
SA = S*A;
Sb = S*b;
% Sketch and Solve
X = SA\Sb;
• Sparse-vs-Dense
• 1D/2D distributions
• Input-output combinations
• Row-wise or Column-wise
• Summa-based Distributed GEMM
• Comm-free random matrices
• Counter-based PRNGs (Random123)
import elem
from skylark import sketch, elemhelper
from mpi4py import MPI
import numpy as np
# Set up the random regression problem.
A = elem.DistMatrix_d_VR_STAR()
elem.Uniform(A, m, n)
b = elem.DistMatrix_d_VR_STAR()
elem.Uniform(b, m, 1)
# Create transformm with output type = "LocalMatrix".
S = sketch.JLT(m, t, defouttype="LocalMatrix")
# Sketch A and b note specialized distributed GEMM
SA = S * A
Sb = S * b
#SA and Sb reside on rank zero, so solving there.
if (MPI.COMM_WORLD.Get_rank() == 0):
# Solve using NumPy
[x, res, rank, s] = np.linalg.lstsq(SA, Sb)
else:
x = None
IBM Research: Vikas Sindhwani 28
Implementation References
• Sketching Layer
Abbreviation Name ReferenceJLT Johnson-Lindenstrauss Transform Johnson and Lindenstrauss, 1984
FJLT Fast Johnson-Lindenstrauss Transform Ailon and Chazelle, 2009CT Cauchy Transform Sohler and Woodruff, 2011
MMT Meng-Mahoney Transform Meng and Mahoney, 2013CWT Clarkson-Woodruff Transform Clarkson and Woodruff, 2013WZT Woodruff-Zhang Transform Woodruff and Zhang, 2013PPT Pahm-Pagh Transform Pahm and Pagh, 2013
ESRLT Random Laplace Transform Yang et al, 2014LRFT Laplacian Random Fourier Transform Rahimi and Recht, 2007GRFT Gaussian Random Fourier Transform Rahimi and Recht, 2007
FGRFT Fast Gaussian Random Fourier Transform Le, Sarlos and Smola, 2013
• Avron, H., Maymounkov, P. and Toledo, S., Supercharging LAPACKs Least Squares Solver, 2010• Meng, X., Saunders, M.A. and Mahoney, M. W, LSRN: A Paralllel Iterative Solver for Strongly Over- or
Under-Determined Systems, 2012• Halko, N. and Martinsson, P.G, and Tropp J., Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions, SIAM Rev., Survey and Review section, Vol. 53, num. 2, pp.217-288, 2011
• N. Parikh and S. Boyd, Block splitting for distributed optimization, Math. Prog. Comp., October 2013• V. Sindhwani and H. Avron, High-performance Kernel Machines with Implicit Distributed Optimization and
Randomization, 2014
IBM Research: Vikas Sindhwani 29
Conclusion
• High-performance implementation of randomized algorithms anddistributed optimization with emphasis on scaling up non-parametericmodels.
• Scalable Kernel methods may be promising alternatives to Deep NeuralNetworks.
– Incorporating prior knowledge (e.g., invariances) and new forms ofkernel learning
• libskylark: http://xdata-skylark.github.io/libskylark/
IBM Research: Vikas Sindhwani 30
Thank you.
The Machine Learning Group at IBM Research, NY is hiring Research StaffMembers and Postdocs!
IBM Research: Vikas Sindhwani 31