Mathematical Methods for Data Analysis
Massimiliano Pontil
Istituto Italiano di Tecnologiaand
Department of Computer ScienceUniversity College London
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 1 / 1
Learning from data
Let µ be a probability measure on a set Z
µ is unknown, but can sample from it
z1, ..., zm ∼ µ
Goal: learn “properties” of µ from the data:
3 Density estimation
3 Study “low dimensional” representation of the data
3 Supervised learning (prediction): Z = X × Y
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 2 / 1
Supervised learning
Z = X × Y , given data (x1, y1), ..., (xn, yn) ∼ µ, find
f = argminf ∈F
1
n
n∑i=1
(yi − f (xi )
)2
︸ ︷︷ ︸empirical error
Three key problems:
Function representation/approximation: which F ?
(Typically F = Ω(f ) ≤ α with Ω e.g. a norm in a function space
Numerical optimization: iterative schemes to find f(gradient descent, proximal-gradient methods, stochastic optimization)
Statistical analysis: derive high probability bound
E(y − f (x)
)2 ≤ minf∈F
E(y − f (x)
)2+ ε(n, δ,F)
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 3 / 1
Regularization
Difficulty: high dimensional data / complex tasks
Increasing need for methods which can impose sophisticated form ofprior knowledge
General approach in machine learning and statistics:
minimizef
1
n
n∑i=1
(yi − f (xi )
)2+ λ Ω(f )︸︷︷︸
regularizer
Three predominant assumptions:
smoothness: Ω is the norm in a RKHS
sparsity: non-differentiable penalties (e.g. `1 norm)
shared representations: needs multiple “tasks”
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 4 / 1
Regularization in reproducing kernel Hilbert spaces[Aronszajn 1950, Wahba 1990, Cucker & Smale 2002, Scholkopf & Smola, 2002,...]
Choose a feature map φ : X → `2 and solve:
minimizew∈H
n∑i=1
(〈w , φ(xi )〉 − yi
)2+ λ‖w‖2
2
Regularizer favors smooth functions, e.g. small Sobolev norms
Define the kernel function K (x , x ′) = 〈φ(x), φ(x ′)〉
e.g. the Gaussian: k(x , x ′) = e−β‖x−x′‖2
Solution has the form f (x) =n∑
i=1ciK (xi , x)
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 5 / 1
Linear regression and sparsity[Bickel, Ritos, Tsybakov, 2009, Buhlmann & van de Geer, 20012, Candes and Tao, 2006]
Consider the modely = Xw∗ + ξ
y ∈ Rn is a vector of observations
X is a prescribed n × d data matrix
ξ ∈ Rm is a noise vector (e.g. i.i.d. Gaussian)
w∗ ∈ Rd is assumed to be sparse
Goal:
estimate w∗ (or its sparsity pattern or its prediction error) from y
efficient computational schemes for:
minimizew∈H
n∑i=1
(w>xi − yi )2 + λΩ(w)
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 6 / 1
Regularizers for structured sparsity[Maurer & P., 2012, Micchelli, Morales, P., 2013, McDonald, P. Stamos, 2015]
Exploit additional knowledge on sparsity pattern of w∗:
Ω(w) =
√√√√ infθ∈Θ
d∑i=1
w2i
θi
Constraint set Θ ⊆ Rd++, convex and bounded
Example: if Θ = θ > 0 :∑n
i=1 θi ≤ 1 yields the `1 norm
Focus on:
efficient optimization methods (e.g. proximal gradient methods)statistical estimation bounds (e.g. using Rademacher averages)ongoing applications in neuroimaging
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 7 / 1
Multi-task learning
minw1,...,wT
1
T
T∑t=1
‖Xtwt − yt‖2︸ ︷︷ ︸error task t
+λ Ω(w1, . . . ,wT )︸ ︷︷ ︸joint regularizer
Xt : n × d data matrix
Typical scenario: many tasks but only few examples per task: n d
If the tasks are related, learning them jointly should perform betterthan learning each task independently
Several applications: computer vision, neuroimaging, NLP, roboticsuser modeling, etc.
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 8 / 1
Multitask regularizers
Quadratic: encourage similarities between tasks (e.g. small variance)Can be made more general using RKHS of vector-valued functions
[Caponnetto et al., 2008; Carmeli, De Vito, Toigo, 2006]
Row sparsity: few common variables (provably better than Lasso[Lounici, P. Tsybakov, van de Geer, 2011])
Spectral: few common linear features (low rank matrix) [Srebro &
Shraibman, 2005, Argyriou, Evgeniou, P. 2006; Maurer and P. 2013]
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 9 / 1
Matrix completion
Learn a matrix from a subset of its entry (possibly noisy); see e.g.[Srebro 2004; Candes & Tao, 2008]
Special case of the above when raws of Xt are elements of thestandard basis e1, ..., ed
minW
∑(i ,t)∈S
(Yi ,t −Wi ,t)2 + λΩ(W )
Ongoing project on online (binary) matrix completion
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 10 / 1
Lifelong learning
Human intelligence relies on transferring knowledge learned fromprevious tasks to learn new tasks
Online approach: see one tasks at the time, train on past tasks, teston next task
Interactive learning, e.g. active learning, choose which entries tosample, choose which tasks to learn next
Nonlinear extension: φ : X → `2 a prescribed mapping
minimizew1,...wT∈`2
n∑i=1
T∑t=1
`(yti , 〈wt , φ(xti 〉) + λ‖[w1, . . . ,wn]‖
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 11 / 1
Vector-valued learning
Choose a class of vector-valued functions:
F G =x ∈ `2 7→ f (g (x)) ∈ RT : f ∈ F , g ∈ G
,
where g : H → RK , and f : RK → RT , found by the method
minimizef ∈F , g∈G
N∑i=1
`(f g(xi ), yi ) + Ω(f , g)
Includes neural networks with shared hidden layers (“deep nets”)
Loss function includes multitask and multi-category learning
Includes nuclear or factorization norms [Jameson, 1987]
Current focus on Rademacher complexity bounds:
1
NE sup
f ,g
N∑i=1
εi`(f (xi ), yi )
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 12 / 1
Multilinear models[Gandy eta la. 2011, Kolda & Bader, 2009,...]
General problem: Learning a tensor from a set of linear measurements
Examples:
Tensor completion
Video denoising/completion
3D scanningdenoising/completion
Context-awarerecomendation
Entities-relationships learning(NLP)
Multilinear multitask learning
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 13 / 1
Multilinear multitask learning[Romera-Paredes et al. 2013]
Tasks are be referencedby multiple indices
E.g: ( , Food)
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 14 / 1
Problem modelling
Want to encourage low rank tensors
argminW
E (W) + γN
N∑n=1
rank(W(n)
)W(n) is the n-th matricization of the tensor, e.g.:
W(1) W(3)
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 15 / 1
Research interests / PhD projects
Supervised learning: support vector machines and reproducing kernels
Study of regularizers for structured sparsity
Multitask and transfer learning: study assumptions on task relatedness (e.g.learning shared representations)
Online learning and mistake bounds - connection to lifelong learning
Statistical learning theory (e.g. study of Rademacher bounds) forcompetitive vector-valued function classes
Multilinear models: modelling low rank tensors and convex relaxations
Sparse coding / dictionary learning (not covered today, ask me if interested)
Transfer in reinforcement learning (not covered today, ask me if interested)
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 16 / 1
Plan
Focus on a specific project for the first 6 months
Converge to a PhD topic within 9 months
Can propose your own project
Interact with postdocs in the group and colleagues atDIMA/DIBRIS/IIT
Reading groups on specific topics
1 year abroad (UCL or to visit other collaborators)
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 17 / 1
Collaborators (mostly ongoing)
Mark Herbster (UCL) online learning
Theodoros Evgeniou (INSEAD) user modelling
Cecilia Mascolo (Cambridge) user modelling
Nadia Bianchi-Berthouze (UCL) affective computing
Janaina Mourau-Miranda (UCL) ML in neuroimaging
Alexandre Tsybakov (ENSAE Paris Tech) statistical estimation
Andreas Maurer (Munich) statistical learning theory
Sara van de Geer (ETH Zurich) sparse estimation
Patrick Combettes (Paris 6) numerical optimization
Rapahel Hauser (Oxford) numerical optimization
Charles Micchelli (SUNY Albany) kernel methods, mathematics
Massimiliano Pontil Mathematical Methods for Data Analysis DIMA, 19-5-2016 18 / 1