Artificial Intelligence as Statistical Learning
I Before we talk about GNNs, we need to specify what we mean by learning and intelligence
⇒ Statistical Learning and Empirical Learning
1
Observations and Information
I An Artificial Intelligence (AI) extracts information from observations ⇒ Sorry to disappoint you
I E.g., this image is an observation. The intelligence tells you this is your professor, Alejandro
Observations Artificial Intelligence Information
I Names vary across communities. My own professional bias is to talk of signals and signal processing
⇒ And to call Observations = Input(s) and to call Information = Output(s)
2
Observations and Information
I An Artificial Intelligence (AI) extracts information from observations ⇒ Sorry to disappoint you
I E.g., this image is an observation. The intelligence tells you this is your professor, Alejandro
Artificial Intelligence Professor AlejandroProfessor Alejandro
I Names vary across communities. My own professional bias is to talk of signals and signal processing
⇒ And to call Observations = Input(s) and to call Information = Output(s)
2
Statistical Models and Statistical Estimation
I Observations (inputs) x and information (outputs) y are related by a statistical model p(
x, y )
x ∈ Rn p(
x, y ) y ∈ Rp
I Given that the universe (nature) associates inputs x and outputs y according to distribution p(
x, y )
⇒ The AI should predict y from x with the conditional distribution ⇒ y ∼ p(
y∣∣ x )
⇒ Or, if we want deterministic output, a conditional expectation ⇒ y = E[
y∣∣ x ]
I There is a lot to say about statistical estimation but this is beyond the scope of this course
3
Statistical Models and Statistical Estimation
I Observations (inputs) x and information (outputs) y are related by a statistical model p(
x, y )
x ∈ Rn p(
y∣∣ x)
y ∈ Rp
I Given that the universe (nature) associates inputs x and outputs y according to distribution p(
x, y )
⇒ The AI should predict y from x with the conditional distribution ⇒ y ∼ p(
y∣∣ x )
⇒ Or, if we want deterministic output, a conditional expectation ⇒ y = E[
y∣∣ x ]
I There is a lot to say about statistical estimation but this is beyond the scope of this course
3
Statistical Models and Statistical Estimation
I Observations (inputs) x and information (outputs) y are related by a statistical model p(
x, y )
x ∈ Rn E[
y∣∣ x]
y ∈ Rp
I Given that the universe (nature) associates inputs x and outputs y according to distribution p(
x, y )
⇒ The AI should predict y from x with the conditional distribution ⇒ y ∼ p(
y∣∣ x )
⇒ Or, if we want deterministic output, a conditional expectation ⇒ y = E[
y∣∣ x ]
I There is a lot to say about statistical estimation but this is beyond the scope of this course
3
Estimation Loss
I AI is not perfect. Nature and AI may produce different outputs when presented with the same input
Nature relates x and y with distribution p(x, y)
x p(
x, y ) y
The AI relates x and y with function Φ(x)
x ∈ Rn Φ(
x ) y = Φ(x)
I Loss function `(y, y) = `(y,Φ(x)) measures cost of predicting y = Φ(x) when actual output is y
⇒ In estimation problems we often use quadratic loss ⇒ `(y, y) = ‖y − y‖22
⇒ In classification problems we often use hit loss ⇒ `(y, y) = ‖y − y‖0 = #(y 6= y)
4
Statistical Risk Minimization
I Average the loss `(y,Φ(x)) over nature’s distribution p(x, y) and choose best estimator/classifier
Φ∗ = argminΦ
Ep(x,y)
[`(
y,Φ(x)) ]
I Predict Φ(x). Nature draws y. Evaluate loss `. Take loss expectation over distribution p(x , y)
⇒ Optimal estimator is the function with minimum average cost over all possible estimators.
I This optimization program is called the statistical risk minimization (SRM) problem
5
Training / Learning
I Learning, or Training, is the process of solving the statistical risk minimization problem
Learning / Training
x Φ(
x ) y = Φ(x)
x p(x, y) y
argminΦ
E[`(
y, Φ(x))]
Φ∗
I Outcome of learning is function Φ∗ with minimum average statistical loss ⇒ We learn to estimate y
⇒ During execution time, we just evaluate Φ(x∗) to predict output associated with input x
6
A Word on Models
I We have seen how to formulate learning as a mathematical program
I Our formulation requires access to a model (probability distribution). This is easier said than done
7
We Need Access to a Model
I We have reduced learning to the solution of a statistical risk minimization (SRM) problem
⇒ Requires access to the distribution p(x, y) ⇒ A model of how x and y are jointly generated
Learning / Training
x Φ(
x ) y = Φ(x)
x p(x, y) y
argminΦ
E[`(
y, Φ(x))]
Φ∗
I This block diagram does not work without a model ⇒ Where is the model coming from?
8
Systems’ Modeling
I Maybe we know the laws that relate inputs and outputs ⇒ Indeed, we very often do
I Do not underestimate models ⇒ This is how we design the vast majority of marvels around you
9
Systems’ Identification
I Or, we acquire data pairs (xq, yq) ∼ p(x, y) to estimate the model ⇒ We learn the distribution
I Very powerful too ⇒ What we do not design with models, we design with systems identification
10
Machine Learning
I Bypass the learning of the distribution ⇒ Go straight to the learning of the estimation map Φ(x)
Learning / Training
x Φ(
x ) y = Φ(x)
x Data Samples y
argminΦ
E[`(
y, Φ(x))]
Φ∗
I Very powerful ⇒ Recent impressive transformations in speech processing and computer vision
11
Empirical Risk Minimization
I Learning bypasses models. It tries to imitate observations. Let us formulate mathematically.
12
Artificial Intelligence (AI) / Machine Learning (ML)
I AI and ML in this course refer to the pipeline where we learn from data samples. Not distributions
Learning / Training
x Φ(
x ) y = Φ(x)
x Data Samples y
argminΦ
E[`(
y, Φ(x))]
Φ∗
I AI learns to imitate input-output pairs observed in nature.
13
Approximating the Statistical Risk
I Statistical Risk Minimization works on the cost averaged over the distribution of inputs and outputs
Φ∗ = argminΦ
Ep(x, y)[`(
y, Φ(x))]
I This expectation can be approximated with data
⇒ Acquire training set with Q pairs (xq, yq) ∈ T drawn independently from distribution p(x, y)
⇒ For sufficiently large Q we can approximate ⇒ Ep(x, y)[`(
y, Φ(x))]≈ 1
Q
Q∑q=1
`(
yq, Φ(xq))
⇒ This is just the law of large numbers. True under very mild conditions
14
Empirical Risk Minimization (ERM)
I Replace statistical risk minimization (SRM) with empirical risk minimization (ERM)
Φ∗S = argminΦ
Ep(x, y)[`(
y, Φ(x))]
⇒ Φ∗E = argminΦ
1
Q
Q∑q=1
`(
yq, Φ(xq))
I Since the objectives are close, one would think the optima are close ⇒ Φ∗S ≈ Φ∗E
⇒ Alas, this it not true ⇒ Φ∗S 6≈ Φ∗E ⇒ Statistical and empirical risk minimizers need not be close
I In fact, the solution of ERM is trivial ⇒ Make Φ(xq) = yq for all pairs in the training set
I As trivial as nonsensical ⇒ Yields no information about observations outside the training set
15
ERM with Learning Parametrizations
I Our first attempt at learning from data led to an ERM problem that does not make sense
I The search for a problem that makes sense brings us to the notion of learning parametrizations
16
Learning Parametrization
I A sensical ERM problem, requires the introduction of a function class C
Φ∗ = argminΦ∈C
1
Q
Q∑q=1
`(
yq, Φ(xq))
I For example, we can select the class of linear functions Φ(x) = Hx and solve for
H∗ = argminH
1
Q
Q∑q=1
`(
yq, H xq
)I This choice of parametrization may be good or bad. But at least is sensical
⇒ Good or bad, having H∗ allows estimates y = H∗x for observations x outside the training set
17
Statistical and Empirical Risk Minimization with Learning Parametrizations
I Selecting C to contain sufficiently smooth functions makes SRM and ERM close
argminΦ∈C
Ep(x, y)[`(
y, Φ(x))]
≈ argminΦ∈C
1
Q
Q∑q=1
`(
yq, Φ(xq))
I Fundamental theorem of statistical learning ⇒ ERM is a valid approximation of SRM
I Need to identify the appropriate function class C ⇒ But this problem is unavoidable
18
Learning / Training (For Real)
I SRM learns from model ⇒ Parametrized ERM learns from data ⇒ Three differences:
Learning / Training
x Φ(
x ) y = Φ(x )
x p(x, y) y
argminΦ
E[`(
y, Φ(x))]
Φ∗
19
Learning / Training (For Real)
I SRM learns from model ⇒ Parametrized ERM learns from data ⇒ Three differences:
⇒ The distribution is unknown ⇒ We have access to a training set of data samples
Learning / Training
xq Φ(
xq ) yq = Φ(xq)
xq (xq, yq) ∈ T yq
argminΦ∈C
1
Q
Q∑q=1
`(
yq, Φ(xq))
Φ∗
19
Learning / Training (For Real)
I SRM learns from model ⇒ Parametrized ERM learns from data ⇒ Three differences:
⇒ The nonparametric ERM problem is nonsensical ⇒ We restrict the function class
Learning / Training
xq Φ(
xq ) ∈ C yq = Φ(xq)
xq (xq, yq) ∈ T yq
argminΦ∈C
1
Q
Q∑q=1
`(
yq, Φ(xq))
Φ∗
19
Learning / Training (For Real)
I SRM learns from model ⇒ Parametrized ERM learns from data ⇒ Three differences:
⇒ The statistical risk ⇒ Is replaced by the empirical risk
Learning / Training
xq Φ(
xq ) yq = Φ(xq)
xq (xq, yq) ∈ T yq
argminΦ∈C
1
Q
Q∑q=1
`(
yq, Φ(xq))
Φ∗
19
The Meaning of ML or AI in this Course
I Here, Machine learning (ML) ≡ Artificial Intelligence (AI) ≡ Empirical Risk Minimization (ERM)
Φ∗ = argminΦ∈C
∑(x,y)∈T
`(
y,Φ(x))
= argminΦ∈C
1
Q
Q∑q=1
`(
yq,Φ(xq))
I The components of ERM are a dataset, a loss function and, most importantly, a function class
I Make parametrization more explicit ⇒ Parameter H ∈ Rp to span function class Φ(x; H)
H∗ = argminH
∑(x,y)∈T
`(
y,Φ(x; H))
I Designing an ML / AI system means selecting the appropriate function class C ⇒ What else?
⇒ The function class determines generalization from inputs in training set to unseen inputs
20
Stochastic Gradient Descent (SGD)
I SGD is the customary method for the minimization of the empirical risk
21
Training ≡ Minimization of Empirical Risk
I We have seen that the training of an estimator requires minimization of the empirical risk
H∗ = argminH∈Rp
1
Q
Q∑q=1
`(
yq,Φ(xq; H))
= argminH∈Rp
L(H)
I Minimization of the average loss function defined as the average of pointwise loss function
L(H) :=1
Q
Q∑q=1
`(
yq,Φ(xq; H))
I It’s particular form notwithstanding, just a minimization ⇒ Use gradient descent algorithm
22
Gradient Descent
I Gradient g(H) = ∇L(H) is perpendicular to level set of loss L(H)
I Thus, they point towards the minimum. Not directly. Towards
⇒ Angle is less than π/2 ⇒ −gT (H) (H−H∗) ≥ 0
I We can then use gradients in a gradient descent algorithm
Ht+1 = Ht − ε g(Ht)
I Converges to the optimum H∗ if the stepsize ε is sufficiently small
−g(H)
−g(H)
•H∗
23
Gradient Descent in Empirical Risk Minimization
I The gradient of the average loss function is the average of the gradients of the pointwise losses
g(H) = ∇L(H) =1
Q
Q∑q=1
∇H`(
yq,Φ(xq; H))
I Equipped with gradients, we write the gradient descent method as the recursion given by
Ht+1 = Ht − ε g(Ht) = Ht −ε
Q
Q∑q=1
∇H`(
yq,Φ(xq; Ht))
I This is all good, but those gradients are costly to compute ⇒ An average of Q pointwise gradients
24
Stochastic Gradient Descent (SGD)
I At iteration t, select a batch of Qt � Q samples Tt . Randomly chosen from dataset T
I Define stochastic gradient as sum over batch g(Ht) =1
Qt
∑(xq ,yq)∈Tt
∇H`(
yq,Φ(xq; Ht))
I SGD ≡ Replace gradients with stochastic gradients in gradient descent
Ht+1 = Ht − ε g(Ht) = Ht −1
Qt
∑(xq ,yq)∈Tt
∇H`(
yq,Φ(xq; Ht))
I This is cheaper to implement because the sum is over a smaller number of pointwise gradients.
25
Stochastic Gradients are Unbiased Estimators of Gradients
I If samples are chosen independently and with equal probability ⇒ E(g(Ht)) = g(Ht)
I Stochastic gradients point in the right direction on average
I We move towards the optimum more often than not
I Expected angle is accute ⇒ E[− gT (H) (H−H∗)
]≥ 0
I Can build a submartingale and prove convergence
−g(Ht)
−g(Ht)
•H∗
26
Stochastic Gradient Descent Memorabilia
I I covered SGD briefly because there are some things I wanted you to know.
27
SGD Memorabilia
I GD converges because negative gradients point towards the optimum ⇒ −gT (H) (H−H∗) ≥ 0
I SGD converges because stochastic gradients do so on expectation ⇒ E[− gT (H) (H−H∗)
]≥ 0
I Computing stochastic gradients is much cheaper (Qt � Q) than the cost of computing gradients
g(Ht) =1
Qt
∑(xq ,yq)∈Tt
∇H`(
yq ,Φ(xq ; Ht))
vs g(Ht) =1
Q
∑(xq ,yq)∈T
∇H`(
yq ,Φ(xq ; Ht))
I The difference between a method that works and a method that does not work.
28
Convergence of Stochastic Gradient Descent
I Convergence means that as iteration index t grows ⇒ lim inft→∞
∥∥Ht −H∗∥∥2 ≤ O
(ε/√
Qt
)
⇒ We do not converge exactly ⇒ We approach the optimum and hover around it
⇒ Size of hover region is proportional to stepsize
⇒ Size of hover region is inversely proportional to square root of batch size
I For large batch size Qt we have g(Ht) ≈ g(Ht) ⇒ Not needed ⇒ Mistakes corrected in next step
29
When Objectives are not Convex
I Plots illustrates, comments apply, and results hold for convex functions
⇒ Not always (rarely!) true ⇒ Notably, not true for neural networks. Convolutional or not
I Gradients may move iterates towards
⇒ Global minimum H∗
⇒ Local minimum H†
I Depending on initial condition
−g(H)•H∗
−g(H)•H†
I We may converge to local minima but we won’t care ⇒ Implicitly assume that H† is optimal
30
Stochastic Gradient Descent is Finicky
I Stochastic gradient descent is not a great algorithm. Just the one we have.
I Convergence speed and convergence itself is very sensitive to choice of parameters
I Requires trying different stepsizes and different batch sizes. Maybe different initial conditions
⇒ Small changes in any of these parameters may have large effects on convergence
31
The Importance of Learning Parametrizations
I AI reduces to ERM. And in ERM all we have to do is choose a parametrization.
I Not an easy choice ⇒ The parametrization controls generalization. Make or break.
I The parametrization is a model of how outputs are related to inputs ⇒ It has to be accurate
32
Data Generation Models
I To illustrate effect of learning parametrizations generate fake data following models we specify
I A linear model with inputs x ∈ Rn and outputs y ∈ Rm related by ⇒ y = Ax + w
⇒ For some matrix A ∈ Rm×n. Noise w ∈ Rp Gaussian white, independent with mean E(w) = 0
I A non-linear model postprocessing the linear model with a sign function ⇒ y = sign(
Ax + w)
33
Statistical Risk Minimization
I Given that we know the models we can compute the Statistical Risk Minimizer (SRM). ““The AI”
I For instance, if we use the squared 2-norm loss to penalize AI estimation errors
Φ∗S(x) = argminΦ
Ep(x,y)
[1
2
∥∥y − Φ(x)∥∥2
2
]
I Using the given model y = Ax + w and taking derivatives, the AI is ⇒ Φ∗S(x) = Ax
I Literally, the AI mimics nature.
34
Parametrized Empirical Risk Minimization
I Suppose model is unknown ⇒ Instead, we have access to Q data pairs (xq, yq) in training set T
I Hypothesize a linear parametrization Φ(x) = Hx Formulate parametrized ERM problem
H∗ = argminH∈Rm×n
1
Q
Q∑q=1
[1
2
∥∥∥yq −Hxq∥∥∥2
2
]= argmin
H∈Rm×nL(H)
I Solve with SGD ⇒ Ht+1 = Ht − ε g(Ht) = Ht −ε
Qt
Qt∑q=1
(yq −Htxq
)xTq
I Can use linear parametrization irrespectively of the actual model relating inputs xq to outputs yq.
⇒ But it will work well only if the parametrization matches the unknown model.
35
Parametrization and Model are Matched
I Data generated by linear model with dimensions m = n = 102. Number of samples Q = 103.
I ERM learning with linear parametrization. ⇒ SGD trajectory iterates reduce loss (left)
I Live operation tested outside of training set ⇒ loss is also reduced in test set
0 2500 5000 7500 10000 12500 15000 17500 20000Number of gradient steps
0.004
0.005
0.006
0.007
0.008
0.009
0.010
0.011
MSE
ove
r the
dat
aset
0 2500 5000 7500 10000 12500 15000 17500 20000Number of gradient steps
0.004
0.005
0.006
0.007
0.008
0.009
0.010
0.011
MSE
ove
r the
dat
aset
I The model is linear. The parametrization is linear. The parametrization learn the model.
36
Parametrization and Model are Mismatched
I Data generated by sign model with dimensions m = n = 102. Number of samples Q = 103.
I ERM learning with linear parametrization. ⇒ SGD trajectory iterates reduce loss (left)
I But we converge to a high loss. We do not learn. ⇒ Situation is just as bad in the test set
0 2500 5000 7500 10000 12500 15000 17500 20000Number of gradient steps
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
MSE
ove
r the
dat
aset
0 2500 5000 7500 10000 12500 15000 17500 20000Number of gradient steps
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
MSE
ove
r the
dat
aset
I Model is NOT linear. Parametrization is linear. The parametrization DOES NOT learn the model.
37
Parametrization and Model are Matched but There is Insufficient Data
I Data generated by linear model with dimensions m = n = 102. Number of samples Q = 102.
I ERM learning with linear parametrization. ⇒ SGD trajectory iterates reduce loss (left)
I Live operation tested outside of training set ⇒ loss is NOT reduced in test set
0 2500 5000 7500 10000 12500 15000 17500 20000Number of gradient steps
0.002
0.004
0.006
0.008
0.010
MSE
ove
r the
dat
aset
0 2500 5000 7500 10000 12500 15000 17500 20000Number of gradient steps
0.002
0.004
0.006
0.008
0.010
MSE
ove
r the
dat
aset
I Model is linear. Parametrization is linear. Not enough data to learn model. ⇒ There never is
38
Machine Learning is Model Free but Not Model Free
I Machine learning does not require a model relating inputs x to outputs y
⇒ For example, we don’t need to know the matrix A
I But we need to know a class of functions to which the model belongs
⇒ For example, we need to know the model relating inputs to outputs is linear
I Model also needs to be sufficiently simple to operate with insufficient data
⇒ This is where we leverage structure using convolutional architectures such as CNNs and GNNs
39