Download - Arti cial Intelligence as Statistical Learning

Artificial Intelligence as Statistical Learning

I Before we talk about GNNs, we need to specify what we mean by learning and intelligence

⇒ Statistical Learning and Empirical Learning

1

Observations and Information

I An Artificial Intelligence (AI) extracts information from observations ⇒ Sorry to disappoint you

I E.g., this image is an observation. The intelligence tells you this is your professor, Alejandro

Observations Artificial Intelligence Information

I Names vary across communities. My own professional bias is to talk of signals and signal processing

⇒ And to call Observations = Input(s) and to call Information = Output(s)

2

Observations and Information

I An Artificial Intelligence (AI) extracts information from observations ⇒ Sorry to disappoint you

I E.g., this image is an observation. The intelligence tells you this is your professor, Alejandro

Artificial Intelligence Professor AlejandroProfessor Alejandro

I Names vary across communities. My own professional bias is to talk of signals and signal processing

⇒ And to call Observations = Input(s) and to call Information = Output(s)

2

Statistical Models and Statistical Estimation

I Observations (inputs) x and information (outputs) y are related by a statistical model p(

x, y )

x ∈ Rn p(

x, y ) y ∈ Rp

I Given that the universe (nature) associates inputs x and outputs y according to distribution p(

x, y )

⇒ The AI should predict y from x with the conditional distribution ⇒ y ∼ p(

y∣∣ x )

⇒ Or, if we want deterministic output, a conditional expectation ⇒ y = E[

y∣∣ x ]

I There is a lot to say about statistical estimation but this is beyond the scope of this course

3



x, y )

x ∈ Rn p(

y∣∣ x)

y ∈ Rp


x, y )


y∣∣ x )


y∣∣ x ]


3



x, y )

x ∈ Rn E[

y∣∣ x]

y ∈ Rp


x, y )


y∣∣ x )


y∣∣ x ]


3

Estimation Loss

I AI is not perfect. Nature and AI may produce different outputs when presented with the same input

Nature relates x and y with distribution p(x, y)

x p(

x, y ) y

The AI relates x and y with function Φ(x)

x ∈ Rn Φ(

x ) y = Φ(x)

I Loss function `(y, y) = `(y,Φ(x)) measures cost of predicting y = Φ(x) when actual output is y

⇒ In estimation problems we often use quadratic loss ⇒ `(y, y) = ‖y − y‖22

⇒ In classification problems we often use hit loss ⇒ `(y, y) = ‖y − y‖0 = #(y 6= y)

4

Statistical Risk Minimization

I Average the loss `(y,Φ(x)) over nature’s distribution p(x, y) and choose best estimator/classifier

Φ∗ = argminΦ

Ep(x,y)

[`(

y,Φ(x)) ]

I Predict Φ(x). Nature draws y. Evaluate loss `. Take loss expectation over distribution p(x , y)

⇒ Optimal estimator is the function with minimum average cost over all possible estimators.

I This optimization program is called the statistical risk minimization (SRM) problem

5

Training / Learning

I Learning, or Training, is the process of solving the statistical risk minimization problem

Learning / Training

x Φ(

x ) y = Φ(x)

x p(x, y) y

argminΦ

E[`(

y, Φ(x))]

Φ∗

I Outcome of learning is function Φ∗ with minimum average statistical loss ⇒ We learn to estimate y

⇒ During execution time, we just evaluate Φ(x∗) to predict output associated with input x

6

A Word on Models

I We have seen how to formulate learning as a mathematical program

I Our formulation requires access to a model (probability distribution). This is easier said than done

7

We Need Access to a Model

I We have reduced learning to the solution of a statistical risk minimization (SRM) problem

⇒ Requires access to the distribution p(x, y) ⇒ A model of how x and y are jointly generated

Learning / Training

x Φ(

x ) y = Φ(x)

x p(x, y) y

argminΦ

E[`(

y, Φ(x))]

Φ∗

I This block diagram does not work without a model ⇒ Where is the model coming from?

8

Systems’ Modeling

I Maybe we know the laws that relate inputs and outputs ⇒ Indeed, we very often do

I Do not underestimate models ⇒ This is how we design the vast majority of marvels around you

9

Systems’ Identification

I Or, we acquire data pairs (xq, yq) ∼ p(x, y) to estimate the model ⇒ We learn the distribution

I Very powerful too ⇒ What we do not design with models, we design with systems identification

10

Machine Learning

I Bypass the learning of the distribution ⇒ Go straight to the learning of the estimation map Φ(x)

Learning / Training

x Φ(

x ) y = Φ(x)

x Data Samples y

argminΦ

E[`(

y, Φ(x))]

Φ∗

I Very powerful ⇒ Recent impressive transformations in speech processing and computer vision

11

Empirical Risk Minimization

I Learning bypasses models. It tries to imitate observations. Let us formulate mathematically.

12

Artificial Intelligence (AI) / Machine Learning (ML)

I AI and ML in this course refer to the pipeline where we learn from data samples. Not distributions

Learning / Training

x Φ(

x ) y = Φ(x)

x Data Samples y

argminΦ

E[`(

y, Φ(x))]

Φ∗

I AI learns to imitate input-output pairs observed in nature.

13

Approximating the Statistical Risk

I Statistical Risk Minimization works on the cost averaged over the distribution of inputs and outputs

Φ∗ = argminΦ

Ep(x, y)[`(

y, Φ(x))]

I This expectation can be approximated with data

⇒ Acquire training set with Q pairs (xq, yq) ∈ T drawn independently from distribution p(x, y)

⇒ For sufficiently large Q we can approximate ⇒ Ep(x, y)[`(

y, Φ(x))]≈ 1

Q

Q∑q=1

`(

yq, Φ(xq))

⇒ This is just the law of large numbers. True under very mild conditions

14

Empirical Risk Minimization (ERM)

I Replace statistical risk minimization (SRM) with empirical risk minimization (ERM)

Φ∗S = argminΦ

Ep(x, y)[`(

y, Φ(x))]

⇒ Φ∗E = argminΦ

1

Q

Q∑q=1

`(

yq, Φ(xq))

I Since the objectives are close, one would think the optima are close ⇒ Φ∗S ≈ Φ∗E

⇒ Alas, this it not true ⇒ Φ∗S 6≈ Φ∗E ⇒ Statistical and empirical risk minimizers need not be close

I In fact, the solution of ERM is trivial ⇒ Make Φ(xq) = yq for all pairs in the training set

I As trivial as nonsensical ⇒ Yields no information about observations outside the training set

15

ERM with Learning Parametrizations

I Our first attempt at learning from data led to an ERM problem that does not make sense

I The search for a problem that makes sense brings us to the notion of learning parametrizations

16

Learning Parametrization

I A sensical ERM problem, requires the introduction of a function class C

Φ∗ = argminΦ∈C

1

Q

Q∑q=1

`(

yq, Φ(xq))

I For example, we can select the class of linear functions Φ(x) = Hx and solve for

H∗ = argminH

1

Q

Q∑q=1

`(

yq, H xq

)I This choice of parametrization may be good or bad. But at least is sensical

⇒ Good or bad, having H∗ allows estimates y = H∗x for observations x outside the training set

17

Statistical and Empirical Risk Minimization with Learning Parametrizations

I Selecting C to contain sufficiently smooth functions makes SRM and ERM close

argminΦ∈C

Ep(x, y)[`(

y, Φ(x))]

≈ argminΦ∈C

1

Q

Q∑q=1

`(

yq, Φ(xq))

I Fundamental theorem of statistical learning ⇒ ERM is a valid approximation of SRM

I Need to identify the appropriate function class C ⇒ But this problem is unavoidable

18

Learning / Training (For Real)

I SRM learns from model ⇒ Parametrized ERM learns from data ⇒ Three differences:

Learning / Training

x Φ(

x ) y = Φ(x )

x p(x, y) y

argminΦ

E[`(

y, Φ(x))]

Φ∗

19



⇒ The distribution is unknown ⇒ We have access to a training set of data samples

Learning / Training

xq Φ(

xq ) yq = Φ(xq)

xq (xq, yq) ∈ T yq

argminΦ∈C

1

Q

Q∑q=1

`(

yq, Φ(xq))

Φ∗

19



⇒ The nonparametric ERM problem is nonsensical ⇒ We restrict the function class

Learning / Training

xq Φ(

xq ) ∈ C yq = Φ(xq)


argminΦ∈C

1

Q

Q∑q=1

`(

yq, Φ(xq))

Φ∗

19



⇒ The statistical risk ⇒ Is replaced by the empirical risk

Learning / Training

xq Φ(

xq ) yq = Φ(xq)


argminΦ∈C

1

Q

Q∑q=1

`(

yq, Φ(xq))

Φ∗

19

The Meaning of ML or AI in this Course

I Here, Machine learning (ML) ≡ Artificial Intelligence (AI) ≡ Empirical Risk Minimization (ERM)

Φ∗ = argminΦ∈C

∑(x,y)∈T

`(

y,Φ(x))

= argminΦ∈C

1

Q

Q∑q=1

`(

yq,Φ(xq))

I The components of ERM are a dataset, a loss function and, most importantly, a function class

I Make parametrization more explicit ⇒ Parameter H ∈ Rp to span function class Φ(x; H)

H∗ = argminH

∑(x,y)∈T

`(

y,Φ(x; H))

I Designing an ML / AI system means selecting the appropriate function class C ⇒ What else?

⇒ The function class determines generalization from inputs in training set to unseen inputs

20

Stochastic Gradient Descent (SGD)

I SGD is the customary method for the minimization of the empirical risk

21

Training ≡ Minimization of Empirical Risk

I We have seen that the training of an estimator requires minimization of the empirical risk

H∗ = argminH∈Rp

1

Q

Q∑q=1

`(

yq,Φ(xq; H))

= argminH∈Rp

L(H)

I Minimization of the average loss function defined as the average of pointwise loss function

L(H) :=1

Q

Q∑q=1

`(

yq,Φ(xq; H))

I It’s particular form notwithstanding, just a minimization ⇒ Use gradient descent algorithm

22

Gradient Descent

I Gradient g(H) = ∇L(H) is perpendicular to level set of loss L(H)

I Thus, they point towards the minimum. Not directly. Towards

⇒ Angle is less than π/2 ⇒ −gT (H) (H−H∗) ≥ 0

I We can then use gradients in a gradient descent algorithm

Ht+1 = Ht − ε g(Ht)

I Converges to the optimum H∗ if the stepsize ε is sufficiently small

−g(H)

−g(H)

•H∗

23

Gradient Descent in Empirical Risk Minimization

I The gradient of the average loss function is the average of the gradients of the pointwise losses

g(H) = ∇L(H) =1

Q

Q∑q=1

∇H`(

yq,Φ(xq; H))

I Equipped with gradients, we write the gradient descent method as the recursion given by

Ht+1 = Ht − ε g(Ht) = Ht −ε

Q

Q∑q=1

∇H`(

yq,Φ(xq; Ht))

I This is all good, but those gradients are costly to compute ⇒ An average of Q pointwise gradients

24

Stochastic Gradient Descent (SGD)

I At iteration t, select a batch of Qt � Q samples Tt . Randomly chosen from dataset T

I Define stochastic gradient as sum over batch g(Ht) =1

Qt

∑(xq ,yq)∈Tt

∇H`(

yq,Φ(xq; Ht))

I SGD ≡ Replace gradients with stochastic gradients in gradient descent

Ht+1 = Ht − ε g(Ht) = Ht −1

Qt

∑(xq ,yq)∈Tt

∇H`(

yq,Φ(xq; Ht))

I This is cheaper to implement because the sum is over a smaller number of pointwise gradients.

25

Stochastic Gradients are Unbiased Estimators of Gradients

I If samples are chosen independently and with equal probability ⇒ E(g(Ht)) = g(Ht)

I Stochastic gradients point in the right direction on average

I We move towards the optimum more often than not

I Expected angle is accute ⇒ E[− gT (H) (H−H∗)

]≥ 0

I Can build a submartingale and prove convergence

−g(Ht)

−g(Ht)

•H∗

26

Stochastic Gradient Descent Memorabilia

I I covered SGD briefly because there are some things I wanted you to know.

27

SGD Memorabilia

I GD converges because negative gradients point towards the optimum ⇒ −gT (H) (H−H∗) ≥ 0

I SGD converges because stochastic gradients do so on expectation ⇒ E[− gT (H) (H−H∗)

]≥ 0

I Computing stochastic gradients is much cheaper (Qt � Q) than the cost of computing gradients

g(Ht) =1

Qt

∑(xq ,yq)∈Tt

∇H`(

yq ,Φ(xq ; Ht))

vs g(Ht) =1

Q

∑(xq ,yq)∈T

∇H`(

yq ,Φ(xq ; Ht))

I The difference between a method that works and a method that does not work.

28

Convergence of Stochastic Gradient Descent

I Convergence means that as iteration index t grows ⇒ lim inft→∞

∥∥Ht −H∗∥∥2 ≤ O

(ε/√

Qt

)

⇒ We do not converge exactly ⇒ We approach the optimum and hover around it

⇒ Size of hover region is proportional to stepsize

⇒ Size of hover region is inversely proportional to square root of batch size

I For large batch size Qt we have g(Ht) ≈ g(Ht) ⇒ Not needed ⇒ Mistakes corrected in next step

29

When Objectives are not Convex

I Plots illustrates, comments apply, and results hold for convex functions

⇒ Not always (rarely!) true ⇒ Notably, not true for neural networks. Convolutional or not

I Gradients may move iterates towards

⇒ Global minimum H∗

⇒ Local minimum H†

I Depending on initial condition

−g(H)•H∗

−g(H)•H†

I We may converge to local minima but we won’t care ⇒ Implicitly assume that H† is optimal

30

Stochastic Gradient Descent is Finicky

I Stochastic gradient descent is not a great algorithm. Just the one we have.

I Convergence speed and convergence itself is very sensitive to choice of parameters

I Requires trying different stepsizes and different batch sizes. Maybe different initial conditions

⇒ Small changes in any of these parameters may have large effects on convergence

31

The Importance of Learning Parametrizations

I AI reduces to ERM. And in ERM all we have to do is choose a parametrization.

I Not an easy choice ⇒ The parametrization controls generalization. Make or break.

I The parametrization is a model of how outputs are related to inputs ⇒ It has to be accurate

32

Data Generation Models

I To illustrate effect of learning parametrizations generate fake data following models we specify

I A linear model with inputs x ∈ Rn and outputs y ∈ Rm related by ⇒ y = Ax + w

⇒ For some matrix A ∈ Rm×n. Noise w ∈ Rp Gaussian white, independent with mean E(w) = 0

I A non-linear model postprocessing the linear model with a sign function ⇒ y = sign(

Ax + w)

33

Statistical Risk Minimization

I Given that we know the models we can compute the Statistical Risk Minimizer (SRM). ““The AI”

I For instance, if we use the squared 2-norm loss to penalize AI estimation errors

Φ∗S(x) = argminΦ

Ep(x,y)

[1

2

∥∥y − Φ(x)∥∥2

2

]

I Using the given model y = Ax + w and taking derivatives, the AI is ⇒ Φ∗S(x) = Ax

I Literally, the AI mimics nature.

34

Parametrized Empirical Risk Minimization

I Suppose model is unknown ⇒ Instead, we have access to Q data pairs (xq, yq) in training set T

I Hypothesize a linear parametrization Φ(x) = Hx Formulate parametrized ERM problem

H∗ = argminH∈Rm×n

1

Q

Q∑q=1

[1

2

∥∥∥yq −Hxq∥∥∥2

2

]= argmin

H∈Rm×nL(H)

I Solve with SGD ⇒ Ht+1 = Ht − ε g(Ht) = Ht −ε

Qt

Qt∑q=1

(yq −Htxq

)xTq

I Can use linear parametrization irrespectively of the actual model relating inputs xq to outputs yq.

⇒ But it will work well only if the parametrization matches the unknown model.

35

Parametrization and Model are Matched

I Data generated by linear model with dimensions m = n = 102. Number of samples Q = 103.

I ERM learning with linear parametrization. ⇒ SGD trajectory iterates reduce loss (left)

I Live operation tested outside of training set ⇒ loss is also reduced in test set

0 2500 5000 7500 10000 12500 15000 17500 20000Number of gradient steps

0.004

0.005

0.006

0.007

0.008

0.009

0.010

0.011

MSE

ove

r the

dat

aset


0.004

0.005

0.006

0.007

0.008

0.009

0.010

0.011

MSE

ove

r the

dat

aset

I The model is linear. The parametrization is linear. The parametrization learn the model.

36

Parametrization and Model are Mismatched

I Data generated by sign model with dimensions m = n = 102. Number of samples Q = 103.


I But we converge to a high loss. We do not learn. ⇒ Situation is just as bad in the test set


0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

MSE

ove

r the

dat

aset


0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

MSE

ove

r the

dat

aset

I Model is NOT linear. Parametrization is linear. The parametrization DOES NOT learn the model.

37

Parametrization and Model are Matched but There is Insufficient Data

I Data generated by linear model with dimensions m = n = 102. Number of samples Q = 102.


I Live operation tested outside of training set ⇒ loss is NOT reduced in test set


0.002

0.004

0.006

0.008

0.010

MSE

ove

r the

dat

aset


0.002

0.004

0.006

0.008

0.010

MSE

ove

r the

dat

aset

I Model is linear. Parametrization is linear. Not enough data to learn model. ⇒ There never is

38

Machine Learning is Model Free but Not Model Free

I Machine learning does not require a model relating inputs x to outputs y

⇒ For example, we don’t need to know the matrix A

I But we need to know a class of functions to which the model belongs

⇒ For example, we need to know the model relating inputs to outputs is linear

I Model also needs to be sufficiently simple to operate with insufficient data

⇒ This is where we leverage structure using convolutional architectures such as CNNs and GNNs

39