+ All Categories
Home > Documents > Lecture 5: Optimization Formulations of Data Analysis...

Lecture 5: Optimization Formulations of Data Analysis...

Date post: 17-Jul-2020
Category:
Upload: others
View: 2 times
Download: 2 times
Share this document with a friend
33
Lecture 5: Optimization Formulations of Data Analysis Problems March 18 - 25, 2020 Optimization Formulations Lecture 5 March 18 - 25, 2020 1 / 33
Transcript
Page 1: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Lecture 5 Optimization Formulations of DataAnalysis Problems

March 18 - 25 2020

Optimization Formulations Lecture 5 March 18 - 25 2020 1 33

1 Setup

Practical data sets are often extremely messy Data may be misla-beled noisy incomplete or otherwise corrupted Dasu andJohnson claim out that ldquo80 of data analysis is spent on theprocess of cleaning and preparing the datardquo

The data set D in a typical analysis problem consists of m objects

D = (aj yj) j = 1 2 m

where aj is a vector (or matrix) of features and yj is a label orobservation

The analysis task then consists of discovering a function φ suchthat

φ(aj) asymp yj

holds for most j = 1 m The process of discovering the mappingφ is often called ldquolearningrdquo or ldquotrainingrdquo

Optimization Formulations Lecture 5 March 18 - 25 2020 2 33

The problem of identifying φ usually is a data-fitting problem

Find the parameters x defining φ such that φ(aj) asymp yj j = 1 min some optimal sense Once we come up with a definition of theterm optimal we have an optimization problem

Many such optimization formulations have objective functions ofthe ldquosummationrdquo type

LD(x) =

m983131

j=1

ℓ(aj yj x)

where the jth term ℓ(aj yj x) is a measure of the mismatchbetween φ(aj) and yj and x is the vector of parameters thatdetermines φ The optimization problem is

minx

LD(x)

Uses of φ (1) prediction (2) feature selection (3) reveal datastructure (4) many others

Optimization Formulations Lecture 5 March 18 - 25 2020 3 33

Examples of labels yj include the following

(1) A real number leading to a regression problem

(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem

(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA

Other issues

(1) Noisy andor corrupted data robust φ required

(2) Parts of aj andor yj are not known how to treat

(3) Streaming data (NOT available all at once) online φ required

(4) Overfitting (bad) φ too sensitive to the particular sample D

Generalization or regularization come to rescue

Optimization Formulations Lecture 5 March 18 - 25 2020 4 33

2 Least squares

The data points (aj yj) lie in Rn times R and we solve

minx

983099983103

9831011

2m

m983131

j=1

(aTj xminus yj)2 =

1

2m983042Axminus y98304222

983100983104

983102

where A is the matrix whose rows are aTj j = 1 m and

y =983045y1 y2 middot middot middot ym

983046T

The function φ is defined by

φ(a) = aTx

We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β

Optimization Formulations Lecture 5 March 18 - 25 2020 5 33

Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate

Impose desirable structure on x

(1) Tikhonov regularization with a squared ℓ2-norm

minx

1

2m983042Axminus y98304222 + λ983042x98304222 λ gt 0

yields a solution x with less sensitivity to perturbations in thedata (aj yj)

(2) LASSO formulation

minx

1

2m983042Axminus y98304222 + λ983042x9830421 λ gt 0

tends to yield solutions x that are sparse that is containingrelatively few nonzero components

Optimization Formulations Lecture 5 March 18 - 25 2020 6 33

ℓ1 norm promotes sparsity compared with ℓ2 norm

LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features

Optimization Formulations Lecture 5 March 18 - 25 2020 7 33

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 2: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

1 Setup

Practical data sets are often extremely messy Data may be misla-beled noisy incomplete or otherwise corrupted Dasu andJohnson claim out that ldquo80 of data analysis is spent on theprocess of cleaning and preparing the datardquo

The data set D in a typical analysis problem consists of m objects

D = (aj yj) j = 1 2 m

where aj is a vector (or matrix) of features and yj is a label orobservation

The analysis task then consists of discovering a function φ suchthat

φ(aj) asymp yj

holds for most j = 1 m The process of discovering the mappingφ is often called ldquolearningrdquo or ldquotrainingrdquo

Optimization Formulations Lecture 5 March 18 - 25 2020 2 33

The problem of identifying φ usually is a data-fitting problem

Find the parameters x defining φ such that φ(aj) asymp yj j = 1 min some optimal sense Once we come up with a definition of theterm optimal we have an optimization problem

Many such optimization formulations have objective functions ofthe ldquosummationrdquo type

LD(x) =

m983131

j=1

ℓ(aj yj x)

where the jth term ℓ(aj yj x) is a measure of the mismatchbetween φ(aj) and yj and x is the vector of parameters thatdetermines φ The optimization problem is

minx

LD(x)

Uses of φ (1) prediction (2) feature selection (3) reveal datastructure (4) many others

Optimization Formulations Lecture 5 March 18 - 25 2020 3 33

Examples of labels yj include the following

(1) A real number leading to a regression problem

(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem

(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA

Other issues

(1) Noisy andor corrupted data robust φ required

(2) Parts of aj andor yj are not known how to treat

(3) Streaming data (NOT available all at once) online φ required

(4) Overfitting (bad) φ too sensitive to the particular sample D

Generalization or regularization come to rescue

Optimization Formulations Lecture 5 March 18 - 25 2020 4 33

2 Least squares

The data points (aj yj) lie in Rn times R and we solve

minx

983099983103

9831011

2m

m983131

j=1

(aTj xminus yj)2 =

1

2m983042Axminus y98304222

983100983104

983102

where A is the matrix whose rows are aTj j = 1 m and

y =983045y1 y2 middot middot middot ym

983046T

The function φ is defined by

φ(a) = aTx

We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β

Optimization Formulations Lecture 5 March 18 - 25 2020 5 33

Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate

Impose desirable structure on x

(1) Tikhonov regularization with a squared ℓ2-norm

minx

1

2m983042Axminus y98304222 + λ983042x98304222 λ gt 0

yields a solution x with less sensitivity to perturbations in thedata (aj yj)

(2) LASSO formulation

minx

1

2m983042Axminus y98304222 + λ983042x9830421 λ gt 0

tends to yield solutions x that are sparse that is containingrelatively few nonzero components

Optimization Formulations Lecture 5 March 18 - 25 2020 6 33

ℓ1 norm promotes sparsity compared with ℓ2 norm

LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features

Optimization Formulations Lecture 5 March 18 - 25 2020 7 33

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 3: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The problem of identifying φ usually is a data-fitting problem

Find the parameters x defining φ such that φ(aj) asymp yj j = 1 min some optimal sense Once we come up with a definition of theterm optimal we have an optimization problem

Many such optimization formulations have objective functions ofthe ldquosummationrdquo type

LD(x) =

m983131

j=1

ℓ(aj yj x)

where the jth term ℓ(aj yj x) is a measure of the mismatchbetween φ(aj) and yj and x is the vector of parameters thatdetermines φ The optimization problem is

minx

LD(x)

Uses of φ (1) prediction (2) feature selection (3) reveal datastructure (4) many others

Optimization Formulations Lecture 5 March 18 - 25 2020 3 33

Examples of labels yj include the following

(1) A real number leading to a regression problem

(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem

(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA

Other issues

(1) Noisy andor corrupted data robust φ required

(2) Parts of aj andor yj are not known how to treat

(3) Streaming data (NOT available all at once) online φ required

(4) Overfitting (bad) φ too sensitive to the particular sample D

Generalization or regularization come to rescue

Optimization Formulations Lecture 5 March 18 - 25 2020 4 33

2 Least squares

The data points (aj yj) lie in Rn times R and we solve

minx

983099983103

9831011

2m

m983131

j=1

(aTj xminus yj)2 =

1

2m983042Axminus y98304222

983100983104

983102

where A is the matrix whose rows are aTj j = 1 m and

y =983045y1 y2 middot middot middot ym

983046T

The function φ is defined by

φ(a) = aTx

We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β

Optimization Formulations Lecture 5 March 18 - 25 2020 5 33

Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate

Impose desirable structure on x

(1) Tikhonov regularization with a squared ℓ2-norm

minx

1

2m983042Axminus y98304222 + λ983042x98304222 λ gt 0

yields a solution x with less sensitivity to perturbations in thedata (aj yj)

(2) LASSO formulation

minx

1

2m983042Axminus y98304222 + λ983042x9830421 λ gt 0

tends to yield solutions x that are sparse that is containingrelatively few nonzero components

Optimization Formulations Lecture 5 March 18 - 25 2020 6 33

ℓ1 norm promotes sparsity compared with ℓ2 norm

LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features

Optimization Formulations Lecture 5 March 18 - 25 2020 7 33

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 4: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Examples of labels yj include the following

(1) A real number leading to a regression problem

(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem

(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA

Other issues

(1) Noisy andor corrupted data robust φ required

(2) Parts of aj andor yj are not known how to treat

(3) Streaming data (NOT available all at once) online φ required

(4) Overfitting (bad) φ too sensitive to the particular sample D

Generalization or regularization come to rescue

Optimization Formulations Lecture 5 March 18 - 25 2020 4 33

2 Least squares

The data points (aj yj) lie in Rn times R and we solve

minx

983099983103

9831011

2m

m983131

j=1

(aTj xminus yj)2 =

1

2m983042Axminus y98304222

983100983104

983102

where A is the matrix whose rows are aTj j = 1 m and

y =983045y1 y2 middot middot middot ym

983046T

The function φ is defined by

φ(a) = aTx

We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β

Optimization Formulations Lecture 5 March 18 - 25 2020 5 33

Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate

Impose desirable structure on x

(1) Tikhonov regularization with a squared ℓ2-norm

minx

1

2m983042Axminus y98304222 + λ983042x98304222 λ gt 0

yields a solution x with less sensitivity to perturbations in thedata (aj yj)

(2) LASSO formulation

minx

1

2m983042Axminus y98304222 + λ983042x9830421 λ gt 0

tends to yield solutions x that are sparse that is containingrelatively few nonzero components

Optimization Formulations Lecture 5 March 18 - 25 2020 6 33

ℓ1 norm promotes sparsity compared with ℓ2 norm

LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features

Optimization Formulations Lecture 5 March 18 - 25 2020 7 33

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 5: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

2 Least squares

The data points (aj yj) lie in Rn times R and we solve

minx

983099983103

9831011

2m

m983131

j=1

(aTj xminus yj)2 =

1

2m983042Axminus y98304222

983100983104

983102

where A is the matrix whose rows are aTj j = 1 m and

y =983045y1 y2 middot middot middot ym

983046T

The function φ is defined by

φ(a) = aTx

We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β

Optimization Formulations Lecture 5 March 18 - 25 2020 5 33

Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate

Impose desirable structure on x

(1) Tikhonov regularization with a squared ℓ2-norm

minx

1

2m983042Axminus y98304222 + λ983042x98304222 λ gt 0

yields a solution x with less sensitivity to perturbations in thedata (aj yj)

(2) LASSO formulation

minx

1

2m983042Axminus y98304222 + λ983042x9830421 λ gt 0

tends to yield solutions x that are sparse that is containingrelatively few nonzero components

Optimization Formulations Lecture 5 March 18 - 25 2020 6 33

ℓ1 norm promotes sparsity compared with ℓ2 norm

LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features

Optimization Formulations Lecture 5 March 18 - 25 2020 7 33

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 6: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate

Impose desirable structure on x

(1) Tikhonov regularization with a squared ℓ2-norm

minx

1

2m983042Axminus y98304222 + λ983042x98304222 λ gt 0

yields a solution x with less sensitivity to perturbations in thedata (aj yj)

(2) LASSO formulation

minx

1

2m983042Axminus y98304222 + λ983042x9830421 λ gt 0

tends to yield solutions x that are sparse that is containingrelatively few nonzero components

Optimization Formulations Lecture 5 March 18 - 25 2020 6 33

ℓ1 norm promotes sparsity compared with ℓ2 norm

LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features

Optimization Formulations Lecture 5 March 18 - 25 2020 7 33

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 7: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

ℓ1 norm promotes sparsity compared with ℓ2 norm

LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features

Optimization Formulations Lecture 5 March 18 - 25 2020 7 33

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 8: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

3 Matrix completion

Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2

where〈AB〉F = tr(ATB)

A regularized version leading to solutions X that are low-rank is

minX

1

2m

m983131

j=1

(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0

where 983042X983042lowast is the nuclear norm (the sum of singular values of X)

Optimization Formulations Lecture 5 March 18 - 25 2020 8 33

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 9: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve

minLR

1

2m

m983131

j=1

(〈Aj LRT〉F minus yj)

2

In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R

The objective function is nonconvex

Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np

Minimum rank version

minX

rank(X) st 〈Aj X〉F = yj j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 9 33

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 10: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

4 Nonnegative matrix factorization

Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative

If the full matrix Y isin Rntimesp is observed this problem has the form

minLR

983042LRT minusY9830422F subject to L ge 0 R ge 0

5 Sparse inverse covariance estimation

In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean

The sample covariance matrix constructed from these observationsis

S =1

mminus 1

m983131

j=1

ajaTj

Optimization Formulations Lecture 5 March 18 - 25 2020 10 33

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 11: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a

Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse

The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)

One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following

minXisinSntimesnX≻0

〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0

where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =

983123nil=1 |Xil|

Optimization Formulations Lecture 5 March 18 - 25 2020 11 33

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 12: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

6 Sparse principal components

We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector

The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues

It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros

An explicit optimization formulation of this problem is

maxvisinRn

vTSv st 983042v9830422 = 1 983042v9830420 le k

where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v

Optimization Formulations Lecture 5 March 18 - 25 2020 12 33

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 13: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn

maxMisinSntimesn

〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ

for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem

More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros

Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach

Optimization Formulations Lecture 5 March 18 - 25 2020 13 33

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 14: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The optimization formulation is

maxVisinRntimesr

〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r

We can write a convex relaxation of this problem once again asemidefinite program as

maxMisinSntimesn

〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ

A more compact (but nonconvex) formulation is

maxFisinRntimesr

〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R

where 983042F98304221 =983123n

i=1 983042Fi9830422

Optimization Formulations Lecture 5 March 18 - 25 2020 14 33

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 15: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

7 Sparse plus low-rank matrix decomposition

Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is

minMS

983042M983042lowast + λ983042S9830421 st Y = M+ S

where983042S9830421 =

983131

ij

|Sij |

Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed

minLRS

1

2983042LRT + SminusY9830422F

Optimization Formulations Lecture 5 March 18 - 25 2020 15 33

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 16: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Partially observed

minLRS

1

2983042PΦ(LR

T + SminusY)9830422F

where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set

One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations

Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time

Optimization Formulations Lecture 5 March 18 - 25 2020 16 33

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 17: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

8 Subspace identification

In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace

The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr

If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =

983045aj983046mj=1

and take X to be the leading r right singular vectors

In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that

PΦj (aj minusXsj) asymp 0 j = 1 m

Optimization Formulations Lecture 5 March 18 - 25 2020 17 33

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 18: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

9 Support vector machines

Classification via support vector machines (SVM) is a classicalparadigm in machine learning

This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat

aTj xminus β ge 1 when yj = 1

aTj xminus β le minus1 when yj = minus1

Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)

Optimization Formulations Lecture 5 March 18 - 25 2020 18 33

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 19: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form

H(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0 ge 0

Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min

xβH(xβ) = 0

means the existence of a separating hyperplane

Regularized version

Hλ(xβ) =1

m

m983131

j=1

max1minus yj(aTj xminus β) 0+ λ

2983042x98304222

If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the

maximum-margin separating hyperplane limxrarrx0

H(xβ)minusH(x0β0)983042x098304222minus983042x98304222

Optimization Formulations Lecture 5 March 18 - 25 2020 19 33

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 20: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The maximum-margin property is consistent with the goals ofgeneralizability and robustness

Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)

Optimization Formulations Lecture 5 March 18 - 25 2020 20 33

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 21: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case

Optimization Formulations Lecture 5 March 18 - 25 2020 21 33

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 22: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions

ζ(aj)Txminus β ge 1 when yj = 1

ζ(aj)Txminus β le minus1 when yj = minus1

lead to

Hζλ(xβ) =1

m

m983131

j=1

max1minus yj(ζ(aj)Txminus β) 0+ λ

2983042x98304222

When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)

Optimization Formulations Lecture 5 March 18 - 25 2020 22 33

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 23: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then

minxβs

1

m1Ts+

λ

2983042x98304222

subject to

sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m

where 1 =9830451 1 middot middot middot 1

983046T isin Rm

The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)

Optimization Formulations Lecture 5 March 18 - 25 2020 23 33

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 24: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The dual problem in m variables

minzisinRm

1

2zTQzminus 1Tz subject to 0 le z le 1

mλ yTz = 0

whereQkl = ykylζ(ak)

Tζ(al)

Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)

Tζ(al) This is the so-called ldquokernel trickrdquo

A particularly popular choice of kernel is the Gaussian kernel

K(akal) = exp(minus983042ak minus al9830422(2σ))

where σ is a positive parameter

Optimization Formulations Lecture 5 March 18 - 25 2020 24 33

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 25: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

10 Logistic regression

We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows

p(ax) = (1 + exp(aTx))minus1

and aim to choose the parameter x so that

p(aj x) asymp 1 when yj = 1

p(aj x) asymp 0 when yj = minus1

The optimal value of x can be found by maximizing alog-likelihood function

L(x) =1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108

Optimization Formulations Lecture 5 March 18 - 25 2020 25 33

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 26: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

We can perform feature selection using this model by introducinga regularizer

maxx

1

m

983091

983107983131

jyj=minus1

log(1minus p(aj x)) +983131

jyj=1

log p(aj x)

983092

983108minus λ983042x9830421

where λ gt 0 is a regularization parameter

The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x

Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class

Optimization Formulations Lecture 5 March 18 - 25 2020 26 33

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 27: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows

pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a

Tx[ℓ]) k = 1 M

whereX = x[k]|k = 1 M

Note that for all a and for all k we have

pk(aX) isin (0 1)

M983131

k=1

pk(aX) = 1

If one of these inner products aTx[ℓ]Mℓ=1 dominates the others

that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then

pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k

Optimization Formulations Lecture 5 March 18 - 25 2020 27 33

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 28: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows

yjk =

9830831 when aj belongs to calss k

0 otherwise

We seek to define the vectors x[k] so that

pk(aj X) asymp 1 when yjk = 1

pk(aj X) asymp 0 when yjk = 0

The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood

L(X) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]aj)minus log

983075M983131

ℓ=1

exp(xT[ℓ]aj)

983076983078

Group-sparse regularization terms

Optimization Formulations Lecture 5 March 18 - 25 2020 28 33

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 29: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

11 Deep learning

Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications

The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section

A wonderful reference

Deep Learning An Introduction for Applied Mathematicians

Catherine F Higham and Desmond J Higham

SIAM Review 2019

The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj

Optimization Formulations Lecture 5 March 18 - 25 2020 29 33

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 30: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Deep neural network showing connections between adjacent layers

Optimization Formulations Lecture 5 March 18 - 25 2020 30 33

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 31: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next

A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is

alj = σ(Wlalminus1j + gl) l = 1 2 D

where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector

of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers

Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector

aj and let aDj be the vector formed by the nodes at the topmosthidden layer

Optimization Formulations Lecture 5 March 18 - 25 2020 31 33

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 32: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector

(1) Logistic function t 983041rarr 1(1 + eminust)

(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)

(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise

Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class

The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage

We aim to choose all these parameters so that the network does agood job on classifying the training data correctly

Optimization Formulations Lecture 5 March 18 - 25 2020 32 33

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33

Page 33: Lecture 5: Optimization Formulations of Data Analysis Problemsmath.xmu.edu.cn/group/nona/damc/Lecture05.pdf · Lecture 5: Optimization Formulations of Data Analysis Problems March

Using the notation w for the hidden layer transformations that is

w = (W1g1W2g2 WDgD)

and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows

L(wX) =1

m

m983131

j=1

983077M983131

ℓ=1

yjℓ(xT[ℓ]a

Dj (w))minus log

983075M983131

ℓ=1

exp(xT[ℓ]a

Dj (w))

983076983078

We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m

The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find

The total number of parameters in (wX) is usually very large

Optimization Formulations Lecture 5 March 18 - 25 2020 33 33


Recommended