Lecture 5 Optimization Formulations of DataAnalysis Problems
March 18 - 25 2020
Optimization Formulations Lecture 5 March 18 - 25 2020 1 33
1 Setup
Practical data sets are often extremely messy Data may be misla-beled noisy incomplete or otherwise corrupted Dasu andJohnson claim out that ldquo80 of data analysis is spent on theprocess of cleaning and preparing the datardquo
The data set D in a typical analysis problem consists of m objects
D = (aj yj) j = 1 2 m
where aj is a vector (or matrix) of features and yj is a label orobservation
The analysis task then consists of discovering a function φ suchthat
φ(aj) asymp yj
holds for most j = 1 m The process of discovering the mappingφ is often called ldquolearningrdquo or ldquotrainingrdquo
Optimization Formulations Lecture 5 March 18 - 25 2020 2 33
The problem of identifying φ usually is a data-fitting problem
Find the parameters x defining φ such that φ(aj) asymp yj j = 1 min some optimal sense Once we come up with a definition of theterm optimal we have an optimization problem
Many such optimization formulations have objective functions ofthe ldquosummationrdquo type
LD(x) =
m983131
j=1
ℓ(aj yj x)
where the jth term ℓ(aj yj x) is a measure of the mismatchbetween φ(aj) and yj and x is the vector of parameters thatdetermines φ The optimization problem is
minx
LD(x)
Uses of φ (1) prediction (2) feature selection (3) reveal datastructure (4) many others
Optimization Formulations Lecture 5 March 18 - 25 2020 3 33
Examples of labels yj include the following
(1) A real number leading to a regression problem
(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem
(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA
Other issues
(1) Noisy andor corrupted data robust φ required
(2) Parts of aj andor yj are not known how to treat
(3) Streaming data (NOT available all at once) online φ required
(4) Overfitting (bad) φ too sensitive to the particular sample D
Generalization or regularization come to rescue
Optimization Formulations Lecture 5 March 18 - 25 2020 4 33
2 Least squares
The data points (aj yj) lie in Rn times R and we solve
minx
983099983103
9831011
2m
m983131
j=1
(aTj xminus yj)2 =
1
2m983042Axminus y98304222
983100983104
983102
where A is the matrix whose rows are aTj j = 1 m and
y =983045y1 y2 middot middot middot ym
983046T
The function φ is defined by
φ(a) = aTx
We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β
Optimization Formulations Lecture 5 March 18 - 25 2020 5 33
Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate
Impose desirable structure on x
(1) Tikhonov regularization with a squared ℓ2-norm
minx
1
2m983042Axminus y98304222 + λ983042x98304222 λ gt 0
yields a solution x with less sensitivity to perturbations in thedata (aj yj)
(2) LASSO formulation
minx
1
2m983042Axminus y98304222 + λ983042x9830421 λ gt 0
tends to yield solutions x that are sparse that is containingrelatively few nonzero components
Optimization Formulations Lecture 5 March 18 - 25 2020 6 33
ℓ1 norm promotes sparsity compared with ℓ2 norm
LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features
Optimization Formulations Lecture 5 March 18 - 25 2020 7 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
1 Setup
Practical data sets are often extremely messy Data may be misla-beled noisy incomplete or otherwise corrupted Dasu andJohnson claim out that ldquo80 of data analysis is spent on theprocess of cleaning and preparing the datardquo
The data set D in a typical analysis problem consists of m objects
D = (aj yj) j = 1 2 m
where aj is a vector (or matrix) of features and yj is a label orobservation
The analysis task then consists of discovering a function φ suchthat
φ(aj) asymp yj
holds for most j = 1 m The process of discovering the mappingφ is often called ldquolearningrdquo or ldquotrainingrdquo
Optimization Formulations Lecture 5 March 18 - 25 2020 2 33
The problem of identifying φ usually is a data-fitting problem
Find the parameters x defining φ such that φ(aj) asymp yj j = 1 min some optimal sense Once we come up with a definition of theterm optimal we have an optimization problem
Many such optimization formulations have objective functions ofthe ldquosummationrdquo type
LD(x) =
m983131
j=1
ℓ(aj yj x)
where the jth term ℓ(aj yj x) is a measure of the mismatchbetween φ(aj) and yj and x is the vector of parameters thatdetermines φ The optimization problem is
minx
LD(x)
Uses of φ (1) prediction (2) feature selection (3) reveal datastructure (4) many others
Optimization Formulations Lecture 5 March 18 - 25 2020 3 33
Examples of labels yj include the following
(1) A real number leading to a regression problem
(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem
(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA
Other issues
(1) Noisy andor corrupted data robust φ required
(2) Parts of aj andor yj are not known how to treat
(3) Streaming data (NOT available all at once) online φ required
(4) Overfitting (bad) φ too sensitive to the particular sample D
Generalization or regularization come to rescue
Optimization Formulations Lecture 5 March 18 - 25 2020 4 33
2 Least squares
The data points (aj yj) lie in Rn times R and we solve
minx
983099983103
9831011
2m
m983131
j=1
(aTj xminus yj)2 =
1
2m983042Axminus y98304222
983100983104
983102
where A is the matrix whose rows are aTj j = 1 m and
y =983045y1 y2 middot middot middot ym
983046T
The function φ is defined by
φ(a) = aTx
We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β
Optimization Formulations Lecture 5 March 18 - 25 2020 5 33
Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate
Impose desirable structure on x
(1) Tikhonov regularization with a squared ℓ2-norm
minx
1
2m983042Axminus y98304222 + λ983042x98304222 λ gt 0
yields a solution x with less sensitivity to perturbations in thedata (aj yj)
(2) LASSO formulation
minx
1
2m983042Axminus y98304222 + λ983042x9830421 λ gt 0
tends to yield solutions x that are sparse that is containingrelatively few nonzero components
Optimization Formulations Lecture 5 March 18 - 25 2020 6 33
ℓ1 norm promotes sparsity compared with ℓ2 norm
LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features
Optimization Formulations Lecture 5 March 18 - 25 2020 7 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The problem of identifying φ usually is a data-fitting problem
Find the parameters x defining φ such that φ(aj) asymp yj j = 1 min some optimal sense Once we come up with a definition of theterm optimal we have an optimization problem
Many such optimization formulations have objective functions ofthe ldquosummationrdquo type
LD(x) =
m983131
j=1
ℓ(aj yj x)
where the jth term ℓ(aj yj x) is a measure of the mismatchbetween φ(aj) and yj and x is the vector of parameters thatdetermines φ The optimization problem is
minx
LD(x)
Uses of φ (1) prediction (2) feature selection (3) reveal datastructure (4) many others
Optimization Formulations Lecture 5 March 18 - 25 2020 3 33
Examples of labels yj include the following
(1) A real number leading to a regression problem
(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem
(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA
Other issues
(1) Noisy andor corrupted data robust φ required
(2) Parts of aj andor yj are not known how to treat
(3) Streaming data (NOT available all at once) online φ required
(4) Overfitting (bad) φ too sensitive to the particular sample D
Generalization or regularization come to rescue
Optimization Formulations Lecture 5 March 18 - 25 2020 4 33
2 Least squares
The data points (aj yj) lie in Rn times R and we solve
minx
983099983103
9831011
2m
m983131
j=1
(aTj xminus yj)2 =
1
2m983042Axminus y98304222
983100983104
983102
where A is the matrix whose rows are aTj j = 1 m and
y =983045y1 y2 middot middot middot ym
983046T
The function φ is defined by
φ(a) = aTx
We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β
Optimization Formulations Lecture 5 March 18 - 25 2020 5 33
Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate
Impose desirable structure on x
(1) Tikhonov regularization with a squared ℓ2-norm
minx
1
2m983042Axminus y98304222 + λ983042x98304222 λ gt 0
yields a solution x with less sensitivity to perturbations in thedata (aj yj)
(2) LASSO formulation
minx
1
2m983042Axminus y98304222 + λ983042x9830421 λ gt 0
tends to yield solutions x that are sparse that is containingrelatively few nonzero components
Optimization Formulations Lecture 5 March 18 - 25 2020 6 33
ℓ1 norm promotes sparsity compared with ℓ2 norm
LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features
Optimization Formulations Lecture 5 March 18 - 25 2020 7 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
Examples of labels yj include the following
(1) A real number leading to a regression problem
(2) A label say yj isin 1 2 M indicating that aj belongs to oneof M classes This is a classification problem
(3) Null (ie no labels) Unsupervised learning clustering(grouping) PCA
Other issues
(1) Noisy andor corrupted data robust φ required
(2) Parts of aj andor yj are not known how to treat
(3) Streaming data (NOT available all at once) online φ required
(4) Overfitting (bad) φ too sensitive to the particular sample D
Generalization or regularization come to rescue
Optimization Formulations Lecture 5 March 18 - 25 2020 4 33
2 Least squares
The data points (aj yj) lie in Rn times R and we solve
minx
983099983103
9831011
2m
m983131
j=1
(aTj xminus yj)2 =
1
2m983042Axminus y98304222
983100983104
983102
where A is the matrix whose rows are aTj j = 1 m and
y =983045y1 y2 middot middot middot ym
983046T
The function φ is defined by
φ(a) = aTx
We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β
Optimization Formulations Lecture 5 March 18 - 25 2020 5 33
Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate
Impose desirable structure on x
(1) Tikhonov regularization with a squared ℓ2-norm
minx
1
2m983042Axminus y98304222 + λ983042x98304222 λ gt 0
yields a solution x with less sensitivity to perturbations in thedata (aj yj)
(2) LASSO formulation
minx
1
2m983042Axminus y98304222 + λ983042x9830421 λ gt 0
tends to yield solutions x that are sparse that is containingrelatively few nonzero components
Optimization Formulations Lecture 5 March 18 - 25 2020 6 33
ℓ1 norm promotes sparsity compared with ℓ2 norm
LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features
Optimization Formulations Lecture 5 March 18 - 25 2020 7 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
2 Least squares
The data points (aj yj) lie in Rn times R and we solve
minx
983099983103
9831011
2m
m983131
j=1
(aTj xminus yj)2 =
1
2m983042Axminus y98304222
983100983104
983102
where A is the matrix whose rows are aTj j = 1 m and
y =983045y1 y2 middot middot middot ym
983046T
The function φ is defined by
φ(a) = aTx
We could also introduce a nonzero intercept by adding an extraparameter β isin R and defining φ(a) = aTx+ β
Optimization Formulations Lecture 5 March 18 - 25 2020 5 33
Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate
Impose desirable structure on x
(1) Tikhonov regularization with a squared ℓ2-norm
minx
1
2m983042Axminus y98304222 + λ983042x98304222 λ gt 0
yields a solution x with less sensitivity to perturbations in thedata (aj yj)
(2) LASSO formulation
minx
1
2m983042Axminus y98304222 + λ983042x9830421 λ gt 0
tends to yield solutions x that are sparse that is containingrelatively few nonzero components
Optimization Formulations Lecture 5 March 18 - 25 2020 6 33
ℓ1 norm promotes sparsity compared with ℓ2 norm
LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features
Optimization Formulations Lecture 5 March 18 - 25 2020 7 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
Statistically when the observations yj are contaminated with iidGaussian noise the least squares solution x is the maximumlikelihood estimate
Impose desirable structure on x
(1) Tikhonov regularization with a squared ℓ2-norm
minx
1
2m983042Axminus y98304222 + λ983042x98304222 λ gt 0
yields a solution x with less sensitivity to perturbations in thedata (aj yj)
(2) LASSO formulation
minx
1
2m983042Axminus y98304222 + λ983042x9830421 λ gt 0
tends to yield solutions x that are sparse that is containingrelatively few nonzero components
Optimization Formulations Lecture 5 March 18 - 25 2020 6 33
ℓ1 norm promotes sparsity compared with ℓ2 norm
LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features
Optimization Formulations Lecture 5 March 18 - 25 2020 7 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
ℓ1 norm promotes sparsity compared with ℓ2 norm
LASSO performs feature selection To gather a new data vector afor prediction we need to find only the ldquoselectedrdquo features
Optimization Formulations Lecture 5 March 18 - 25 2020 7 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
3 Matrix completion
Suppose Aj isin Rntimesp We seek X isin Rntimesp that solves
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2
where〈AB〉F = tr(ATB)
A regularized version leading to solutions X that are low-rank is
minX
1
2m
m983131
j=1
(〈Aj X〉F minus yj)2 + λ983042X983042lowast λ gt 0
where 983042X983042lowast is the nuclear norm (the sum of singular values of X)
Optimization Formulations Lecture 5 March 18 - 25 2020 8 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
Rank-r version Let X = LRT with L isin Rntimesr R isin Rptimesr andr ≪ min(n p) We solve
minLR
1
2m
m983131
j=1
(〈Aj LRT〉F minus yj)
2
In this formulation the rank r is ldquohard-wiredrdquo into the definitionof X via two ldquothin-tallrdquo matrices L and R
The objective function is nonconvex
Advantage The total number of elements in L and R is (n+ p)rwhich is much less than np
Minimum rank version
minX
rank(X) st 〈Aj X〉F = yj j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 9 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
4 Nonnegative matrix factorization
Applications in computer vision chemometrics and documentclustering require us to find factors L isin Rntimesr and R isin Rptimesr withall elements nonnegative
If the full matrix Y isin Rntimesp is observed this problem has the form
minLR
983042LRT minusY9830422F subject to L ge 0 R ge 0
5 Sparse inverse covariance estimation
In this problem the labels yj are null and the vectors aj isin Rn areviewed as independent observations of a random vector a isin Rnwhich has zero mean
The sample covariance matrix constructed from these observationsis
S =1
mminus 1
m983131
j=1
ajaTj
Optimization Formulations Lecture 5 March 18 - 25 2020 10 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The element Sil is an estimate of the covariance between the ithand lth elements of the random variable vector a
Our interest is in calculating an estimate X of the inversecovariance matrix that is sparse
The structure of X yields important information about a Inparticular if Xil = 0 we can conclude that the i and l componentsof a are conditionally independent (That is they are independentgiven knowledge of the values of the other nminus 2 components of a)
One optimization formulation that has been proposed forestimating the inverse sparse covariance matrix X is the following
minXisinSntimesnX≻0
〈SX〉F minus log det(X) + λ983042X9830421 λ gt 0
where Sntimesn is the set of ntimes n symmetric matrices X ≻ 0indicates that X is positive definite and 983042X9830421 =
983123nil=1 |Xil|
Optimization Formulations Lecture 5 March 18 - 25 2020 11 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
6 Sparse principal components
We have a sample covariance matrix S that is estimated from anumber of observations of some underlying random vector
The principal components of this matrix are the eigenvectorscorresponding to the leading eigenvalues
It is often of interest to find a sparse principal componentapproximation to the leading eigenvector that also contain fewnonzeros
An explicit optimization formulation of this problem is
maxvisinRn
vTSv st 983042v9830422 = 1 983042v9830420 le k
where 983042 middot 9830420 indicates the cardinality of v (that is the number ofnonzeros in v) and k is a user-defined parameter indicating abound on the cardinality of v
Optimization Formulations Lecture 5 March 18 - 25 2020 12 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
A relaxation replace vvT by a positive semidefinite proxyM isin Sntimesn
maxMisinSntimesn
〈SM〉F st M ≽ 0 〈IM〉F = 1 983042M9830421 le ρ
for some parameter ρ gt 0 that can be adjusted to attain thedesired sparsity This formulation is a convex optimizationproblem in fact a semidefinite programming problem
More generally one need to find the leading r gt 1 sparse principalcomponents approximations to the leading r eigenvectors thatalso contain few nonzeros
Ideally we would obtain these from a matrix V isin Rntimesr whosecolumns are mutually orthogonal and have at most k nonzeroseach
Optimization Formulations Lecture 5 March 18 - 25 2020 13 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The optimization formulation is
maxVisinRntimesr
〈SVVT〉F st VTV = I 983042vi9830420 le k i = 1 r
We can write a convex relaxation of this problem once again asemidefinite program as
maxMisinSntimesn
〈SM〉F st 0 ≼ M ≼ I 〈IM〉F = r 983042M9830421 le ρ
A more compact (but nonconvex) formulation is
maxFisinRntimesr
〈SFFT〉F st 983042F9830422 le 1 983042F98304221 le R
where 983042F98304221 =983123n
i=1 983042Fi9830422
Optimization Formulations Lecture 5 March 18 - 25 2020 14 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
7 Sparse plus low-rank matrix decomposition
Decompose a partly or fully observed ntimes p matrix Y into the sumof a sparse matrix and a low-rank matrix A convex formulation ofthe fully-observed problem is
minMS
983042M983042lowast + λ983042S9830421 st Y = M+ S
where983042S9830421 =
983131
ij
|Sij |
Compact nonconvex formulations that allow noise in theobservations include the following (L isin Rntimesr R isin Rptimesr S isin S)Fully observed
minLRS
1
2983042LRT + SminusY9830422F
Optimization Formulations Lecture 5 March 18 - 25 2020 15 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
Partially observed
minLRS
1
2983042PΦ(LR
T + SminusY)9830422F
where Φ represents the locations of the observed entries of Y andPΦ is projection onto this set
One application of these formulations is to robust PCA where thelow-rank part represents principal components and the sparse partrepresents ldquooutlierrdquoobservations
Another application is to foreground-background separation invideo processing Here each column of Y represents the pixels inone frame of video whereas each row of Y shows the evolution ofone pixel over time
Optimization Formulations Lecture 5 March 18 - 25 2020 16 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
8 Subspace identification
In this application the aj isin Rn j = 1 m are vectors that lie(approximately) in a low-dimensional subspace
The aim is to identify this subspace expressed as the columnsubspace of a matrix X isin Rntimesr
If the aj are fully observed an obvious way to solve this problemis to perform a singular value decomposition of the ntimesm matrixA =
983045aj983046mj=1
and take X to be the leading r right singular vectors
In interesting variants of this problem however the vectors ajmay be arriving in streaming fashion and may be only partlyobserved for example in indices Φj sub 1 2 n We would thusneed to identify a matrix X and vectors sj isin Rr such that
PΦj (aj minusXsj) asymp 0 j = 1 m
Optimization Formulations Lecture 5 March 18 - 25 2020 17 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
9 Support vector machines
Classification via support vector machines (SVM) is a classicalparadigm in machine learning
This problem takes as input data (aj yj) with aj isin Rn andyj isin minus1 1 and seeks a vector x isin Rn and a scalar β isin R suchthat
aTj xminus β ge 1 when yj = 1
aTj xminus β le minus1 when yj = minus1
Any pair (xβ) that satisfies these conditions defines the separatinghyperplane aTx = β in Rn that separates the ldquopositiverdquo casesaj |yj = 1 from the ldquonegativerdquo cases aj |yj = minus1Among all separating hyperplanes the one that minimizes 983042x9830422 isthe one that maximizes the margin between the two classes thatis the hyperplane whose distance to the nearest point aj of eitherclass is greatest (Distance between aTx = β plusmn 1 is 2983042x9830422)
Optimization Formulations Lecture 5 March 18 - 25 2020 18 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
We can formulate the problem of finding a separating hyperplaneas an optimization problem by defining an objective with thesummation form
H(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0 ge 0
Note that the jth term in this summation is zero if the conditionsin last page are satisfied and positive otherwise min
xβH(xβ) = 0
means the existence of a separating hyperplane
Regularized version
Hλ(xβ) =1
m
m983131
j=1
max1minus yj(aTj xminus β) 0+ λ
2983042x98304222
If λ is sufficiently small (but positive) and if separatinghyperplanes exist the pair (xβ) that minimizes Hλ(xβ) is the
maximum-margin separating hyperplane limxrarrx0
H(xβ)minusH(x0β0)983042x098304222minus983042x98304222
Optimization Formulations Lecture 5 March 18 - 25 2020 19 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The maximum-margin property is consistent with the goals ofgeneralizability and robustness
Linear support vector machine classification with one classrepresented by circles and the other by squares One possiblechoice of separating hyperplane is shown at left If the observeddata is an empirical sample drawn from a cloud of underlying datapoints this plane does not do well in separating the two clouds(middle) The maximum-margin separating hyperplane doesbetter (right)
Optimization Formulations Lecture 5 March 18 - 25 2020 20 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The problem of minimizing Hλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(aTj xminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
Often it is not possible to find a hyperplane that separates thepositive and negative cases well enough to be useful as a classifierThen how to treat this case
Optimization Formulations Lecture 5 March 18 - 25 2020 21 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
One solution is to transform all of the raw data vectors aj by amapping ζ into a higher-dimensional Euclidean space thenperform the support-vector-machine classification on the vectorsζ(aj) j = 1 m Then the conditions
ζ(aj)Txminus β ge 1 when yj = 1
ζ(aj)Txminus β le minus1 when yj = minus1
lead to
Hζλ(xβ) =1
m
m983131
j=1
max1minus yj(ζ(aj)Txminus β) 0+ λ
2983042x98304222
When transformed back to Rn the surface a|ζ(a)Txminus β = 0 isnonlinear and possibly disconnected and is often a much morepowerful classifier than the hyperplanes resulting from minimizingHλ(xβ)
Optimization Formulations Lecture 5 March 18 - 25 2020 22 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The problem of minimizing Hζλ(xβ) can be written as a convexquadratic program (having a convex quadratic objective and linearconstraints) by introducing variables sj j = 1 m to represent theresidual terms Then
minxβs
1
m1Ts+
λ
2983042x98304222
subject to
sj ge 1minus yj(ζ(aj)Txminus β) sj ge 0 j = 1 m
where 1 =9830451 1 middot middot middot 1
983046T isin Rm
The dual of this convex quadratic program is another convexquadratic program One can obtain the dual problem via theresult in Convex Optimization sect524 (Lagrange dual of QCQP)and the result in First-Order Methods in Optimization sect447(Convex Quadratic Functions)
Optimization Formulations Lecture 5 March 18 - 25 2020 23 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The dual problem in m variables
minzisinRm
1
2zTQzminus 1Tz subject to 0 le z le 1
mλ yTz = 0
whereQkl = ykylζ(ak)
Tζ(al)
Interestingly this problem can be formulated and solved withoutany explicit knowledge or definition of the mapping ζ We needonly a technique to define the elements of Q This can be donewith the use of a kernel function K Rn times Rn 983041rarr R whereK(akal) replaces ζ(ak)
Tζ(al) This is the so-called ldquokernel trickrdquo
A particularly popular choice of kernel is the Gaussian kernel
K(akal) = exp(minus983042ak minus al9830422(2σ))
where σ is a positive parameter
Optimization Formulations Lecture 5 March 18 - 25 2020 24 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
10 Logistic regression
We seek an ldquoodds functionrdquo p isin (0 1) parametrized by a vectorx isin Rn as follows
p(ax) = (1 + exp(aTx))minus1
and aim to choose the parameter x so that
p(aj x) asymp 1 when yj = 1
p(aj x) asymp 0 when yj = minus1
The optimal value of x can be found by maximizing alog-likelihood function
L(x) =1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108
Optimization Formulations Lecture 5 March 18 - 25 2020 25 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
We can perform feature selection using this model by introducinga regularizer
maxx
1
m
983091
983107983131
jyj=minus1
log(1minus p(aj x)) +983131
jyj=1
log p(aj x)
983092
983108minus λ983042x9830421
where λ gt 0 is a regularization parameter
The regularization term λ983042x9830421 has the effect of producing asolution in which few components of x are nonzero making itpossible to evaluate p(ax) by knowing only those components ofa that correspond to the nonzeros in x
Multiclass (or multinomial) logistic regression in which the datavectors aj belong to more than two classes Assume in total Mclasses We need a distinct odds functions pk isin (0 1) for each class
Optimization Formulations Lecture 5 March 18 - 25 2020 26 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
These M functions are parametrized by vectors x[k] isin Rnk = 1 M defined as follows
pk(aX) =exp(aTx[k])983123Mℓ=1 exp(a
Tx[ℓ]) k = 1 M
whereX = x[k]|k = 1 M
Note that for all a and for all k we have
pk(aX) isin (0 1)
M983131
k=1
pk(aX) = 1
If one of these inner products aTx[ℓ]Mℓ=1 dominates the others
that is aTx[k] ≫ aTx[ℓ] for all ℓ ∕= k then
pk(aX) asymp 1 and pℓ(aX) asymp 0 for ℓ ∕= k
Optimization Formulations Lecture 5 March 18 - 25 2020 27 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
In the setting of multiclass logistic regression the labels yj arevectors in RM whose elements are defined as follows
yjk =
9830831 when aj belongs to calss k
0 otherwise
We seek to define the vectors x[k] so that
pk(aj X) asymp 1 when yjk = 1
pk(aj X) asymp 0 when yjk = 0
The problem of finding values of x[k] that satisfy these conditionscan again be formulated as one of maximizing a log-likelihood
L(X) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]aj)minus log
983075M983131
ℓ=1
exp(xT[ℓ]aj)
983076983078
Group-sparse regularization terms
Optimization Formulations Lecture 5 March 18 - 25 2020 28 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
11 Deep learning
Deep neural networks are often designed to perform the samefunction as multiclass logistic regression that is to classify a datavector a into one of M possible classes where M ge 2 is large insome key applications
The difference is that the data vector a undergoes a series ofstructured transformations before being passed through amulticlass logistic regression classifier of the type described in theprevious section
A wonderful reference
Deep Learning An Introduction for Applied Mathematicians
Catherine F Higham and Desmond J Higham
SIAM Review 2019
The data vector aj enters at the bottom of the network each nodein the bottom layer corresponding to one component of aj
Optimization Formulations Lecture 5 March 18 - 25 2020 29 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
Deep neural network showing connections between adjacent layers
Optimization Formulations Lecture 5 March 18 - 25 2020 30 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
The vector then moves upward through the network undergoing astructured nonlinear transformation as it moves from one layer tothe next
A typical form of this transformation which converts the vectoralminus1j at layer l minus 1 to input vector alj at layer l is
alj = σ(Wlalminus1j + gl) l = 1 2 D
where Wl is a matrix of dimension |alj |times |alminus1j | and gl is a vector
of length |alj | σ is a componentwise nonlinear transformation andD is the number of hidden layers defined as the layers situatedstrictly between the bottom and top layers
Each arc in the figure represents one of the elements of atransformation matrix Wl Define a0j to be the ldquorawrdquo input vector
aj and let aDj be the vector formed by the nodes at the topmosthidden layer
Optimization Formulations Lecture 5 March 18 - 25 2020 31 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
Typical forms of the function σ include the following actingidentically on each component t isin R of its input vector
(1) Logistic function t 983041rarr 1(1 + eminust)
(2) Rectified Linear Unit ReLU t 983041rarr max(t 0)
(3) Bernoulli a random function that outputs 1 with probability1(1 + eminust) and 0 otherwise
Each node in the top layer corresponds to a particular class andthe output of each node corresponds to the odds of the inputvector belonging to each class
The parameters in this neural network are the matrix-vector pairs(Wlgl) l = 1 2 D that transform the input vector aj into itsform aDj at the topmost hidden layer together with theparameters X of the multiclass logistic regression operation thattakes place at the very top stage
We aim to choose all these parameters so that the network does agood job on classifying the training data correctly
Optimization Formulations Lecture 5 March 18 - 25 2020 32 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33
Using the notation w for the hidden layer transformations that is
w = (W1g1W2g2 WDgD)
and defining X = x[k]|k = 1 2 M we can write the lossfunction for deep learning as follows
L(wX) =1
m
m983131
j=1
983077M983131
ℓ=1
yjℓ(xT[ℓ]a
Dj (w))minus log
983075M983131
ℓ=1
exp(xT[ℓ]a
Dj (w))
983076983078
We can view multiclass logistic regression as a special case of deeplearning in which there are no hidden layers so that D = 0 w isnull and aDj = aj j = 1 2 m
The ldquolandscaperdquo of L is complex with the global maximizer beingexceedingly difficult to find
The total number of parameters in (wX) is usually very large
Optimization Formulations Lecture 5 March 18 - 25 2020 33 33