Introduction toOptimization
Constrained Optimization
Marc ToussaintU Stuttgart
Constrained Optimization
• General constrained optimization problem:Let x ∈ Rn, f : Rn → R, g : Rn → Rm, h : Rn → Rl find
minx
f(x) s.t. g(x) ≤ 0, h(x) = 0
In this lecture I’ll focus (mostly) on inequality constraints g!
• Applications– Find an optimal, non-colliding trajectory in robotics– Optimize the shape of a turbine blade, s.t. it must not break– Optimize the train schedule, s.t. consistency/possibility
2/39
• Try to some how transform the constraint problem to
a series of unconstraint problems
a single but larger unconstraint problem
another constraint problem, hopefully simpler (dual, convex)
3/39
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the constraint”(augmented Lagrangian)– Associate a log barrier with a constraint, becoming∞ for violation (interiorpoint method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
4/39
Penalties & Barries
• Convention:
A barrier is really∞ for g(x) > 0
A penalty is zero for g(x) ≤ 0 and increases with g(x) > 0
5/39
Log barrier method or Interior Point method
6/39
Log barrier method
• Instead ofminx
f(x) s.t. g(x) ≤ 0
we addressminx
f(x)− µ∑i
log(−gi(x))
7/39
Log barrier
• For µ→ 0, −µ log(−g) converges to∞[g > 0]
Notation: [boolean expression] ∈ {0, 1}
• The barriers gradient ∇− log(−g) = ∇gg pushes away from the
constraint
• Eventually we want to have a very small µ – but choosing small µmakes the barrier very non-smooth, which is bad for Gradient and 2ndorder methods
8/39
Central Path
• Every µ defines a different optimal x∗(µ)
x∗(µ) = argminx
f(x)− µ∑i
log(−gi(x))
• Each point on the path can be understood as the optimal compromiseof minimizing f(x) and a repelling force of the constraints. (Whichcorresponds to dual variables λ∗(µ).)
9/39
Log barrier method
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tolerances θ, εOutput: x
1: initialize µ = 1
2: repeat3: find x← argminx f(x)− µ
∑i log(−gi(x)) with tolerance 10θ
4: decrease µ← µ/10
5: until |∆x| < θ and ∀i : gi(x) < ε
Note: See Boyd & Vandenberghe for stopping criteria based on fprecision (duality gap) and better choice of initial µ (which is called tthere).
10/39
We will revisit the log barrier method later, once we introduced theLangrangian...
11/39
Squared Penalty Method
12/39
Squared Penalty Method
• This is perhaps the simplest approach
• Instead ofminx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x) + µ
m∑i=1
[gi(x) > 0] gi(x)2
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol. θ, εOutput: x
1: initialize µ = 1
2: repeat3: find x← argminx f(x) + µ
∑i[gi(x) > 0] g(x)2 with tolerance 10θ
4: µ← 10µ
5: until |∆x| < θ and ∀i : gi(x) < ε
13/39
Squared Penalty Method
• The method is ok, but will always lead to some violation of constraints
• A better idea would be to add an out-pushing gradient/force −∇gi(x)
for every constraint gi(x) > 0 that is violated.
Ideally, the out-pushing gradient mixes with −∇f(x) exactly such thatthe result becomes tangential to the constraint!
This idea leads to the augmented Lagrangian approach.
14/39
Squared Penalty Method
• The method is ok, but will always lead to some violation of constraints
• A better idea would be to add an out-pushing gradient/force −∇gi(x)
for every constraint gi(x) > 0 that is violated.
Ideally, the out-pushing gradient mixes with −∇f(x) exactly such thatthe result becomes tangential to the constraint!
This idea leads to the augmented Lagrangian approach.
14/39
Augmented Lagrangian
(We can introduce this is a self-contained manner, without yet defining the
“Lagrangian”)
15/39
Augmented Lagrangian (equality constraint)• We first consider an equality constraint before addressing inequalities• Instead of
minx
f(x) s.t. h(x) = 0
we address
minx
f(x) + µ
m∑i=1
hi(x)2 +∑i=1
λihi(x) (1)
• Note:– The gradient ∇hi(x) is always orthogonal to the constraint– By tuning λi we can induce a “virtual gradient” λi∇hi(x)
– The term µ∑mi=1 hi(x)2 penalizes as before
• Here is the trick:– First minimize (1) for some µ and λi– This will in general lead to a (slight) penalty µ
∑mi=1 hi(x)2
– For the next iteration, choose λi to generate exactly the gradient thatwas previously generated by the penalty 16/39
• Optimality condition after an iteration:
x′ = argminx
f(x) + µ
m∑i=1
hi(x)2 +
m∑i=1
λihi(x)
⇒ 0 = ∇f(x′) + µ
m∑i=1
2hi(x′)∇hi(x
′) +
m∑i=1
λi∇hi(x′)
• Update λ’s for the next iteration:∑i=1
λnewi ∇hi(x
′) = µ
m∑i=1
2hi(x′)∇hi(x
′) +∑i=1
λoldi ∇hi(x
′)
λnewi = λold
i + 2µhi(x′)
Input: initial x ∈ Rn, functions f(x), h(x),∇f(x),∇h(x), tol. θ, εOutput: x
1: initialize µ = 1, λi = 0
2: repeat3: find x← argminx f(x) + µ
∑i hi(x)2 +
∑i λihi(x)
4: ∀i : λi ← λi + 2µhi(x′)
5: until |∆x| < θ and |hi(x)| < ε
17/39
This adaptation of λi is really elegant:
• We do not have to take the penalty limit µ→∞ but still can have exactconstraints
• If f and h were linear (∇f and ∇hi constant), the updated λi is exactlyright: In the next iteration we would exactly hit the constraint (byconstruction)
• The penalty term is like a measuring device for the necessary “virtualgradient”, which is generated by the agumentation term in the nextiteration
• The λi are very meaningful: they give the force/gradient that a constraintexerts on the solution
18/39
Augmented Lagrangian (inequality constraint)• Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x) + µ
m∑i=1
[gi(x) ≥ 0 ∨ λi > 0] gi(x)2 +
m∑i=1
λigi(x)
• A constraint is either active or inactive:– When active (gi(x) ≥ 0 ∨ λi > 0) we aim for equality gi(x) = 0
– When inactive (gi(x) < 0 ∧ λi = 0) we don’t penalize/augment– λi are zero or positive, but never negative
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol. θ, εOutput: x
1: initialize µ = 1, λi = 0
2: repeat3: find x← argminx f(x) + µ
∑i[gi(x) ≥ 0 ∨ λi > 0] gi(x)2 +
∑i λigi(x)
4: ∀i : λi ← max(λi + 2µgi(x′), 0)
5: until |∆x| < θ and gi(x) < ε
19/39
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the constraint”(augmented Lagrangian)– Associate a log-barrier with a constraint, becoming∞ for violation (interiorpoint method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
20/39
The Lagrangian
21/39
The Lagrangian
• Given a constraint problem
minx
f(x) s.t. g(x) ≤ 0
we define the Lagrangian as
L(x, λ) = f(x) +
m∑i=1
λigi(x)
• The λi ≥ 0 are called dual variables or Lagrange multipliers
22/39
What’s the point of this definition?
• The Lagrangian is useful to compute optima analytically, on paper –that’s why physicist learn it early on
• The Lagrangian implies the KKT conditions of optimality
• Optima are necessarily at saddle points of the Lagrangian
• The Lagrangian implies a dual problem, which is sometimes easier tosolve than the primal
23/39
Example: Some calculus using the Lagrangian
• For x ∈ R2, what is
minxx2 s.t. x1 + x2 = 1
• Solution:
L(x, λ) = x2 + λ(x1 + x2 − 1)
0 = ∇xL(x, λ) = 2x+ λ
1
1
⇒ x1 = x2 = −λ/2
0 = ∇λL(x, λ) = x1 + x2 − 1 = −λ/2− λ/2− 1 ⇒ λ = −1
⇒ = x1 = x2 = 1/2
24/39
The “force” & KKT view on the Lagrangian
• At the optimum there must be a balance between the cost gradient−∇f(x) and the gradient of the active constraints −∇gi(x)
25/39
The “force” & KKT view on the Lagrangian
• At the optimum there must be a balance between the cost gradient−∇f(x) and the gradient of the active constraints −∇gi(x)
• Formally: for optimal x: ∇f(x) ∈ span{∇gi(x)}
• Or: for optimal x there must exist λi such that −∇f(x) = −[∑
i(−λi∇gi(x))]
• For optimal x it must hold (necessary condition): ∃λ s.t.
∇f(x) +
m∑i=1
λi∇gi(x) = 0 (“force balance”)
∀i : gi(x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λigi(x) = 0 (complementary)
The last condition says that λi > 0 only for active constraints.These are the Karush-Kuhn-Tucker conditions (KKT, neglectingequality constraints)
26/39
The “force” & KKT view on the Lagrangian
• At the optimum there must be a balance between the cost gradient−∇f(x) and the gradient of the active constraints −∇gi(x)
• Formally: for optimal x: ∇f(x) ∈ span{∇gi(x)}
• Or: for optimal x there must exist λi such that −∇f(x) = −[∑
i(−λi∇gi(x))]
• For optimal x it must hold (necessary condition): ∃λ s.t.
∇f(x) +
m∑i=1
λi∇gi(x) = 0 (“force balance”)
∀i : gi(x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λigi(x) = 0 (complementary)
The last condition says that λi > 0 only for active constraints.These are the Karush-Kuhn-Tucker conditions (KKT, neglectingequality constraints)
26/39
The “force” & KKT view on the Lagrangian
• The first condition (“force balance”), ∃λ s.t.
∇f(x) +
m∑i=1
λi∇gi(x) = 0
can be equivalently expressed as, ∃λ s.t.
∇xL(x, λ) = 0
• In that sense, the Lagrangian can be viewed as the “energy function”that generates (for good choice of λ) the right balance between costand constraint gradients
• This is exactly as in the augmented Lagrangian approach, where however wehave an additional (“augmented”) squared penalty that is used to tune the λi
27/39
Saddle point view on the Lagrangian
• Let’s briefly consider the equality case again:
minx
f(x) s.t. h(x) = 0
with the Lagrangian
L(x, λ) = f(x) +
m∑i=1
λihi(x)
• Note:
minxL(x, λ) ⇒ 0 = ∇xL(x, λ) ↔ force balance
maxλ
L(x, λ) ⇒ 0 = ∇λL(x, λ) = hi(x) ↔ constraint
• Optima (x∗, λ∗) are saddle points where∇xL = 0 ensures force balance and∇λL = 0 ensures the constraint 28/39
Saddle point view on the Lagrangian
• In the inequality case:
maxλ≥0
L(x, λ) =
f(x) if g(x) ≤ 0
∞ otherwise
maxλi≥0
L(x, λ)⇒
λi = 0 if g(x) < 0
0 = ∇λiL(x, λ) = gi(0) otherwise
This implies either (λi = 0 ∧ gi(x) < 0) or gi(0) = 0, which is exactlyequivalent to the KKT conditions
• Again, optima (x∗, λ∗) are saddle points whereminx L enforces force balance andmaxλ L enforces the KKT conditions
29/39
The Lagrange dual problem• We define the Lagrange dual function as
l(λ) = minxL(x, λ)
• This implies two problems
minxf(x) s.t. g(x) ≤ 0 primal problem
maxλ
l(λ) s.t. λ ≥ 0 dual problem
The dual problem is convex, even if the primal is non-convex!
• Written more symmetric:
minx
maxl≥0
L(x, λ) primal problem
maxλ≥0
minxL(x, λ) dual problem
because the maxλ≥0 L(x, λ) ensures the constraints (previous slide).30/39
The Lagrange dual problem
• The dual function is always a lower bound (for any λi ≥ 0)
l(λ) = minxL(x, λ) ≤
[minxf(x) s.t. g(x) ≤ 0
]And consequently
maxλ≥0
minxL(x, λ) ≤ min
xmaxl≥0
L(x, λ)
• We say strong duality holds iff
maxλ≥0
minxL(x, λ) = min
xmaxl≥0
L(x, λ)
• If the primal is convex, and there exist an interior point
∃x : ∀i : gi(x) < 0
(which is called Slater condition), then we have strong duality31/39
And what about algorithms?
• So far we’ve only introduced a whole lot of formalism, and seen thatthe Lagrangian sort of represents the constraint problem– minx L or ∇xL = 0 is related to the force balance– maxλ L or ∇λL = 0 is related to constraints or KKT conditions– This implies two dual problems, minx maxλ L and maxλ minx L, thesecond (dual) is a lower bound of the first (primal)
• But what are the algorithms we can get out of this?
32/39
Algorithmic implications of the Lagrangian view
• If minx L(x, λ) can be solved analytically, we can alternatively solve the(convex) dual problem.
• But more generally
Optimization problem −→ Solve KKT conditions
→ Apply standard algos for solving an equation system r(x, λ) = 0:
Newton method ∇r∆x
∆λ
= −r
This leads to primal-dual algorithms that adapt x and λ concurrently.Roughly, they use the curvature ∇2f to estimate the right λ to push outof the constraint. We will discuss this after we’ve learnt about 2ndorder methods.
33/39
Log barrier method revisited
34/39
Log barrier method revisited
• Log barrier method: Instead of
minx
f(x) s.t. g(x) ≤ 0
we addressminx
f(x)− µ∑i
log(−gi(x))
• For given µ the optimality condition is
∇f(x)−∑i
µ
gi(x)∇gi(x) = 0
or equivalently
∇f(x)−∑i
λi∇gi(x) = 0 , λigi(x) = −µ
These are called modified (=approximate) KKT conditions.35/39
Log barrier method revisited
Centering (the unconstrained minimization) in the log barrier method isequivalent to solving the modified KKT conditions.
Note also: On the central path, the duality gap is mµ:l(λ∗(µ)) = f(x∗(µ)) +
∑i λigi(x
∗(µ)) = f(x∗(µ))−mµ
36/39
Phase I: Finding a feasible initialization
37/39
Phase I: Finding a feasible initialization
• An elegant method for finding a feasible point x:
min(x,s)∈Rn+1
s s.t. ∀i : gi(x) ≤ s, s ≥ 0
or
min(x,s)∈Rn+m
m∑i=1
si s.t. ∀i : gi(x) ≤ si, si ≥ 0
38/39
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the constraint”(augmented Lagrangian)– Associate a log barrier with a constraint, becoming∞ for violation (interiorpoint method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
39/39