Date post: | 18-Apr-2018 |
Category: |
Documents |
Upload: | truongngoc |
View: | 217 times |
Download: | 2 times |
Thank you for the slides. They come mostly from the following sources.
Marc Pollefeys U. on North Carolina
Ramani Duraiswami U. of Maryland
Derivative of a matrix
Jacobian and Hessian
Least Squares, SVD, Pseudoinverse
• Ax=b A is m×n, x is n×1 and b is m×1.• A=USVt where U is m×m, S is m×n and V is n×n• USVt x=b. So SVt x=Utb• If A has rank r, then r singular values are significant
Vtx= diag(σ1-1,…,σr
-1 ,0, …, 0)Utbx= Vdiag(σ1
-1,…,σr-1 ,0, …, 0)Utb
11
,tri
r i r ri i
σ ε σ εσ +
=
= > ≤∑ u bx v
•Pseudoinverse A+=V diag(σ1-1,…,σr
-1 ,0, …, 0) Ut
–A+ is a n×m matrix. –If rank (A) =n then A+=(AtA)-1A
–If A is square A+=A-1
Well Posed problems• Hadamard postulated that for a problem to be “well
posed”1. Solution must exist2. It must be unique3. Small changes to the input data should cause small changes to the
solution
• Many problems in science and computer vision result in “ill-posed” problems.
– Numerically it is common to have condition 3 violated.
• Recall from the SVD 11
,tni
i r ri i
σ ε σ εσ +
=
= > ≤∑ u bx v
•If σs are close to zero small changes in the “data” vector b cause big changes in x.•Converting ill-posed problem to well-posed one is called regularization.
Regularization• Pseudoinverse provides one means of
regularization
• Another is to solve (A+εI)x=b 21
( )n
tii i
i i
σε σ=
=+∑x u b v
•Solution of the regular problem requires minimizing of ||Ax-b||2
•This corresponds to minimizing
||Ax-b||2 + ε||x||2
–Philosophy – pay a “penalty” of O(ε) to ensure solution does not blow up.–In practice we may know that the data has an uncertainty of a certain magnitude … so it makes sense to optimize with this constraint.
•Ill-posed problems are also called “ill-conditioned”
Derivative
• In 1-D
• Taylor series: for a continuous function
0
( ) ( )limh
df f x h f x
dx h→+ −=
( )
2 2
2
2 2
2
( ) ( )2 !
( ) ( ) 12 !
n n
nx x x
n nn
nx x x
df h d f h d ff x h f x h
dx dx n dx
df h d f h d ff x h f x h
dx dx n dx
+ = + + + + +
− = − + + + − +
" "
" "
f(x)
x
•Geometric interpretation –Approximate smooth curveby values of tangent, curvature, etc.
Remarks• Mean value theorem:
– f(b)-f(a)=(b-a)df/dx|c a<c<b
– There is at least one point betweena and b on the curve where the slope matches that of the straight line joiningthe two points x
f(x)
•df/dx=0–represents a minimum, maximum or saddle point of the curve y=f(x)
–d2f/dx2 > 0 minimum, d2f/dx2 < 0 maximum
–d2f/dx2 = 0 saddle point
Finite differences
• Approximate derivatives at points by using values of a function known at certain neighboring points
• Truncate Taylor series and obtain an expression for the derivatives
• Forward differences: use value at the point and forward x x x x
• Backward differences
( ) ( )
( ) ( )
21 2
2
21 2
2
( ) ( )2
( ) ( )2
x x
x x
df h d fh f x h f x O h
dx dx
df h d fh f x f x h O h
dx dx
−
−
= + − − +
= − − + +
Finite Differences• Central differences
– Higher order approximation
( )
( )
2 22
2 2
2
( ) ( ) ( ) ( )2
2 2
( ) ( )2
x x x
x
df f x h f x h d f f x f x h h d fO h
dx h dx h dx
df f x h f x hO h
dx h
+ − − −= − + + +
+ − −= +
–However we need data on both sides
–Not possible for data on the edge of an image
–Not possible in time dependent problems (we have data at current time and previous one)
Approximation• Order of the approximation O(h), O(h2)
• Sidedness, one sided, central etc.
• Points around point where derivative is calculated that are involved are called the “stencil” of the approximation.
• Second derivative
( )2 2
22 2
2
2 2
( ) ( ) ( ) ( )0
2 2
( ) 2 ( ) ( )( )
x x
f x h f x h d f f x f x h h d fO h
h dx h dx
d f f x h f x f x hO h
dx h
+ − − −= − − + +
+ − + −= +
23 ( ) 4 ( ) ( 2 )( )
2df f x f x h f x h
O hdx h
− + + − += +
•One sided difference of O(h2)
Polynomial interpolation• Instead of playing with Taylor series we can obtain fits
using polynomial expansions.– 3 points fit a quadratic ax2+bx+c
• Can calculate the 1st and 2nd derivatives
– 4 points fit a cubic, etc.
• Given x1, x2, x3, x4 and values f1, f2, f3, f412 3 2 3
0 01 11 1 1 1 1 12 3 2 3
1 12 22 2 2 2 2 22 3 2 3
2 23 33 3 3 3 3 32 3 2 3
3 34 44 4 4 4 4 4
1 1
1 1
1 1
1 1
a af fx x x x x x
a af fx x x x x x
a af fx x x x x x
a af fx x x x x x
− = =
•Vandermonde system – fast algorithms for solution.•If more data than degree .. Can get a least squares solution.•Matlab functions polyfit, polyval
Remarks• Can use the fitted polynomial to calculate derivatives
• If equation is solved analytically this provides expressions for the derivatives.
• Equation can become quite ill conditioned – especially if equations are not normalized.
ax2+bx+c can also be written as a* (x-x0)2+b* (x-x0) + c*
– Find the polynomial through x0-h, x0, x0+h
20 1
1 02
2 1
1
1 0 0
1
h h a f
a f
h h a f
− − =
–a0=f0, a1=,(f1-f-1)/2h a2=(f-1-2f0+f1)/2h2
–Gives the expected values of the derivatives.
Polynomial interpolation• Results from Algebra
– Polynomial of degree n through n+1 points is unique – Polynomials of degree less than xn is an n dimensional space.– 1,x,x2, …,xn-1 form a basis.
• Any other polynomial can be represented as a combination of these basis elements.
– Other sets of independent polynomials can also form bases.
• To fit a polynomial through x0,…,xn with values f0, …,fn– Use Lagrangian basis lk.
0,
, 0,...,n
ik
i k ii k
x xl k n
x x=≠
−= =−∏
–p(x)=a0l0+a1l1+…+anln.–Then ai=fi
–Many polynomial bases: Chebyshev, Legendre, Laguerre …–Bernstein, Bookstein …
Increasing n• As n increases we can
increase the polynomial degree.
• However the function in between is very poorly interpolated.
• Becomes ill-posed.
• For large n interpolant blows up.
•Idea: –Taylor series provides good local approximations
–Use local approximations
•Splines
Spline interpolation• Piecewise polynomial approximation
– E.g. interpolation in a table– Given xk ,xk+1, fk and fk+1 evaluate f at a point x such that
xk<x<xk+1
11 1
1 1
,( )
0 , otherwise
k kk k k k
k k k k
x x x xf f x x x
x x x xf x+
+ ++ +
− − + ≤ ≤ − −=
•Construct approximations of this type on each subintervalThis method uses Lagrangian interpolants
•Endpoints are called breakpoints
•For higher polynomial degree we need more conditions
• e.g. specify values at points inside the interval [xk<x<xk+1]•Specifying function and derivative values at the end points xk,xk+1 leads to cubic Hermite interpolation
Cubic Spline• Splines – name given to a flexible piece of wood used by
draftsmen to draw curves through points.– Bend wood piece so that it passes through known points and draw a line
through it.– Most commonly used interpolant used is the cubic spline– Provides continuity of the function, 1st and 2nd derivatives at the
breakpoints.– Given n+1 points we have n intervals– Each polynomial has four
unknown coefficients• Specifying function values
provides 2 equations• Two derivative continuity
equations provides two more
{ }
( ) ( )( ) ( )
'' ''1
' '1
, , 1,..., 1
( ) 1, , 1
2, ,
2, ,
i i
i i
i i
i i
x f i n
P x f i n
P x P x i n
P x P x i n
−
−
= += = +
= =
= =
"
"
"
•Left with two free conditions. Usually chosen so that second derivatives are zero at ends
Interpolating along a curve
• Curve can be givenas x(s) and y(s)
• Given xi,yi,si
• Can fit splines for x and y
• Can compute tangents, curvature and normal based on this fit
• Things like intensity van vary along the curve. Can also fit I(s)
*
*
**
**
* **
Typical Optimization Problems• Model fitting
– Fit a straight line or polynomial through data
yi=Σj aj xji
– Fit a sum of cosines, exponentials etc.
yi=Σj aj φj(xi)
Model φj s, parameters ajs data (xi,yi)
• Determine a transformation– Determine a homography matrix
x′′′′ = Hx– Determine the fundamental matrix
x′′′′tFx = 0x x ′′′′
H
x
y
Algebraic Distance• Algebraic system Ax = b
• Approximate solution x/
• Residual ||A x/ - b||
• Residual is also calledalgebraic distance
• Algorithms that seek toreduce the residual arecalled “minimumresidual” algorithms
x
y
Scaling• Try to avoid anyone equation being overly represented.
• Scale each equation– Scale by largest coefficient so that it becomes 1
ai1/a11
– Scale so that sum of coefficients is 1
a112+ a12
2 +…+ a1n2=1
• Scaling also has the benefit of avoiding round-off.
Weighted Least Squares• Multiplying an equation by a number will increase its
weight or influence in the cost function.
• Not always a bad thing– May want to weight different equations differently
• How to select weights?– Number of observations
– Reliability of measurement• Measured variances
• How good is the least squares solution? How “probable”are the parameter estimates?
• Bring in notions of statistics
Cost functions for image based data• Notation
– Measured value of a point x∼
– True value of a point x– Estimated value of a point x^
– Transformation or model is denoted H• Model y=H(x) and x=H-1(y)
• Symmetric error functions– Case 1: Error only in one image
• Could arise if we are imaging a calibration pattern with knowncoordinates and trying to determine camera calibration
– Appropriate error function is
Find H^ that minimizes Σj d(x′′′′~j,H^xj
~)2
Cost Functions
• Ideally there is areal cost beingminimized– E.g. Dollars or
distance travelled– Then each equation
makes sense
• Statistical measures• Review concepts of
Metrics
Constrained optimization• We have to optimize f(x) subject to g(x)=0
– Makes sense if g(x)=0 leaves a few degrees of freedom (N-M)
• Approach 1 (Eliminate constraints)– Eliminate variables using constraint equations and solve a
reduced problem f(x*)=0 – Not practical, except for simple problems
• Approach 2 (Penalty function)– Construct a new minimization function f(x)+Pg(x) where P>>1– If constraint is violated the minimization function increases
rapidly, forcing the optimization routine to solutions where it is not violated
• Approach 3 (Lagrange Multipliers)– Solution has to lie on the surface of g(x)=0– Can’t have ∇f =0 anymore– However we require ∇f parallel to ∇g=0
Lagrangian• Consider the Lagrangian function
L(x, λ) = f + λg
gLgfL =∂∂∇+∇=
∂∂ ),(,),( λ
λλλ xx
x• Extremize the Lagrangian
0)(),(,0),( ==∂∂=∇+∇=
∂∂ xxxx
gLgfL λλ
λλ
• So this gives us both the constraint equation and the way to optimize the function on the surface.
Optimization Techniques• Different problem here
– Given a set of locations xi where one has measured a fitness functionχ/f(xxxx) find a vector of parameters (xxxx) that minimizes it
• For the case where the function was linear we already have methods such as SVD to solve the linear system.
• Here we are concerned with systems where the equation is not so simple. – In particular f may be a nonlinear function of parameters xxxx
• Differential calculus provides us with ways of estimating extrema.– The minimum (max) of f occurs at ∇f / , or– ∇f is in the direction of increasing f, or– Given an interval ∇f has opposite signs at the boundary there must
be a point inside where ∇f must be zero
• However calculus is local– So these methods can only guarantee a local extremum
Bisection methods• Given a function f at three points a,b,c with [a<b<c], and a
way to evaluate f at a new point– Given 2 initial guesses f(a) and f(b), if f(a)>f(b) move in the
direction a to b and choose a new parameter c.
– Find a triplet [a,b,c] sothat and f(c)>f(b)and f(a)>f(b)
– Choose a new pointbetween a and bor b and c
– Repeat until thepoints a, b and care sufficientlyclose
Paraboloic bracketing
Bracketing a minimum in multiple dimensions
• Smallest region bounded by a group of points in– 1D is bounded by two points (a line segment)
– 2D is bounded by three points (a triangle)
– 3D by four points (a tetrahedron)
– In ND by N+1 points (a simplex)
• Can find a direction of a decreasing function in – 1D by the line from point with higher value to lower
– 2D by joining point with highest value through point with average value on the opposite side of the triangle
– And so on for ND
• However cannot guarantee a bracket of a minimum in ND
Downhill Simplex Method (Nelder-Mead)• Reflection: Project along the
direction of decrease with size 1.
• Reflection and expansion:If decrease is large try a step of size 2.
• Contraction: Result of reflection is bad, so try a simple reduction within simplex.
• Multiple contraction: If result of contraction does not give a better result than lowest point.
• Conclude: volume of simplex becomes below tolerance.
Newton iteration
Taylor approximation
Δ+≈Δ+ J)(P )(P 00 ff PXJ∂∂
=Jacobian
)(PX 1f−
Δ−=Δ−−≈− JJ)(PX)(PX 001 eff
( ) 0T-1T
0TT JJJJJJ ee =Δ⇒=Δ⇒
Δ+=+ i1i PP ( ) 0T-1T JJJ e=Δ
( ) 01-T-11-T JJJ eΣΣ=Δ
normal eq.
Gradient Descent• We have a function f and an estimate of its gradient ∇f
• Decrease f by a quantity along the direction of ∇f– Begin initialize x, tol, k=0
do k<-k+1
x x-hk ∇f
until hk∇f<tol`return x
end
• Determining h is not easy– Called “learning rate” in AI
– Hard to determine h• If h is too small algorithm will be too slow to converge. If it is too large the
procedure will diverge
• Can select it using a line search or using a Newton method.
600 Appendix 6 Iterative Estimation Methods
Levenberg-Marquardt method is essentially a Gauss-Newton method that transitions smoothly to gradient descent when the Gauss-Newton updates fail.
To summarize, we have so far considered three methods of minimization of a cost function g(P) = ||e(P)||2/2:
(i) Newton. Update equation:
C / P P A = -gP
where gPP = eJeP + e j p e and c/P = eje. Newton iteration is based on the assumption of an approximately quadratic cost function near the minimum, and will show rapid convergence if this condition is met. The disadvantage of this approach is that the computation of the Hessian may be difficult. In addition, far from the minimum the assumption of quadratic behaviour is probably invalid, so a lot of extra work is done with little benefit,
(ii) Gauss-Newton. Update equation:
eJePA = —eje
This is equivalent to Newton iteration in which the Hessian is approximated by e je p . Generally this is a good approximation, particularly close to a minimum, or when e is nearly linear in P.
(iii) Gradient descent. Update equation:
AA = —eje = —gF.
The Hessian in Newton iteration is replaced by a multiple of the identity matrix. Each update is in the direction of most rapid local decrease of the function value. The value of A may be chosen adaptively, or by a line search in the downward gradient direction. Generally, gradient descent by itself is not recommended, but in conjunction with Gauss-Newton it yields the commonly used Levenberg-Marquardt method.
A6.2 Levenberg-Marquardt iteration
The Levenberg-Marquardt (abbreviated LM) iteration method is a slight variation on the Gauss-Newton iteration method. The normal equations JT J A = — JTe are replaced by the augmented normal equations (JT J + Al)A = - J T e , for some value of A that varies from iteration to iteration. Here I is the identity matrix. A typical initial value of A is 1CT3 times the average of the diagonal elements of N = JT J.
If the value of A obtained by solving the augmented normal equations leads to a reduction in the error, then the increment is accepted and A is divided by a factor (typically 10) before the next iteration. On the other hand if the value A leads to an increased error, then A is multiplied by the same factor and the augmented normal equations are solved again, this process continuing until a value of A is found that gives rise to a decreased error. This process of repeatedly solving the augmented normal equations for different values of A until an acceptable A is found constitutes one iteration of the LM algorithm. An implementation of the LM algorithm is given in [Press-88].
Levenberg-Marquardt
0TT JNJJ e=Δ=Δ
0TJN' e=Δ
Augmented normal equations
Normal equations
J)λdiag(JJJN' TT +=
30 10λ −=
10/λλ :success 1 ii =+
ii λ10λ :failure = solve again
accept
λ small ~ Newton (quadratic convergence)
λ large ~ descent (guaranteed decrease)
Levenberg-Marquardt
Requirements for minimization• Function to compute f• Start value P0
• Optionally, function to compute J(but numerical ok, too)
Sparse Levenberg-Marquardt
• complexity for solving• prohibitive for large problems
(100 views 10,000 points ~30,000 unknowns)
• Partition parameters• partition A • partition B (only dependent on A and itself)
0T-1 JN' e=Δ3N
ε = X – Xhat we seek to minimize ||ε ||2ΣX
A = [∂Xhat /∂a] B = [∂Xhat /∂ b] and the normal equation is
JT Σ-1X J Δ = JT Σ-1
X ε J = [A | B ]
U, V and W, with * in the book because multiplying with 1 + λon the diagonal, are matrices (AT Σ-1
X A)…
Sparse bundle adjustment
residuals:normal equations:
with
note: tie points should be in partition A
Sparse bundle adjustment
normal equations:
modified normal equations:
solve in two parts:
Sparse bundle adjustment
• Covariance estimation
-1WVY =( )
1a
Tb
1-a
VYYWWVU
−
+
+∑=∑−=∑
Yaab ∑=∑ -
Sparse bundle adjustment
U1
U2
U3
WT
W
V
P1 P2 P3 M
Jacobian of has sparse block structure
=J == JJN T
12xm 3xn(in general
much larger)
im.pts. view 1
( )( )∑∑= =
m
k
n
iikD
1 1
2ki M̂P̂,m
Needed for non-linear minimization
Sparse bundle adjustment• Eliminate dependence of camera/motion
parameters on structure parametersNote in general 3n >> 11m
WT V
U-WV-1WT
=×⎥⎦⎤
⎢⎣⎡ − −
NI0WVI 1
11xm 3xn
Allows much more efficient computationse.g. 100 views,10000 points,
solve ±1000x1000, not ±30000x30000
Often still band diagonaluse sparse linear algebra algorithms