Robust Visual Motion Analysis: Piecewise-Smooth Optical Flow and
Motion-Based Detection and Tracking
Ming Ye
A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
University of Washington
2002
Program Authorized to Offer Degree: Electrical Engineering
University of Washington
Abstract
Robust Visual Motion Analysis: Piecewise-Smooth Optical Flow and Motion-Based
Detection and Tracking
by Ming Ye
Co-Chairs of Supervisory Committee:
Professor Robert M. HaralickElectrical Engineering
Professor Linda G. ShapiroComputer Science and Engineering
This thesis describes new approaches to optical flow estimation and motion-based detection
and tracking. Statistical methods, particularly outlier rejection, error analysis and Bayesian
inference, are extensively exploited in our study and are shown to be crucial to the robust
analysis of visual motion.
To recover optical flow, or 2D velocity fields, from image sequences, certain models of
brightness conservation and flow smoothness must be assumed. Thus how to cope with
model violations especially motion discontinuities becomes a very challenging issue. We
first tackle this problem from a local approach, that is, finding the most representative
flow vector for each small image region. We recast the popular gradient-based method as
a two-stage regression problem and apply adaptive robust estimators to both stages. The
estimators are adaptive in the sense that their complexity increases with the amount of
outlier contamination. Due to the limited contextual information, the local approach has
spatially varying uncertainty. We evaluate the uncertainty systematically through covari-
ance propagation.
Pointing out the limitations of local and gradient-based methods, we further propose
a matching-based global optimization technique. The optimal estimate is formulated as
maximizing the a posteriori probability of the optical flow given three image frames. Using
a Markov random field flow model and robust statistics, the formulation reduces to mini-
mizing a regularization type of global energy function, which we carefully design so as to
accommodate outliers, occlusions and local adaptivity. Minimizing the resulting large-scale
nonconvex function is nontrivial and is often the performance bottleneck of previous global
techniques. To overcome this problem, we develop a three-step graduated solution method
which inherits the advantages of various popular approaches and avoids their drawbacks.
This technique is highly efficient and accurate. Its performance is demonstrated through
experiments on both synthetic and real data and comparison with competing techniques.
By making only weak assumptions of spatiotemporal continuity, the two proposed tech-
niques are applicable to general scenarios, for example, to both rigid and nonrigid motion.
They serve as a foundation for object-based motion analysis. Many of their conclusions are
also extendable to other visual surface reconstruction problems such as image restoration
and stereo matching.
The last part of the thesis describes a motion-based detection and tracking system
designed for an airborne visual surveillance application, in which challenges arise from the
small target size (1× 2− 3× 3 pixels), low image quality, substantial camera wobbling and
plenty of background clutters. The system is composed of a detector and a tracker. The
former identifies suspicious objects by the statistical difference between their motion and
the background motion; the latter employs a Kalman filter to track the dynamic behavior of
objects in order to detect real targets and update their states. Both components operate in
a Bayesian mode, and each benefits from the other’s accuracy. The system exhibits excellent
performance in experiments. In an 1800-frame real video, it produces no false detections
and tracks the true target since the second frame, with average position error below 1 pixel.
This probabilistic approach reduces parameter tuning to a minimum. It also facilitates data
fusion from different information channels.
TABLE OF CONTENTS
List of Figures iv
List of Tables vi
Chapter 1: Introduction 1
1.1 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 A Local Method with Error Analysis . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 A Global Optimization Method . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Motion-Based Target Detection and Tracking . . . . . . . . . . . . . . . . . . 12
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Estimating Optical Flow: Approaches and Issues 15
2.1 Brightness Conservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Flow Field Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Typical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Hierarchical Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 3: Local Flow Estimation and Error Analysis 34
3.1 A Two-Stage-Robust Adaptive Technique . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Linear Regression and Robustness . . . . . . . . . . . . . . . . . . . . 35
3.1.2 Two-Stage Regression Model . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.3 Choosing Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 43
i
3.2 Adaptive High-Breakdown Robust Methods For Visual Reconstruction . . . . 50
3.2.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Error Analysis on Robust Local Flow . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Covariance Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 4: Global Matching with Graduated Optimization 70
4.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 MAP Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.2 MRF Prior Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.3 Likelihood Model: Robust Three-Frame Matching . . . . . . . . . . . 74
4.1.4 Global Energy with Local Adaptivity . . . . . . . . . . . . . . . . . . 75
4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Step I: Gradient-Based Local Regression . . . . . . . . . . . . . . . . . 77
4.2.2 Step II: Gradient-Based Global Optimization . . . . . . . . . . . . . . 77
4.2.3 Step III: Global Matching . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.4 Overall Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 Quantitative Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2 TS: An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.3 Barron’s Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.4 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 5: Motion-Based Detection and Tracking 96
5.1 Bayesian State Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
ii
5.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Motion-Based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Bayesian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 6: Conclusions 117
6.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Open Questions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 121
Bibliography 125
iii
LIST OF FIGURES
1.1 Example Optical flow on flower garden sequence . . . . . . . . . . . . . . . . 2
1.2 Motion estimation by template matching . . . . . . . . . . . . . . . . . . . . . 5
1.3 Motion analysis for airborne video surveillance . . . . . . . . . . . . . . . . . 12
2.1 Aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Hierarchical processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Comparison of Geman-McClure norm and L2 norm . . . . . . . . . . . . . . . 38
3.2 Block diagram of the two-stage-robust adaptive algorithm . . . . . . . . . . . 44
3.3 Central frame of the synthetic sequence (5 frames, 32× 32) . . . . . . . . . . 45
3.4 Correct flow field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 OFC cluster plots at three typical pixels . . . . . . . . . . . . . . . . . . . . . 46
3.6 TS sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Pepsi sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8 Pepsi: estimated flow fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.9 Random sampling based algorithm for high-breakdown robust estimators . . 51
3.10 Adaptive algorithm for high-breakdown robust estimators . . . . . . . . . . . 52
3.11 TS: trial set size map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.12 TS: correct and estimated flow fields . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 TT, DT middle frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.14 YOS middle frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.15 OTTE sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.16 TAXI sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.17 TAXI: intensity images of x-component . . . . . . . . . . . . . . . . . . . . . 60
iv
3.18 TS motion boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.19 TAXI motion boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.20 TAXI: motion boundary on images subsampled by 2 . . . . . . . . . . . . . . 69
4.1 Comparison of Geman-McClure norm and L2 norm . . . . . . . . . . . . . . . 73
4.2 System diagram (operations at each pyramid level) . . . . . . . . . . . . . . . 80
4.3 TS sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Error cdf curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5 DTTT sequence results (motion boundaries highlighted in (a)). . . . . . . . . 88
4.6 Taxi results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.7 Flower garden results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.8 Traffic results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.9 Pepsi can results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1 A typical detection-tracking system . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Proposed Bayesian system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Example data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 f16502 target pixel candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 f18300 and f19000 target pixel candidates . . . . . . . . . . . . . . . . . . . . 108
5.6 Target pixels for f16502, f18300 and f19000 . . . . . . . . . . . . . . . . . . . 110
5.7 Detection results w and w/o priors on f16503 . . . . . . . . . . . . . . . . . . 112
5.8 Two sample frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
v
LIST OF TABLES
3.1 Comparison of four popular regression criteria (estimators) . . . . . . . . . . 40
3.2 TS sequence: comparison of average error percentage . . . . . . . . . . . . . . 48
3.3 Quantitative comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Quantitative measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Comparison of various techniques on Yosemite (cloud part excluded) with
Barron’s angular error measure . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Quantitative measures in 1800 frames . . . . . . . . . . . . . . . . . . . . . . 115
vi
ACKNOWLEDGMENTS
It is a great pleasure to express my gratitude to all those who have made this disser-
tation possible. First, I thank my co-advisor Prof. Robert Haralick, a man of wisdom
and rigor, for his guidance and support during both my master’s and doctoral study.
I am deeply indebted to Prof. Linda Shapiro, who became my co-advisor in my final
year and helped me through the critical period of time with constant support and
encouragement.
I would also like to thank other members of my supervisory committee: Prof.
Jenq-Neng Hwang, Prof. Qiang Ji, Prof. Werner Stuetzle, Prof. Ming-Ting Sun
and Prof. David Thouless, who monitored my work and put in the effort to reading
earlier versions of this dissertation.
My former colleagues in the Intelligence Systems Laboratory: Dr. Qiang Ji, Dr.
Gang Liu, Dr. Desikachari Nadadur, Dr. Selim Aksoy, Dr. Mingzhou Song, Dr.
Jisheng Liang, Dr. Lei Sui and Dr. Yalin Wang, deserve many thanks for their
friendship and help. I especially want to thank Dr. Qiang Ji and Dr. Gang Liu for
pleasant and fruitful discussions and brotherly advice that helped me stay encouraged
and on the right track.
I am grateful to the Electrical Engineering Department for providing me a good
work environment during my final year. Particularly, I must thank Helene Obradovich
for her efforts to support me with a teaching assistantship, to Frankye Jones for keep-
ing an eye on my progress, and to Sekar Thiagarajan and his team for the computing
support.
I wish to express sincere appreciation to Dr. Marshall Bern and Dr. David Gold-
berg for giving me the opportunity to work at the Xerox Palo Alto Research Center
vii
(PARC). Their advice, encouragement and friendship made my summer internship
at PARC a very productive and enjoyable one.
Last but certainly not the least, I am forever indebted to the love and caring of
my family and friends. Special thanks go to my dear husband Chengyang Li, who
will receive his Ph.D. about the same time, and to my dear parents and sister for
supporting and encouraging me to pursue my academic aspirations.
viii
1
Chapter 1
INTRODUCTION
Visual motion is the 2D velocity field corresponding to the movement of brightness
patterns in the image plane of a visual sensor. It usually arises from the relative motion
between 3D objects and the observer, and it provides rich information about the surface
structures of the objects and their dynamic behavior [58, 89]. Human beings rely on the
skills of perceiving and understanding visual motion in order to move around, meet with
people, watch movies and perform many other essential daily tasks. If we want computers to
assist us and interact with us, we must endow them with a similar capability for analyzing
visual motion, that is, accurately measuring and appropriately interpreting the 2D velocity
present in digital images. This has turned out to be a highly complicated and error-prone
process. The co-existence of profound significance and great challenge makes visual motion
analysis a very important and active research area in computer vision.
Optical flow is a flexible representation of visual motion that is particularly suitable
for computers analyzing digital images. It associates each image pixel (x, y) with a two-
component vector u = (u(x, y), v(x, y))T , indicating its apparent instantaneous 2D velocity.
The optical flow representation is adopted throughout this thesis and henceforth we use the
terms “visual motion” and “optical flow” interchangeably. In order to illustrate the concept
of optical flow, Figure 1.1 shows three frames that are part of a video sequence taken by a
camera passing in front of a flower garden. The optical flow estimated for the second frame,
which is subsampled by 8 each way to avoid being too crowded, is shown in Figure 1.1(d).
It overall agrees with our perception of motion in the scene.
Once available, optical flow estimates can be used to infer many 3D structural and
dynamic properties of the scene [54, 36]. In a general scenario, 2D image motion can
2
(a) Frame 1 (b) Frame 2 (c) Frame 3
(d) Optical flow estimated on Frame 2
Figure 1.1: Three frames in a video sequence taken by a camera passing in front of a flowergarden and the estimated optical flow field
3
be caused by camera motion (ego-motion), motion of independent moving objects in the
scene, or a composite of these two. If a video sequence is taken by a moving camera of
a rigid 3D scene, as in the case of the flower garden sequence, analysis of this sequence
can lead to recovery of the camera motion (pose) [48, 64] and the 3D surface structure of
the scene [31, 78, 108, 115, 67]. When there are independent moving objects in the scene,
motion analysis can help determine the number of objects, their individual 3D motions, their
distances to the observer and surface structures. The above study is vital to applications
in environment modelling [8, 132], target detection and tracking [80, 92, 110, 32, 81, 101],
auto-navigation [2, 45, 123], video event analysis [60, 107] and medical image registration
[1].
Analyzing optical flow in the 2D domain is important in its own right. Many dynamic
features such as the focus of expansion [122], motion boundaries [93, 16] and occlusion
relationships [73] can be extracted from optical flow fields (although the extraction is much
less straightforward than it intuitively would be—we will come to this topic in the next
section). These dynamic features can assist in image segmentation [88, 113, 14, 86, 131] and
independent motion detection [80, 92], and usually serve as intermediate measures to object-
based representations [110]. Moreover, temporal continuity encoded in visual motion has
been exploited for redundancy reduction in video compression [114, 150, 109], image/video
super-resolution [112], and removal of image noise [120] and image distortion [142].
Visual motion, as a compelling cue to the perception of 3D structure and realism, can
also be used for graphics and animation [29, 100]. For example, a cartoon character can be
made to mimic a human character’s expression by first measuring the human character’s
facial motion and then warping the cartoon character accordingly. Such concepts have
already been utilized in film production [100], and they are expected to play an increasingly
important role in the future with advances in computational technology.
All the visual motion applications discussed above assume that accurate optical flow
estimates are already available or can be conveniently computed. Unfortunately, recovering
optical flow from images is very difficult for three reasons. First of all, the movements of
brightness patterns in the image plane might not impose sufficient constraints on the actual
2D motion—this is the intrinsic ambiguity of optical flow. Secondly, in formulating the
4
problem of optical flow estimation, certain assumptions about the motion and the image
observation must be made; these assumptions, as simplifications of real-world phenomena,
can easily be violated and result in erroneous estimates. Finally, the computation involved
can be intensive and even prohibitive so that a more appropriate formulation might not lead
to higher practical accuracy. Even worse, these difficulties are usually entangled together
and make it very hard to tell which factors attribute to the failure. For the above reasons,
despite decades of active research and steady progress, the performance of existing optical
flow estimation techniques remains unsatisfactory. It is thus the main theme of this thesis
to explore new approaches to optical flow estimation which handle these problems more
effectively.
The rest of this chapter serves as a high-level overview of my dissertation. The following
section briefly reviews optical flow research and motivates our study. Two novel techniques,
exploiting local and global motion coherence respectively, are described in Section 1.2 and
Section 1.3. Section 1.4 discusses a detection and tracking system which can be considered
as an application of visual motion, and is built on top of various results established in
our study of optical flow estimation. Finally Section 1.5 gives an outline of the thesis.
Conclusions and contributions of various pieces of our work will be pointed out in each
individual section.
1.1 Optical Flow Estimation
Basics
Given the two images in Figure 1.2, the task of estimating the optical flow in the first
frame is to determine where each pixel in this frame moves to in the next frame. The most
intuitive method of doing this is probably template matching. Consider the pixel at the
center of the box, which is near the center of the front tree trunk, in Frame 1. In order
to find its corresponding point in Frame 2, we may take the image block within the box
as a template, search Frame 2 for the block most similar to it, and computer the optical
flow vector from the displacement between the centers of the blocks. Two assumptions are
implied in this matching process: (i) the template maintains its brightness pattern and (ii)
5
2
14
3
Frame 1 Frame 2
Figure 1.2: Motion estimation by template matching. Consider the pixel at the center ofthe white box in Frame 1. In order to find its corresponding point in Frame 2, we may takethe image block within the box as a template, search Frame 2 for the block most similar toit, and then the displacement between the centers of the blocks is the optical flow vector.Templates 1, 2 and 4 show the aperture problem. Template 3 shows a case of assumptionviolation caused by motion boundaries.
all pixels within the template move at the same speed. These are simple embodiments of the
brightness conservation and flow field smoothness assumptions, which are the foundation of
all motion estimation methods.
The above template matching method, however, does not work well in all places. Some
problematic positions are marked in Figure 1.2. Template 1 (upper-left in Frame 1, on the
roof) contains an intensity edge and many blocks along the edge in the second frame seem
to match it almost equally well; as a result, only the motion perpendicular to the edge can
be reliably recovered. Template 2 belongs to the sky and is poorly textured. Where the
block matching process finds the best match in the next frame is pretty much due to image
noise. Template 4 shows a similar problem. Its sky part is attached to the twigs and is
assigned the foreground motion (see Figure 1.1). These three cases illustrates the aperture
problem [58]: if we are examining motion only through a small aperture (region), the local
image information can be insufficient for uniquely determining the motion. The aperture
problem is the intrinsic difficulty of visual motion perception. Some of the ambiguity it
induces may be resolvable with appropriate contextual knowledge. For instance, human
6
viewers can recognize Templates 1 and 2 as part of the house and the sky, respectively, and
can associate their motions with the rest of the scene. There have been efforts to mimic
this ability, including adaptive template window selection [72] and flow propagation [58].
Nonetheless, the aperture problem is unavoidable in general; it always exists in the form of
spatially varying uncertainty. For such reasons, error analysis [52, 144] is an integral part
of optical flow estimation and will be addressed in this thesis.
Challenges in motion estimation also arise from assumption violations. One example
is given by Template 3. The correct motion of its center pixel is the motion of the front
tree trunk. But since the template also includes a part from the flower bed, which moves
differently, flow constancy no longer holds in this block and the outcome of the matching
process can be arbitrarily wrong. Motion discontinuities have received the most attention in
combating assumption violations, not only because they are abundant in real imagery, but
also in that they often correspond to significant scene features, which could be of even greater
interest than the motion itself in some applications. The brightness conservation assumption
can also become invalid due to large image noise and illumination changes. To deal with
these problems, we may either adopt new models accommodating the abnormalities or
develop techniques that work gracefully even with violations present. The latter measure is
indispensable because, any assumption, as a simplification of a real-world phenomena, will
potentially be violated. For this reason, devising methods robust to unmodelled incidences
has become a central issue in motion estimation as well as in the entire computer vision
community [50, 85, 51].
In more than two decades’ intensive research, optical flow estimation has been tackled
from different angles with variable success. Early studies [58, 82, 3] establish basic models
for brightness conservation and flow smoothness. Recent efforts [77, 15, 5, 97] emphasize
enhancing robustness against model violations and solving associated optimization prob-
lems. The following section is a glimpse of the broad area especially methods related to our
work. More literature review will be given in Chapter 2.
Overview of Related Work
Two main types of constraints are derived from the brightness conservation assumption:
7
matching-based constraints [3] and gradient-based constraints [58, 82]. Matching-based
constraints, as used in the template matching process, determine the motion vector by
trying a number of candidate positions and finding the one with the minimum matching
error. This method can handle large motion, but the search process can be computationally
expensive and yield poor sub-pixel accuracy [7, 14]. Gradient-based constraints are linear
approximations of matching-based constraints. By exploiting gradient information, they
can achieve much better efficiency and accuracy and hence have become the most popular
in practice. But relying on derivative computation makes their applicability more limited
[145].
Based on how flow smoothness is imposed, the approaches are further divided into
two types: local parametric and global optimization. Local parametric methods assume
that within a certain region the flow field is described by a parametric model [12]. The
simplest, yet one of the most popular models is the local constant model, as implied in
template matching. Local models usually involve simple computation and can achieve good
local accuracy [7, 39], but they degrade or fail when the model is inappropriate or the
local information becomes insufficient or unreliable. Global optimization methods cast
optical flow estimation in a regularization framework — every vector satisfies its brightness
constraint while maintaining coherence with its neighbors [58]. Because they propagate flow
between different parts of the flow field, such approaches are less sensitive to the aperture
problem, but for the same reason, they tend to oversmooth the flow field. Most popular
approaches are gradient-based. The best known classical techniques are perhaps the global
gradient-based method by Horn and Schunck [58] and the local gradient-based method by
Lucas and Kanade [82].
Traditional techniques [7] usually require brightness conservation and flow smoothness
to be satisfied everywhere in the flow field. The restrictive assumptions result in smeared
motion discontinuities and high sensitivity to abrupt noise. As their limitations are widely
recognized, a large number of recent efforts have been devoted to increasing robustness
especially allowing motion discontinuities. For gradient-based local parametric methods,
various robust regression techniques such as robust clustering [113, 94], M-estimators [15,
109], high-breakdown robust estimators [5, 97, 145] are substituted for the traditional least-
8
squares estimator. They reduce the impact of model violations by fitting the structure
of the majority of the data. Global optimization approaches are reformulated in terms of
anisotropic diffusion [91, 19], Markov random fields with line process [88, 77, 57, 14], weak
continuity [20, 15], or robust statistics [116, 15, 86], among many others. These techniques
generally outperform their non-robust counterparts in terms of accuracy. But computational
complexity quickly becomes the new performance bottleneck. This is especially true for
global methods involving large-scale nonconvex optimization, which are considered the most
promising [111] methods.
Why Low-level Approaches
This thesis is concerned with low-level approaches to optical flow estimation (in fact,
when people talk about optical flow methods, they normally refer to low-level methods).
“Low-level” means that only primitive image descriptors (intensity values) and weak as-
sumptions (piecewise spatiotemporal continuity) are exploited. Due to the small amount
of prior knowledge, the limitations of such approaches, for instance, in handling motion
discontinuities, are obvious and understandable.
The reader may then wonder why we do not use other channels of information or stronger
assumptions—it seems to make perfect sense to extract motion for each object separately.
Such ideas are compelling and have been exploited in a number of applications. Examples
include using color segmentation to assist motion boundary localization [129]; assuming the
motion field to be a mixture [68, 109], single/multiple rigid bodies [12] or layers [131]; and
explicitly modelling and tracking motion boundaries [86, 16]. Replacing the optical flow
representation of visual motion by an object-based representation has also been suggested
[118, 48, 101].
Nonetheless, low-level approaches continue to be extensively studied for good reasons
[83]. First of all, by making weaker assumptions, low-level methods are more general and
are applicable to different types of visual motion, for example, both rigid and nonrigid
motion. Secondly, low-level methods are indispensable building blocks, leading in a bottom-
up fashion to more complex motion analysis [118]; in fact, higher-level methods usually need
low-level methods in model selection [130], initialization and optimization procedures [68],
9
and advances in low-level research are applicable to them as well. Finally, there is still
plenty of room for improvement in low-level motion estimation, particularly in robustness
and error analysis. Due to compromises in formulations and solution methods, existing
techniques can fail even in ideal settings. As an example, many methods intended to preserve
motion discontinuities use gradient-based brightness constraints, which can break down at
discontinuities due to derivative evaluation failure. Error analysis of motion estimates is a
crucial task due to the inherent ambiguity in visual motion. The insufficiency of robustness
and error analysis in optical flow estimation are the major motivations of our research.
We have considered both local and global approaches to piecewise-smooth optical flow
estimation. The following two sections overview the main results and contributions of our
work.
1.2 A Local Method with Error Analysis
A Two-Stage-Robust Adaptive Scheme. Gradient-based optical flow estimation tech-
niques essentially consist of two stages: estimating derivatives, and organizing and solving
optical flow constraints (OFC). Both stages pool information in a certain neighborhood and
are regression procedures in nature. Least-squares (LS) solutions to the regression prob-
lems break down in the presence of outliers such as motion boundaries. To cope with this
problem, a few robust regression tools [15, 86, 97, 5] have been introduced to the OFC
stage. However, as a very similar information pooling step, derivative calculation has sel-
dom received proper attention in optical flow estimation. Crude derivative estimators are
widely used; as a consequence, robust OFC (one-stage robust) methods still break down
near motion boundaries. Pointing out this limitation, we propose to calculate derivatives
from a robust facet model [146, 145]. To reduce the computation overhead, we carry out the
robust derivative stage adaptively according to a confidence measure of the flow estimate.
Preliminary experimental results show that the two-stage robust scheme permits correct
flow recovery even at immediate motion boundaries.
A Deterministic Algorithm for High-Breakdown Robust Regression. High-
breakdown criteria are employed in both of the above regression problems. They have no
10
closed-form solutions and past research has resorted to certain approximation schemes. So
far all applications of high-breakdown robust methods in visual reconstruction [121, 75, 5, 97,
117] have adopted a random-sampling algorithm given in [106]—the estimate with the best
criterion value is picked from a random pool of trial estimates. These methods uniformly
apply the algorithms to all pixels in an image disregarding the actual amount of outliers,
and suffer from heavy computation as well as unstable accuracy. By taking advantage
of the piecewise smoothness property of the visual field and the selection capability of
robust estimators, we propose a deterministic adaptive algorithm for high-breakdown local
parametric estimation. Starting from LS estimates, we iteratively choose neighbors’ values
as trial solutions and use robust criteria to adapt them to the local constraint. This method
provides an estimator whose complexity depends on the actual outlier contamination. It
inherits the merits of both LS and robust estimators and results in crispy boundaries as
well as smooth inner surfaces; it is also faster than algorithms based on random sampling.
Error Analysis Through Covariance Propagation. Due to the aperture problem
and outlying structures, an optical flow estimate generally has spatially varying reliability.
In order for subsequent applications to make judicious use of the results [34], error statistics
of the flow estimate have to be analyzed. In our earlier work [141], we have conducted error
analysis for the least-squares-based local estimation method using the covariance propaga-
tion theory for approximate linear systems and small errors. Here we generalize the results
to the newer robust method. Our analysis estimates image noise and derivative errors in an
adaptive fashion, takes into account correlation of derivative errors at adjacent positions.
It is more complete, systematic and reliable than previous efforts.
1.3 A Global Optimization Method
By drawing information from the entire visual field, the global optimization approach [58, 15]
to optical flow estimation is conceptually more effective in handling the aperture problem
and outliers than the local approach. But its actual performance has been somehow dis-
appointing due to formulation defects and solution complexity. On one hand, approximate
formulations are frequently adopted for ease of computation, with the consequence that the
11
correct flow is unrecoverable even in ideal settings. On the other hand, more sophisticated
formulations typically involve large-scale nonconvex optimization problems, which are so
hard to solve that the practical accuracy might not be competitive with simpler methods.
The global optimization method we have developed is aimed at better solutions to both
problems.
From a Bayesian perspective, we assume the flow field prior distribution to be a Markov
random field (MRF) and formulate the optimal optical flow as the maximum a posteriori
(MAP) estimate, which is equivalent to the minimum of a robust global energy function.
The novelty in our formulation lies mainly in two aspects. 1) Three-frame matching is
proposed to detect correspondences; this overcomes the visibility problem at occlusions. 2)
The strengths of brightness and smoothness errors in the global energy are automatically
balanced according to local data variation, and consequently parameter tuning is reduced.
These features enable our method to achieve a higher accuracy upper-bound than previous
algorithms.
In order to solve the resultant energy minimization problem, we develop a hierarchical
three-step graduated optimization strategy. Step I is a high-breakdown robust local gradient
method with a deterministic iterative implementation, which provides a high-quality initial
flow estimate. Step II is a global gradient-based formulation solved by Successive Over-
Relaxation (SOR), which efficiently improves the flow field coherence. Step III minimizes
the original energy by greedy propagation. It corrects gross errors introduced by derivative
evaluation and pyramid operations. In this process, merits are inherited and drawbacks are
largely avoided in all three steps. As a result, high accuracy is obtained both on and off
motion boundaries.
Performance of this technique is demonstrated on a number of standard test data sets.
On Barron’s synthetic data, which have become the benchmark since the publication of
[7], this method achieves the best accuracy among all low-level techniques. Close compar-
ison with the well known Black and Anandan’s dense regularization technique (BA) [14]
shows that our method yields uniformly higher accuracy in all experiments at a similar
computational cost.
12
(a) A typical frame (b) Target marked
Figure 1.3: Motion analysis for airborne video surveillance. A tiny airplane is only observ-able by its distinct motion.
1.4 Motion-Based Target Detection and Tracking
In a visual surveillance project funded by the Boeing Company, we have investigated an
application of optical flow to airborne target detection and tracking. The greatest difficulty
in this problem lies in the extremely small target size, typically 2× 1− 3× 3 pixels, which
makes results from most previous aerial visual surveillance studies unapplicable. Challenges
also arise from low image quality, substantial camera wobbling and plenty of background
clutters. A sample frame of the client data is given in Figure 1.3 together with a copy in
which the target is marked.
The proposed system consists of two components; a moving object detector identifies
objects by the statistical difference between their motions and the background motion, and
a Kalman filter tracks their dynamic behaviors in order to detect targets and update their
states. Both the detector and the tracker operate in a Bayesian mode and they each benefit
from the other’s accuracy. The system exhibits excellent performance in experiments. On
an 1800-frame real video clip with heavy clutter and a true target (1 × 2 − 3 × 3 pixels
large), it produces no false targets and tracks the true target from the second frame with
13
average position error below 1 pixel. This probabilistic approach reduces parameter tuning
to a minimum. It also facilitates data fusion from different information channels.
1.5 Thesis Outline
The first and major half of the dissertation is devoted to optical flow estimation, and the rest
describes the motion-based target detection and tracking system. To enhance visual motion
analysis robustness, which is the central issue in our study, statistical tools are extensively
explored at every stage. Given the diversity of the topics, previous work is summarized and
mathematical and statistical tools are introduced when the need arises.
Chapter 2 serves as a literature review on piecewise-smooth optical flow estimation.
Standard constraints derived from the brightness conservation and flow smoothness assump-
tions and common techniques such as hierarchical processing are described. Representative
methods, both classical and more robust ones, are discussed. The relative merits of different
approaches are important considerations in designing our methods.
Chapter 3 addresses the two-stage robust adaptive approach to local flow estimation
and its error analysis. Using the facet model, the popular local gradient-based approach is
reformulated as a two-stage regression problem. Appropriate robust estimators are identified
for both stages and the adaptive scheme is introduced. A deterministic algorithm for high-
breakdown robust regression in visual reconstruction is proposed, and its effectiveness is
demonstrated at the OFC solving stage. Error analysis carried out for the least-squares
version of the method is reviewed and then the results are generalized to the robust version.
Experimental results on both synthetic and real data are given in each of the above three
parts. Robust estimation is formally introduced in this chapter and it will be extensively
used in the rest of the thesis.
Chapter 4 discusses the global optimization approach to optical flow estimation. From
a Bayesian perspective, the maximum a posteriori (MAP) criterion is used with a Markov
random field (MRF) prior distribution to formulate optical flow estimation as minimiz-
ing a global energy function. The global energy is carefully designed to allow occlusions,
flow discontinuities and local adaptivity. Furthermore, a graduated deterministic solution
14
technique is developed for the minimization problem. It exploits the advantages of various
formulations and solution techniques for accuracy and efficiency. The theoretical and practi-
cal advantages of this method are illustrated by experimental results and comparisons with
other techniques on various synthetic and real image sequences. This chapter concludes by
pointing out contributions and future research directions along this line.
Chapter 5 presents the motion-based target detection and tracking system. It begins
by describing the Kalman-filter-based tracker. In doing so the Bayesian state estimation
theory, which is also used in the detection phase, is explained. A hybrid motion estimator
is devised to locate independently moving objects. Its measurements are integrated with
priors from the previous tracking results, and then the detector can operate in a Bayesian
mode. Performance of this system is demonstrated on real airborne video.
Chapter 6 concludes this dissertation by summarizing the results, contributions and
future research avenues of each individual piece of our work.
15
Chapter 2
ESTIMATING OPTICAL FLOW: APPROACHES AND ISSUES
Optical flow estimation has long been an active research area in computer vision. Pio-
neering work on calculating image velocity for compressing TV signals [79, 28] dates back
to the mid 70’s. During the 80’s, the fundamental assumptions enabling optical flow es-
timation, namely, brightness conservation and flow field coherence, were examined from
different angles resulting in a large number of techniques, which are compared in the in-
fluential review articles by Barron, Fleet and Beauchemin [7, 10]. A drawback common to
many of these early techniques is that they usually require the assumptions to be satisfied in
a strict (least-squares) sense so that their performance degrades severely in the presence of
unmodelled events especially motion discontinuities. As such limitations have been widely
recognized, the theme of optical flow research in recent years has shifted to enhancing the
robustness of classical approaches. Encouraging progress has been made along this line and
the estimation accuracy has been greatly improved. However, due to problems in formu-
lations and solution techniques, there still exists a considerable gap between the achieved
performance and what is desired in real-world applications. In addition, visual motion has
its intrinsic ambiguity, which cannot be resolved by any estimation methods. It makes reli-
able error analysis of optical flow estimates a crucial issue that needs to be addressed more
adequately. This unsatisfactory state of affairs continues to motivate investigation in the
area.
This chapter reviews piecewise-smooth optical flow estimation. We will describe typical
formulations, representative techniques and their relative merits. The purpose is not to give
a comprehensive literature review, which is beyond our scope, but to provide background
knowledge for understanding difficulties in this problem, major achievements of previous
work and motivations for our study. We organize this chapter as follows. The first two
sections discuss the modelling of brightness conservation and flow coherence respectively,
16
and Section 2.3 describes typical formulations resulting from combinations of these models.
Section 2.4 addresses challenges arising from modelling violations and efforts at ameliorating
these problems. Section 2.5 points out the inherent ambiguity of optical flow and introduces
previous work on error analysis. Finally, Section 2.6 explains the hierarchical process that
is widely employed to handle large motions.
2.1 Brightness Conservation
Let I(x, y, t) be the image intensity at a point (x, y) at time t. The brightness conservation
assumption can be expressed as
I(x, y, t) = I(x + δx, y + δy, t + δt)
= I(x + uδt, y + vδt, t + δt), (2.1)
where (δx, δy) is the spatial displacement during the time interval δt, and (u, v) is the optical
flow vector. This equation simply states that a point maintains its intensity value during
motion, or corresponding points in different frames have the same brightness.
Matching-based methods [3, 120] find the flow vector or displacement that yields the
best match between image regions in different frames. Best match can be defined in terms
of maximizing a similarity measure such as the normalized cross-correlation, or minimizing
a distance measure such as the sum-of-squares difference (SSD):
EB(u, v) =∑
(x,y)∈R
[I(x, y, t)− I(x + uδt, y + vδt, t + δt)]2, (2.2)
where EB designates the brightness conservation error, and R is the image region spanned
by the template.
Such matching criteria normally do not lead to closed-form solutions. In order to find
the best match, usually a set of displacements are hypothesized, and the one with the best
matching score is retained. This discrete exhaustive search process has poor efficiency and
often results in low subpixel accuracy. For this reason, gradient-based methods have gained
popularity in the optical flow estimation community.
17
Gradient-based methods [58, 82, 53] make use of differential approximations of the
brightness constancy constraint Eq. 2.1. When the spatiotemporal image intensity I is
differentiable at the point (x, y, t), the right side of Eq. 2.1 can be expanded as Taylor
series, yielding
I(x, y, t) = I(x, y, t) + Ixuδt + Iyvδt + Itδt + ε,
where (Ix, Iy, It) is the image intensity gradient vector at the point (x, y, t), and ε represents
the higher-order terms. If the displacement (uδt, vδt) is infinitesimally small, ε becomes
negligible and the equation simplifies to the well known optical flow constraint equation
(OFCE) [58]
Ixu + Iyv + It = 0, (2.3)
which is a linear equation in the two unknowns u and v. Given n ≥ 2 pixels of the same
2D motion, their OFCEs can be grouped together and then u, v can be calculated through
linear regression.
Another way of obtaining additional constraints is to exploit second-order image deriva-
tives. Differentiating Eq. 2.3 with respect to x, y and t respectively gives three more equa-
tions:
Ixxu + Iyxv + Itx = 0
Ixyu + Iyyv + Ity = 0
Ixtu + Iytv + Itt = 0.
They can be used alone [7] or combined with OFCE [53] to solved for u.
The most distinct attraction of gradient-based constraints, compared with matching-
based constraints, is their ease of computation. The use of derivatives allows more efficient
exploration of the solution space and hence achieves lower complexity and higher floating-
point precision [7, 9]. However, it is important to point out that such advantages do come
with a price: the additional assumptions made in deriving the gradient-based constraints
dictates their more limited applicability. First of all, gradient-based constraints are valid
only for small displacements, which in practice means magnitude < 1 ∼ 2 pixels/frame. Sec-
ondly, in order for the higher-order terms to be negligible, the local image intensity function
18
should be close to a planar structure, which is also often violated. Finally, derivative es-
timation is a problematic process itself. Commonly used methods include neighborhood
differences [58], facet model fitting [145] and spatiotemporal filtering [119]. They all imply
constant optical flow in the neighborhood and therefore break down near motion bound-
aries. In fact, derivatives are low-level visual metrics just like optical flow, and thus their
computation also meets with difficulties produced by the aperture problem and assumption
violations [20, 13].
Maintaining high derivative quality, identifying unusable estimates and diagnosing fail-
ures are crucial to the robustness of (gradient-based) optical flow estimation. We address
these issues in developing both of our new techniques (Chapter 3 and Chapter 4).
Frequency-based methods. Performing the Fourier transform on the brightness con-
stancy constraint Eq. 2.1 yields
I(ωx, ωy, ωt) = I(ωx, ωy, ωt)e−j(uδtωx+vδtωy+δtωt),
where I(ω1, ω2, ω3) is the Fourier transform of I(x, y, t) and ω1, ω2, ω3 denote spatiotemporal
frequency. Clearly, for this equation to hold, it must satisfy
uωx + vωy + ωt = 0. (2.4)
This is the basic constraint for frequency-based approaches. It states that all nonzero
energy associated with a translating 2D pattern lies on a plane through the origin in the
frequency space, and the norm of the plane defines the optical flow vector. Frequency-
based approaches are often presented as biological models of human motion sensing. They
can handle cases that are difficult for matching approaches, e.g., the motion of random
dot patterns. But in most cases, they are close to the frequency-domain equivalents of
matching-based and gradient-based methods [10], and extracting the nonzero energy plane
usually involves heavy computation. As a consequence, they are not as popular as the other
two types of approaches.
19
2.2 Flow Field Coherence
For each pixel the brightness conservation constraint (Eq. (2.1), (2.3) or (2.4)) forms one
constraint in the two unknowns u and v. Additional constraints come from the flow field
coherence assumption, which means neighboring pixels experience consistent motion. Based
on how coherence is imposed, the approaches can be further divided into two major types,
local parametric and global optimization.
Local parametric methods assume that within a certain region the flow field is described
by a parametric model:
u(x) = u(x;p).
Here boldface letters denote column vectors: u = (u, v)T , x = (x, y)T , p is the vector of
model parameters. Common models include the constant model
u(x;p) =
u(x, y)
v(x, y)
=
p0
p1
,
which holds at any location as the region size approaches zero, the affine model
u(x;p) =
p0 + p1x + p2y
p3 + p4x + p5y
which approximates the 2D motion of a remote 3D surface, and the quadratic model
u(x;p) =
p0 + p1x + p2y + p6x2 + p7xy
p3 + p4x + p5y + p6xy + p7y2
(2.5)
which describes the instantaneous 2D motion of a planar surface undergoing 3D rigid motion
(we will use this model in the airborne visual surveillance application in Chapter 5).
Low-order polynomial flow models gain popularity from their clear physical meanings
and simple computation. But how to select a region appropriate for a given model and
how to choose models suitable for a given region are very complicated problems [130, 25].
The common practice of applying the same model uniformly to all image locations risks the
danger of under-fitting, over-fitting and compromising different models, and usually results
in a flow field of highly uneven accuracy.
20
Global optimization methods can avoid the region selection problem to a certain extent.
Instead of assuming a rigid model for an entire region, they allow arbitrary local variations as
long as the flow field is smooth (almost) everywhere. Such a global smoothness assumption
usually leads to a regularization type of formulation. A classical technique of this kind is
due to Horn and Schunck [58]. They define the best optical flow field as the one minimizing
the overall OFCE error and local flow variation:
∑s
[(Ixsus + Iysvs + Its)2 + λ(us − us)2 + (vs − vs)2]. (2.6)
Here s is a one-dimensional index of pixel locations (x, y), which traverses all pixel locations
in a progressive scan manner. The first quadratic term in the summation is the OFCE error
at location s; the second term requires minimal deviations between the flow vector and
(us, vs), the average of its neighbors i ∈ Ns. The constant λ is a tuning parameter which
controls the relative importance of data and flow variation.
Global optimization models deal with the aperture problem more effectively than local
parametric models by propagating flow estimates between different locations, but due to
the propagation, they tend to over-smooth the field. In addition, global models are sensitive
to the choice of the control parameter λ and their computation is more involved.
2.3 Typical Approaches
In principle, any of the above brightness conservation models and flow coherence models
can be paired up to derive a formulation for optical flow estimation. Among all possible
combinations, gradient-based local parametric, gradient-based global optimization and spa-
tiotemporal filtering approaches, especially the first two, have attracted the most attention
because of the good balance between their accuracy and complexity.
Gradient-based local parametrization
Combining gradient-based constraints and low-order polynomial flow models, one usually
arrives at a linear equation in the flow model parameter p:
Ap = b.
21
Particularly, using first-order constraints and the constant flow model, we have
Au = b (2.7)
A =
Ix1 Iy1
......
Ixn Iyn
b = −
It1
...
Itn
.
When A′A is nonsingular, the least-squares (LS) solution to the equation is
u = (A′A)−1A′b. (2.8)
Both sides of Eq. 2.7 can be multiplied by a window function W = diag[W1, . . . , Wn] to
assign heavier weights to certain constraints. The corresponding equation is WAu = Wb.
If the weights are absorbed by A and b: A ← WA,b ← Wb, the same LS solution Eq. 2.8
is obtained. Lucas and Kanade [82] employ an iterative version of the above algorithm
for stereo registration. Since they are probably the first to formalize this approach, the
(weighted) LS fit of local first-order constraints to a constant flow model is usually referred
to as the Lucas and Kanade technique, which we abbreviated as LK. This technique is
reported to be the most efficient and accurate, especially after confidence-based selection
[7].
An early technique using second-order constraints is due to Haralick and Lee [53]. They
interpret the OFCE as the interception line of the isocontour plane with a successive image
frame, calculate image derivatives from the facet model, and solve the first- and second-order
constraints at each pixel by singular value decomposition (SVD) [102].
Gradient-based global optimization
The seminal technique of this category by Horn and Schunck (HS) [58] was introduced in
the last section. They solve the constraint Eq. 2.6 for the flow field by iterative relaxation:
uns = un−1
s − Ixs(Ixsus + Iysvs + Its)λ + Ix
2s + Iy
2s
vns = vn−1
s − Iys(Ixsus + Iysvs + Its)λ + Ix
2s + Iy
2s
where n denotes the iteration number, (u0, v0) denote initial flow estimates (set to zero),
and λ is chosen empirically. Typically, flow fields obtained from this technique are visually
22
pleasing because of the smoothness, but their quantitative accuracy is not as good as local
gradient-based methods [7] due to over-smoothing and slow convergence of the relaxation
process.
Spatiotemporal filtering
Movements in the spatiotemporal image volume, formed by stacking images in a se-
quence, induce structures with certain orientations. For example, the trace of a translating
point is a line whose direction in the volume directly corresponds to its velocity. Different
methods were proposed to extract the orientations including inertia tensor [66], hyperge-
ometric filters [140] and orientation tensors [35]. Since determining 2D velocity in the
frequency domain amounts to finding a nonzero energy plane (Section 2.1), the filtering
approach is also adopted by frequency-based methods [56].
A recent filtering method with good accuracy reported is due to Farneback [35]. He fits
data in an image neighborhood to a quadratic polynomial model I(x) = xT Ax + bTx + c,
derives an orientation tensor from the model parameters T = AAT + ηbbT , and finds the
flow vector by minimizing vT Tv. Here we temporarily adopt his notations x = (x, y, t)′,v =
(u, v, 1)T /|(u, v, 1)T | for convenience of presentation.
It is not hard to see that this method closely resembles local gradient-based approaches:
tensor construction is equivalent to derivative (first- and second-order) calculation; solving
the homogeneous linear equation in the augmented flow vector vT Tv = 0 is equivalent
to solving the linear equation in the original vector (u, v)T Eq. 2.7. The efficiency of this
technique is mainly enabled by the intermediate step of tensor construction. Without the
intermediate step, or if a filter bank is used instead [56, 38], the computation can becomes
cumbersome and only discrete estimates can be obtained. This contrast is also similar to the
that between gradient-based and matching-based approaches. The equivalence between spa-
tiotemporal filtering/frequency-domain approaches and certain matching-based/gradient-
based methods was pointed out previously [118, 7].
Others
Block matching (SSD) methods can be used to find a pixel-accuracy displacement, and
a quadratic surface fitting of the neighboring matching errors can produce an estimate of
23
subpixel accuracy [126]. The techniques of Anandan [3] and Singh [120] initialize the flow
field using this method, and then employ some global smoothness constraints to propagate
flow from places of higher confidence to places of lower confidence. Matching-based ap-
proaches have better large motion handling capability than gradient-based approaches. But
the computational difficulties and poor subpixel accuracy make them less competitive in
optical flow estimation. For similar reasons, matching-based global optimization schemes
were attempted with very limited success [88, 14], and frequency-based global optimization
approaches are almost never explored.
2.4 Robust Methods
Most early techniques, as described in the above three sections, require brightness conser-
vation and flow smoothness to be satisfied everywhere in the flow field. The restrictive
assumptions make them easily break down in reality where model violations are abundant.
An obvious source of violation is motion discontinuity. Imposing flow smoothness in a region
containing multiple motion modes results in compromise between these modes and smeared
flow estimates. Such failure not only is detrimental to optical flow accuracy, but also ob-
scures important geometric or physical properties of the scene. Violations of the brightness
constancy assumption also occur commonly in natural scenes. Conditions such as specular
reflections, shadows and illumination variations induce non-motion brightness changes. In
cases of transparency, due to the interaction of translucent reflective surfaces, the image
intensity of a single pixel can be a composite of multiple 3D points’ brightness values [14].
Examples include looking into a running creek and watching through a pane of glass. Ap-
plying simple brightness matching criteria in these situations does not produce meaningful
motion estimates. As the above limitations of traditional techniques [7] are widely recog-
nized, a large number of recent efforts have been devoted to increasing robustness against
assumption violations, especially to allowing motion discontinuities.
Explicit segmentation
Assuming motion boundaries coincide with intensity discontinuities and the former are
subsets of the latter, a number of researchers [17, 129] first segment the visual field using
24
image intensity, then compute parametric (e.g. affine) motion in each segment, and finally
group neighboring segments into regions of coherent motion. Such approaches experience
two problems. First, accurate image segmentation is itself very difficult. Second, the as-
sumed relationship between motion and intensity discontinuities is not necessarily correct.
Motion estimation and motion-based segmentation form a chicken-and-egg dilemma: the
motion estimator needs to know where motion boundaries are in order to avoid smooth-
ing across them, whereas the motion-based segmenter requires an accurate motion field in
order to divide the scene into regions of consistent motion. In an attempt to circumvent
this problem, motion estimation and segmentation have been carried out simultaneously.
The generic approach can be described as finding a segmentation of the flow field and the
motion (parameters) in each segment that minimizes the difference between the observed
and predicted image data [151]. Actual techniques differ by the employed flow models,
optimization criteria and solution methods.
Kanade and Okutomi [72] develop an adaptive window technique that adjusts the rect-
angular window size to minimize the uncertainty in the estimate. Schweitzer [114] devises
a recursive algorithm to split the motion field into square patches according to the minimal
encoding length criterion. These methods use rectangular division of the flow field and can-
not adapt to irregular motion boundaries. Wang and Adelson [134] assume that an image
region is modeled by a set of overlapping layers which can be irregularly shaped or even
transparent. They compute initial motion estimates using a least-squares approach within
image patches, then use K-means clustering to group motion estimates into regions of con-
sistent affine motion. Jepson and Black [68] use a probabilistic mixture model to explicitly
represent multiple motions within a patch, and use the EM algorithm to estimate parame-
ters for the fixed number of layers. Darrell and Pentland [33] and Sawhney and Ayer [109]
automatically determine the number of layers under a minimum description length (MDL)
encoding principle, which regards the most compact interpretation as the best among all
possibilities [114].
The explicit segmentation approach usually involves modeling the visual field as a col-
lection of (rigid) objects of certain parametric motion. Appropriately choosing the motion
models and the number of objects, especially in a dynamic situation, is very difficult [25, 130]
25
and can be impossible when nonrigid motion such as human movement and facial expres-
sion are present. Furthermore, due to the extremely high dimension of the problem, how
to efficiently solve the associated numerical optimization problems remains a challenging
issue. In general, iterative methods in which, each updating step consists of sequential es-
timation and segmentation of the motion field, are used. The initial guess is also given by
a sequential method and its quality is crucial for convergence. For the above reasons, the
explicit segmentation approach is not suitable for general optical flow estimation and is not
pursued in this thesis.
Outlier-suppressed regression
A major cause of the failure of traditional gradient-based local parametric techniques is
the use of least-squares regression, which finds a compromise among all constraints and can
break down even in the presence of a single model outlier. To repair this problem, various
mechanisms have been attempted to reject outliers and fit the structure of the majority of
constraints.
It is sometimes possible to detect outliers by examining the residual of the least-squares
solution. After obtaining an initial least-squares estimate, Irani et al. [64] iteratively remove
outliers and recompute the least-squares solution. This process is still least-squares in
essence; it is sensitive to the initial quadratic estimate which may be arbitrarily bad. A
number of researchers (e.g. Fennema and Thompson [37], Schunck [113], Nesi et al. [94])
investigate robust clustering [70] based on the Hough transform. Such approaches have
better outlier resistance but are computationally very expensive. More success is achieved
by employing robust estimators, particularly, M-estimators [15, 109] and high-breakdown
robust estimators [5, 97, 145] in local optical flow constraint fitting. These estimators will
be formally introduced and compared in Section 3.1.1. Among these methods, the one
reporting the best accuracy is to first identify and reject outliers using high-breakdown
criteria and then estimate parameters from the remaining constraints [5, 97, 145].
The computational burden of high-breakdown robust estimators increases with the
amount of outlier contamination. Applying the same algorithm uniformly to the entire
flow field incurs excessive computation, since most places contain few outliers. We tackle
26
the efficiency problem with an adaptive algorithm (Section 3.2). Also, the limitations of
gradient-based and local regression approaches (Section 2.1, 2.2) remain regardless of the
regression technique. We will propose a matching-based global optimization formulation to
overcome such limitations (Section 4).
Discontinuity-preserving regularization
A significant amount of attention has been paid to reformulating the regularization prob-
lem to alleviate over-smoothing. Nagel and Enkelmann [91] suggest an oriented-smoothness
constraint in which smoothness is not imposed across steep intensity gradients (edges).
Their formulation differs from HS’s (Eq. 2.6) in that the terms (us, vs) are augmented by
functionals of local flow derivatives and first- and second-order image derivatives. Despite
the added complexity, this method yields similar experimental results to HS [7], which is not
surprising. On one hand, image discontinuities and flow discontinuities do not necessarily
overlap; reducing smoothing in all places of large image gradient hurts flow propagation from
areas to areas. On the other hand, in the vicinity of flow discontinuities where smoothing
needs to be stopped, image derivatives are of poor precision and do not serve as a reliable
indicator of occlusion.
Following Geman and Geman’s work on stochastic image restoration [42], Markov ran-
dom fields (MRF) formulations [88, 77, 57, 14] have become an important class of techniques
for coping with spatial discontinuities in optical flow estimation. An MRF is a distribution
of a random field in which the probability of a site having a particular value depends on its
neighbors’ values. The distribution of a piecewise-smooth field can be modeled by a dual
pair of MRFs, one representing the observed field values and the other representing the un-
observed discontinuities (line process), and then the best interpretation of the field can be
found as the one maximizing the a posterior (MAP) probability. Utilizing the equivalence of
the MRF and the Gibbs distribution, the MAP formulation reduces to minimizing a regu-
larization energy, which is often solved by stochastic relaxation. Blake and Zisserman show
that similar formulations can be obtained by modeling piecewise smoothness using weak
continuity [20], and they tackle the optimization problem using a graduated non convexity
(GNC) strategy. Their formulation is more compact with the elimination of the line process,
27
and their optimization strategy is more effective in practice than stochastic relaxation.
Shulman and Herve [116] first point out that spatial discontinuities can be treated as
outliers and they propose an approach based on Huber’s minimax estimator. This choice of
estimator leads to a convex optimization problem which is relatively easy to solve. Black
and Anandan propose a robust framework in which both brightness and flow smoothness
terms are modeled with robust estimators. They use redescending estimators which sup-
press outliers more effectively than convex estimators, and solve the optimization problem
by hierarchical continuation [15, 20]. Sim and Park adopt high-breakdown robust estimators
to achieve even more effective outliers rejection [117] than commonly adopted M-estimators
[116, 15, 86]. Black and Rangarajan [18] unify the line process and robust statistics per-
spectives and suggest the approach can benefit problem formulation and solution.
It is important to point out that, in refining optical flow formulations, computational
complexity increases rapidly with model sophistication. This is especially true for global
methods which usually involve large-scale nonconvex optimization problems. There are two
approaches to global optimization: stochastic and deterministic. Stochastic methods such
as simulated annealing [42] make updates probabilistically to avoid low minima and use a
temperature parameter to gradually dampen the randomness. They converge too slowly to
be practically useful [88, 77, 14, 15, 23]. Deterministic methods such as continuation [15]
and multigrid [86] assume a good initial flow estimate is available and make greedy updates
towards a local minimum. The procedure can be multi-stage, resembling the annealing
schedule. These methods have achieved more success in practice, but they have a limited
capability for avoiding local minima and their performance depend on the initialization
quality. Since global optimization is widely recognized as a powerful formulation technique
for inverse problems, and computing technology looks promising for solving the associated
numerical problems, developing global optimization algorithms has become a very hot topic
in computer vision [23, 137, 111].
Brightness conservation violations
Phenomena violating the brightness constancy assumption have only been studied to
a limited extent [10]. Transparency can be modeled by layered/mixture representations
28
[12, 134, 17, 33], which assign to each pixel a set of ownership weights indicating how
different surface layers contribute to the observed pixel brightness. Bergen et al. [12]
first consider the problem of extracting two motions, induced by either transparency or
motion discontinuity, from three image frames. They use an iterative algorithm to estimate
one motion, perform a nulling operation to remove the intensity pattern giving rise to the
motion, and then solve for the second motion. Variable illumination can be accommodated
by deriving more complex brightness conservation models such as the linear model (see
[116] for one example), or matching less illumination-sensitive image features such as phase
[38]. When violations comprise only a small fragment of observations in an area, they can
be treated as outliers in a robust estimation framework [15]. Considering that most real
objects are opaque and global illumination variation is usually negligible during a small
interval, we adopt a robust estimation framework in our study.
2.5 Error Analysis
Despite steady progress on robust visual motion analysis, accurate optical flow estimates
are generally inaccessible. One reason is that, in making necessary assumptions to turn the
estimation problem into a well-posed problem, errors are inevitably introduced by assump-
tion violations. Even under (unrealistic) conditions that no violations are encountered, the
estimate can have a large uncertainty due to the aperture problem [58]—brightness vari-
ation can be insufficient for uniquely determining the 2D velocity (Figure 2.1, also 1.2).
The aperture problem shows the intrinsic ambiguity in visual motion perception: optical
flow only approximates the projected image motion. In its most severe form, i.e., when the
image is completely textureless, recovering the projected motion is impossible; more gener-
ally, optical flow estimates in regions of more appropriate texture have higher confidence.
The sensitivity to assumption violations and the aperture problem varies from technique to
technique and from place to place in a visual field, and so does the uncertainty of the esti-
mated optical flow. If subsequent applications are to make judicious use of such a flow field
estimate, they must be equipped with certain error measurements indicating the uneven
reliability [52].
29
True Motion
1
2
4
3
?
?
Figure 2.1: Aperture problem: local information in an aperture might be insufficient todetermine the 2D motion vector. Each circle is an aperture. (1) corner: reliable estimate;(2) boundary: normal flow only; (3) homogeneous region: ambiguous; (4) highly texturedregion: multiple solutions (aliasing).
Extracting 2D velocity at a pixel requires exploiting a spatiotemporal image neighbor-
hood of that pixel. This fact introduces correlation between errors in nearby flow estimates.
Accounting for such error correlation, especially in a global formulation, is a daunting task
for both optical flow error analysis and subsequent applications; therefore it has seldom
been tackled. Most previous efforts seek to provide an error measure with each individual
estimate by analyzing error behaviors of local methods. Barron et al. compare and modify
a number of one-dimensional confidence measures and use them to select reliable optical
flow estimates [7, 10]. Since errors in optical flow estimates are in general directional and
anisotropic, a two-dimensional confidence measure, particularly the covariance matrix, is
more appropriate and informative.
Performance analysis in computer vision is often carried out with covariance propagation
[36, 52]. Haralick illustrates the derivation and application of covariance propagation theory
for a wide variety of vision algorithms [52]. A trivial case is to propagate additive random
perturbations through a linear system y = Tx with input x and output y, in which the
output covariance Σy can be expressed in terms of the input covariance Σx as Σy = TΣxT ′.
The solution to the constraint equation Eq. 2.7 under the least-squares criterion is optimized
30
when the only error source is the additive iid noise in b (temporal derivatives) that has zero
mean and variance σ2b . Under this assumption, the above conclusion can be applied and the
covariance of the optical flow estimate is simply
Σu = σ2b (A
′A)−1 (2.9)
where σ2b can be estimated from the residual errors ri = Ixiu + Iyiv + Iti as
σ2b =
1n− 2
n∑
i=1
r2i .
The error analysis on a local matching-based method by Szeliski [126] and that on a local
spatiotemporal filtering method by Heeger [56] make similar assumptions and obtain similar
results.
The assumptions enabling the above derivation are apparently unrealistic because (i)
spatial derivatives in A also contain noise, and (ii) errors in derivatives are correlated due
to the overlapping data supports for their computation. Ignoring these factors makes the
velocity and covariance estimates biased. Efforts have been made to calculate unbiased
estimates using generalized least-squares [90, 96, 95]. However, these methods bring little
accuracy improvement at the cost of much heavier computation, because bias is a much
weaker error source than variance [27] and outliers [15] in optical flow estimation. More
details of the related work will be given in Section 3.3 to facilitate comparison with our
methods.
In our earlier work [141], we have conducted an error analysis for the least-squares
based local estimation method using the covariance propagation theory for approximate
linear systems and small errors. In this thesis, we generalize the results to the newer robust
method. Our analysis estimates image noise and derivative errors in an adaptive fashion,
taking into account correlation of derivative errors at adjacent positions. It is more complete,
systematic and reliable than previous efforts.
2.6 Hierarchical Processing
Recall that gradient-based constraints are valid only for small image motion; in practice
this typically means below 2 pixels/frame. While matching-based and frequency-based
31
formulations may cope with larger motion, the computational burden and chances of false
matches (aliasing) increase rapidly with the search range. A general way of circumventing
the large motion and aliasing problems is to adopt a hierarchical, coarse-to-fine strategy
[12, 14].
The basic idea of hierarchical processing is to construct a pyramid representation [26] of
an image sequence in which higher levels of the pyramid contain filtered and sub-sampled
versions of the original images. Going up the pyramid, the image resolution decreases and
the motion magnitude reduces proportionally. When a certain level of reduction is reached,
the motion becomes small enough for estimation. Then computation proceeds in a top-
down fashion, on each level the incremental flow is estimated, added to the initial value,
and projected down to the lower level as its initial value. This process continues until the
flow in the original images is recovered. In what follows, we describe an implementation of
the hierarchical process that is used in our algorithms. Much of the recipe is adapted from
[14].
• Gaussian pyramid construction. We create a P -level image pyramid Ip, p = 0, . . . , P−1. Each upper-level image sequence Ip−1 is a smoothed and sub-sampled version of
its lower level images Ip, expressed as
Ip−1(x
2,y
2) = f ∗ Ip(x, y),∀ x, y at level p
where f is a 3 × 3 Gaussian filter, “∗” represents convolution, and the resolution
reduction rate is 2 which means each upper level image is one-fourth of its ancestor.
• Flow projection with interpolation. Once the optical flow field V is available at level
p − 1, it is projected down to level p. The simplest projection scheme is “projection
with duplication”: V p(x, y) = 2V p−1(bx2 c, by
2c), ∀ x, y at level p. To reduce the blocky
effect, we use “projection with interpolation”:
up(2x, 2y) = 2up−1(x, y),∀ x, y at level p− 1,
up(x, y) =14[up(x− 1, y − 1) + up(x− 1, y + 1)
+V p(x + 1, y − 1) + up(x + 1, y + 1)], ∀ other x, y at level p.
32
construct image pyramid Ip, p = 0, . . . , P − 1;
IP−1w ← IP−1;
for (p : P − 1 → 0) {estimate residual flow ∆up;
current total flow: up ← up + ∆up;
stop if (p=0);
project flow up to Level p− 1 yielding up−1;
warp Ip−1 yielding Ip−1w ; }
Figure 2.2: Hierarchical processing
• Image warping. Given the flow field up that explains the motion from image Ip(t) to
image Ip(t + dt), the image Ip(t + dt) can be warped to remove (compensate for) the
motion such that these two images are almost aligned. Using “backward warping”,
the stabilized version of Ipt + dt is defined as
Ipw(x, y, t + dt) = Ip(x + u(x, y), y + v(x, y), t + dt).
(x + u(x, y), y + v(x, y)) usually does not fall on a regular grid. We use bilinear
interpolation to estimate its intensity value. The warped images Ipw(t), Ip
w(t + dt)
show the residual motion ∆up.
• Motion estimation. The residual motion is estimated form the warped sequence:
∆up ← Ipw(t), Ip
w(t+ dt), and the overall motion at level p is the sum of the projected
and residual motion: up ← up + ∆up.
At the top level, the initial (projected) motion is assumed to be zero and the warped sequence
is the same as the pyramid sequence Ipw ← Ip. Finally, the procedure of the hierarchical,
coarse-to-fine framework is given in Figure 2.2.
Hierarchical schemes like the above have been used in a wide variety of motion estimation
algorithms but their limitations [9] are often overlooked: (i) the blind projection and warping
33
operations may extrapolate and interpolate across motion boundaries; (ii) in the top-down
fashion, errors produced in coarser levels are magnified and propagated to finer levels and
are generally irreversible [14]. Solving the first problem again brings up the estimation-
segmentation dilemma. To correct errors in coarser levels, certain multi-resolution schemes
are needed which can propagate results in a bottom-up fashion, too [102]. Since each
additional level of pyramid introduces new sources of errors, the number of levels should be
large enough to allow incremental flow estimation but no larger. Appropriately choosing
the number of levels is a difficult problem that has been addressed only to a very limited
extent [9]. Most current techniques including ours determine the number empirically.
34
Chapter 3
LOCAL FLOW ESTIMATION AND ERROR ANALYSIS
This chapter considers the problem of finding the most representative translation within
a small spatiotemporal image neighborhood and presents new algorithms to address the
involved accuracy, efficiency and uncertainty measuring issues. In particular, (i) the popular
local gradient-based approach is reformulated as a two-stage regression problem, appropriate
robust estimators are identified for both stages, and an adaptive scheme is introduced to
derivative evaluation to obtain sharp motion boundaries; (ii) a deterministic algorithm for
high-breakdown robust regression in visual reconstruction is proposed, and its effectiveness
is demonstrated at the optical flow constraint solving stage, and (iii) error analysis is carried
out by covariance propagation, it accounts for spatially varying image noise and derivative
errors and correlation of derivative errors at adjacent positions, and provides a reliable
measure of the estimation uncertainty. This chapter is composed of three sections dedicated
to the above three topics respectively. Experimental results on both synthetic and real data
are given in each individual section.
3.1 A Two-Stage-Robust Adaptive Technique
The gradient-based local regression approach to optical flow estimation has become very
popular because of its good overall accuracy and efficiency. Despite various formulations,
methods of this type are generally composed of derivative estimation and optical flow con-
straints (OFC) solving two stages. Both stages involve optimization by pooling information
in a certain neighborhood and are regression procedures in nature. Classical techniques
solve both regression problems in a Least-Squares (LS) sense [7]. In places where the mo-
tion is multi-modal, their results can be arbitrarily bad. To cope with this problem, a few
robust regression tools such as M-Estimators [15, 86] and Least Median of Squares (LMedS)
estimators [5, 97] have been introduced to the OFC stage. By carefully analyzing the charac-
35
teristics of the optical flow constraints and comparing strengths and weaknesses of different
robust regression tools [138, 106, 105, 104], we identify the Least Trimmed Squares (LTS)
technique as more appropriate for the OFC stage.
Meanwhile, as a very similar information pooling step, derivative calculation has seldom
received proper attention in optical flow estimation. Crude (least-squares-based) estimators
are widely used with the hope that the derivative estimation error can be averaged out or
treated as outliers in the OFC regression stage. However as illustrated in Figure 3.5, near
motion boundaries, derivative evaluation can completely fail and most of the constraints
become outliers; in such a situation, no matter what robust tool is employed, OFC regression
breaks down and motion boundaries cannot be preserved. Pointing out this limitation, we
use a 3D facet model to formulate derivative estimation as an explicit regression problem,
which can be robustified when the LS technique fails. We choose an LTS estimator for
robust facet model fitting. LTS is costly and it may yield less accurate estimates where
there are no outliers and LS suffices. Therefore, it should be applied to only when it is
necessary. We calculate a confidence measure for each estimate from the LTS OFC step,
and update the derivatives and the flow vector if the measure takes a small value. In this
way the one-stage and two-stage robust methods are carried out adaptively. Preliminary
experimental results show that this adaptive LTS scheme permits correct flow recovery even
at immediate motion boundaries.
Below we provide details of the two-stage-robust adaptive scheme. We will start with
introducing robust regression, which is the backbone of the proposed method and will be
extensively exploited in the rest of the thesis.
3.1.1 Linear Regression and Robustness
A linear regression model relates the output of a system yi, i = 1, . . . , n to its m-dimensional
input xi = (xi1, . . . , xim)T by a linear transform with an additive noise term ξi, i.e.,
yi = xi1θ1 + xi2θ2 + . . . + ximθM + ξi,
or in a more compact form,
yn×1 = Xn×mθm×1 + ξn×1. (3.1)
36
With sufficient data points (X;y) collected (n À m), the model parameters θ can be
estimated by minimizing a scalar criterion function F (r):
θ = argminθF (r),
where r is the residual fitting error
r = y − y = y −Xθ.
The criterion function F (r) differs among estimators depending on what error models are
assumed.
Least-squares estimator
The least-squares estimator uses a quadratic error function
F (r) = ‖r‖2 =n∑
i=1
r2i
and has a closed-form solution
θ = (XT X)−1XTy.
It is optimal only if X is error-free and ξi is iid Gaussian with zero mean and variance
σ2. When either condition has a significant violation, the least-squares estimate can be
completely disrupted.
There are two major types of significant model violations, or gross errors: those caused
by bad y values are called y-outliers and those caused by error in X are leverage points.
The performance of a regression estimator is usually characterized by its statistical efficiency
and breakdown point. Simply put, statistical efficiency indicates the accuracy (in terms of
estimate variance) when no gross error is present, and breakdown point is the smallest
fraction of contamination that can cause the estimator to take on values arbitrarily far from
the truth. These two factors are usually against each other. A good regression tool should
have both factors high. The reason for the poor accuracy of least-squares in many situations
is its 0% breakdown point, which means that a single outlier can lead to arbitrarily wrong
estimates. The goal of robust regression is thus to develop regression tools that are relatively
insensitive to gross errors while maintaining sufficiently high statistical efficiency [106, 138].
37
M-estimators
An M-estimator uses the criterion function
F (r) =n∑
i=1
ρ(ri, σi)
where the σi are scale parameters for the ρ-function. It includes the least-squares estimator
as a special case with ρ being an L2 (quadratic) error norm. The impact of each datum
on the overall solution is measured by the influence function: ψ(x, σ) = ∂ρ(x, σ)/∂x. The
least-squares estimator has ψLS(x, σ) = 2x/σ2, which allows an outlier to introduce infinite
bias to the estimate. One way to reduce outlier influence is to adopt a less drastic error
norm, e.g., the Geman-McClure error norm
ρGM (x, σ) = x2/(x2 + σ2)
[43]. Its ρ, ψ curves are compared to those of the L2 norm in Figure 3.1. The Geman-
McClure error norm saturates at 1 as the error increases. Its ψ function is bounded and
redesceding—the influence of small errors is almost linear while that of abnormally large
ones tends to zero. Finding an M-estimate is a nonlinear minimization problem. It is usually
solved by iterated reweighted least-squares
θ(k) = argminθ
n∑
i=1
w(r(k−1)i )r2
i
where the superscript (k) designates the iteration number, the weight function is defined by
w(x) = ψ(x)/x, and ri is the residual evaluated with the current estimate.
M-estimators are resistant to y-outliers and have relatively high statistical efficiency,
but they meet with computational difficulties such as initial guess dependency and non-
convexity (for redescending estimators), have a low breakpoint (about 1/(M + 1)), and are
vulnerable to leverage points [106, 138].
High-breakdown robust estimators
Two popular high-breakdown robust estimators are the least-median-of-squares (LMedS)
estimator and the least-trimmed-squares (LTS) estimator [106]. The LMedS estimator
θ = argminθmedni=1r
2 (3.2)
38
−5 0 50
0.2
0.4
0.6
0.8
1
1.2
(a) ρ(x, σ)
−5 0 5−1
−0.5
0
0.5
1
(b) ψ(x, σ) = ρ′(x, σ)
Figure 3.1: Comparison of Geman-McClure norm (solid line) and L2 norm (dashed line):(a) error norms (σ = 1), (b) influence function.
overcomes most limitations of M-estimators: it is resistant to both types of gross errors,
has a breakdown point as high as 50%, does not need an initial guess, and is guaranteed
to converge. However it has extremely low statistical efficiency, which means that it tends
to have very large estimation variances when no gross error is present. The LTS estimator
was introduced to repair the low efficiency of LMedS. It is defined as
θ = argminθ
h∑
i=1
(r2)i:n (3.3)
where h < n and (r2)1:n ≤ . . . ≤ (r2)n:n are the ordered squared residuals. LTS allows the
fit to stay away from the gross errors by excluding the largest squared residuals from the
summation. Owning almost all merits of LMedS and better statistical efficiency, LTS is
considered preferable to LMedS [104, 103].
High-breakdown estimators usually do not have closed-form solutions and are approx-
imated by Monte Carlo like algorithms ([106]). A trial solution pool is constructed by p
random draws from totally Cmn m-subsets, each yielding an exact solution and a correspond-
ing criterion value; the one with the minimum value is picked as the solution. The value p
is chosen so that the probability of having at least one good subset,
1− (1− (1− ε)m)p, (3.4)
39
where ε is the fraction of outliers (up to 50%), is close to 1. The randomness in the solution
is obvious especially when p is chosen small. A subsequent weighted least-squares (WLS) is
recommended to enhance the statistical efficiency. In particular, a preliminary error scale is
defined as σ = C√
F (r), where C makes σ roughly unbiased at Gaussian error distribution
[105]; then regression outliers which have |ri/σ| > 2.5 are removed. Finally a WLS estimate
is calculated from inliers as
θ = argminθ
n∑
i=1
wir2i (3.5)
and a more efficient scale estimate is given by the sample variance of inliers.
According to the above recipe, LTS takes slightly longer time to compute than LMedS
since finding the smallest n/2 numbers is more costly than finding the median of n numbers.
A new algorithm for approximating LTS, so called FAST-LTS, has been introduced recently,
which runs faster than all programs for LMedS and makes LTS the preferred choice of high-
breakdown robust estimator. What enables FAST-LTS is the concentration property of
LTS: starting from any approximate LTS estimate θold and its associated criterion value
Qold, it is possible to compute another approximation θnew yielding an even lower criterion
value Qnew [104]. In algorithmic terms, the C-step can be described as follows.
Given the h-subset Hold then:
• compute θold ← LS estimate from Hold
• compute the residuals rold(i) for i = 1, . . . , n
• sort the absolute values of these residuals, which yields a permutation π for which
|rold(π(1))| ≤ |rold(π(2))| ≤ . . . ≤ |rold(π(n)|
• put Hnew ← {π(1), π(2), . . . , π(h)}
• compute θnew ← LS estimate from Hnew
The C-step can iterate until convergence. It speeds up LTS computation by providing
a more efficient way of selecting trial solutions than random sampling.
40
Estimator Criterion Statistical Breakdown Y- Leverage Solution
F (r) Efficiency Point Outliers Points Technique
LS∑n
i=1 r2i High 0% No No Closed-form
M∑n
i=1 ρ(r) High 100/(1 + m)% Yes No Approximate
LMedS medni=1r
2i Low 50% Yes Yes Approximate
LTS∑n/2
i=1(r2)i:n
a Low 50% Yes Yes Approximate
a(r2)1:n ≤ · · · ≤ (r2)n:n : ordered squared residuals
Table 3.1: Comparison of four popular regression criteria (estimators)
To summarize the above discussion, properties of four popular estimators are given in
Table 3.1 for a regression problem of n equations and m unknowns.
3.1.2 Two-Stage Regression Model
In this section we show that both derivative estimation and optical flow constraint solving
stages in the gradient-based local approach can be formulated as linear regression problems.
Optical flow constraint
Following Haralick and Lee [53], we constrain the optical flow vector u = (u, v)T at
location (x, y, t)T by
Au + ξ = b (3.6)
where
A =
Ix Iy
Ixx Ixy
Iyx Iyy
Itx Ity
b = −
It
Ixt
Iyt
Itt
.
We further assume that flow vectors in each small neighborhood of N pixels is constant,
and hence each vector u conforms to N sets of constraints simultaneously. This constitutes
our optical flow constraint [144]: a linear regression model
Asu + ξ = bs (3.7)
41
where As = (A′1, A′2, . . . , A′N )′, bs = (b′1,b′2, . . . ,b′N ), and each pair of Ai,bi are the A,b
defined by Eq.(2.3) at pixel i, i = 0, . . . , N − 1. In our experiment, we choose the constant
flow neighborhood size to be 5× 5, so N = 25.
Comparing to the first-order constraint Eq. 2.7, this mixed-order constraint has the
advantage that a large number of equations are provided on a small data support (100
equations in a 9 × 9 × 5 neighborhood). Such compactness is desirable because a smaller
neighborhood size means less chance of encountering multiple motion, and a larger sample
size brings higher statistical efficiency. Although second-order constraints alone are often
avoided due to derivative quality concerns [7], we argue that they are beneficial when used
together with first-order constraints under a robust criterion, because (i) they are automati-
cally assigned less weights, due to the fact that second-order derivatives normally take much
smaller values than first-order derivatives in real imagery, and (ii) outliers among them can
be ignored under the robust criterion. In addition, experiments show that second-order
derivatives of reasonable accuracy can be obtained from the facet model.
Derivatives From the Facet Model
The facet model characterizes each small image data neighborhood by a signal model
and a noise model [54]. Low-order polynomials are the most commonly used signal form.
We use a 3D cubic polynomial for derivative estimation [143]. Here “3D” means that the
polynomial is about the spatiotemporal variable (x, y, t); “cubic” means that the highest
order of a term is 3. The facet model finds the polynomial coefficient vector a from the
linear regression model
Da + ξ = J (3.8)
where J is the observed image data vector (formed by traversing the neighborhood data
lexicographically), and D is the design matrix composed of 20 canonical polynomial bases
(1, x, y, t, x2, . . . , xyt). We use the facet model neighborhood size 5 × 5 × 5, so D has
dimension 125 × 20. Once a is found, the spatiotemporal derivatives are merely scaled
versions of its elements. More details about derivatives from the facet model can be found
in our earlier work [143, 141, 146].
Most popular derivative estimators in optical flow estimation are neighborhood masks.
42
They essentially come from facet models [54] of different dimension (1D, 2D or 3D), order
(1st, 2nd or 3rd) or neighborhood size (2, 3 or 5). For example, the four-point central
difference mask (−1, 8, 0,−8, 1)/12 that Barron et al. use [7] is actually a 1D cubic facet
model on a neighborhood of 5 pixels [146]. Our facet model outperforms it on most image
sequences.
3.1.3 Choosing Estimators
In this section, we analyze the characteristics of the two regression problems and identify
appropriate regression estimators for them.
Solving OFC by LTS.LS
We observe that (i) both y-outliers and leverage points can happen in Eq.(3.7) because
both As and bs are composed of derivative estimates; (ii) leverage points are roughly twice
as likely as y-outliers due to the size contrast of As and bs; (iii) a significant portion of
the constraints can be gross errors, when, for example, multiple motion models happen in a
neighborhood; and (iv) the number of constraints are relatively small. Therefore the desired
estimator for the OFC stage should be resistant to both types of gross errors and have a
high breakdown point and good statistical efficiency on a small sample size.
M-estimators [15, 86] and LMedS estimators [97, 5] were previously used at the OFC-
stage. M-estimators are resistant to y-outliers and have relatively high statistical efficiency,
but they have a low breakpoint of about 1/(1+m) and are vulnerable to leverage points. The
LMedS estimator [105] is resistant to both types of gross errors and has a high breakdown
point of 50%, but it has extremely low statistical efficiency, which means it tends to perform
poorly when there is no gross error (Table 3.1).
Owning almost all merits of LMedS and better statistical efficiency, LTS is preferred to
LMedS [138, 103, 104]. We use least-trimmed-squares followed by (weighted) least-squares
to solve the optical flow constraint, and call the procedure “LTS.LS”.
LS or LTS: Adaptive Derivative Estimation
By default we solve the 3D cubic facet model in an LS sense to find the derivatives. When
the estimation quality is poor, we update the derivatives from robust facet model fitting.
43
To reduce computation and prevent over-fitting, we use a 3D quadratic facet model for this
purpose. As the dimension of the parameter vector a is as large as 10, and the breakdown
point has to be high, the LTS estimator is again a better choice than M- and LMedS
estimators. Unlike the OFC stage, where we estimate two parameters out of 100 constraint
equations, in this stage, there are 10 parameters but only 125 constraint equations. With a
rather small sample size, WLS can hardly improve results of LTS. So we decide to use LTS
for robust facet fitting.
Note that it might not be the best to apply LTS facet model fitting uniformly, because
LTS tends to have lower statistical efficiency than LS when there is no gross error, and
it involves much more computation. Therefore the LTS facet model should be used when
and only when the estimation fails due to the LS facet quality. We take the coefficient of
determination (R2) [105] from the LTS.LS OFC step as a confidence measure of the flow
estimate. R2 measures the proportion of observation variability explained by the regression
model. Here it is defined as
R2 = 1−∑
i∈inliers r2i∑
i∈inliers y2i
.
We detect poor flow estimates as those having R2 < T , and try robust facet model fitting
to improve them.
It is worth mentioning that our OFC stage and that of Bab-Hadiashar and Suter use a
similar local optimization formulation, the difference being that they use LMedS while we
use LTS as the regression tool. Both of us detect bad estimates with low R2 values but we
treat them very differently. They remove them as unreliable, whereas we apply a two-stage
LTS to improve their accuracy.
Finally, the diagram of the proposed algorithm is given in Figure 3.2.
3.1.4 Experiments and Analysis
We demonstrate on both synthetic and real data how optical flow accuracy improves as the
method upgrades from purely LS-based (LS-LS), one-stage robust (LS-LTS.LS) and two-
stage robust (LTS-LTS.LS). We also compare our results with those from Bab-Hadiashar
and Suter’s technique (BS) [5] which applies LMedS to the OFC stage. The results were
44
DerivativesImageData
Optica flow
& Confidence
High conf?
Robust
Y
N
LSFacet
RobustFacet
OFC
Figure 3.2: Block diagram of the two-stage-robust adaptive algorithm
computed using their own C program, all parameters set as default. For fair comparison,
the facet and OFC neighborhood sizes are fixed to 5 pixels for both techniques.
An illustrative example
We first use the synthetic data set in Figure 3.3 to demonstrate the necessity of robust
regression in both stages. The image size is 32× 32. Motion of the left and right halves are
vertical and horizontal respectively, both at 1 pixel/frame. Since an optical flow constraint
equation forms a line au + bv + c = 0 in the u, v coordinate, with its distance to the true
velocity an indicator of the degree of modeling imperfection, we use OFC cluster plots to
visualize derivative quality and results of different estimators. Three typical points: (5, 5),
(5, 20) and (14, 17) in Figure 3.3,3.4 are closely examined. Their true velocities are marked
by black dots in Figure 3.5.
(5, 5) is a point where most derivatives are of good quality, as we can tell from the
nice OFC cluster at the true velocity (Figure 3.5(a)). However, even in this favorable
case, LS-LS yields only (0.9734, 0.0015) while LS-LTS.LS yields (numerically) exactly (1,
0). The 9 × 9 × 5 data support of point (5, 20) has 1/9 conveying the left motion mode.
Accordingly we observe a clear cluster at the true velocity and a small vague cluster at the
left velocity (Figure 3.5(b)). LS is totally lost in this, yielding a compromise of (0.5933,
0.518), as oppose to LS-LTS.LS, which gives (-0.0051, 0.9913). These two cases suggest that
LS-LTS.LS significantly outperforms LS-LS at the OFC stage.
In the above cases, the facet model fitting errors can be accommodated by robust OFC.
45
Figure 3.3: Central frame of the synthetic se-quence (5 frames, 32× 32)
5 10 15 20 25 30
5
10
15
20
25
30
Figure 3.4: Correct flow field
But this is not the case with (14, 17), a boundary point on the right side. Figure 3.5(c)
shows constraint lines scattering around, with two very vague clusters at (0, 1) and (1, 0).
Estimates from LS-LS (-0.3937, 0.2482) and LS-LTS.LS (0.0708, 0.1267) are both totally
wrong. Here applying robust regression at the OFC stage alone does not help any more.
The reason is that derivative estimation at most points fails and a large portion of the
constraints become gross errors, so that the major optical flow constraint model does not
exist. Figure 3.5(d) shows the OFC plot from the robust facet model fitting. The ma-
jor motion model becomes clear so that LTS.LS yields a reasonably accurate estimate of
(0.0109, 1.0000).
Translating Squares Sequence (TS)
Figure 3.6(a),3.6(b) show the central frame and the correct flow field of another synthetic
sequence Translating Squares (TS). It contains two squares translating at 1 pixel/frame. The
image size is 64× 64. LTS-LTS.LS is applied at places with R2 < 0.99.
We calculate the error percentage as the quantitative accuracy measure. It is the error
vector magnitude normalized by the true velocity magnitude and multiplied by 100. We
report the average error percentages on the entire flow field (AEP) as well as those measured
46
(a) (5, 5): LS facet (b) (5, 20): LS facet
(c) (14, 17): LS facet (d) (14, 17): LTS facet
Figure 3.5: OFC cluster plots at three typical pixels. Each line represent a constraintequation. (5, 5): good derivative quality; (5, 20): a small number of bad derivatives; (14,17): on motion boundary, most derivative estimates are bad and robust facet fitting becomesnecessary.
47
(a) Central frame (b) Correct flow (c) BS
(d) LS-LS (e) LS-LTS.LS (f) LTS-LTS.LS (g) R2 map
Figure 3.6: TS sequence results. Flow field estimates are subsampled by 2. Estimates witherror percentages larger than 0.1% are shaded.
48
Technique AEP(%) AEPB(%)
LS-LS 18.83 50.61
BS 8.03 26.24
LS-LTS 7.53 24.59
LTS-LTS 4.75 15.51
Table 3.2: TS sequence: comparison of average error percentage
(a) Central frame (b) BS (c) LS-LTS.LS (d) LTS-LTS.LS
Figure 3.7: Pepsi sequence central frame and horizontal flow (darker pixels indicate largerspeeds to the left).
in the motion boundary area (AEPB). The motion boundary area is defined as a 9-pixel-wide
band. Since the spatiotemporal data support for each flow estimate is 9× 9× 5, out of this
band there are no outliers from motion boundaries at either the derivative or the OFC stage.
The AEP and AEPB values are summarized in Table 3.2. The flow fields estimated from
four algorithms are given in Figure3.6. To facilitate visual comparison, we shade estimates
with error percentages larger than 0.1%. To keep the flow field plots from being too crowded,
we subsample them by 2 in both x and y directions. We observe from the results that (i)
robust methods out-perform LS methods, (ii) LTS seems to be slightly better than LMedS
in the OFC stage, and (iii) LTS derivative estimation significantly reduces boundary errors.
The Pepsi Sequence
This is a real image sequence in which a Pepsi can and background move approximately
49
(a) BS (b) LS-LTS.LS (c) LTS-LTS.LS
Figure 3.8: Pepsi: estimated flow fields
0.8 and 0.35 pixels to the left respectively (Figure3.7(a)). We show subsampled flow fields
of four techniques in Figure3.8 and the (linearly scaled) horizontal flow values in Figure3.7.
BS’s result (Figure 3.8(a),3.7(b) has significant vertical speed components in the upper-left
and the lower parts, and the flow is still over-smoothed. Figure 3.8(b),3.7(c) is the result
of LS-LTS (1st and 2nd order constraint). Motion contrast and discontinuities are much
clearer. LTS-LTS.LS (Figure 3.8(c),3.7(d)) updated LS-LTS estimates with R2 < 0.75 and
further improved the boundary accuracy.
Discussion
The primary contribution of the above work is that it formulates optical flow estimation
as two regression problems and adaptively solves them using one-stage or two-stage LTS
methods. Preliminary experimental results on both synthetic and real image sequences ver-
ified the effectiveness. Since derivative estimation is a fundamental step of many computer
vision problems, and most optimization problems can be fit into the regression framework,
conclusions of this paper may extend to other fields.
A limitation of the proposed method lies in the high computational cost, induced by uni-
formly applying an expensive high-breakdown robust estimator to both regression stages.
In the next section, we exploit the piecewise-smooth property of visual fields to develop a
deterministic algorithm whose complexity adapts to the degree of local outlier contamina-
50
tion. It converges faster and achieves more stable accuracy than the random-sampling-based
algorithm.
3.2 Adaptive High-Breakdown Robust Methods For Visual Reconstruction
Visual reconstruction is the process of recovering the underlying true visual field from a
noisy observation [20]. It includes many fundamental tasks in early vision such as image
restoration, 3D surface reconstruction, stereo matching and optical flow estimation. What
permits the reconstruction is the piecewise continuity property of a visual field, which is
often imposed by local parametric models [55, 53]. In recent years, many robust methods
have been employed to solve the associated regression problems [13, 121, 75]; among them,
those based on high-breakdown robust criteria [106, 85], e.g. least-median-of-squares and
least-trimmed-squares, have reported the best accuracy.
High-breakdown criteria usually have no closed-form solutions, so certain approximation
schemes must be used. Different approximation methods may lead to very different accuracy
and convergence rate; and esearch is still going on in the statistics community to find
more appropriate methods [104]. So far almost all high-breakdown robust methods in
visual reconstruction applications [121, 75, 5, 97, 117, 145] adopt the random-sampling-based
algorithm outlined by Rousseeuw and Leory [106]—the estimate with the best criterion value
is picked from a random pool of trial estimates, and the algorithm is uniformly applied to
all pixels in an image. The generic scheme can be summarized as follows.
Using the same number of trial subsets p to all locations causes both efficiency and
accuracy concerns. According to Eq. 3.4, p must be chosen large enough to ensure a high
breakdown point. Since evaluating the criterion value F is an expensive operation, a large p
value incurs a heavy computational burden. Meanwhile, much of the burden is unnecessary
because most places in a normal visual surface have few outliers and the above complicated
process often ends up generating least-squares estimates. If a priority is placed on saving
computation and a smaller p value is chosen, the probability of locking on the correct
solution can be hurt, especially at locations with significant contamination.
What is needed, in order to circumvent the efficiency-accuracy predicament, is an adap-
51
choose number of subsets p according to Eq. 3.4
for all pixels {for p subsets {
compute LS solution uLS;
store criterion value F ; }select solution with best criterion value;
compute WLS solution from Eq.3.5; }
Figure 3.9: Random sampling based algorithm for high-breakdown robust estimators
tive scheme which performs least-squares estimation when no outlier is present and increases
the p value as the noise contamination becomes more severe. This does not seem possible for
an isolated regression problem in which no prior information about outlier contamination
is available. But in visual reconstruction problems, we can exploit the piecewise-smooth
property of visual surfaces to achieve the adaptiveness.
3.2.1 The Approach
Now we present an adaptive algorithm for high-breakdown robust visual reconstruction by
considering the example problem of estimating piecewise-constant optical flow from noisy
measurements of first-order derivatives, i.e., solving the Lucas-Kanade constraint equation
Eq. 2.7 for all pixel locations under a high-breakdown robust criterion.
Observing that in normal image sequences, (i) the majority of pixel locations do not have
outliers and least-squares estimates are reasonably good, and (ii) flow fields are smooth and
nearby estimates have similar values (true even at motion boundaries), we initialize the
flow field using least-squares estimates, and then iteratively generate trial solutions for each
pixel using its neighbors’ values. Given a trial solution, which may come from either the
least-squares initial or a neighbor’s value, we identify the part of local constraints consistent
with it and calculate a new solution following the weighted least-squares (WLS) procedure
(Eq. 3.5). In this way, we obtain an updated trial solution which represents the local
52
for all pixels {compute LS estimate VLS;
V, F ← WLS on VLS; }while #{pixels updated}>0 {
for all pixels {for all its neighbors Vn {
if ( Vn updated and |Vn − V | > T ) {Vtry, Ftry ← WLS on Vn;
if ( Ftry < F )
{ update V, F } } } } }
Figure 3.10: Adaptive algorithm for high-breakdown robust estimators
constraints more closely. We then compute its criterion value, and retain this trial solution
if it achieves the best criterion value so far. The algorithm is described in Figure 3.10.
Because of the “locking” capability of the WLS update step, similar neighbor values
usually result in very close or identical trial solutions, and hence not all neighboring values
need to be borrowed. Also, to expedite convergence, it is better to use neighbors a few
pixels away rather than immediate ones. Therefore, we use four neighbors to the N,S,E,W
directions which are w/2 pixels away, where w is the window size of constant local flow.
In this approach, the piecewise smoothness property of the visual field and the selection
capability of robust estimators are exploited to produce trial solutions in a much more
educated way than random sampling methods. The complexity of the estimator varies with
the local structure: least-squares solutions are used where no outlier is present, and the
number of trials increases with the outlier percentage. The adaptive nature is revealed in
Figure 3.2.2, which shows the number of trials at each pixel as an intensity image for the
TS sequence (see description in Section 3.1.4). More trials, indicated by brighter colors,
are carried out closer to the boundary where the structure is more complex. The trial set
size ranges between 1 and 13 in this case, as opposed to the uniform p = 30 in the random
53
sampling based algorithm [5].
3.2.2 Experiments and Analysis
We calculate derivatives from a first-order spatiotemporal facet model on a support of size
3 × 3 × 3 [145], and solve the optical flow constraint under the LMedS criterion. Optical
flow is estimated on the middle frame of every three frames. A hierarchical scheme [12] is
adopted to handle large motions, with the number of pyramid levels empirically determined.
We carefully handled boundary cases such that the resulted flow field is of the same size as
the original image.
Comparison is made with modified versions of Lucas and Kanade’s LS based method
(LK) [82] and Bab-Hadiashar and Suter’s random LMS based method (BS) [5]. Their
original implementations do not have hierarchical process and derivatives are estimated
differently. To emphasize the contrast of different regression methods, we implemented LK
and BS by modifying the code of our algorithm. No pre-smoothing is done and the constant
flow window size is fixed at 9× 9. p = 30 random subsets are drawn in BS as [5] suggested.
Experiments are carried out on a PIII 500MHz PC running Solaris. Vector plots below are
appropriately subsampled and scaled to faciliate visual inspection.
Five image sequences with flow groundtruth are used for quantitative comparison. Two
error measures are reported. One is the angular error e 6 used in [7]. It is defined on the
augmented flow vector u = (u′, 1)′ as arccosu · u0, where u0 is the correct flow vector. The
other one is the error vector magnitude measure e|·| = |u − u0|/|u0|. We also report the
consumed CPU time in seconds to give a rough idea on the speed contrast.
Translating Squares Sequence (TS). This data set was introduced in Section 3.1.4
(Figure 3.6). Calculation is done on the original resolution. The correct and estimated
vector plots are given in Figure 3.12. LK’s result is smeared around motion boundaries
and good elsewhere. This shows the necessity of robust estimation, and also justifies using
LS for initialization in our method. BS’s and our results look close for this data set.
The reason is that for the case m = 2 which we examine in this paper, p = 30 samples
make the probability of having at least one good initial as high as 99.98%, even with 50%
54
Figure 3.11: TS trial set size map. Value range: 1 (darkest) → 13 (brightest). More trialsolutions are generated in places of higher motion complexity.
outliers present. For these reasons and because of the simplicity of the TS sequence, the
random scheme succeeds most of the time. This similarity in turn justifies the viability of
implementing high-breakdown criteria without random sampling.
TT, DT, YOS Sequences. Three popular synthetic data sets, Translating Tree
(TT), Diverging Tree (DT) and Yosemite (YOS), are obtained from Barron [7]. Their
middle frames, the 20th in the TT and DT data sets and the 9th in the YOS data set, are
given in Figs. 3.2.2, 3.2.2. TT and DT (150 × 150) simulate translational camera motion
with respect to a textured planar surface. TT’s motion is horizontal, DT’s is divergent and
their maximum speeds are about 2 pixels/frame. The motion in YOS (316× 252) is mostly
divergent with the maximum speed about 4-5 pixels/frame. The cloud part is excluded
from evaluation. We use two levels of pyramid for TT and DT, and three levels for YOS.
OTTE Sequence. This real image sequence is provided by Nagel [98]. The scene
is stationary except for a marble block in the center moving leftwards; and the camera is
translating. Groundtruth is available where the vector is nonzero (Figure 3.15). Three
levels of pyramid are used. Measures on all sequences are summarized in Table 3.3.
Robust methods are more accurate but much slower. Quite noticeably however, the
accuracy advantage fades with the use of image pyramids. This is caused by the limitations
55
(a) Groundtruth (b) LK
(c) BS (d) Ours
Figure 3.12: TS: correct and estimated flow fields
56
Data Technique e 6 (◦) e|·| (%) time (sec)
LK 6.14 15.12 0
TS BS 1.10 2.64 2
Ours 1.09 2.65 0
LK 2.36 5.48 1
TT BS 1.67 3.75 16
Ours 1.39 3.22 6
LK 6.12 18.33 1
DT BS 5.73 18.53 16
Ours 5.00 16.14 10
LK 3.69 12.68 4
YOS BS 3.81 11.87 61
Ours 3.42 11.10 40
LK 17.22 48.56 13
OTTE BS 17.02 48.23 205
Ours 16.84 47.90 121
Table 3.3: Quantitative comparison of the proposed adaptive LMedS algorithm to Lucasand Kanade (LK) [82] and Bab-Hadiashar and Suter (BS) [5]. The new algorithm is moreaccurate, and more efficient than BS.
57
Figure 3.13: TT, DT middle frame Figure 3.14: YOS middle frame
[14] of the simple hierarchical strategy [12]. The error it introduces can be greater than
that from LS estimation and becomes the quality bottleneck. This issue will be addressed
in the next chapter. Note in Table 3.3 that while ours method significantly outperforms LK
in all cases, BS produces larger errors than LK for the DT, YOS sequences. This suggests
the unstable nature of LMS based on random sampling.
TAXI Sequence. TAXI is a real sequence with no groundtruth also from Barron
[7]. In the street scene there are four moving objects: a taxi turning the corner, a car in
the lower left driving leftwards, a van in the lower right driving towards the right, and a
pedestrian in the upper-left. Their image speeds are approximately 1.0, 3.0, 3.0 and 0.3
pixels/frame respectively. Two levels of pyramid are used. To enhance details, we display
the horizontal flow component as intensity images in Figure 3.17. Brighter pixels represent
larger speeds to the right. In LK’s estimate the flow fields of the vehicles have severely
invaded the background. BS and our method preserve motion boundaries better. However
it is quite obvious that BS has bumpier boundaries and produces more gross errors, for
instance on the taxi. In addition, BS took 36 seconds CPU time while ours only took 13
seconds. We show the trial solution set sizes and give the minimum and maximum numbers
of trials on two levels of pyramid and their sum in Figure 3.16. Apparently larger numbers
58
(a) Middle frame (b) True flow
Figure 3.15: OTTE sequence
of trials were used for places with more complex motion. This observation suggests that the
trial set size map might help motion structure analysis.
3.2.3 Discussion
In this section we have presented an adaptive high-breakdown robust method for visual re-
construction and applied it to optical flow estimation. By taking advantage of the piecewise
smoothness property of visual fields and the selection capability of robust estimators, this
algorithm can be faster and more accurate than algorithms based on random sampling.
Although we have chosen locally constant flow estimation to illustrate its effectiveness,
the strengths of this approach should be more apparent in problems of higher dimensions,
such as affine flow estimation and piecewise-cubic image restoration, for which random
sampling methods quickly become computationally formidable. One of our future work
directions is to extend this approach to these applications. Also worth further investigation
is a less expensive alternative to the WLS estimator (Eq. 3.5) for updating estimates during
59
(a) Middle frame (b) Level 1: (1,14)
(c) Level 0: (1,13) (d) Total: (2,25)
Figure 3.16: TAXI: snapshot and trial set sizes map (in parentheses: min. and max.numbers of trials)
60
(a) BS (36sec)
(b) Ours (13sec)
Figure 3.17: TAXI: intensity images of x-component. Note that BS has bumpier boundariesand produces more gross errors, e.g., near the center of the taxi.
61
each visit. One possibility is to use the concentration property of LTS [104]. Finally, it
would be interesting to see how the trial set size could be used as an early cue for analyzing
scene complexity.
3.3 Error Analysis on Robust Local Flow
In this section we provide an error analysis of the first-order differential local regression tech-
nique through covariance propagation [52]. By using a high-breakdown robust criterion, we
minimize the impact of outliers on both optical flow estimation and its error evaluation.
We calculate spatiotemporal derivatives from a facet model, which enables us to take corre-
lation of adjacent pixels into account, and estimate image noise and derivative errors in an
adaptive fashion. In the regression problem, we consider errors from both the observations
and the measurements. In addition, we adopt a hierarchical process to handle large motion.
Our error analysis is more complete, systematic and reliable than previous attempts. The
advantages are demonstrated in a motion boundary detection application.
3.3.1 Covariance Propagation
Covariance Propagation Theory
Consider a system relating the output y to the input x by the function
y = f(x).
Generally f(·) is nonlinear; but when the perturbation ∆x is small enough to fit its linear
range, the output error is well approximated by
∆y =df(x)dx
∆x.
Then the covariance of the output is
Σy = (df(x)dx
)′Σxdf(x)dx
. (3.9)
f(·) of most real systems cannot be expressed explicitly. Instead a relationship
g(x,y) = 0
62
usually exists. In such cases we have
df(x)dx
) = (∂g(x, Y )
∂y)−1 ∂g(x,y)
∂x,
and finally
Σy = [(∂g(x,y)
∂y)−1 ∂g(x,y)
∂x]′Σx[(
∂g(x,y)∂y
)−1 ∂g(x,y)∂x
] (3.10)
Evaluating the covariance requires the knowledge of the true x,y values, which are seldom
available and commonly approximated by their estimates in practice.
The assumptions of unimodal noise and accurate observations and estimates limit the
application of the theory to only near-perfect systems with no outlier present.
Below we introduce the application of the theory to our robust optical flow estimator.
The explanation goes as OFC and facet model two steps. Results presented here are more
general than that reported in an earlier paper about a least-squares technique [144].
The OFC Step
After outliers are removed under the robust criterion, the actual OFC we use is composed
of the rows in Eq. 2.7 corresponding to the inliers. For simplicity, from now on we overload
Eq. 2.7 by the actual OFC and let n be the number of inliers.
Solving Eq. 2.7 using least-squares is optimized for the model that only b is contaminated
by iid additive zero-mean noise with variance σ2b . Under this assumption, the optical flow
covariance is simply
Σu = σ2b (A
′A)−1 (3.11)
where σ2b can be estimated from the residual errors as
σ2b =
1n− 2
n∑
i=1
r2i .
The above error model is apparently unrealistic because spatial derivatives in A are noisy
as well. Under the Error-In-Variable (EIV) condition the LS estimate is biased. Accordingly,
efforts have been made to calculate the unbiased estimate using generalized least-squares
[90, 96, 95]. However, at the cost of much heavier computation, these methods bring little
accuracy improvement. It is because bias is a much weaker error source than outliers [15]
in optical flow estimation; methods which can suppress outliers [15, 147] achieve fairly good
63
accuracy, much better than what generalized least-squares can do. In addition, bias has
also turned out to be less significant than estimation variance in motion estimation [27].
Therefore, we solve the outlier-suppressed OFC using an LS estimator, and analyzing the
estimation error by propagating covariance from the derivative estimates.
The input to the system Eq. 2.7 is the spatiotemporal derivative vector d and the output
is the optical flow vector u. We assume their errors are both zero-mean and have covariance
matrices Σd and Σu respectively. u and d do not have a linear relationship, but they are
related by
g(u,d) =∂F (d,u)
∂u= A′(Au− b) = 0
where F (d,u) = |Au−b|2 is the criterion function. Proceeding as the covariance propaga-
tion theory (Section 3.3.1) suggests, we obtain
∂g(d,u)/∂u = A′A
and
∂g(d,u)/∂d = (∂g(d,u)/∂d1, . . . , ∂g(d,u)/∂dn)
where
∂g(d,u)∂di
=
ri + Ixiu Ixiv Ixi
Iyiu ri + Iyiv Iyi
.
Applying Eq. 3.10 yields the optical flow estimate covariance
Σu = [(∂g(d,u)
∂u)−1 ∂g(d,u)
∂d]′Σd[(
∂g(d,u)∂u
)−1 ∂g(d,u)∂d
]. (3.12)
This expression reveals that the error in the optical flow estimate not only depends on
the residual errors and the derivative values (through system conditioning) as indicated
by Eq. 3.11, but also relates to the optical flow value and errors in the derivatives. Such
observations have been made in many previous studies [7, 15, 5].
Assuming the derivatives and optical flow estimates are sufficiently accurate, we use
them in place of the unknown true values for evaluation. Now the only missing piece in
the above expression is the derivative error covariance Σd. Its modeling has posed a great
difficulty to many previous studies [119, 90, 96]. Below we tackle the problem using the
facet model.
64
The Facet Model Step
We assume the image noise is an iid zero-mean variable with variance σ2. From Sec-
tion 3.3.1 we know that the gradient vector di at pixel i, i = 1, . . . , s is linearly related to
its neighborhood data Ji by di = MJi. It permits directly applying of Eq. 3.9 and leads to
Σdi = σ2i MM ′.
Similarly, for any pair of gradient vectors di,dj we have
Σdidj = σ2ijMiM
′j ,
where Mi,Mj are the weights on their overlapped support, and σ2ij is approximated by
σiσj . Finally the full derivative covariance matrix is assemblied from Σdidj , i, j = 1, . . . , n.
Notice that MiM′j depends on the positional relationship of pixel i and j, and only takes
a few forms once the supports for the OFC and the facet model are determined. Hence in
implementation we create a lookup table of all possible MiM′j beforehand and refer to it
during pixel-by-pixel error estimation.
The above procedure defines the structure of Σd. [90, 95] arrive at similar conclusions
using their derivative masks. However they meet difficulty with image noise variance es-
timation. [90] attempts to evaluate the variance empirically, but the method turns out
unsuccessful. The reason is that here the image noise is not caused simply by acquisition
errors; it also depends on the derivative masks and the local image texture [119, 96, 92].
We derive an estimate of σ2 from the facet model fitting residual error
σ2 = |Da− J|2/(nd − nb)
where nd, nb are respectively the number of pixels in J and the number of polynomial bases.
This measure reflects the deviation of the local image texture from the assumed polynomial
model, which arises from either image noise or complex textures. It is a by-product of
derivative estimation, and is adaptive across the image. The use of the facet model fully
automates the error propagation from the image data to the optical flow estimation.
Hierarchical Processing
65
We build our optical flow estimation and covariance propagation method in a hierarchical
scheme to cope with large motions [12]. [119] propagates covariance down the pyramid by
a Kalman filtering like scheme. Currently we assume results on different pyramid levels are
independent from each other, and hence we combine covariance matrices of different levels
simply by multiplying the values at the higher level by 4 and adding them to the values at the
lower level. Due to the limitations of hierarchical schemes [147] and the crude combination
method, we observe performance degradation as the number of pyramid levels increases.
Handling large motions remains a very difficult problem and needs further investigation.
3.3.2 Experiments
Motion Boundary Detection. Inspired by [93, 90], we demonstrate the performance
of our error analysis method through a local statistical motion boundary detector. Given
two adjacent optical flow vectors and their covariance matrices (ui, Σui) and (uj , Σuj ),
we examine the hypothesis H0 that they originate from normal distributions of the same
mean. Under H0, their difference vector u = ui − uj obeys a bivariate normal distribution
V ∼ N(0, Σui + Σuj ). Thus the statistic
T = u′(Σui + Σuj )−1u
should obey a χ2 distribution of 2 degrees of freedom. We reject H0, or declare a boundary
pixel pair, when T > Tα. Each Tα corresponds to a significance level α, which is the
theoretical false alarm rate.
We estimate optical flow on the middle frame of an odd number of frames. The constant
flow window size is fixed at 9 × 9. We handled estimates at image borders, so that they
also have good accuracy and the resultant flow field is of the same size as the original image
[142].
We compare two optical flow estimators, LS and LMS, and covariance propagation under
two noise models: iid b error (Eq. 3.11) and correlated EIV (Eq. 3.12). This forms four
combinations. (a) LS OFC and covariance from Eq.3.11, (b) LS OFC and covariance from
Eq.3.12, (c) LMS OFC and covariance from Eq.3.11, and (d) LMS OFC and covariance from
Eq.3.12. (d) is the proposed method. (a) is similar to [126, 56], and (b) is similar to [90]. In
66
(a) 0.01 (b) 0.15 (c) 2e−11 (d) 0.05
Figure 3.18: TS motion boundary
all experiments, we adjust the α value to produce the best visual results. The performance
of different methods are compared by inspecting the false-alarm and misdetection rates. The
closeness of the α value and the observed false alarm rate is an indicator of the statistical
validity of the results.
TS Sequence. We calculate derivatives from a first-order spatiotemporal facet model
on a support of size 3× 3× 3 [145]. 3 frames are used. The results are shown in Figure 3.18
with α values given in captions. Simple as it is, the χ2 test is effective in detecting motion
boundaries. LS methods break down around motion boundaries and the covariance from
neither error model is reliable. This verifies that performance analysis based on covariance
propagation only works for near-perfect systems and is unable to detect its own failure. (c,d)
are similar by visual inspection. However, with an inappropriate noise model, (c) severely
underestimates the error. Its associated α = 2e−11 makes little statistical sense, and thus in
practice there is no good method for choosing the threshold. With outliers rejected and the
correlated EIV model assumed, good results with solid statistical meaning are produced by
(d).
Hamburg Taxi Sequence (TAXI). Derivatives are calculated from a cubic spa-
tiotemporal facet model on a support of size 5× 5× 5 [145]. 5 frames are used. Figure 3.19
gives the results using a 2-level pyramid. Due to limitations of the hierarchial processing
scheme (Section 3.3.1), the results are more noisy than those on the higher level of pyramid
67
only (original sequence spatially subsampled by 2) in Figure 3.20. But the overall observa-
tion is that robust estimates have more faithful boundaries, and the correlated EIV model
yields much less false alarms and misdetections. Quite noticeably though, the α values
become more problematic on real data. This suggests that our error modeling still needs
refining to meet the demand of real complexity.
3.3.3 Discussion
In this paper we have presented an error analysis on a robust optical flow estimation tech-
nique. Our work extends previous research in several directions. First of all, we make explicit
the dependence of the popular covariance propagation theory on accurate estimates, and
perform our analysis with a highly robust technique. We employ a high-breakdown robust
criterion to reject outliers, which are most detrimental to both optical flow estimation and
error analysis. By using a 3D facet model we obtain good derivative estimation, and in
addition we systematically estimate the image noise strength and the correlated errors in
spatiotemporal derivatives. We also adopt a hierarchical scheme to handle the large motion
case.
We illustrate the effectiveness of our error analysis on an application of statistical mo-
tion boundary detection. Compared to least-squares based methods, our method has sig-
nificantly higher motion estimation accuracy and boundary fidelity, and produces less false
alarms and misdetections. These exhibit the potential of our results in a wide range of
applications such as Structure From Motion (SFM) and camera calibration [34].
Automatic performance analysis is a very important yet very difficult problem. Although
it is one step farther than previous attempts, our approach is still based on the covariance
propagation theory and break downs when the estimate quality becomes too low. The open
issue is how to make the system be aware of when the quality of the estimates becomes too
low to make further inference.
68
(a) 0.05 (b) 0.4
(c) 0.001 (d) 0.5
Figure 3.19: TAXI motion boundary
69
(a) 0.0005 (b) 0.2
(c) 3e−7 (d) 0.25
Figure 3.20: TAXI: motion boundary on images subsampled by 2
70
Chapter 4
GLOBAL MATCHING WITH GRADUATED OPTIMIZATION
The local approaches we presented in the previous chapter analyze each optical flow
vector by exploring image data in a small spatiotemporal neighborhood surrounding that
pixel location. Due to the limited contextual information, drawbacks of such approaches
are obvious: if data in a neighborhood do not have enough brightness variation or they
happen to be very noisy, the analysis can completely fail; in other words, local approaches
are highly sensitive to the aperture problem and their reliability can vary greatly within a
single image. In order to overcome such limitations, appropriate global approaches must be
developed to incorporate contextual information more effectively.
Global optimization techniques for optical flow estimation have been extensively studied
throughout the years, but the state-of-the-art performance remains unsatisfactory due to
formulation defects and solution complexity. On one hand, approximate formulations are
frequently adopted for ease of computation, with the consequence that the correct flow is
unrecoverable even in ideal settings. As an example, many methods intended to preserve
motion discontinuities use gradient-based brightness constraints, which can break down at
discontinuities due to derivative evaluation failure and thus cannot reach the goal of precise
boundary localization [145]. On the other hand, more sophisticated formulations typically
involve large-scale nonconvex optimization problems, which are so hard to solve that the
practical accuracy might not be competitive to simpler methods. Motion estimation research
has arrived at a stage in which a good collection of ingredients are available; but in order
to significantly improve performance, both problem formulation and solution methods need
to be carefully considered and optimized.
In this chapter, we discuss the problem of optimal optical flow estimation assuming
brightness conservation and piecewise smoothness and propose a matching-based global
optimization method with a practical solution technique.
71
From a Bayesian perspective, we assume the flow field prior distribution to be a Markov
random field (MRF) and formulate the optimal optical flow as the maximum a posteriori
(MAP) estimate, which is equivalent to the minimum of a robust global energy function.
The novelty in this formulation lies mainly in two aspects. 1) Three-frame matching is
proposed to detect correspondences; this overcomes the visibility problem at occlusions. 2)
The strengths of brightness and smoothness errors in the global energy are automatically
balanced according to local data variation, and consequently parameter tuning is reduced.
These features enable our method to achieve a higher accuracy upper-bound than previous
algorithms.
In order to solve the resultant energy minimization problem, we develop a hierarchical
three-step graduated optimization strategy. Step I is the robust local gradient method
that we have proposed in Section 3.2. It provides a high-quality initial flow estimate.
Step II is a global gradient-based formulation solved by Successive OverRelaxation (SOR),
which efficiently improves the flow field coherence. Step III minimizes the original energy
by greedy propagation. It corrects gross errors introduced by derivative evaluation and
pyramid operations. In this process, merits are inherited and drawbacks are largely avoided
in all three steps. As a result, high accuracy is obtained both on and off motion boundaries.
Performance of this technique is demonstrated on a number of standard test data sets.
On Barron’s benchmark synthetic data [7], this method achieves the best accuracy among
all low-level techniques. Close comparison with the well known Black and Anandan’s dense
regularization technique (BA) [14] shows that our method yields uniformly higher accuracy
in all experiments at a similar computational cost.
4.1 Formulation
Let I(x, y, t) be the image intensity at a point s = (x, y) at time t. The optical flow at
time t is a 2D vector field V with the vector at each site s denoted by Vs = (us, vs)T ,
where us, vs represent the horizontal and vertical velocity components, respectively. At
places with no confusion, we may drop the s index and denote an image frame by I(t)
and a flow vector by V = (u, v)T . The task of estimating optical flow can be described
72
as finding V to best interpret the spatiotemporal intensity variation in the image frames
I = {I(t1), . . . , I(t), . . . , I(t2)}, t1 ≤ t ≤ t2. We consider it as a Bayesian inference problem
and define the optimal solution under the maximum a posteriori (MAP) criterion.
4.1.1 MAP Estimation
Let P (V |I) be the posterior probability density of the flow field V conditioned on the
intensity observation I. According to the maximum a posteriori (MAP) criterion, the best
optical flow V is at the mode of this density, i.e.,
V = argmaxV P (V |I).
Applying Bayes rule, the posterior pdf can be factored as
P (V |I) =P (I(t)|V, I − I(t))P (V |I − I(t))
P (I(t)|I − I(t))(4.1)
where I − I(t) designates the image frames excluding the one on which we estimate optical
flow. Ignoring the denominator, which does not involve V , we have
V = argmaxV P (I(t)|V, I − I(t))P (V |I − I(t)) (4.2)
where P (I(t)|V, I − I(t)) is the likelihood of observing the image I(t) given the optical flow
V and its neighboring frames I − I(t); P (V |I − I(t)) is the prior probability density of the
optical flow.
4.1.2 MRF Prior Model
We model the prior distribution of the optical flow using a Markov random field. The
MRF is a highly effective model for piecewise smoothness. It was first introduced for image
restoration by Geman and Geman [42] and has been widely employed in motion estimation
to preserve boundaries [88, 77, 57, 14]. The elegance of the MRF lies in that once a
neighborhood system N is defined, due to MRF/Gibbs distribution equivalence, the prior
distribution with respect to N can be expressed in terms of a potential function ES(V ) as
P (V |I − I(t)) = exp(−ES(V ))/Z.
73
−5 0 50
0.2
0.4
0.6
0.8
1
1.2
(a) ρ(x, σ)
−5 0 5−1
−0.5
0
0.5
1
(b) ψ(x, σ) = ρ′(x, σ)
−5 0 50
0.02
0.04
0.06
(c) pdf ∝ exp(−ρ(x, σ))
Figure 4.1: Comparison of Geman-McClure norm (solid line) and L2 norm (dashed line): (a)error norms (σ = 1), (b) influence function, (c) corresponding probability density functions(truncated).
The partition function Z is a normalizing constant. ES(V ) is the flow smoothness energy
modeled as a sum of site potentials: ES(V ) =∑
s ES(Vs).
We use a second-order neighborhood system of only pairwise interactions in the flow
field prior. Correspondingly, the local flow smoothness potential ES(Vs) is specified by the
average deviation of Vs from its 8-connected neighbors Vi, i ∈ N8s :
ES(Vs) =18
∑
i∈N8s
ρ(Vs − Vi, σSs). (4.3)
Here σSs is the flow variation scale at the site s, and the error norm ρ(x, σ) reflects the flow
deviation distribution.
The choice of ρ is the decisive factor of the boundary preservation capability of an MRF
formulation. If ρ is an L2 norm and σSs is a fixed global parameter, the flow prior potential
reduces to the smoothness error in the Horn and Schunck formulation (Eq. (2.6)), which
does not preserve motion discontinuities at all. Geman and Geman [42] modeled continuous
surfaces as an MRF, and introduced the “line process”, a set of binary variables indicating
edges, as a dual MRF. This formulation has been widely adopted in motion estimation
[88, 77, 57, 14]. It was shown by Blake and Zisserman [20] to be equivalent to assuming ρ
as a truncated quadratic function. In a robust statisitics context, Black [14, 19] generalized
74
the line process to an analog “outlier process”. We adopt this point of view in designing
the error norm. More distribution and robust statistics insight to the error norm and our
design are given in the following two sections.
4.1.3 Likelihood Model: Robust Three-Frame Matching
If the likelihood P (I(t)|V, I − I(t)) is a site-independent exponential distribution propor-
tional to exp(−∑s EB(Vs)), the posterior distribution is also Gibbs, with the potential
resembling the regularization global energy (Eq. (2.6)). We take this approach so that
specifying the likelihood term reduces to modeling the brightness conservation error and its
potential function.
We use the matching constraint Eq. (2.1) to model brightness conservation. The tradi-
tional assumption that pixels are visible in all frames is a major source of gross errors in
occlusion areas. Taking such violations as outliers [14] may prevent error from propagating
to nearby regions, but does not provide constraints for occlusion pixels and thus does not
help their motion estimation. We observe that without temporal aliasing, all points in a
frame are visible in the previous or the next frame. Therefore we define the matching error
as the minimum of backward and forward warping errors in three frames, i.e.,
eW (Vs) = min(|Ib(Vs)− Is|, |If (Vs)− Is|)
where Is is the intensity of pixel s in the middle frame; Ib(Vs), If (Vs) are warped intensities
in the previous and the next frames respectively. We are the first to explicitly model
correspondence at occlusions in optical flow estimation [147]. A similar idea, known as a
temporally shiftable window, has shown high effectiveness in handling occlusions in multi-
view stereo [73].
It is conventional to assume that matching error comes from iid Gaussian noise and
correspondingly use its L2 norm as the potential function [58]. However, image noise is not
always Gaussian due to abrupt perturbations, and the matching error can come from other
sources such as warping failures. It may often take large values and thus has a distribution
with fatter tails than Gaussian. To represent the distribution more realistically, we use a
75
robust error norm ρ(x, σ) to define the potential, yielding
EB(Vs) = ρ(eW (Vs), σBs), (4.4)
where σBs is the local brightness variation scale. Figure 4.1 gives the ρ, ψ curves for the L2
and Geman-McClure error norms [15] and their corresponding pdfs 1. (Figure 4.1(c)) has
much fatter tails than the Gaussian.
The above prior and likelihood models are also justified from the robust statistics per-
spective. In optical flow estimation, small matching errors and smooth flows are dominant;
large errors and motion discontinuities can be considered as outliers to the modelled struc-
ture. Henceforth, applying robust constraints to both the EB and ES terms serves to reduce
the impact of local perturbations and prevent flow smoothing across boundaries. In addi-
tion, it gracefully handles motion estimation and segmentation, a difficult “chicken-and-egg”
problem, since motion boundaries can be easily located as flow smoothness outliers [15].
4.1.4 Global Energy with Local Adaptivity
A robust error norm is usually chosen to possess certain desirable properties to suit the
problem at hand. We use the Geman-McClure robust error function [15]
ρ(x, σ) = x2/(x2 + σ2)
in both EB and ES terms for its redescending [15, 106] and normalizing properties. The
first property ensures that the outlier influence tends to zero. We take errors exceeding
τ = σ/√
3, (4.5)
where the influence function begins to decrease, as outliers [15]. This is equivalent to
identifying pixels with error norm≥ 0.25 as outliers. The normalization property is desirable
because it makes the degrees of flow smoothness and brightness conservation comparable.
Together with the spatially varying scales, it allows the relative strength of these two terms
1ρ(x, σ) might not necessarily represent a proper distribution, like Gaussian, which is defined on x ∈(−∞,∞). But we consider it appropriate in an application to define a reasonable range of expected errorsand obtain a pdf by normalization in this range. Figure 4.1(c) shows the pdfs for x ∈ (−5, 5).
76
to be adaptive: where the observation is not trustworthy (σBs is large), stronger smoothness
is enforced, and vice versa. The scales are gradually learned from image data, as we will
discuss in Section 4.2.2 and 4.2.3.
Finally, the complete global energy is expressed as
E(V ) =∑s
[ρ(min(|Ib(Vs)− Ii|, |If (Vs)− Is|), σBs) +18
∑
i∈N8s
ρ(Vs − Vi, σSs)]. (4.6)
This design extends current robust global formulations [15, 86] in two aspects. First of
all, the three-frame matching error models correspondences even at occlusions and enables
higher accuracy upper bounds, which gradient-based or two-frame methods cannot achieve.
Secondly, the locally adaptive scheme is more reasonable than those taking σB, σS , λ as
fixed global parameters and eases parameter tuning in experiments.
4.2 Optimization
As we have discussed in Section 2.4, the global energy Eq. (4.6) resides in a high-dimensional
space and is nonconvex. Even finding its local minima is not easy lacking an explicit gradient
expression. Because no general global optimization technique is known to provide a practical
solution, we take a graduated deterministic approach to the minimization problem. We start
from an initial estimate and progressively minimize a series of finer approximations to the
original energy. In this process, we exploit the advantages of various formulations and
solution techniques for accuracy and efficiency.
Our first attempt to approximation is to replace the matching error by the OFCE, which
enables simple gradient evaluation and more efficient exploration of the solution space. This
step needs a good-quality initial estimate to start with. We provide this initial estimate
from a yet cruder approximation, a gradient-based local regression method. This method
is cruder, because the global smoothness is not enforced and estimation is solely based
on local data. After these two steps, we usually have very high accuracy except at motion
boundaries, and then we can directly minimize the original energy to correct residual errors.
We build the process in a coarse-to-fine framework to handle large motions and expedite
convergence. Details of this algorithm are explained below.
77
4.2.1 Step I: Gradient-Based Local Regression
Suppose a crude flow estimate V0 is available and has been compensated for. Step I uses
the robust gradient-based local regression method that we have developed in Section 3.2
to compute the incremental flow ∆V . Both least-median-of-squares (LMedS) and least-
trimmed-squares (LTS) were tried and yielded similar results, so henceforth our discussion
is based on LMedS. This step generates high-quality initial flow estimates. Its effectiveness
as an independent optical flow estimation approach has been verified in various studies
[5, 97, 145].
4.2.2 Step II: Gradient-Based Global Optimization
∆V0, the incremental flow resulting from Step I, has good accuracy at most places, but its
quality degrades where local constraints become unreliable. We improve its coherence using
a gradient-based global optimization method, which is a better approximation to Eq. (4.6).
The energy to minimize is
E(∆V ) =∑s
[ρ(eG(∆Vs), σBs) +18
∑
i∈N8s
ρ(Vs + ∆Vs − Vi −∆Vi, σSs)] (4.7)
where eG is the OFCE error (Eq. (2.3)), and Vs is the sth vector of the initial flow V0. The
local scales σBs, σSs are important parameters which control the shape of E and hence the
solution. Below we describe how to estimate them from Step I’s results.
Suppose normal errors are Gaussian variables with zero mean and standard deviation
σ; then those exceeding 2.5σ can be considered as outliers. Contrasting this threshold to
Eq. (4.5), we can establish an equivalence between the Geman-McClure scale σ and the
Gaussian standard deviation σ as:
σ = 2.5√
3 σ.
If we have sample standard deviation σ, we may compute σ using the above formula.
At a site s, we calculate σSs as the sample standard deviation of “inliers” in Vi −Vs, i ∈ N8
s . Inliers are selected using the RLS procedure described in the previous section.
Some σSs values might be very large due to bad flow estimates. We put a cap on them:
78
1.4826medians σSs . For stability concerns as well as for the estimate to be reasonable, we
also put a lower limit 0.001 pixels/frame on these values. We calculate σBs as the OFCE
residual. And similarly we bound the value in the range [0.01, 1.4826medians σBs ].
Now that the scales are all specified, we minimize the energy using Successive OverRe-
laxation (SOR) [20, 14, 102]. Starting with the initial estimate ∆V0, on the nth iteration,
each u component (and v similarly) is updated as
uns = un−1
s − ω1
T (us)∂E
∂un−1s
,
where
T (us) = I2x/σB
2s + 8/σS
2s.
SOR is well known as good at removing high-frequency errors while very slow in removing
low-frequency errors [137, 23]. In our algorithm, the initial estimate has dominant high-
frequency errors: it has good accuracy at most places but may lack coherence due to the
local constraints. In such a case, the SOR procedure is very effective and converges fast.
In addition, the update step size is adaptively adjusted by the local scales, which further
improves the efficiency in exploring the solution space.
4.2.3 Step III: Global Matching
The incremental flow from Step II, ∆V1, and the initial estimate V0 add up to V1, which
still exhibits gross errors at motion boundaries and other places where gradient evaluation
fails. But it is overall a sufficiently precise representation, based on which we are ready to
consider the original formulation Eq. (4.6).
The computation of local scales is similar to that in Step II with a few differences. We
adopt a globally constant matching error standard deviation σB. It is a bounded robust
estimate from all matching errors σBs : max{0.08, 1.4826 medians σBs}. The flow vector
standard deviation is kept spatially varying within [0.004,0.02] pixels/frame.
We minimize the global energy function by greedy propagation. We first calculate the
energy EB(Vs)+ES(Vs) from V1 for all pixels. Then we iteratively visit each pixel, examining
whether a trial estimate from a candidate set results in a lower global energy E. The
79
candidate set consists of the 8-connected neighbors and their average, which were updated
in the last visit. Once a pixel energy decrease occurs, we accept the candidate and update the
related energy terms. The simple scheme works reasonably well because bad estimates are
confined to narrow areas in the initial flow V1. It converged quickly in our experiments. Since
each flow estimate Vi affects E only through its own energy and the smoothness energies of
its 8-connected neighbors, the updating is entirely local can be carried out in parallel [42].
It is worth mentioning that a similar greedy propagation scheme was successfully applied
to solving a global matching stereo formulation in an independent study [129].
4.2.4 Overall Algorithm
We employ a hierarchical process [12] to cope with large motion and to expedite convergence.
We create a P -level image pyramid Ip, p = 0, . . . , P − 1 and start the estimation from the
top (coarsest) level P − 1 with a zero initial flow field. At each level p, we warp the image
sequence Ip using the initial flow V p0 , obtaining image frames Ip
w. On Ipw we initialize the
residual flow using the local gradient method, enhance it using the global gradient method,
and add it to V p0 yielding V p
1 . Then we refine V p1 by applying the global matching method
to Ip, resulting in the final flow estimate on Level p, V p2 , which is projected down to Level
p− 1 as its initial flow field V p−10 . At last the flow estimate on the original resolution is V 0
2 .
Operations on each pyramid level are illustrated in Figure 4.2. There is an exception: when
more than one pyramid level is used, we skip Step III on the coarsest level. The rationale
is that gradient-based methods suffice on the coarsest level, since the data are substantially
smoothed and the flow is incremental; applying the matching constraint is usually harmful
due to the smoothing and possible aliasing.
Hierarchical schemes have become standard in motion estimation, but their limitations
are often overlooked. The projection and warping operations oversmooth the flow field;
errors in coarser levels are magnified and propagated to finer levels and are generally irre-
versible [14]. These problems are much alleviated by the global matching step—it works
on the original pyramid images and corrects gross errors caused by derivative computation,
projection and warping.
80
Iwp
Iwp∆
V0p∆ σBi σSi
∆Vp
1
Vp
1
Vp
2
V0p
I p
III: Global Matching
Image Warping
Gradient Computation
I: Local Gradient
II: Global Gradient
Projection
+
V0
p−1
Level
Level
p
p−1
Figure 4.2: System diagram (operations at each pyramid level)
81
From a practical point of view, the graduated scheme benefits from the merits of all three
popular optical flow approaches and overcomes their limitations. Step I use a gradient-based
local regression method for high-quality initialization, while leaving local ambiguities to be
resolved later in more global formulations. Step II improves the flow coherence using the
gradient-based global optimization method, which converges fast because of the good ini-
tialization. Step III adopts a matching-based global formulation to correct gross errors
introduced by derivative computation and the hierarchical process. Matching-based for-
mulations have been studied before, but their advantages over gradient-based counterparts
were not apparent due to computational difficulties [88, 77, 14]. We provide, for the first
time, a practical solution and achieve highly competitive accuracy and efficiency.
4.3 Experiments
This section demonstrates the performance of the proposed technique on various synthetic
and real data and makes comparison with previous techniques.
The settings in our algorithm are given below. Optical flow is estimated on the middle
frame of every three frames. No image pre-smoothing is done. Derivatives are calculated
from a first-order spatiotemporal facet model [54] on a support of size 3× 3× 3 [145]. The
constant flow window size in Step I is set to W = 9. Sites at image borders use the valid
part of the window so that the resulting flow field is of the same size as the original image.
In Step II, 20 iterations are used for SOR. The values of the local scale bounds have been
given in Section 4.2.2 and 4.2.3. The image pyramid is constructed by sub-sampling 3× 3
Gaussian-smoothed images. Projection expands the flow field by 2× 2 times with bilinear
interpolation. Bilinear interpolation is also used for image warping. The above factors are
kept constant in all experiments. The only tuning parameter is the number of pyramid
levels. For each data set, we choose the number of levels to be just large enough for the
gradient-based constraints to hold on the finest level. Larger numbers introduce more errors
due to suppression of fine structures in reduction and smoothing in projection and warping
(Section 4.2.4). Adaptive hierarchial control [9, 14] is an important open problem, which is
not tackled in this work.
82
Close comparison is given with Black and Anandan’s dense regularization technique
(BA) [14] whose code is publicly available. We modified their code to output floating-point
data. BA calculates flow on the second of two frames. It uses the same number of pyramid
levels as ours and other parameters are set as suggested in [14]. All experiments are carried
out on a PIII 900M PC runing Linux. The computing time of our algorithm depends on
the motion complexity in the input data. It is typically close to that of BA. Some sample
CPU time values (in seconds) for our algorithm and BA are: 11.7 and 14.7 (Taxi), 29.5
and 27.4 (Flower Garden), 36.8 and 24.2 (Yosemite). Note that neither algorithm has been
optimized for speed.
4.3.1 Quantitative Measures
Quantitative evaluation can be conducted on data sets with flow groundtruth by reporting
statistics of certain error measures. The most frequently adopted error measures are the
angle and magnitude of the error vector; the first and second order statistics are commonly
reported.
We propose to use e, the absolute u or v error, as the error measure. It is a consistent
and fair measure since u, v components and positive and negative errors are treated sym-
metrically in optical flow estimation. Also, this 1-D measure is much easier to work with
than are 2-D or higher dimensional measures. In considering what statistics to use, we find
the popular first and second order statistics not representative enough for such a highly
screwed e distribution. Therefore we give the empirical cumulative distribution function
(cdf) of e in addition to its mean e. Better estimates should have cdfs closer to the ideal
unit step function. In order to facilitate comparison with other techniques, we also report
the popular average angular error e6 [7]. e and e 6 values for five synthetic image sequences
are summarize in Table 4.1.
4.3.2 TS: An Illustrative Example
The Translating Squares (TS) sequence (64× 64, Figure 4.3(a)) was created to examine the
theoretical merits of different approaches. It contains two squares translating at exactly 1
83
(a) (b) (c)
(d) (e)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
e (pixels)
cdf(
e)
S3S2S1LSBA400BA
(f)
Figure 4.3: TS sequence results. (a) middle frame. The motion boundary is highlightedwith a solid white line. In order to provide details near motion boundaries, (b,c,d,f) showflows in the window outlined by the dotted line. (b) groundtruth and our estimate look thesame. (c) BA estimate. (d) LS estimate in Step I. (f) Step I result. Step II result looksidentical and is hence not shown separately. (g) error cdf curves.
84
pixel/frame, with the foreground square outlined in solid white. The vector plot of the flow
groundtruth is given for the part near the boundary (marked by white dots). The images
are well textured and noise-free. The motion is small and thus no hierarchical process is
needed. In such an ideal setting, an optimal formulation assuming brightness conservation
and piecewise smoothness should fully recover the flow.
Our method does achieve the performance upper bound. The result is almost perfect;
its vector plot looks the same as the groundtruth (Figure 4.3(b)), the average errors are
negligible (Table 4.1), and the error cdf curve (Figure 4.3(g), curve “S3”) closely resembles
the unit step function.
Figure 4.3(d) shows the flow estimate from the LS initialization in Step I, which can
be considered as an embodiment of the Lucas and Kanade technique [82]. Because of LS’s
zero tolerance to outliers, the flow is completely smoothed out near the motion boundary
(shadowed). Figure 4.3(e) shows the final result of Step I. Replacing LS by LMS dramatically
improves the boundary accuracy, as it is also clear by comparing curves “LS” and “S1” in
Figure 4.3(g). This proves the necessity of robustification.
Due to gradient evaluation failure, in Figure 4.3(e) gross errors still remain at the motion
boundary. Moreover, the corners are rounded because the background motion becomes
dominant there. These problems are characteristic of robust local gradient techniques [5,
97, 145], and they become more severe as the number of pyramid levels and the constant
flow window size W increase. Since the TS sequence is well textured and there is no serious
aperture problem, the improvement from the global OFC formulation (Step II) is minimal
(see Figure 4.3(g) curve “S2”). The remaining gross errors at the motion boundary are
inevitable to gradient-based techniques. They are finally corrected in Step III.
BA yields poor accuracy (Figure 4.3(c), Table 4.1) on this data set. The oversmooth-
ing bias introduced by the LS initialization is not effectively corrected in the continuation
process. The SOR procedure converges very slowly. The suggested 20 iterations [15] does
not seem to be sufficient (see Figure 4.3(g) curve “BA”). Even after 400 iterations (curve
“BA400”) the bias persists and the accuracy remains low.
85
Data Technique e6 (◦) e (pix)
TS BA 8.04 0.12
Ours 1.1e-2 2.2e-4
TT BA 2.60 0.07
Ours 0.05 9.8e-3
DT BA 6.36 0.11
Ours 2.60 0.05
YOS BA 2.71 0.12
Ours 1.92 0.08
DTTT BA 10.9 0.20
Ours 4.03 0.08
Table 4.1: Quantitative measures
4.3.3 Barron’s Synthetic Data
The Translating Tree (TT), Diverging Tree (DT) and Yosemite (YOS), data sets were
introduced in Section 3.2.2. We use two levels of pyramid for TT and DT, and three levels
for YOS. The cloud part in YOS is excluded from evaluation. As it is consistently observed
from the average error measures (Table 4.1) and the error cdf curves (Figure 4.4), our
method achieves very high accuracy and consistently out-performs BA 2.
Most optical flow papers published after [7] report the average angular error e 6 on YOS.
Some of the results are quoted in Table 4.2. The first group take a dense regularization ap-
proach assuming piecewise constant flow. To our knowledge, our method gives the smallest
error among such techniques. The second group make stronger flow model assumptions such
as local affine flow and constant flow in a considerable number of frames. These assumptions
are appropriate on the YOS data set and may lead to higher accuracy. The smallest error
on YOS was reported by Farneback [35]. The algorithm couples orientation-tensor-based
2The BA angular error obtained here is different from the one reported by Black and Anandan [15]. Mostprobably it is because that their data are different from Barron’s and they calculated flow on the 14thframe. We adopt Barron’s experiment setup for wider comparability.
86
0 0.5 1 1.50
0.2
0.4
0.6
0.8
1
e (pixels)
cdf(
e)
OursBA
(a) TT
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
e (pixels)
cdf(
e)
OursBA
(b) DT
0 0.5 1 1.50
0.2
0.4
0.6
0.8
1
e (pixels)
cdf(
e)
OursBA
(c) YOS
Figure 4.4: Error cdf curves.
spatiotemporal filtering with region-competition-based segmentation, and estimates locally
affine motion in 9 frames. Although it uses only low-level models and 3 frames, our method
also compares favorably with these techniques.
DTTT: Motion Discontinuities
The above three data sets, including YOS, contain smooth motions and cannot display
the discontinuity-preserving capability of our method. We synthesize the DTTT sequence
(150 × 150) for this purpose. DTTT was generated from TT, DT and “cookie cutters”:
image data inside the cookie cutters come from TT and those outside come from DT. Its
middle frame with motion boundaries highlighted is given in Figure 4.5(a). We use two
pyramid levels for this set. For images of realistic sizes, vector plots with enough details do
not fit the page. Following [15], we show the horizontal and vertical flow components u, v
as intensity images, with brighter pixels representing larger speeds to the right. We linearly
stretch the image contrast so as to use the full intensity range.
Our flow estimate (Figure 4.5) has a clear layered look—it exhibits crisp motion dis-
continuity and smoothness at other places. Figure 4.5(d) marks motion boundary pixels
in black. They are located as smoothness outliers, i.e., those with final pixel smoothness
87
Technique e6 (◦)
Ye, Haralick and Shapiro (proposed) 1.92
Sim and Park [117] 4.13
Black and Anandan [15] 2.71
Szeliski and Coughlan [127] 2.45
Memin and Perez [86] 2.34
Black and Jepson [35] 2.29
Ju, Black and Jepson [71] 2.16
Bab-Hadiashar and Suter [5] 1.97
Farneback [35] 1.14
Table 4.2: Comparison of various techniques on Yosemite (cloud part excluded) with Bar-ron’s angular error measure
energy exceeding 0.25 (see Eq. 4.5). BA’s result is oversmoothed with local perturbations:
at many places foreground and background motions invade one another; meanwhile, a num-
ber of false boundaries arise corresponding to noise or intensity edges erroneously taken as
motion discontinuities. This is also reflected by its boundary map (output from their code),
which has lots of spurious detections. Note that their discontinuity is 1-pixel thick since
they only mark one of each pair of mutual outliers. Our result has much higher quantitative
accuracy than BA’s as shown in Figure 4.4 and Table 4.1.
However in our estimate, we do notice some gross errors near motion boundaries. For
example, the right corner of the triangle is smoothed into the background. A closer look
reveals that most of these errors happen in textureless regions, where even human viewers
are unable to tell what the actual motion is (aperture problem). In such situations, the
correctness of “groundtruth” becomes questionable and so does the authority of quantitative
evaluation based on it. For this reason, together with the simplicity of synthetic data and
error measures, “quantitative” results should be considered as qualitative at best. The above
suggests that the inherent ambiguity of the optical flow should be considered in quantitative
evaluation—a more convincing evaluation method should allow larger errors in regions of
88
(a) Middle frame(b) Our horizontal
flow
(c) Our vertical flow(d) Our motion
boundaries
0 0.5 1 1.5 20
0.2
0.4
0.6
0.8
1
e (pixels)
cdf(
e)
OursBA
(e) e cdf curves(f) BA horizontal
flow
(g) BA vertical flow(h) BA motion
boundaries
Figure 4.5: DTTT sequence results (motion boundaries highlighted in (a)).
89
(a) Middle frame (b) BA horizontal flow
(c) Our horizontal flow (d) Our smoothness error (ES) map
Figure 4.6: Taxi results.
90
less local information.
Also noticeable in our estimate is that our motion boundaries are not as smooth as one
would like. This is partly due to the weakness of the simple optimization method in Step
III. Developing more suitable optimization methods is an important direction in our future
work.
4.3.4 Real Data
In this section we show results on four well-know real image sequences: Taxi, Flower Garden,
Traffic and Pepsi. Taxi and Traffic contain independent motions; motions in the other two
data sets are caused by camera motion and scene depth. For each data set we give the
middle frame, the horizontal flow u from BA and our method, and the smoothness error ES
map from our method.
The Taxi sequence (256 × 190) is obtained from Barron [7]. It mainly contains three
moving cars (from left to right) at image speeds about 3.0, 1.0, 3.0 pixels/frame respectively.
The van on the left has low contrast and surface reflectance. The truck on the right is
fragmented by a tree in front. Difficulties also arise from the low image quality. Optical
flow is estimated on the 9th frame. Two levels of pyramids are used. BA’s result is almost
smoothed out. Better boundary performance might be obtained by tuning parameters.
But as we have discussed earlier, smoothing seems to be inevitable for BA especially on
data of such diverse motions. Our method yields a reasonable flow estimate and a motion
boundary map. Note that the car regions include shadows which move with the cars at the
same speeds. Motion boundaries inside the truck reflect the motion fragmentation.
Motion in the Flower Garden sequence (360 × 240, from Black) is caused by camera
translation and scene depth. The image speed of the front tree is as large as about 7
pixels/frame. Optical flow is estimated on the 2nd frame. Three pyramid levels are used.
In both BA’s and our results, the motion of the tree twigs smears into the background.
This is another example of inherent flow ambiguity (aperture problem). BA’s estimate has
considerable oversmoothing between layers. Our result shows clear-cut motion boundaries
and smooth flows within each layer. Its accuracy is highly competitive with those from
91
(a) Middle frame (b) BA horizontal flow
(c) Our horizontal flow (d) Our smoothness error (ES) map
Figure 4.7: Flower garden results.
model- or layer- based techniques [109, 131].
Consistent observations are made on the remaining two data sets. The Traffic sequence
(512× 512, from Nagel) contains eleven moving vehicles with the maximum image speed at
about 6 pixels/frame. Optical flow is estimated on the 8th frame with three pyramid levels.
The motorcycle in the building shadow (upper middle) is missed by BA but picked out by
our method.
The Pepsi sequence (201 × 201) was used by Black [15] to illustrate motion boundary
preservation capability. Like Flower Garden, its motion discontinuities are caused by camera
translation and scene depths. The maximum image speed is about 2 pixels/frame. Optical
flow is estimated on the 3rd frame with three pyramid levels. We exclude a 5-pixel wide
92
(a) Middle frame (b) BA horizontal flow
(c) Our horizontal flow (d) Our smoothness error (ES) map
Figure 4.8: Traffic results.
93
(a) Middle frame(b) BA horizontal
flow
(c) Our horizontal
flow
(d) Our ES map
Figure 4.9: Pepsi can results.
border from BA’s result to have a better contrast. The erroneous flow and discontinuity
estimated at the lower-left corner are also caused by the poor texture.
4.4 Conclusions and Discussion
This chapter has presented a novel approach to optical flow estimation assuming brightness
conservation and piecewise smoothness. From a Bayesian perspective, we propose a formu-
lation based on three-frame matching and global optimization allowing local variation, and
we solve it under a graduated minimization strategy. Extensive experiments verify that the
new method out-performs its competitors and yields good accuracy on a wide variety of
data.
The contributions of our work to visual motion estimation are summarized as follows.
• We introduced backward-forward matching for optical flow estimation. It avoids prob-
lematic derivative evaluation and models correspondences more faithfully than popular
gradient-based constraints and those ignoring the visibility problem at occlusions.
• We designed the global energy to automatically balance the strength of brightness and
smoothness errors according to local data variation. It is more complete and adaptive
than previous designs containing rigid tuning parameters.
94
• As a by-product of the robust formulation, motion discontinuities can be reliably
located as flow smoothness outliers.
• We developed a three-step graduated optimization strategy to minimize the resultant
energy. It is the first efficient algorithm yielding good accuracy for a global matching
formulation.
• The solution technique takes advantage of gradient-based local regression, gradient-
based global optimization and matching-based global optimization methods and over-
comes their limitations. The local gradient step provides a high-quality initial flow,
while leaving local ambiguities to be resolved later in more global formulations. The
global gradient step improves the flow coherence and it converges fast because of
the good initialization. The global matching step corrects gross errors introduced by
derivative computation and the hierarchical process.
• We proposed a deterministic algorithm to approximate the high-breakdown robust
estimator in the local gradient step. It can be faster and more accurate than algorithms
based on random sampling.
Many of the above conclusions are also applicable to other low-level visual problems
such as stereo matching, 3D surface reconstruction and image restoration.
As an accurate and efficient low-level approach to visual motion analysis, the new method
has great potential in a wide variety of applications. First of all, it provides a good start-
ing point for higher-level motion analysis. Our flow estimates already take a layered look
and motion boundaries of layers are closed curves. They can reliably initialize motion seg-
mentation [109], contour-based [86] and layered [131] representation. Model selection [130]
is a crucial problem in automatic scene analysis [16] which is difficult because comparing
a collection of models on the raw image data involves formidable computation. Our re-
sults can ease this task by supplying a higher ground for scene knowledge learning. The
backward-forward matching error, together with detected motion boundaries, can facilitate
occlusion reasoning [16]. It may also guide image warping to avoid smoothing across motion
95
discontinuities. Some success has been obtained in our preliminary experiments. This is
important to motion estimation as well as for novel view synthesis.
A noticeable problem in our results is that motion boundaries are not very smooth.
This is in part due to the simplicity of our minimization method. It has a limited ability for
generating new values. And propagating only among immediate neighbors might be slow
and can get stuck at trivial local minima. For the purpose of global optimization, methods
such as graph cuts [23], which yields very nice results in stereo matching, full multigrid
methods [102], Bayesian belief propagation (BBP) [137], and local minimization methods
alternative to SOR [102] are worth studying.
Furthermore, the benefits of the Bayesian framework should be fully exploited. Among
all criteria that the global energy may arise from, we find the Bayesian approach most ap-
pealing in both theoretic and practical interest. Estimating optical flow from a few images
is inherently ambiguous: areas with more appropriate textures have higher estimation cer-
tainty. This indicates that the nature of the problem is probabilistic instead of deterministic.
Furthermore, the Bayesian formulation may provide a graceful solution to two important
problems: global optimization and confidence estimation [126, 7, 144]. Interesting results
from a global optimization method Bayesian belief propagation (BBP) have been shown on
a limited domain of vision problems [137]. BBP propagates estimates together with their
covariances. If it converges, it converges in a small number of iterations with covariance as
a by-product. Confidence measures such as covariances are critical for subsequent applica-
tions to make judicious use of the results. It will be interesting to see if ideas like BBP are
applicable and beneficial to optical flow estimation.
96
Chapter 5
MOTION-BASED DETECTION AND TRACKING
This chapter considers an application of visual motion to detecting and tracking point
targets in an airborne video of intensity images. The research has been carried out with
Engineering 2000 Inc. and the Boeing Company, as a part of the efforts for developing a UAV
See And Avoid System. The greatest difficulty in this project lies in the extremely small
target size. For many purposes of airborne visual surveillance such as collision avoidance,
targets need to be identified as far/early as possible. This requires handling targets no
more than a few pixels large. Meanwhile, it is common for airborne video imagery to have
low image quality, substantial camera wobbling and plenty of background clutter. How to
reliably detect and track point targets irrespective of these distractions is a very challenging
issue that has seldom been dealt with.
The primary cue for detection in aerial surveillance is the motion difference between the
target and the background. It is especially true in our problem, because the tiny target has
almost no other features to separate it from background clutter. Detecting and associating
objects based on brightness patterns [40] easily leads to false matches and tracking failure. A
popular motion-based detection method fits the background motion into a gradient-based
parametric model and takes pixels with large fitting residuals as belonging to potential
targets [12, 63, 101]. These kind of approaches, unfortunately, only work for objects with
extended spatial supports, but do not apply to fast-moving point targets. We develop
a hybrid motion estimation method and a hypothesis test to identify small independently
moving objects. Specifically, for each pair of adjacent frames, we compute the global motion
from a hierarchical model-based method; estimate the individual pixel motion by template
matching; and detect object pixels as those with these two values statistically significantly
different. The detection threshold is chosen with clear statistical meaning and remains fixed
for all frames.
97
DetectorData Target
StateTracker
Measurement
Target
Figure 5.1: A typical detection-tracking system
Tracking has been intensively studied in a wide variety of areas and also as a general
information processing problem [6, 30, 139, 65]. It can become highly involved and error-
prone when dealing with multiple maneuvering targets and low-quality measurements. In
aerial surveillance applications, tracking can be considered relatively easier, since aerial
targets are normally well separated and have predictable dynamics. Single-target tracking
is usually formulated, either explicitly or implicitly, as a Bayesian state estimation problem.
The Kalman filter is a Bayesian tracker under linear/Gaussian assumptions and is the most
widely used tracking technique in practice. We assume the target position, after the global
motion is compensated, conforms to a second order kinematic model, and track it using a
Kalman filter. Gandhi et al. [40] take a similar approach, but they rely on a navigation
system for the global motion parameters. The temporal integration method proposed by
Pless et al. [101] is also similar to a Kalman filter but has more heuristics.
In a typical detection-tracking system, as Fig. 5.1 shows, the data flow between two
components takes only one direction: from the detector to the tracker. The detector assumes
a uniform prior distribution of the measurement space and makes decisions in a Neyman-
Pearson (NP) mode. This is what the detector starts with when it has no idea of the target
presence. It is also how most existing detection-tracking systems are operated [139, 101, 40].
Once an object is detected and a track is formed for it, priors become available to the
detector in the form of predicted state and its associated covariance. Taking the tracker
feedback into account, the detector can operate in a Bayesian mode. This amounts to
boosting the prior surface near the expected value, or equivalently lowering the NP test
threshold for states consistent with priors. At a minimal computational cost, the Bayesian
detector achieves remarkably lower false-alarm and misdetection rates and higher position
accuracy than the Neyman-Pearson detector. The data flow in the Bayesian system is bi-
98
Prediction
Detector
Bayesian
Motion−Based Kalman
Filtering
Tracker
Image Measurement State
& Covariance & CovarianceSequence
Prior
Figure 5.2: Proposed Bayesian system
directional: the tracker tells the detector where to look for measurements, and the detector
returns what it finds.
The hybrid motion-based detector and the Bayesian detection method are crucial to
high tracking accuracy and efficiency and form the major contributions of our work. In an
experiment on a 1800-frame real video clip, no false targets are detected; the true target is
tracked from the second frame, with position error mean and standard deviation as low as
0.88 and 0.44 pixels respectively.
5.1 Bayesian State Estimation
Bayes’ theorem gives the rule for updating belief in a Hypothesis H (i.e. the probability of
H) given background information (context) I and additional evidence E:
p(H|E, I) =p(E|H, I)p(H|I)
p(E|I).
The posterior probability p(H|E, I) gives the probability of the hypothesis H after consid-
ering the effect of evidence E in context I. The p(H|I) term is the prior probability of H
given I alone; that is, the belief in H before the evidence E is considered. The likelihood
term p(E|H, I) gives the probability of the evidence assuming the hypothesis H and back-
ground information I is true. The denominator, p(E|I), is independent of H, and can be
regarded as a normalizing or scaling constant. The information I is a conjunction of (at
least) all of the other statements relevant to determining p(H|I) and p(E|I). A Bayesian
estimate is optimal in the sense that it is derived from all available information, and no
other inference can do better.
99
Suppose we want to estimate the state variable xt of a dynamic system at time t given
all the observations yt up to time t. In the Bayesian framework, we need to propagate the
conditional probability p(xt|Yt), where Yt = {yi|i = 1, . . . , t} is the entire set of observations.
We define Xt = {xi|i = 0, 1, . . . , t} as the history of the system state; x0 reflects our prior
knowledge about the state before any evidence is collected.
Applying the Bayes’ rule we have
p(xt|Yt) =p(yt|xt, Yt−1)p(xt|Yt−1)
p(yt|Yt−1).
Note that to make an inference at time t, we need to carry the entire history of the state
and observation along. This incurs great modeling and computational difficulties. To keep
the problem manageable, the following three assumptions are commonly made.
• the yt’s are mutually independent: p(yi, yj) = p(yi)p(yj).
• each yt is independent of the dynamic process: p(yt|Xt) = p(yt|xt).
• the system has a Markov property such that any new state is only conditioned on the
immediately preceding state: p(xt|Xt−1) = p(xt|xt−1).
These assumptions are reasonable in our application, while they dramatically simplify
p(xt|Yt) to
p(xt|Yt) =p(yt|xt)p(xt|Yt−1)
p(yt)
where
p(xt|Yt−1) =∫
p(xt|xt−1)p(xt−1|Yt−1)dxt−1.
Since the denominator p(yt) is not related to xt, we may take it as a scaling factor and
rewrite the first equation as
p(xt|Yt) = Ctp(yt|xt)p(xt|Yt−1).
These equations suggest a recursive procedure to update p(xt|Yt). Once we have
• the measurement model p(yt|xt),
100
• the system model p(xt|xt−1) and
• the prior model p(x0) = p(x0|x−1),
we may propagate the probability in two phases:
• prediction: p(xt|Yt−1) =∫
p(xt|xt−1)p(xt−1|Yt−1)dxt−1,
• correction: p(xt|Yt) = Ctp(yt|xt)p(xt|Yt−1).
While the equations appeare deceivingly simple, propagating the conditional probability
density is not easy. In most real systems the three models take complex forms and might
not be expressed analytically. Usually Monte-Carlo type of methods have to be exploited
to provide a sample of the density function. As it is well known, such methods can be
very computationally demanding; and when efforts are made to reduce the computational
burden, the resulting density function might not represent the underlying truth faithfully.
Some work along this direction has been done [46, 65].
The general Bayesian approach may have to be pursued in cases of multiple maneuvering
targets with clutter. But for our problem a special case of the approach, the Kalman filter,
is more appropriate.
5.2 Kalman Filter
The Kalman filter is a recursive Bayesian state estimator optimized for linear systems with
Gaussian noise. It is derived from three probabilistic models.
• The prior model: x0 ∼ N(x0, P0). Here x0 and P0 are the mean and covariance of the
state before any observation is made.
• The system model: x−t+1 = Ftxt + Gtut + wt. Here Ft rules the linear evolution of
the state variable with time. Gtut reflects some control input to the system, which is
taken as a known constant. wt ∼ N(0, Qt) is the process noise.
101
• The measurement model: yt = Htxt + vt. Here Ht relates the observed measurement
yt to the underlying true state. vt ∼ N(0, Rt) is the measurement noise.
The updating process can be summarized as
• Prediction:
x−t = Ft−1xt + Gt−1ut−1
P−t = Ft−1Pt−1F
′t−1 + Qt−1.
• Correction:
Pt = (P−t−1 + H ′
tR−1t Ht)−1 = P−
t − P−t H ′
t(HtP−t H ′
t + Rt)−1HtP−t
Kt = PtH′tR
−1t = P−
t H ′t(HtR
−t H ′
t + Rt)−1
xt = x−t + Kt(yt −Htx−t )
where Pt is the posterior covariance of xt, and Kt is the gain matrix. The computation
of Kalman filtering is very easy. The crucial factors for a successful Kalman filter are: (i)
accurate modeling (defining the state variable, system model and measurement model) in-
cluding appropriate parameter setting, and (ii) supplying high-quality measurements. Below
we describe the Kalman filter we built for target tracking.
5.3 Tracking
The first step in designing the tracker is to model the dynamic behavior of the target. The
simplest model is the second-order kinematic model, or constant translation model,
pt = pt−1 + vt−1,
where pt = (xt, yt)T is the target position, and vt = (vxt, vyt) is the velocity which should
be a constant. In our problem, vt is not constant. It is a sum of two velocities, the velocity
of the target itself vRt and that of the background vG
t due to camera airplane motion. vGt
102
is quite random and so is vt. But, the component vRt = vt − vG
t remains quite steady over
time. It gives us a way of predicting vt:
vt = vt−1 + vGt − vG
t−1.
Here the background motion can be accurately estimated between each pair of frames (Sec-
tion 5.4), and is considered as a known control input to the system.
In Kalman filter notation, the tracking problem can be formalized as follows:
State Variable.
θt = (pTt , vT
t )T
where pt = (xt, yt)T is the centroid position of the target, and vt = (vxt, vyt)T is its image
velocity.
Prior Model. There are many ways of specifying the prior model. When no prior
knowledge about the target motion is available, a diffuse prior (infinite covariance) is used,
and the Kalman filtering process reduces to a recursive least-squares estimation.
System Model.
θt = Fθt−1 + ut + wt (5.1)
where the control input is
ut = (0, 0, (vGt − vG
t−1)′)′,
the stationary system matrix is
F =
1 0 1 0
0 1 0 1
0 0 1 0
0 0 0 1
,
and wt ∼ N(0, Q). We assume the process noise covariance to be
Q = εFPt−1F′
in Eq. 5.4. The result of multiplying FPt−1F′ by ε makes it become the covariance of
the predicted state P−t . This is an ad hoc approach accounting for errors from unmodeled
sources, known in stochastic estimation as exponential aging [126].
103
Measurement Model. We assume the observed state yt is a contaminated version of
the true value
yt = θt + νt (5.2)
where the noise νt ∼ N(0, Rt), and Rt is block-diagonal (position errors are correlated and
velocity errors are correlated). Accurate (yt, Rt) estimation is crucial to the success of the
filter, and it is the topic of the next two sections.
The optimal solution procedure is given below.
Prediction.
θ−t = F θt−1 + ut (5.3)
P−t = FPt−1F
′ + Q.
Correction.
Kt = P−t (P−
t + Rt)−1 (5.4)
θt = θ−t + Kt(yt − θ−t ) (5.5)
Pt = (I −Kt)P−t
The numerical instability of the Kalman filter is well-known. It arises from the matrix
inversion in evaluating Kt (Eq. 5.6): if (P−t + Rt) is poorly conditioned, the value Kt gets
is pretty much due to round-off errors. We deal with the problem using a simple method:
adding a tiny positive number ε to the diagonal entries of the posterior covariance Pt. Cur-
rently we set ε to be 1% of the smallest diagonal entry. The importance of parameter tuning
cannot be understated in optimizing the performance of real Kalman filtering systems. But
here we emphasize on the theoretical part of the problem, and leave the practical aspects
for future consideration.
5.4 Motion-Based Detection
This section addresses the problem of detecting independently moving pixels between two
frames, It and It+1, and measuring the object state. Examples are given on three 100× 100
sample frames (Fig. 5.3). f16502 and f18300 have a target near the center. f19000 has no
104
target but many ground objects resembling targets by appearance. These data sets are
cropped out of the full data set (Section 5.7).
Candidates as Background Motion Outliers.
The background motion vG is introduced by the relative ground-camera movement. The
ground is well approximated by a planar object. Its image motion conforms to the quadratic
model Eq. 2.5. Putting this model into the optical flow constraint equation Eq. 2.3, we have
a linear constraint Asa = bs at each pixel location s where
As =[
Ix Iy xIx xIy yIx yIy (x2Ix + xyIy) (xyIx + y2Iy)], bs = −It.
We solve this regression problem using least-squares. To reduce the impact of outliers, we
refine the LS estimate under the least-trimmed-squares criterion Eq. 3.3 using the C-step
in the FAST-LTS implementation (Section 3.1.1). The number of equations n is equal to
the number of pixels in each frame. Processing time increases proportionally with n while
estimation accuracy quickly saturates, since the number of unknowns is fixed at 8. Our
experiments show that using 2500 out of the n equations achieves the same accuracy at a
small constant cost. To handle large motion, up to 4 pixels in our client data, we adopt
a hierarchical scheme with two pyramid levels (Section 2.6 and Fig. 2.2). The projection
operation for the planar flow parameter a can be expressed as
ai−1p (0, 1) = 2ai
p(0, 1),
ai−1p (2, 3, 4, 5) = ai
p(2, 3, 4, 5),
ai−1p (6, 7) = ai
p(6, 7)/2.
Once the estimate a is available, the velocity at any position (x, y), vG(x, y) can be calculated
from Eq. 2.5.
Background motion can be very accurately recovered from the above process. Fig. 5.4(a)
shows the frame difference between f16502 and f16503 after the background motion is re-
moved by image warping. The difference is very small except near the target, which moves
independently, and the image border, where warping errors are significant. It is this high
accuracy that allows us to consider vG as a known control input to the system in the Kalman
filter model (Eq.5.1).
105
Figure 5.3: Example data sets. Column 1: first frame, Column 2: second frame, Column3: frame difference (first frame minus second frame). Row 1: f16502 (target is the whitedot near the center), Row 2: f18300 (target near the center), Row 3: f19000 (no target butmany target-like ground objects).
106
(a) (b)
(c) (d)
Figure 5.4: f16502 target pixel candidates. (a) frame difference after background motion isremoved. (b) pixels of large warping errors. Postprocessing: (c) isolated pixels removed.(d) dilated by 3× 3.
107
For slow-moving objects of sufficiently large size, the planar model fitting error serves as
a good indicator of independent motion [63, 101]. Another method, which applies to small
objects with considerable image motion, is to warp It+1 towards It according to vG yielding
a new image IWt , and consider pixels with large warping errors |It − IW
t | as potentially
independently moving [40]. We use this method to find candidate pixels. We estimate the
image noise variance σ2 by fitting a facet model to the image data (Section 3.1.2). Denoting
by σi the standard deviation estimate at the ith pixel, σ = 1.4826 median σi. Then we
take pixels with warping errors exceeding 2.5σ as candidates. Two processes further refine
the candidate set: isolated pixels are pruned and residual ones are dilated by one pixel
in each direction. Results for f16502 are given in Fig. 5.4. We observe that target pixels
are successfully selected and the number false alarms is very small. Results for f18300 and
f19000 are given in Fig. 5.5.
As we see in the above results, this method does locate the target, but also produces
a considerable amount of false detections, especially for f19000 (Fig. 5.5). This is because
large intensity change can result from independent motion as well as great intensity vari-
ation. Feeding such measures to the tracker imposes a penalty on both tracking accuracy
and responding time [40]. Therefore we further exploit motion information to resolve the
ambiguity.
Candidate Pixel Motion and Covariance.
It is important to point out that hierarchical gradient-based methods cannot be extended
to calculating candidate pixel motion because they require information aggregation in a
neighborhood much larger than the spatial support of a point target. Therefore, we calculate
target candidate pixel motions using the matching-based (template matching) method [3,
126], which requires the minimal spatial integration.
For each candidate pixel, we take its 3 × 3 neighborhood as the template and find its
best match in a window of size w × w in the next frame It+1. The displacement gives us
a pixel-accuracy solution v0. w should be chosen large enough to encompass the range of
the velocity, but as small as possible to avoid false matches. Now we let w = 4 according
to the observed maximum image motion. To achieve sub-pixel accuracy, we refine v0 by a
108
Figure 5.5: f18300 and f19000 target pixel candidates. Left column: f18300. Right column:f19000. Row 1: frame difference after background motion is removed. Row 2: pixels of largewarping errors. Row 3: isolated pixels removed.
109
quadratic fit of the error surface surrounding it. Particularly, we take matching errors in
the 3× 3 neighborhood centered at v0, fit them into a 2D quadratic surface
e = v′Av + b′v + c = a1 + a2vx + a3vy + a4v2x + a5vxvy + a6v
2y ,
find the minimum of the surface and the displacement achieving it as, respectively,
emin = c− b′A−1b/4 = c + b′vmin/2,
vmin = −A−1b/2,
and obtain the final motion estimate as
v = v0 + vmin.
The covariance matrix of v is available as a by-produce of the the quadratic fitting. It can
be shown [126] that
Σv =emin
9A−1.
When vmin is greater than 1 pixel either way, emin < 0, or the diagonal entries of Σv < 0,
the motion estimator clearly has made a mistake, and we give that pixel up and take it as
belonging to the background. We also give the estimate up when the larger eigenvalue of
Σv exceeds 0.25, which corresponds to position errors above 0.5 pixels.
Independent Motion from χ2 Test.
If a candidate pixel actually belongs to the background, its image motion v should be no
different from the background motion vG. That is, under the null hypothesis H0 : v = vG,
we should have v − vG ∼ N(0, Σ) and the test statistic
T = (v − vG)′Σ−1(v − vG)
conforms to a χ2 distribution with 2 degrees of freedom. An independent motion is detected
when H0 is rejected. We reject H0 at the significance level α = 0.05 which amounts to finding
T > Tα = 5.9915. This test is simple yet very effective. As Fig. 5.6 shows, almost all the
clutter is now gone and the target stands out as the only significant connected pixel set.
Data Association, Object Detection and State Measurement.
110
Figure 5.6: Target pixels for f16502, f18300 and f19000. Left column: target pixel candi-dates. Right column: target pixels detected by the statistical test (small connected pixelsets are considered noise and removed). Row 1: f16502. Row 2: f18300. Row 3: f19000.
111
Moving pixels are first assigned to existing tracks according to the nearest neighbor rule.
New tracks are formed for residual large connected sets (≥ 5 pixels). Given a cluster of
pixels, we calculate the object velocity and covariance by averaging all the (v, Σ) values; and
calculate the position and covariance from the sample mean and variance. This provides
accurate (yt, Rt) estimates to the tracker.
5.5 Bayesian Detection
The object detector described above assumes every pixel is equally likely to be an object
pixel, and tries to make the distinction solely based on evidence it collects from two adjacent
frames. It is the best we can do when initiating a track with no prior knowledge at that
time. Due to the extremely small object size, detection is still difficult, especially for small
independent motions. Meanwhile, once a track is formed, priors on the object state become
immediately available from the tracker in the form of predicted state distribution. From a
Bayesian point of view, to pursue the optimal detection results we are obliged to exploit
the priors. This section introduces a Bayesian object detector. It is an important feature
of our system; most detectors in previous visual surveillance applications [101, 40] always
operate in the Neyman-Pearson mode.
The prior distribution of the state at time t is exactly the predicted distribution from
time t − 1 by the tracker. Our system is in all respects linear/Gaussian, and hence the
distribution is defined by the mean θ−t and covariance P−t as in Eq. 5.4. The Bayesian
detector utilizes the priors in two phases: (i) augmenting the candidate set by adding pixels
falling into predicted 3σ regions, and (ii) validating/updating the velocity estimates.
Each predicted candidate pixel has two sets of velocity and covariance available: one
from the matching-based motion estimation step and the other from prediction, which we
denote as (v0, Σ0) and (v−, Σ−), respectively. To combine these two pieces of evidence, we
first conduct a consistency test: candidates with pixel motion significantly different from
the prediction are rejected. As summarized below, the χ2 test works in the same way as
the detection of independent motions. Pixels failing the test are considered to have poor
motion estimates and taken as background pixels.
112
• Null hypothesis: H0 : v0 ∼ N(v−, Σ−).
• Test statistic: T = (v0 − v−)′(Σ−)−1(v0 − v−) ∼ χ22 under H0.
• Reject H0 when T > Tα. (α, Tα) is fixed at (0.05,5.9915).
We next calculate its posterior motion estimate (v, Σ) for each remaining candidate using
formulae similar to the correction phase of the Kalman filter [148]:
K = Σ−(Σ− + Σ0)−1
v = v− + K(v0 − v−)
Σ = (I −K)Σ−,
Then the independent motion test is done for both the prior and posterior estimates. Object
pixels are much easier to identify in the Bayesian mode because of the boosted density
distribution, or equivalently the dampened threshold in the χ2 test, around the predicted
values [139].
Fig. 5.7 illustrates the prior impact on f16503. By comparing the original candidate set
(a) and the augmented one (b, predicted pixels in gray), we see that the position priors help
to locate candidates missed by the motion-based detector. Object pixels detected with no
priors, with position priors only and with full priors are given in (c), (d) and (e) respectively.
Lower misdetection rates are achieved as more priors are incorporated.
(a) (b) (c) (d) (e)
Figure 5.7: Detection results w and w/o priors on f16503
113
5.6 The Algorithm
Note that so far we have been carefully using the word “object” instead of “target” when
talking about detection and tracking. This is because not all objects we track are targets,
only those with consistent dynamic behavior. In our application, we watch any object for
Ni = 25 frames (about 0.8 seconds), and declare it as a target only if during the period, it
never misses measurements, and its position covariance is always within an allowed range.
We also permit a target to miss measurements up to Nt = 5 frames. When the measurement
is missing, we use the prediction as the new state; and we confine the position covariance to
be within the allowed range. The range is defined by the larger eigenvalue of the covariance
matrix, and the maximum is set to 2 squared pixels. Tracks corresponding to false objects
and dead targets are terminated.
We record four properties for the track associated with each object: a unique ID id,
the track length (since the first detection) hist, the number of successive frames in which
the object is not measured miss, and the object state Xt. Here we briefly synopsize the
execution of the algorithm on each frame.
1. Prediction. Extrapolate new state 1 and update covariance (Eq. 5.4).
2. Detection.
(a) Calculate global motion parameters. Find candidates from both large warping
errors and position priors.
(b) Estimate candidate motion. Further find posterior estimates for predicted can-
didates.
(c) Detect independently moving pixels using χ2 test.
(d) Assign detected pixels to existing tracks; initiate new tracks from unassigned
large connected sets.
1We do prediction before detection in order to supply priors to the Bayesian detector. However, sincepredicting target positions needs the new global motion parameters (Eq. 5.1), the prediction process isactually completed in the detection phase.
114
(a) Frame 16540 (b) Frame 18000
Figure 5.8: Two sample frames in the video clip (targets at the center of the 41 × 41windows).
(e) Measure track states.
3. Correction.
(a) Update object states (use prediction and increase miss by 1 if no new measure-
ments, otherwise do Kalman filter correction (Eq. 5.6) and assign miss ← 0).
(b) Remove a target with miss = Nt. Terminate an object track if miss > 0 or
position covariance too large.
5.7 Experiments
We demonstrate the system performance on a 1800-frame video clip (sample frames in
Figure 5.8) obtained from real flight data. The frame rate is 30 frames/second and the
image size is 256 × 192 pixels. There is one target in the clip. It is about 2 × 1 to 3 × 3
pixels in size, and its maximum image motion is about 5 pixels/frame. Many ground objects
resemble the target in brightness and/or shape. The camera is constantly wobbling and the
image quality is low.
115
Exp MD FA ITR min max med mean sd
I 0 0 0.999 0.05 3.88 0.82 0.88 0.44
II 0 2 0.988 0.12 3.82 0.99 1.06 0.52
III 0 0 0.618 0.14 3.78 0.83 0.83 0.37
Table 5.1: Quantitative measures in 1800 frames. MD: number of missed targets. FA:number of false targets. ITR: in-track rate (target track length divided by 1800). Theremaining measures refer to position error vector magnitude statistics (unit: pixel).
All experiments are carried out on a PIII 500MHz PC running Solaris. The current
implementation is not optimized for speed, but it is reasonably fast, spending about 2.3
seconds on each frame including 2 seconds on background motion estimation. The algorithm
has the potential to execute in real-time.
In the 1800-frame clip, a total of 252 objects are detected, and only one of them is
identified as a target. The target is in-track since the second frame. We mark the target
by placing a 7× 7 white box centered at its estimated position. As can be observed in the
output video [148], the marker encloses the target throughout the sequence.
We managed to locate target centroid positions in 564 frames and used them as the
groundtruth in quantitative evaluation. Table 5.1 gives results from three experiments.
Experiment I shows the proposed method. To illustrate the effectiveness of the Bayesian
detector, we also performed Experiment II in which only position priors are used and Ex-
periment III in which no priors are used.
In both Experiments II and III, the true target is still detected and thus there is no
mis-detection. II has two false alarms due to the absence of the consistency test between
the estimated and predicted motion vectors (Section 5.5). Its in-track rate is slightly lower,
while the localization accuracy severely degrades. III’s error measures are comparable to
I’s. However its in-track rate suffers a drastic decrease. The target is not confirmed until
23 seconds later and its track is broken twice. This could be unacceptable in time-critical
applications such as collision avoidance.
116
5.8 Discussion
We have introduced a novel approach to point target detection and tracking in a low-quality
airborne video. We identify objects by the statistical difference between their motions and
the background motion, and track their dynamic behavior in order to detect targets and
update their states. Compared to most previous visual surveillance studies, our method has
four main advantages.
• The hybrid motion-based detector is highly efficient in suppressing background clutter,
locating moving objects and modeling their dynamics. It enables us to employ a simple
Kalman filter for object tracking.
• With priors exploited in detection, false alarm, misdetection and in-track rates receive
significant improvement.
• The extensive use of statistical tests rather than heuristics reduces parameter tuning
to a minimum.
• The Bayesian detection-tracking approach is readily applicable to other data sources
such as UV and RGB images. It allows results from different channels to be easily
integrated to yield more reliable output.
Performance of the proposed technique has been very encouraging in preliminary ex-
periments. More real and synthetic data are needed for further evaluation. Currently the
approach is being integrated to a UAV See And Avoid System being jointly developed by
Engineering 2000 Inc. and the Boeing Company.
117
Chapter 6
CONCLUSIONS
Visual motion is a compelling cue to the structures and dynamics of the world around
us. Its analysis is crucial to many key problems in today’s vision research such as ob-
ject/environment/human modeling, video compression, event analysis and image-based ren-
dering. This dissertation has addressed two fundamental problems in visual motion analysis:
optical flow estimation and motion-based detection and tracking. Two new approaches, ex-
ploiting local and global motion coherence, respectively, have been proposed for estimating
piecewise-smooth optical flow. A video surveillance system has been designed based on mo-
tion cues and Bayesian estimation theory to achieve reliable target detection and tracking.
In the process of developing these techniques, statistical methods have been extensively
used to measure estimation uncertainty, facilitate information fusion and achieve high ro-
bustness. This chapter will summarize the main contributions of the dissertation and point
out some open questions and future work directions.
6.1 Summary and Contributions
A two-stage-robust adaptive scheme for gradient-based local flow estimation.
Gradient-based optical flow estimation techniques consist of two stages: estimating
derivatives and organizing and solving optical flow constraints (OFC). Both stages pool
information in a certain neighborhood and are regression procedures in nature. Least-
squares solutions to the regression problems break down in the presence of outliers such as
motion boundaries. To cope with this problem, a few robust regression tools have been in-
troduced to the OFC solving stage. By carefully analyzing the characteristics of the optical
flow constraints and comparing the strengths and weaknesses of different robust regression
tools, we identified the least-trimmed-squares (LTS) technique as more appropriate for the
118
OFC stage. As a very similar information pooling step, derivative calculation has seldom
received proper attention in optical flow estimation. Crude derivative estimators are widely
used; as a consequence, robust OFC (one-stage robust) methods still break down near mo-
tion boundaries. Pointing out this limitation, we proposed to calculate derivatives from a
robust facet model. To reduce the computation overhead, we carried out the robust deriva-
tive stage adaptively according to a confidence measure of the flow estimate. Preliminary
experimental results show that the two-stage robust scheme permits correct flow recovery
even at immediate motion boundaries.
A deterministic high-breakdown robust method for visual reconstruction.
High-breakdown criteria are employed in both of the above regression problems. They
have no closed-form solutions and past research has resorted to certain approximation
schemes. So far all applications of high-breakdown robust methods in visual reconstruction
have adopted a random-sampling algorithm—the estimate with the best criterion value is
picked from a random pool of trial estimates. These methods uniformly apply the algorithms
to all pixels in an image disregarding the actual number of outliers, and suffer from heavy
computation as well as unstable accuracy. Taking advantage of the piecewise smoothness
property of the visual field and the selection capability of robust estimators, we proposed a
deterministic adaptive algorithm for high-breakdown local parametric estimation. Starting
from least-squares estimates, we iteratively choose neighbors’ values as trial solutions and
use robust criteria to adapt them to the local constraint. This method provides an estima-
tor whose complexity depends on the actual outlier contamination. It inherits the merits of
both least-squares and robust estimators and results in crisp boundaries as well as smooth
inner surfaces; it is also faster than algorithms based on random sampling.
Error analysis on gradient-based local flow.
Due to the intrinsic ambiguity of visual motion and modeling imperfections, an optical
flow estimate generally has spatially varying reliability. In order for subsequent applications
to make judicious use of the results, error statistics of the flow estimate have to be analyzed.
In our earlier work, we conducted error analysis for the least-squares-based local estimation
119
method using the covariance propagation theory for approximate linear systems and small
errors. In this thesis, we have generalized the results to the newer robust method. Our
analysis estimates image noise and derivative errors in an adaptive fashion, taking into
account correlation of derivative errors at adjacent positions. It is more complete, systematic
and reliable than previous efforts.
Piecewise-smooth optical flow from global matching and graduated optimiza-
tion.
By drawing information from the entire visual field, the global optimization approach
to optical flow estimation is conceptually more effective in handling the aperture problem
and outliers than the local approach. But its actual performance has been somehow dis-
appointing due to formulation defects and solution complexity. On one hand, approximate
formulations are frequently adopted for ease of computation, with the consequence that the
correct flow is unrecoverable even in ideal settings. On the other hand, more sophisticated
formulations typically involve large-scale nonconvex optimization problems, which are so
hard to solve that the practical accuracy might not be competitive with simpler methods.
The global optimization method we developed provide better solutions to both problems.
From a Bayesian perspective, we assume the flow field prior distribution to be a Markov
random field (MRF) and formulate the optimal optical flow as the maximum a posteriori
(MAP) estimate, which is equivalent to the minimum of a robust global energy function.
The novelty in our formulation lies mainly in two aspects. 1) Three-frame matching is
proposed to detect correspondences; this overcomes the visibility problem at occlusions. 2)
The strengths of brightness and smoothness errors in the global energy are automatically
balanced according to local data variation, and consequently parameter tuning is reduced.
These features enable our method to achieve a higher accuracy upper-bound than previous
algorithms.
In order to solve the resultant energy minimization problem, we developed a hierarchical
three-step graduated optimization strategy. Step I is a high-breakdown robust local gradient
method with a deterministic iterative implementation, which provides a high-quality initial
flow estimate. Step II is a global gradient-based formulation solved by Successive Over-
120
Relaxation (SOR), which efficiently improves the flow field coherence. Step III minimizes
the original energy by greedy propagation. It corrects gross errors introduced by derivative
evaluation and pyramid operations. In this process, merits are inherited and drawbacks are
largely avoided in all three steps. As a result, high accuracy is obtained both on and off
motion boundaries.
Performance of this technique was demonstrated on a number of homebrew and standard
test data sets. On Barron’s synthetic data, which have become the benchmark since the
publication of [7], this method achieved the best accuracy among all low-level techniques.
Close comparison with the well known Black and Anandan’s dense regularization technique
(BA) [14] showed that in all of our experiments the new method yields uniformly higher
accuracy at a similar computational cost.
A motion-based Bayesian approach to aerial point target detection and tracking.
In a visual surveillance project funded by the Boeing Company, we have investigated an
application of optical flow to airborne target detection and tracking. The greatest difficulty
in this problem lies in the extremely small target size, typically 2× 1− 3× 3 pixels, which
makes results from most previous aerial visual surveillance studies unapplicable. Challenges
also arise from low image quality, substantial camera wobbling and plenty of background
clutter.
The proposed system consists of two components; a moving object detector identifies
objects by the statistical difference between their motions and the background motion, and
a Kalman filter tracks their dynamic behaviors in order to detect targets and update their
states. Both the detector and the tracker operate in a Bayesian mode, and they each benefit
from the other’s accuracy. The system exhibited excellent performance in experiments. On
an 1800-frame real video clip with heavy clutter and a true target (1 × 2 − 3 × 3 pixels
large), it produced no false targets and tracked the true target from the second frame with
average position error below 1 pixel. This probabilistic approach reduces parameter tuning
to a minimum. It also facilitates data fusion from different information channels.
121
6.2 Open Questions and Future Work
Recovering optical flow from image sequences is a very challenging problem due to the
intrinsic ambiguity of visual motion, inevitable modeling imperfections, computational dif-
ficulties, and the interweaving of these issues. Although more effective methods have been
presented in this thesis to tackle these difficulties, our exploration is only a beginning; there
are a host of issues worth further investigation and attention.
Optical flow formulation.
The formulation of an optical flow estimation technique determines the accuracy upper
bound of that technique. For example, robust formulations model the noise effect more
realistically than least-squares formulations, and hence their accuracy uniformly supersedes
the latter in practice. Years of research efforts have been devoted to developing more precise
formulations. A question that naturally arises is “is there an optimal formulation”?
Answering the question requires defining the best optical flow, which is largely application-
specific: if the application is 3D structure/dynamics analysis, the best flow should be iden-
tical to the projected velocity field, whereas if the application is motion-compensated video
coding, the best flow does not necessarily coincide with the projected motion as long as it
minimizes the coding cost. Low-level approaches do not aim at any specific application—it
is exactly their goal and strength to measure visual motion in a general setting—and there-
fore the best flow cannot be defined for them. It is usually implied in developing low-level
techniques that the best flow estimate is the projected 2D velocity field. But since the
project motion is normally unknown, it cannot be used to derive the objective function.
For the above reasons, there is no optimal formulation for low-level approaches.
Progress in refining low-level formulations has been made by identifying more appropri-
ate models for each individual component. In more than two decades’ intensive research,
a large number of formulations have been studied, among them those considered the most
promising are formulations employing global optimization and robust criteria. Our work ad-
vances the state of art in this direction by introducing three-frame matching to overcome the
visibility problem at occlusions and allowing local variation in the global scheme to reduce
122
parameter tuning and improve local adaptivity. For further improvement, problems such
as the modeling of the three-frame matching error, the choice of robust estimator and the
learning of parameters will be investigated in our future work. Developing more appropriate
formulations continues to be a significant topic in the field of optical flow estimation.
Energy minimization.
In refining optical flow formulations, computational complexity increases rapidly with
model sophistication. This is especially true for global approaches involving large-scale
nonconvex optimization problems. No practical numerical methods exist for finding the
global optimum; only a local optimum can be found. This fact causes great difficulties in
problem diagnosis: when a technique yields a poor estimate, it can be very hard to tell
whether it is due to formulation weaknesses or to the local-minimum nature of the solution.
Investigating more globally optimal solutions to the large-scale nonconvex problems is
one of our immediate future work directions. Towards this end, methods such as graph cuts
[23] which has yield nice results in stereo matching, full multigrid methods [102], Bayesian
belief propagation (BBP) [137], and local minimization methods alternative to SOR [102] are
worth studying. As many areas in computer vision are converging to energy minimization
formulations, progress in this research is expected to have impacts in a wide context.
Uncertainty analysis.
Systematic uncertainty analysis is a very crucial, yet very difficult problem. We made an
attempt to examine the uncertainty in the local gradient-based flow estimate and demon-
strated its effectiveness in a motion boundary detection application. Although our approach
is one step farther than previous efforts, it is based on propagating small perturbations
through approximately linear systems and break downs when the estimate quality becomes
too low. How to make a system be aware of its own failure remains an open issue. In addi-
tion, almost all previous error analysis were performed for local techniques; how to measure
the uncertainty of a global approach is another open issue.
Performance evaluation and comparison.
123
The significance of comparative evaluation cannot be understated: it is necessary to
assess the performance of both established and new algorithms and to gauge the progress in
the field [7, 111]. So far the most popular evaluation method is to measure the difference be-
tween the flow estimate and the projected 2D velocity field. To maintain comparability with
previously published results, we followed the methodology of Barron et al. in this thesis and
reported certain statistics of the difference between our flow estimates and the synthesized
groundtruth. However, as we pointed out in Section 4.3.3, such evaluation methods are
flawed due to the aperture problem; they become problematic in textureless regions, where
the correctness of “groundtruth” becomes questionable and so does the authority of quan-
titative evaluation based on it. For this reason, together with the simplicity of synthetic
data and error measures, “quantitative” results should be considered as qualitative at best.
The above suggests that the inherent ambiguity of the optical flow should be taken into
account in quantitative evaluation—larger errors should be allowed in regions of less local
information. Developing more convincing evaluation methods deserves serious attention.
Bayesian framework.
The benefits of the Bayesian framework should be fully exploited. Among all criteria
from which the global energy may arise, we find the Bayesian approach most appealing in
both theoretic and practical interest. Estimating optical flow from a few images is inherently
ambiguous: areas with more appropriate textures have higher estimation certainty. This
indicates that the nature of the problem is probabilistic instead of deterministic. Further-
more, the Bayesian formulation may provide a graceful solution to two important problems:
global optimization and uncertainty analysis [126, 7, 144]. Interesting results from a global
optimization method, Bayesian belief propagation (BBP), have been shown on a limited do-
main of vision problems [137]. BBP propagates estimates together with their covariances. If
it converges, it converges in a small number of iterations with covariances as a by-product.
It will be interesting to see if ideas like BBP are applicable and beneficial to optical flow
estimation.
Applications.
124
As an accurate and efficient low-level approach to visual motion analysis, the new method
has great potential in a wide variety of applications. First of all, it provides a good starting
point for higher-level motion analysis. Our flow estimates already take a layered look and
motion boundaries of layers are closed curves. They can reliably initialize motion segmen-
tation [109], contour-based [86] and layered [131] representation. Model selection [130] is
a crucial problem in automatic scene analysis [16]; it is difficult because comparing a col-
lection of models on the raw image data involves formidable computation. Our results can
ease this task by supplying a higher ground for scene knowledge learning. The backward-
forward matching error, together with detected motion boundaries, can facilitate occlusion
reasoning [16]. It may also guide image warping to avoid smoothing across motion disconti-
nuities. Some success has been obtained in our preliminary experiments. This is important
to motion estimation as well as for novel view synthesis.
Visual reconstruction.
Motion estimation is one of many low-level visual reconstruction problems. Many con-
clusions from our work are also extendable to other low-level visual problems such as stereo
matching, 3D surface reconstruction and image restoration.
125
BIBLIOGRAPHY
[1] M.D. Abramoff, W. J. Niessen, and M. A. Viergever. Objective quantification of the
motion of soft tissues in the orbit. IEEE Trans. on Medical Imaging, 19(10):986–995,
2000.
[2] G. Adiv. Determining three-dimensional motion and structure from optical flow gen-
erated by several moving objects. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 7(4):384–401, 1985.
[3] P. Anandan. A computational framework and an algorithm for the measurement of
visual motion. International Journal of Computer Vision, 2:283–310, 1989.
[4] S. Ayer, Pschroeter, and J. Bigun. Segmentation of moving objects by robust motion
parameter estimation over multiple frames. In Proc. European Conf. on Computer
Vision, volume 2, pages 316–327, 1994.
[5] A. Bab-Hadiashar and D. Suter. Robust optical flow estimation. International Journal
of Computer Vision, 29(1):59–77, 1998.
[6] Y. Bar-Shalom and T.E. Fortmann. Tracking and Data Association. Academic Press,
1988.
[7] J. L. Barron, S. S. Beauchemin, and D. J. Fleet. Performance of optical flow tech-
niques. International Journal of Computer Vision, 12(1):43–77, 1994.
[8] J.L. Barron. A survey of approaches for determining optic flow, environmental layout
and egomotion. In RBCV-TR, 1984.
126
[9] R. Battiti, E. Amaldi, and C. Koch. Computing optical flow across multiple scales: An
adaptive coarse-to-fine strategy. International Journal of Computer Vision, 6(2):133–
145, 1991.
[10] S. S. Beauchemin and J. L. Barron. The computation of optical flow. ACM Computing
Surveys, 27(3):433–467, 1995.
[11] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical model-based
motion estimation. In Proc. European Conf. on Computer Vision, pages 237–252,
1992.
[12] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. A three-frame algorithm for esti-
mating two-component image motion. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 14(9):886–896, 1992.
[13] P. J. Besl, J. B. Birch, and L. T. Watson. Robust window operators. In Proc. European
Conf. on Computer Vision, pages 591–600, 1988.
[14] M. J. Black. Robust Incremental Optical Flow. Doctoral dissertation (research report),
Yale Univ., 1992.
[15] M. J. Black and P. Anandan. The robust estimation of multiple motions: paramet-
ric and piecewise-smooth flow fields. Computer Vision and Image Understanding,
63(1):75–104, 1996.
[16] M. J. Black and D. J. Fleet. Probabilistic detection and tracking of motion disconti-
nuities. International Journal of Computer Vision, 38:229–243, 2000.
[17] M. J. Black and A. Jepson. Estimating optical flow in segmented images using variable-
order parametric models with local deformations. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 18(10):972–986, 1996.
127
[18] M. J. Black and A. Rangarajan. On the unification of line processes, outlier rejec-
tion, and robust statistics with applications in early vision. International Journal of
Computer Vision, 19:57–91, 1996.
[19] M. J. Black, G. Sapiro, D. Marimont, and D. Heeger. Robust anisotropic diffusion.
IEEE Trans. Image Processing: Special issue on Partial Differential Equations and
Geometry Driven Diffusion in Image Processing and Analysis, 7(3):421–432, 1998.
[20] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, Cambridge, MA, 1987.
[21] P. Bouthemy and E. Francois. Motion segmentation and qualitative dynamic
scene analysis from an image sequence. International Journal of Computer Vision,
10(2):157–182, 1993.
[22] P. Bouthemy and J. S. Rivero. A hierarchical likelihood approach for region segmen-
tation according to motion-based criteria. In Proc. International Conf. on Computer
Vision, pages 463–467, 1987.
[23] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via
graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(11):1–18,
2001.
[24] M. Brooks, W. Chojnacki, D. Gawley, and A. van den Hengel. What value covari-
ance information in estimating vision parameters? In Proc. International Conf. on
Computer Vision, pages 302–308, 2001.
[25] K. Bubna and C. V. Stewart. Model selection and surface merging in reconstruction
algorithms. In Proc. International Conf. on Computer Vision, pages 895–902, 1998.
[26] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE
Trans. on Communication, 31:532–540, 1983.
128
[27] R. Pless C. Fermuller and Y. Aloimonos. Statistical biases in optical flow. In Proc.
Computer Vision and Pattern Recognition, volume 1, pages 561–566, 1999.
[28] C. Cafforio and F. Rocca. Methods for measuring small displacements of television
images. IEEE Trans. on Information Theory, (5):573–579, 1976.
[29] S. E. Chen and L. Williams. View interpolation for image synthesis. Computer
Graphics, 27(Annual Conference Series):279–288, 1993.
[30] C-Y. Chong, D. Garren, and T.P. Grayson. Ground target tracking— a historical
perspective. In IEEE Proc. Aerospace Conference, volume 3, pages 433–448, 2000.
[31] R. Cipolla, K. Okamoto, and Y. Kuno. Robust structure from motion using motion
parallax. In Proc. International Conf. on Computer Vision, pages 374–382, 1993.
[32] C. Colombo and A. del Bimbo. Generalized bounds for time to collision from first-
order image motion. In Proc. International Conf. on Computer Vision, pages 220–226,
1999.
[33] T. Darrell and A. Pentland. Cooperative robust estimation using layers of support.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(5):474–487, 1995.
[34] A.M. Earnshaw and S.d. Blostein. The performance of camera translation direction
estimators from optical flow: Analysis, comparison, and theoretical limits. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 18(9):927–932, 1996.
[35] G. Farneback. Very high accuracy velocity estimation using orientation tensors, para-
metric motion, and simultaneous segmentation of the motion field. In Proc. Interna-
tional Conf. on Computer Vision, volume 1, pages 171–177, 2001.
[36] O. Faugeras. Three-dimensional computer vision: a geometric viewpoint. MIT Press,
1993.
129
[37] C.L. Fennema and W.B. Thompson. Velocity determination in scenes containing
several moving objects. Computer Graphics and Image Processing, 9:301–315, 1979.
[38] D. J. Fleet and A. D. Jepson. Computational of component image velocity from local
phase information. International Journal of Computer Vision, 5(1):77–104, 1990.
[39] B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills. Recovering motion fields:
An evaluation of eight optical flow algorithms. In Proc. British Machine Vision Conf.,
volume 1, pages 195–204, 1998.
[40] T. Gandhi, M. Yang, R. Kasturi, O. Camps, and L. Coraor. Detection of obstacles
in the flight path of an aircraft. In Proc. Computer Vision and Pattern Recognition,
volume 2, pages 304–311, 2000.
[41] D. Geman and G. Reynolds. Constrained restoration and the recovery of disconti-
nuities. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(3):367–384,
1992.
[42] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian
restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence,
6(6):721–741, 1984.
[43] S. Geman and D. E. McClure. Statisical mothods for tomographic image reconstruc-
tion. Bull. Int. Statist. Inst., 2(4):5–21, 1987.
[44] S. Ghosal and P. Vanek. A fast scalable algorithm for discontinuous optical flow
estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(2):181–
194, 1996.
[45] A. Giachetti, M. Campani, and V. Torre. The use of optical flow for the autonomous
navigation. In Proc. European Conf. on Computer Vision, pages A:146–151, 1994.
130
[46] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non-
gaussian bayesian state estimation. IEE Proceedings-F, 140(2):107–113, 1993.
[47] N. Gupta and L. Kanal. Gradient based motion estimation without computing gra-
dients. International Journal of Computer Vision, 22:81–101, 1997.
[48] K.J. Hanna. Direct multi-resolution estimation of ego-motion and structure from
motion. In Proc. Workshop on Visual Motion, pages 156–162, 1991.
[49] R. M. Haralick. Computer vision theory: the lack thereof. CVGIP, 36:372–386, 1986.
[50] R. M. Haralick, editor. Proc. 1st International Workshop on Robust Computer Vision,
Seattle, WA, Oct. 1990.
[51] R. M. Haralick, editor. Workshop Proc. Performance vs. Methodology in Computer
Vision, Seattle, WA, June 1994.
[52] R. M. Haralick. Propagating covariance in computer vision. 10(5):561–72, 1996.
[53] R. M. Haralick and J. S. Lee. The facet approach to optic flow. In Proc. Image
Understanding Workshop, pages 84–93, 1983.
[54] R. M. Haralick and L. G. Shapiro. Computer and Robot Vision. Addison-Wesley
publishing company, 1992.
[55] R. M. Haralick and L. Watson. A facet model for image data. CVGIP, 15:113–129,
1981.
[56] D. Heeger. Optical flow using spatio-temporal filters. International Journal of Com-
puter Vision, 1(4):279–302, 1988.
[57] F. Heitz and P. Bouthemy. Multimodal estimation of discontinuous optical flow using
markov random fields. IEEE Trans. on Pattern Analysis and Machine Intelligence,
15(12):1217–1232, 1993.
131
[58] B. K. P. Horn and B. G. Schunck. Determining optic flow. Artificial Intelligence,
17:185–203, 1981.
[59] Y. Huang, K. Palaniappan, X. Zhuang, and J. e. Cavanaugh. Optical flow field seg-
mentation and motion estimation using a robust genetic partitioning algorithm. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 17(12):1177–1190, 1995.
[60] M. Ioka and M. Kurokawa. Estimation of motion vectors and their application to
scene retrieval. MVA, 7(3):199–208, 1994.
[61] M. Irani. Multi-frame optical flow estimation using subspace constraints. In Proc.
International Conf. on Computer Vision, pages 626–633, 1999.
[62] M. Irani and P. Anandan. A unified approach to moving object detection in 2d and
3d scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(6):577–589,
1998.
[63] M. Irani, B. Rousso, and S. Peleg. Computing occluding and transparent motions.
International Journal of Computer Vision, 12:5–16, 1994.
[64] M. Irani, B. Rousso, and S. Peleg. Recovery of ego-motion using region alignment.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(3):268–272, 1997.
[65] M. Isard and A. Blake. Condensation — conditional density propagation for visual
tracking. International Journal of Computer Vision, 29(1):5–28, 1998.
[66] B. Jahne. Motion determination in space-time images. In Proc. European Conf. on
Computer Vision, pages 161–173, 1990.
[67] T. Jebara, A. Azarbayejani, and A. Pentland. 3d structure from 2d motion. IEEE
Signal Processing Magazine, pages 66–84, 5 1999.
132
[68] A. Jepson and M. J. Black. Mixture models for optical flow computation. Tech.
Report, Res. in Biol. and Comp. Vision RBCV-TR-93-44, Univ. of Toronto, 1993.
[69] A. D. Jepson, D. J. Fleet, and T. El-Maraghi. Robust, on-line appearance models for
vision tracking. In Proc. Computer Vision and Pattern Recognition, volume 1, pages
415–422, 2001.
[70] J.-M. Jolion, P. Meer, and S. Bataouche. Robust clustering with applications in com-
puter vision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 163(8):791–
802, 1991.
[71] S. X. Ju, M. J. Black, and A. D. Jespon. Skin and bones: multi-layer, locally affine,
optical flow and regularization with transparency. In Proc. Computer Vision and
Pattern Recognition, pages 307–314, 1996.
[72] T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window:
theory and experiment. IEEE Trans. on Pattern Analysis and Machine Intelligence,
16(9):920–932, 1994.
[73] S. B. Kang, R. Szeliski, and J. Chai. Handling occlusions in dense multi-view stereo.
In Proc. Computer Vision and Pattern Recognition, volume 1, pages 103–110, 2001.
[74] J. K. Kearney, W. B. Thompson, and D. L. Boley. Optical flow estimation: an error
analysis of gradient-based methods with local optimization. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 9(2):229–244, 1987.
[75] V. Koivunen. A robust nonlinear filter for image restoration. IEEE Trans. Image
Processing, 4(5):569–578, 1995.
[76] D. Koller, J. Weber, and J. Malik. Robust multiple car tracking with occlusion
reasoning. In Proc. European Conf. on Computer Vision, volume 1, pages 189–196,
1994.
133
[77] J. Konrad and E. Dubois. Bayesian estimation of motion vector fields. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 14(9):910–927, 1992.
[78] R. Kumar, P. Anandan, and K. Hanna. Direct recovery of shape from multiple views:
a parallax based approach. In Proc. International Conf. on Pattern Recognition, pages
685–688, 1994.
[79] J. O. Limb and J. A. Murphy. Computer Graphics and Image Processing, pages =.
[80] M. I. A. Lourakis, A. A. Argyros, and S. C. Orphanoudakis. Independent 3d motion
detection using residual parallax normal flow fields. In Proc. International Conf. on
Computer Vision, pages 1012–17, 1998.
[81] M. I. A. Lourakis and S. C. Orphanoudakis. Using planar parallax to estimate the
time-to-contact. In Proc. Computer Vision and Pattern Recognition, pages 640–645,
1999.
[82] B. D. Lucas and T. Kanade. An iterative image-registration technique with an ap-
plication to stereo vision. In Proc. Image Understanding Workshop, pages 121–130,
1981.
[83] D. Marr. On the purpose of low-level vision. In MIT AI Memo, 1974.
[84] L. H. Matthies, R. Szeliski, and T. Kanade. Kalman filter-based algorithms for es-
timating depth from image sequences. International Journal of Computer Vision,
3:209–236, 1989.
[85] P. Meer, D. Mintz D. Y. Kim, and A. Rosenfeld. Robust regression methods for
computer vision: a review. International Journal of Computer Vision, 6(1):59–70,
1991.
[86] E. Memin and P. Perez. Dense estimation and object-based segmentation of the optical
flow with robust techniques. IEEE Trans. Image Processing, 7(5):703–719, 1998.
134
[87] M. Middendorf and H. H. Nagel. Estimation and interpretation of discontinuities in
optical flow fields. In Proc. International Conf. on Computer Vision, pages 178–183,
2001.
[88] D. W. Murray and B. F. Buxton. Scene segmentation from visual motion using global
optimization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):220–
228, 1987.
[89] H. H. Nagel. On the estimation of optical flow: Relations between different approaches
and some new results. Artificial Intelligence, 33(3):299–324, 1987.
[90] H. H. Nagel. Optical flow estimation and the interaction between measurement errors
at adjacent pixel positions. International Journal of Computer Vision, 15:271–288,
1995.
[91] H. H. Nagel and W. Enkelmann. An investigation of smoothness constraints for
the estimation of displacement vector fields from image sequences. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 8(5):565–593, 1986.
[92] H.H. Nagel and M. Haag. Bias-corrected optical flow estimation for road vehicle
tracking. In Proc. International Conf. on Computer Vision, pages 1006–1011, 1998.
[93] H.H. Nagel, G. Socher, H. Hollnig, and M. Otte. Motion boundary detection in image
sequences by local stochastic tests. In Proc. European Conf. on Computer Vision,
volume 2, pages 305–315, 1994.
[94] P. Nesi, A. D. Bimbo, and D. Ben-Tzvi. A robust algorithm for optical flow estimation.
Computer Vision and Image Understanding, 62(1):59–68, 1995.
[95] L. Ng and V. Solo. Errors-in-variable modelling in optical flow problems. In Proc. Int.
Conf. on Acoustics Speech and Signal Processing, volume 5, pages 2773–2776, 1998.
135
[96] N. Ohta. Uncertainty models of the gradient constraint for optical flow computation.
IEICE Trans. Info. & Sys., E79-D(7):958–962, 1996.
[97] E. P. Ong and M. Spann. Robust optical flow computation based on least-median-of-
squares. International Journal of Computer Vision, 31(1):51–82, 1999.
[98] M. Otte and H.H. Nagel. Optical flow estimation: Advances and comparisons. In
Proc. European Conf. on Computer Vision, pages 51–60, 1994.
[99] N. Peterfreund. Robust tracking of position and velocity with kalman snakes. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 21(6):564–569, 1999.
[100] D. Piponi. Virtual cinematography in “the matrix”.
http://www2.parc.com/ops/projects/forum/2000/forum-07-13.html, 2000.
[101] R. Pless, T. Brodsky, and Y. Aloimonos. Detecting independent motion: The statistics
of temporal continuity. IEEE Trans. on Pattern Analysis and Machine Intelligence,
22(8):768–773, 2000.
[102] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes
in C. Cambridge Univ. Press, 2 edition, 1997.
[103] P. J. Rousseeuw and S. Van Aelst. Positive-breakdown robust methods in computer
vision. Computing Science and Statistics, 31:451–460, 1999.
[104] P. J. Rousseeuw and K. Van Driessen. Computing lts regression for large data sets.
Tech. report (submitted), Univ. of Antwerp.
[105] P. J. Rousseeuw and M. Hubert. Recent developments in progress. In Y. Dodge, editor,
L1-Statistical Procedures and Related Topics, volume 31, pages 201–214. Institute of
Mathematical Statistics Lecture Notes-Monograph Series, 1997.
136
[106] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John
Wiley and Sons, 1987.
[107] Y. Rui and P. Anandan. Segmenting visual actions based on spatio-temporal motion
patterns. In Proc. Computer Vision and Pattern Recognition, volume 1, pages 111–
118, 2000.
[108] H. S. Sawhney. 3d geometry from planar parallax. In Proc. Computer Vision and
Pattern Recognition, pages 929–934, 1994.
[109] H. S. Sawhney and S. Ayer. Compact representations of videos through dominant
and multiple motion estimation. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 18(8):814–830, 1996.
[110] H. S. Sawhney and R. Kumar Y. Guo. Independent motion detection in 3d scenes.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(10):1191–1199, 2000.
[111] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. International Journal of Computer Vision, 47(1):7–42,
2002.
[112] R.R. Schultz, L. Meng, and R.L. Stevenson. Subpixel motion estimation for super-
resolution image sequence enhancement. Journal of Visual Communication and Image
Representation, 9(1):38–50, 1998.
[113] B. G. Schunck. Image flow segmentation and estimation by constraint line clustering.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 11(10):1010–1027, 1989.
[114] H. Schweitzer. Occam algorithms for computing visual motion. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 17(11):1033–1042, 1995.
137
[115] A. Shashua and N. Navab. Relative affine structure: theory and application to 3d
reconstruction from perspective views. In Proc. Computer Vision and Pattern Recog-
nition, pages 483–489, 1994.
[116] D. Shulman and J. Herve. Regularization of discontinuous flow fields. In Proc. Work-
shop on Visual Motion, pages 81–85, 1989.
[117] D-G. Sim and R-H Park. Robust reweighted map motion estimation. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 20(4):353–365, 1998.
[118] E. P. Simoncelli. Distributed Analysis and Representation of Visual Motion. Doctoral
dissertation, MIT, 1993.
[119] E.P. Simoncelli, E.H. Adelson, and D. Heeger. Probability distributions of optical
flow. In Proc. Computer Vision and Pattern Recognition, pages 310–315, 1991.
[120] A. Singh. Optic Flow Computation: A Unified Perspective. IEEE Press, 1990.
[121] S. S. Sinha and B. G. Schunk. A two-stage algorithm for discontinuity-preserving
surface reconstruction. IEEE Trans. on Pattern Analysis and Machine Intelligence,
14(1):36–55, 1992.
[122] S. Srinivasan. In Proc. International Conf. on Computer Vision, volume 1.
[123] G. P. Stein and A. Shashua. Model-based brightness constraints: on direct estimation
of structure and motion. IEEE Trans. on Pattern Analysis and Machine Intelligence,
22(9):992–1015, 2000.
[124] C. V. Stewart. Expected performance of robust estimators near discontinuities. In
Proc. International Conf. on Computer Vision, pages 969–974, 1995.
[125] S. Sun, D. Haynor, and Y. Kim. Motion estimation based on optical flow with adaptive
gradients. In Proc. International Conf. on Image Processing, pages 852–855, 2000.
138
[126] R. Szeliski. Bayesian modeling of uncertainty in low-level vision. Kluwer Academic
Pub., 1989.
[127] R. Szeliski and J. Coughlan. Hierarchical spline-based image registration. In Proc.
International Conf. on Image Processing, pages 194–201, 1994.
[128] R. Szeliski and H.-Y. Shum. Motion estimation with quadtree splines. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 18(12):1199–1210, 1996.
[129] H. Tao, H.S. Sawhney, and R. Kumar. A global matching framework for stereo com-
putation. In Proc. International Conf. on Computer Vision, pages 532–539, 2001.
[130] P. H. S. Torr. Geometric motion segmentation and model selection. In J. Lasenby
et al., editor, Proc. The Royal Society of London, pages 1321–1340, 1998.
[131] P. H. S. Torr, R. Szeliski, and P. Anandan. An integrated bayesian approach to layer
extraction from image sequences. In Proc. International Conf. on Computer Vision,
pages 983–990, 1998.
[132] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment –
A modern synthesis. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision
Algorithms: Theory and Practice, LNCS, pages 298–375. Springer Verlag, 2000.
[133] W. N. Venables and B. D. Ripley. Modern Applied Statistic with S-Plus, 2nd Edition.
Springer, 1997.
[134] J. Y. A. Wang and H. Adelson. Representing moving images with layers. IEEE Trans.
Image Processing, 3(5):625–638, 1994.
[135] J. Weber and J. Malik. Robust computation of optical flow in a multi-scale differential
framework. International Journal of Computer Vision, 14(1):67–81, 1995.
139
[136] Y. Weiss. Bayesian belief propagation for image understanding. Workshop on Sta-
tistical and Computational Theories of Vision 1999–Modeling, Learning, Computing,
and Sampling (submitted for publication), 1999.
[137] Y. Weiss and W. T. Freeman. Correctness of belief propagation in gaussian graphical
models of arbitrary topology. Neural Comp., 13(10):2173–200, 2001.
[138] R. R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing. Academic
Press, 1997.
[139] P. Willet, R. Niu, and Y. Bar-Shalom. Integration of bayes detection with target
tracking. IEEE Trans. on Signal Processing, 49(1):17–30, 2000.
[140] Y. Xiong and S. A. Shafer. Moment and hypergeometric filters for high precision
computation of focus, stereo and optical flow. International Journal of Computer
Vision, 22(1):25–59, 1997.
[141] M. Ye. Image flow estimation using facet model and covariance propagation. M.S.
Thesis, Univ. of Washington, Seattle, WA, USA, 1999.
[142] M. Ye, M. Bern, and D. Goldberg. Document image matching and annotation lifting.
In Proc. International Conference on Document Analysis and Recognition, pages 753–
760, 2001.
[143] M. Ye and R. M. Haralick. Image flow estimation using facet model and covariance
propagation. In Vision Interface, pages 51–58, 1998.
[144] M. Ye and R. M. Haralick. Image flow estimation using facet model and covariance
propagation. In M. Cheriet and Y.H. Yang, editors, Vision Interface: Real World
Applications of Computer Vision, pages 209–241. World Sci., 2000.
140
[145] M. Ye and R. M. Haralick. Optical flow from a least-trimmed squares based adaptive
approach. In Proc. International Conf. on Pattern Recognition, pages 1052–1055,
2000.
[146] M. Ye and R. M. Haralick. Two-stage robust optical flow estimation. In Proc. Com-
puter Vision and Pattern Recognition, pages 623–628, 2000.
[147] M. Ye and R. M. Haralick. Local gradient global matching piecewise smooth optical
flow. In Proc. Computer Vision and Pattern Recognition, pages 712–717, 2001.
[148] M. Ye and R. M. Haralick. Point aerial target detection and tracking — a motion-
based bayesian approach. ISL Tech Report, Univ. of Washington, 2001.
[149] M. Ye, R. M. Haralick, and L. G. Shapiro. Estimating optical flow using a global
matching formulation and graduated optimization. In Proc. International Conf. on
Image Processing, Rochester, NY, September 2002. to appear.
[150] K. Zhang, M. Bober, and J. Kittler. Motion based image segmentation for video
coding. In Proc. International Conf. on Image Processing, pages 476–479, 1995.
[151] S. C. Zhu and A. Yuille. Region competition: unifying snakes, region growing, and
bayes/mdl for multiband image segmentation. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 18(9):884–900, 1996.
141
VITA
Ming Ye was born in Chengdu, P.R. China, in January 1975. She received her B.S.
degree in Electrical Engineering from the University of Electronic Science and Technology
of China in June 1997. She then joined the Intelligence Systems Laboratory at the University
of Washington as a research assistant, where she obtained her M.S. degree in March 1999
and will receive her Ph.D. degree by December 2002, both in Electrical Engineering. She was
a research intern at the Xerox Palo Alto Research Center during the summer of 2000. Her
research is in the area of computer vision and image processing, with a focus on statistical
and robust approaches to visual motion analysis and applications.