Robust Visual Motion Analysis: Piecewise-Smooth Optical ...qji/Papers/mingye_thesis.pdfRobust Visual...

Robust Visual Motion Analysis: Piecewise-Smooth Optical Flow and

Motion-Based Detection and Tracking

Ming Ye

A dissertation submitted in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

University of Washington

2002

Program Authorized to Offer Degree: Electrical Engineering

University of Washington

Abstract

Robust Visual Motion Analysis: Piecewise-Smooth Optical Flow and Motion-Based

Detection and Tracking

by Ming Ye

Co-Chairs of Supervisory Committee:

Professor Robert M. HaralickElectrical Engineering

Professor Linda G. ShapiroComputer Science and Engineering

This thesis describes new approaches to optical flow estimation and motion-based detection

and tracking. Statistical methods, particularly outlier rejection, error analysis and Bayesian

inference, are extensively exploited in our study and are shown to be crucial to the robust

analysis of visual motion.

To recover optical flow, or 2D velocity fields, from image sequences, certain models of

brightness conservation and flow smoothness must be assumed. Thus how to cope with

model violations especially motion discontinuities becomes a very challenging issue. We

first tackle this problem from a local approach, that is, finding the most representative

flow vector for each small image region. We recast the popular gradient-based method as

a two-stage regression problem and apply adaptive robust estimators to both stages. The

estimators are adaptive in the sense that their complexity increases with the amount of

outlier contamination. Due to the limited contextual information, the local approach has

spatially varying uncertainty. We evaluate the uncertainty systematically through covari-

ance propagation.

Pointing out the limitations of local and gradient-based methods, we further propose

a matching-based global optimization technique. The optimal estimate is formulated as

maximizing the a posteriori probability of the optical flow given three image frames. Using

a Markov random field flow model and robust statistics, the formulation reduces to mini-

mizing a regularization type of global energy function, which we carefully design so as to

accommodate outliers, occlusions and local adaptivity. Minimizing the resulting large-scale

nonconvex function is nontrivial and is often the performance bottleneck of previous global

techniques. To overcome this problem, we develop a three-step graduated solution method

which inherits the advantages of various popular approaches and avoids their drawbacks.

This technique is highly efficient and accurate. Its performance is demonstrated through

experiments on both synthetic and real data and comparison with competing techniques.

By making only weak assumptions of spatiotemporal continuity, the two proposed tech-

niques are applicable to general scenarios, for example, to both rigid and nonrigid motion.

They serve as a foundation for object-based motion analysis. Many of their conclusions are

also extendable to other visual surface reconstruction problems such as image restoration

and stereo matching.

The last part of the thesis describes a motion-based detection and tracking system

designed for an airborne visual surveillance application, in which challenges arise from the

small target size (1× 2− 3× 3 pixels), low image quality, substantial camera wobbling and

plenty of background clutters. The system is composed of a detector and a tracker. The

former identifies suspicious objects by the statistical difference between their motion and

the background motion; the latter employs a Kalman filter to track the dynamic behavior of

objects in order to detect real targets and update their states. Both components operate in

a Bayesian mode, and each benefits from the other’s accuracy. The system exhibits excellent

performance in experiments. In an 1800-frame real video, it produces no false detections

and tracks the true target since the second frame, with average position error below 1 pixel.

This probabilistic approach reduces parameter tuning to a minimum. It also facilitates data

fusion from different information channels.

TABLE OF CONTENTS

List of Figures iv

List of Tables vi

Chapter 1: Introduction 1

1.1 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 A Local Method with Error Analysis . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 A Global Optimization Method . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Motion-Based Target Detection and Tracking . . . . . . . . . . . . . . . . . . 12

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Chapter 2: Estimating Optical Flow: Approaches and Issues 15

2.1 Brightness Conservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Flow Field Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Typical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Hierarchical Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Chapter 3: Local Flow Estimation and Error Analysis 34

3.1 A Two-Stage-Robust Adaptive Technique . . . . . . . . . . . . . . . . . . . . 34

3.1.1 Linear Regression and Robustness . . . . . . . . . . . . . . . . . . . . 35

3.1.2 Two-Stage Regression Model . . . . . . . . . . . . . . . . . . . . . . . 40

3.1.3 Choosing Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 43

i

3.2 Adaptive High-Breakdown Robust Methods For Visual Reconstruction . . . . 50

3.2.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.2 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Error Analysis on Robust Local Flow . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.1 Covariance Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Chapter 4: Global Matching with Graduated Optimization 70

4.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1.1 MAP Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.2 MRF Prior Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.3 Likelihood Model: Robust Three-Frame Matching . . . . . . . . . . . 74

4.1.4 Global Energy with Local Adaptivity . . . . . . . . . . . . . . . . . . 75

4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.1 Step I: Gradient-Based Local Regression . . . . . . . . . . . . . . . . . 77

4.2.2 Step II: Gradient-Based Global Optimization . . . . . . . . . . . . . . 77

4.2.3 Step III: Global Matching . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.2.4 Overall Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3.1 Quantitative Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3.2 TS: An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3.3 Barron’s Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.3.4 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.4 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Chapter 5: Motion-Based Detection and Tracking 96

5.1 Bayesian State Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

ii

5.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4 Motion-Based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.5 Bayesian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.6 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Chapter 6: Conclusions 117

6.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2 Open Questions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 121

Bibliography 125

iii

LIST OF FIGURES

1.1 Example Optical flow on flower garden sequence . . . . . . . . . . . . . . . . 2

1.2 Motion estimation by template matching . . . . . . . . . . . . . . . . . . . . . 5

1.3 Motion analysis for airborne video surveillance . . . . . . . . . . . . . . . . . 12

2.1 Aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Hierarchical processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 Comparison of Geman-McClure norm and L2 norm . . . . . . . . . . . . . . . 38

3.2 Block diagram of the two-stage-robust adaptive algorithm . . . . . . . . . . . 44

3.3 Central frame of the synthetic sequence (5 frames, 32× 32) . . . . . . . . . . 45

3.4 Correct flow field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 OFC cluster plots at three typical pixels . . . . . . . . . . . . . . . . . . . . . 46

3.6 TS sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.7 Pepsi sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.8 Pepsi: estimated flow fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.9 Random sampling based algorithm for high-breakdown robust estimators . . 51

3.10 Adaptive algorithm for high-breakdown robust estimators . . . . . . . . . . . 52

3.11 TS: trial set size map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.12 TS: correct and estimated flow fields . . . . . . . . . . . . . . . . . . . . . . . 55

3.13 TT, DT middle frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.14 YOS middle frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.15 OTTE sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.16 TAXI sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.17 TAXI: intensity images of x-component . . . . . . . . . . . . . . . . . . . . . 60

iv

3.18 TS motion boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.19 TAXI motion boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.20 TAXI: motion boundary on images subsampled by 2 . . . . . . . . . . . . . . 69

4.1 Comparison of Geman-McClure norm and L2 norm . . . . . . . . . . . . . . . 73

4.2 System diagram (operations at each pyramid level) . . . . . . . . . . . . . . . 80

4.3 TS sequence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4 Error cdf curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.5 DTTT sequence results (motion boundaries highlighted in (a)). . . . . . . . . 88

4.6 Taxi results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.7 Flower garden results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.8 Traffic results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.9 Pepsi can results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1 A typical detection-tracking system . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Proposed Bayesian system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.3 Example data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4 f16502 target pixel candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.5 f18300 and f19000 target pixel candidates . . . . . . . . . . . . . . . . . . . . 108

5.6 Target pixels for f16502, f18300 and f19000 . . . . . . . . . . . . . . . . . . . 110

5.7 Detection results w and w/o priors on f16503 . . . . . . . . . . . . . . . . . . 112

5.8 Two sample frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

v

LIST OF TABLES

3.1 Comparison of four popular regression criteria (estimators) . . . . . . . . . . 40

3.2 TS sequence: comparison of average error percentage . . . . . . . . . . . . . . 48

3.3 Quantitative comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Quantitative measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2 Comparison of various techniques on Yosemite (cloud part excluded) with

Barron’s angular error measure . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.1 Quantitative measures in 1800 frames . . . . . . . . . . . . . . . . . . . . . . 115

vi

ACKNOWLEDGMENTS

It is a great pleasure to express my gratitude to all those who have made this disser-

tation possible. First, I thank my co-advisor Prof. Robert Haralick, a man of wisdom

and rigor, for his guidance and support during both my master’s and doctoral study.

I am deeply indebted to Prof. Linda Shapiro, who became my co-advisor in my final

year and helped me through the critical period of time with constant support and

encouragement.

I would also like to thank other members of my supervisory committee: Prof.

Jenq-Neng Hwang, Prof. Qiang Ji, Prof. Werner Stuetzle, Prof. Ming-Ting Sun

and Prof. David Thouless, who monitored my work and put in the effort to reading

earlier versions of this dissertation.

My former colleagues in the Intelligence Systems Laboratory: Dr. Qiang Ji, Dr.

Gang Liu, Dr. Desikachari Nadadur, Dr. Selim Aksoy, Dr. Mingzhou Song, Dr.

Jisheng Liang, Dr. Lei Sui and Dr. Yalin Wang, deserve many thanks for their

friendship and help. I especially want to thank Dr. Qiang Ji and Dr. Gang Liu for

pleasant and fruitful discussions and brotherly advice that helped me stay encouraged

and on the right track.

I am grateful to the Electrical Engineering Department for providing me a good

work environment during my final year. Particularly, I must thank Helene Obradovich

for her efforts to support me with a teaching assistantship, to Frankye Jones for keep-

ing an eye on my progress, and to Sekar Thiagarajan and his team for the computing

support.

I wish to express sincere appreciation to Dr. Marshall Bern and Dr. David Gold-

berg for giving me the opportunity to work at the Xerox Palo Alto Research Center

vii

(PARC). Their advice, encouragement and friendship made my summer internship

at PARC a very productive and enjoyable one.

Last but certainly not the least, I am forever indebted to the love and caring of

my family and friends. Special thanks go to my dear husband Chengyang Li, who

will receive his Ph.D. about the same time, and to my dear parents and sister for

supporting and encouraging me to pursue my academic aspirations.

viii

1

Chapter 1

INTRODUCTION

Visual motion is the 2D velocity field corresponding to the movement of brightness

patterns in the image plane of a visual sensor. It usually arises from the relative motion

between 3D objects and the observer, and it provides rich information about the surface

structures of the objects and their dynamic behavior [58, 89]. Human beings rely on the

skills of perceiving and understanding visual motion in order to move around, meet with

people, watch movies and perform many other essential daily tasks. If we want computers to

assist us and interact with us, we must endow them with a similar capability for analyzing

visual motion, that is, accurately measuring and appropriately interpreting the 2D velocity

present in digital images. This has turned out to be a highly complicated and error-prone

process. The co-existence of profound significance and great challenge makes visual motion

analysis a very important and active research area in computer vision.

Optical flow is a flexible representation of visual motion that is particularly suitable

for computers analyzing digital images. It associates each image pixel (x, y) with a two-

component vector u = (u(x, y), v(x, y))T , indicating its apparent instantaneous 2D velocity.

The optical flow representation is adopted throughout this thesis and henceforth we use the

terms “visual motion” and “optical flow” interchangeably. In order to illustrate the concept

of optical flow, Figure 1.1 shows three frames that are part of a video sequence taken by a

camera passing in front of a flower garden. The optical flow estimated for the second frame,

which is subsampled by 8 each way to avoid being too crowded, is shown in Figure 1.1(d).

It overall agrees with our perception of motion in the scene.

Once available, optical flow estimates can be used to infer many 3D structural and

dynamic properties of the scene [54, 36]. In a general scenario, 2D image motion can

2

(a) Frame 1 (b) Frame 2 (c) Frame 3

(d) Optical flow estimated on Frame 2

Figure 1.1: Three frames in a video sequence taken by a camera passing in front of a flowergarden and the estimated optical flow field

3

be caused by camera motion (ego-motion), motion of independent moving objects in the

scene, or a composite of these two. If a video sequence is taken by a moving camera of

a rigid 3D scene, as in the case of the flower garden sequence, analysis of this sequence

can lead to recovery of the camera motion (pose) [48, 64] and the 3D surface structure of

the scene [31, 78, 108, 115, 67]. When there are independent moving objects in the scene,

motion analysis can help determine the number of objects, their individual 3D motions, their

distances to the observer and surface structures. The above study is vital to applications

in environment modelling [8, 132], target detection and tracking [80, 92, 110, 32, 81, 101],

auto-navigation [2, 45, 123], video event analysis [60, 107] and medical image registration

[1].

Analyzing optical flow in the 2D domain is important in its own right. Many dynamic

features such as the focus of expansion [122], motion boundaries [93, 16] and occlusion

relationships [73] can be extracted from optical flow fields (although the extraction is much

less straightforward than it intuitively would be—we will come to this topic in the next

section). These dynamic features can assist in image segmentation [88, 113, 14, 86, 131] and

independent motion detection [80, 92], and usually serve as intermediate measures to object-

based representations [110]. Moreover, temporal continuity encoded in visual motion has

been exploited for redundancy reduction in video compression [114, 150, 109], image/video

super-resolution [112], and removal of image noise [120] and image distortion [142].

Visual motion, as a compelling cue to the perception of 3D structure and realism, can

also be used for graphics and animation [29, 100]. For example, a cartoon character can be

made to mimic a human character’s expression by first measuring the human character’s

facial motion and then warping the cartoon character accordingly. Such concepts have

already been utilized in film production [100], and they are expected to play an increasingly

important role in the future with advances in computational technology.

All the visual motion applications discussed above assume that accurate optical flow

estimates are already available or can be conveniently computed. Unfortunately, recovering

optical flow from images is very difficult for three reasons. First of all, the movements of

brightness patterns in the image plane might not impose sufficient constraints on the actual

2D motion—this is the intrinsic ambiguity of optical flow. Secondly, in formulating the

4

problem of optical flow estimation, certain assumptions about the motion and the image

observation must be made; these assumptions, as simplifications of real-world phenomena,

can easily be violated and result in erroneous estimates. Finally, the computation involved

can be intensive and even prohibitive so that a more appropriate formulation might not lead

to higher practical accuracy. Even worse, these difficulties are usually entangled together

and make it very hard to tell which factors attribute to the failure. For the above reasons,

despite decades of active research and steady progress, the performance of existing optical

flow estimation techniques remains unsatisfactory. It is thus the main theme of this thesis

to explore new approaches to optical flow estimation which handle these problems more

effectively.

The rest of this chapter serves as a high-level overview of my dissertation. The following

section briefly reviews optical flow research and motivates our study. Two novel techniques,

exploiting local and global motion coherence respectively, are described in Section 1.2 and

Section 1.3. Section 1.4 discusses a detection and tracking system which can be considered

as an application of visual motion, and is built on top of various results established in

our study of optical flow estimation. Finally Section 1.5 gives an outline of the thesis.

Conclusions and contributions of various pieces of our work will be pointed out in each

individual section.

1.1 Optical Flow Estimation

Basics

Given the two images in Figure 1.2, the task of estimating the optical flow in the first

frame is to determine where each pixel in this frame moves to in the next frame. The most

intuitive method of doing this is probably template matching. Consider the pixel at the

center of the box, which is near the center of the front tree trunk, in Frame 1. In order

to find its corresponding point in Frame 2, we may take the image block within the box

as a template, search Frame 2 for the block most similar to it, and computer the optical

flow vector from the displacement between the centers of the blocks. Two assumptions are

implied in this matching process: (i) the template maintains its brightness pattern and (ii)

5

2

14

3

Frame 1 Frame 2

Figure 1.2: Motion estimation by template matching. Consider the pixel at the center ofthe white box in Frame 1. In order to find its corresponding point in Frame 2, we may takethe image block within the box as a template, search Frame 2 for the block most similar toit, and then the displacement between the centers of the blocks is the optical flow vector.Templates 1, 2 and 4 show the aperture problem. Template 3 shows a case of assumptionviolation caused by motion boundaries.

all pixels within the template move at the same speed. These are simple embodiments of the

brightness conservation and flow field smoothness assumptions, which are the foundation of

all motion estimation methods.

The above template matching method, however, does not work well in all places. Some

problematic positions are marked in Figure 1.2. Template 1 (upper-left in Frame 1, on the

roof) contains an intensity edge and many blocks along the edge in the second frame seem

to match it almost equally well; as a result, only the motion perpendicular to the edge can

be reliably recovered. Template 2 belongs to the sky and is poorly textured. Where the

block matching process finds the best match in the next frame is pretty much due to image

noise. Template 4 shows a similar problem. Its sky part is attached to the twigs and is

assigned the foreground motion (see Figure 1.1). These three cases illustrates the aperture

problem [58]: if we are examining motion only through a small aperture (region), the local

image information can be insufficient for uniquely determining the motion. The aperture

problem is the intrinsic difficulty of visual motion perception. Some of the ambiguity it

induces may be resolvable with appropriate contextual knowledge. For instance, human

6

viewers can recognize Templates 1 and 2 as part of the house and the sky, respectively, and

can associate their motions with the rest of the scene. There have been efforts to mimic

this ability, including adaptive template window selection [72] and flow propagation [58].

Nonetheless, the aperture problem is unavoidable in general; it always exists in the form of

spatially varying uncertainty. For such reasons, error analysis [52, 144] is an integral part

of optical flow estimation and will be addressed in this thesis.

Challenges in motion estimation also arise from assumption violations. One example

is given by Template 3. The correct motion of its center pixel is the motion of the front

tree trunk. But since the template also includes a part from the flower bed, which moves

differently, flow constancy no longer holds in this block and the outcome of the matching

process can be arbitrarily wrong. Motion discontinuities have received the most attention in

combating assumption violations, not only because they are abundant in real imagery, but

also in that they often correspond to significant scene features, which could be of even greater

interest than the motion itself in some applications. The brightness conservation assumption

can also become invalid due to large image noise and illumination changes. To deal with

these problems, we may either adopt new models accommodating the abnormalities or

develop techniques that work gracefully even with violations present. The latter measure is

indispensable because, any assumption, as a simplification of a real-world phenomena, will

potentially be violated. For this reason, devising methods robust to unmodelled incidences

has become a central issue in motion estimation as well as in the entire computer vision

community [50, 85, 51].

In more than two decades’ intensive research, optical flow estimation has been tackled

from different angles with variable success. Early studies [58, 82, 3] establish basic models

for brightness conservation and flow smoothness. Recent efforts [77, 15, 5, 97] emphasize

enhancing robustness against model violations and solving associated optimization prob-

lems. The following section is a glimpse of the broad area especially methods related to our

work. More literature review will be given in Chapter 2.

Overview of Related Work

Two main types of constraints are derived from the brightness conservation assumption:

7

matching-based constraints [3] and gradient-based constraints [58, 82]. Matching-based

constraints, as used in the template matching process, determine the motion vector by

trying a number of candidate positions and finding the one with the minimum matching

error. This method can handle large motion, but the search process can be computationally

expensive and yield poor sub-pixel accuracy [7, 14]. Gradient-based constraints are linear

approximations of matching-based constraints. By exploiting gradient information, they

can achieve much better efficiency and accuracy and hence have become the most popular

in practice. But relying on derivative computation makes their applicability more limited

[145].

Based on how flow smoothness is imposed, the approaches are further divided into

two types: local parametric and global optimization. Local parametric methods assume

that within a certain region the flow field is described by a parametric model [12]. The

simplest, yet one of the most popular models is the local constant model, as implied in

template matching. Local models usually involve simple computation and can achieve good

local accuracy [7, 39], but they degrade or fail when the model is inappropriate or the

local information becomes insufficient or unreliable. Global optimization methods cast

optical flow estimation in a regularization framework — every vector satisfies its brightness

constraint while maintaining coherence with its neighbors [58]. Because they propagate flow

between different parts of the flow field, such approaches are less sensitive to the aperture

problem, but for the same reason, they tend to oversmooth the flow field. Most popular

approaches are gradient-based. The best known classical techniques are perhaps the global

gradient-based method by Horn and Schunck [58] and the local gradient-based method by

Lucas and Kanade [82].

Traditional techniques [7] usually require brightness conservation and flow smoothness

to be satisfied everywhere in the flow field. The restrictive assumptions result in smeared

motion discontinuities and high sensitivity to abrupt noise. As their limitations are widely

recognized, a large number of recent efforts have been devoted to increasing robustness

especially allowing motion discontinuities. For gradient-based local parametric methods,

various robust regression techniques such as robust clustering [113, 94], M-estimators [15,

109], high-breakdown robust estimators [5, 97, 145] are substituted for the traditional least-

8

squares estimator. They reduce the impact of model violations by fitting the structure

of the majority of the data. Global optimization approaches are reformulated in terms of

anisotropic diffusion [91, 19], Markov random fields with line process [88, 77, 57, 14], weak

continuity [20, 15], or robust statistics [116, 15, 86], among many others. These techniques

generally outperform their non-robust counterparts in terms of accuracy. But computational

complexity quickly becomes the new performance bottleneck. This is especially true for

global methods involving large-scale nonconvex optimization, which are considered the most

promising [111] methods.

Why Low-level Approaches

This thesis is concerned with low-level approaches to optical flow estimation (in fact,

when people talk about optical flow methods, they normally refer to low-level methods).

“Low-level” means that only primitive image descriptors (intensity values) and weak as-

sumptions (piecewise spatiotemporal continuity) are exploited. Due to the small amount

of prior knowledge, the limitations of such approaches, for instance, in handling motion

discontinuities, are obvious and understandable.

The reader may then wonder why we do not use other channels of information or stronger

assumptions—it seems to make perfect sense to extract motion for each object separately.

Such ideas are compelling and have been exploited in a number of applications. Examples

include using color segmentation to assist motion boundary localization [129]; assuming the

motion field to be a mixture [68, 109], single/multiple rigid bodies [12] or layers [131]; and

explicitly modelling and tracking motion boundaries [86, 16]. Replacing the optical flow

representation of visual motion by an object-based representation has also been suggested

[118, 48, 101].

Nonetheless, low-level approaches continue to be extensively studied for good reasons

[83]. First of all, by making weaker assumptions, low-level methods are more general and

are applicable to different types of visual motion, for example, both rigid and nonrigid

motion. Secondly, low-level methods are indispensable building blocks, leading in a bottom-

up fashion to more complex motion analysis [118]; in fact, higher-level methods usually need

low-level methods in model selection [130], initialization and optimization procedures [68],

9

and advances in low-level research are applicable to them as well. Finally, there is still

plenty of room for improvement in low-level motion estimation, particularly in robustness

and error analysis. Due to compromises in formulations and solution methods, existing

techniques can fail even in ideal settings. As an example, many methods intended to preserve

motion discontinuities use gradient-based brightness constraints, which can break down at

discontinuities due to derivative evaluation failure. Error analysis of motion estimates is a

crucial task due to the inherent ambiguity in visual motion. The insufficiency of robustness

and error analysis in optical flow estimation are the major motivations of our research.

We have considered both local and global approaches to piecewise-smooth optical flow

estimation. The following two sections overview the main results and contributions of our

work.

1.2 A Local Method with Error Analysis

A Two-Stage-Robust Adaptive Scheme. Gradient-based optical flow estimation tech-

niques essentially consist of two stages: estimating derivatives, and organizing and solving

optical flow constraints (OFC). Both stages pool information in a certain neighborhood and

are regression procedures in nature. Least-squares (LS) solutions to the regression prob-

lems break down in the presence of outliers such as motion boundaries. To cope with this

problem, a few robust regression tools [15, 86, 97, 5] have been introduced to the OFC

stage. However, as a very similar information pooling step, derivative calculation has sel-

dom received proper attention in optical flow estimation. Crude derivative estimators are

widely used; as a consequence, robust OFC (one-stage robust) methods still break down

near motion boundaries. Pointing out this limitation, we propose to calculate derivatives

from a robust facet model [146, 145]. To reduce the computation overhead, we carry out the

robust derivative stage adaptively according to a confidence measure of the flow estimate.

Preliminary experimental results show that the two-stage robust scheme permits correct

flow recovery even at immediate motion boundaries.

A Deterministic Algorithm for High-Breakdown Robust Regression. High-

breakdown criteria are employed in both of the above regression problems. They have no

10

closed-form solutions and past research has resorted to certain approximation schemes. So

far all applications of high-breakdown robust methods in visual reconstruction [121, 75, 5, 97,

117] have adopted a random-sampling algorithm given in [106]—the estimate with the best

criterion value is picked from a random pool of trial estimates. These methods uniformly

apply the algorithms to all pixels in an image disregarding the actual amount of outliers,

and suffer from heavy computation as well as unstable accuracy. By taking advantage

of the piecewise smoothness property of the visual field and the selection capability of

robust estimators, we propose a deterministic adaptive algorithm for high-breakdown local

parametric estimation. Starting from LS estimates, we iteratively choose neighbors’ values

as trial solutions and use robust criteria to adapt them to the local constraint. This method

provides an estimator whose complexity depends on the actual outlier contamination. It

inherits the merits of both LS and robust estimators and results in crispy boundaries as

well as smooth inner surfaces; it is also faster than algorithms based on random sampling.

Error Analysis Through Covariance Propagation. Due to the aperture problem

and outlying structures, an optical flow estimate generally has spatially varying reliability.

In order for subsequent applications to make judicious use of the results [34], error statistics

of the flow estimate have to be analyzed. In our earlier work [141], we have conducted error

analysis for the least-squares-based local estimation method using the covariance propaga-

tion theory for approximate linear systems and small errors. Here we generalize the results

to the newer robust method. Our analysis estimates image noise and derivative errors in an

adaptive fashion, takes into account correlation of derivative errors at adjacent positions.

It is more complete, systematic and reliable than previous efforts.

1.3 A Global Optimization Method

By drawing information from the entire visual field, the global optimization approach [58, 15]

to optical flow estimation is conceptually more effective in handling the aperture problem

and outliers than the local approach. But its actual performance has been somehow dis-

appointing due to formulation defects and solution complexity. On one hand, approximate

formulations are frequently adopted for ease of computation, with the consequence that the

11

correct flow is unrecoverable even in ideal settings. On the other hand, more sophisticated

formulations typically involve large-scale nonconvex optimization problems, which are so

hard to solve that the practical accuracy might not be competitive with simpler methods.

The global optimization method we have developed is aimed at better solutions to both

problems.

From a Bayesian perspective, we assume the flow field prior distribution to be a Markov

random field (MRF) and formulate the optimal optical flow as the maximum a posteriori

(MAP) estimate, which is equivalent to the minimum of a robust global energy function.

The novelty in our formulation lies mainly in two aspects. 1) Three-frame matching is

proposed to detect correspondences; this overcomes the visibility problem at occlusions. 2)

The strengths of brightness and smoothness errors in the global energy are automatically

balanced according to local data variation, and consequently parameter tuning is reduced.

These features enable our method to achieve a higher accuracy upper-bound than previous

algorithms.

In order to solve the resultant energy minimization problem, we develop a hierarchical

three-step graduated optimization strategy. Step I is a high-breakdown robust local gradient

method with a deterministic iterative implementation, which provides a high-quality initial

flow estimate. Step II is a global gradient-based formulation solved by Successive Over-

Relaxation (SOR), which efficiently improves the flow field coherence. Step III minimizes

the original energy by greedy propagation. It corrects gross errors introduced by derivative

evaluation and pyramid operations. In this process, merits are inherited and drawbacks are

largely avoided in all three steps. As a result, high accuracy is obtained both on and off

motion boundaries.

Performance of this technique is demonstrated on a number of standard test data sets.

On Barron’s synthetic data, which have become the benchmark since the publication of

[7], this method achieves the best accuracy among all low-level techniques. Close compar-

ison with the well known Black and Anandan’s dense regularization technique (BA) [14]

shows that our method yields uniformly higher accuracy in all experiments at a similar

computational cost.

12

(a) A typical frame (b) Target marked

Figure 1.3: Motion analysis for airborne video surveillance. A tiny airplane is only observ-able by its distinct motion.

1.4 Motion-Based Target Detection and Tracking

In a visual surveillance project funded by the Boeing Company, we have investigated an

application of optical flow to airborne target detection and tracking. The greatest difficulty

in this problem lies in the extremely small target size, typically 2× 1− 3× 3 pixels, which

makes results from most previous aerial visual surveillance studies unapplicable. Challenges

also arise from low image quality, substantial camera wobbling and plenty of background

clutters. A sample frame of the client data is given in Figure 1.3 together with a copy in

which the target is marked.

The proposed system consists of two components; a moving object detector identifies

objects by the statistical difference between their motions and the background motion, and

a Kalman filter tracks their dynamic behaviors in order to detect targets and update their

states. Both the detector and the tracker operate in a Bayesian mode and they each benefit

from the other’s accuracy. The system exhibits excellent performance in experiments. On

an 1800-frame real video clip with heavy clutter and a true target (1 × 2 − 3 × 3 pixels

large), it produces no false targets and tracks the true target from the second frame with

13

average position error below 1 pixel. This probabilistic approach reduces parameter tuning

to a minimum. It also facilitates data fusion from different information channels.

1.5 Thesis Outline

The first and major half of the dissertation is devoted to optical flow estimation, and the rest

describes the motion-based target detection and tracking system. To enhance visual motion

analysis robustness, which is the central issue in our study, statistical tools are extensively

explored at every stage. Given the diversity of the topics, previous work is summarized and

mathematical and statistical tools are introduced when the need arises.

Chapter 2 serves as a literature review on piecewise-smooth optical flow estimation.

Standard constraints derived from the brightness conservation and flow smoothness assump-

tions and common techniques such as hierarchical processing are described. Representative

methods, both classical and more robust ones, are discussed. The relative merits of different

approaches are important considerations in designing our methods.

Chapter 3 addresses the two-stage robust adaptive approach to local flow estimation

and its error analysis. Using the facet model, the popular local gradient-based approach is

reformulated as a two-stage regression problem. Appropriate robust estimators are identified

for both stages and the adaptive scheme is introduced. A deterministic algorithm for high-

breakdown robust regression in visual reconstruction is proposed, and its effectiveness is

demonstrated at the OFC solving stage. Error analysis carried out for the least-squares

version of the method is reviewed and then the results are generalized to the robust version.

Experimental results on both synthetic and real data are given in each of the above three

parts. Robust estimation is formally introduced in this chapter and it will be extensively

used in the rest of the thesis.

Chapter 4 discusses the global optimization approach to optical flow estimation. From

a Bayesian perspective, the maximum a posteriori (MAP) criterion is used with a Markov

random field (MRF) prior distribution to formulate optical flow estimation as minimiz-

ing a global energy function. The global energy is carefully designed to allow occlusions,

flow discontinuities and local adaptivity. Furthermore, a graduated deterministic solution

14

technique is developed for the minimization problem. It exploits the advantages of various

formulations and solution techniques for accuracy and efficiency. The theoretical and practi-

cal advantages of this method are illustrated by experimental results and comparisons with

other techniques on various synthetic and real image sequences. This chapter concludes by

pointing out contributions and future research directions along this line.

Chapter 5 presents the motion-based target detection and tracking system. It begins

by describing the Kalman-filter-based tracker. In doing so the Bayesian state estimation

theory, which is also used in the detection phase, is explained. A hybrid motion estimator

is devised to locate independently moving objects. Its measurements are integrated with

priors from the previous tracking results, and then the detector can operate in a Bayesian

mode. Performance of this system is demonstrated on real airborne video.

Chapter 6 concludes this dissertation by summarizing the results, contributions and

future research avenues of each individual piece of our work.

15

Chapter 2

ESTIMATING OPTICAL FLOW: APPROACHES AND ISSUES

Optical flow estimation has long been an active research area in computer vision. Pio-

neering work on calculating image velocity for compressing TV signals [79, 28] dates back

to the mid 70’s. During the 80’s, the fundamental assumptions enabling optical flow es-

timation, namely, brightness conservation and flow field coherence, were examined from

different angles resulting in a large number of techniques, which are compared in the in-

fluential review articles by Barron, Fleet and Beauchemin [7, 10]. A drawback common to

many of these early techniques is that they usually require the assumptions to be satisfied in

a strict (least-squares) sense so that their performance degrades severely in the presence of

unmodelled events especially motion discontinuities. As such limitations have been widely

recognized, the theme of optical flow research in recent years has shifted to enhancing the

robustness of classical approaches. Encouraging progress has been made along this line and

the estimation accuracy has been greatly improved. However, due to problems in formu-

lations and solution techniques, there still exists a considerable gap between the achieved

performance and what is desired in real-world applications. In addition, visual motion has

its intrinsic ambiguity, which cannot be resolved by any estimation methods. It makes reli-

able error analysis of optical flow estimates a crucial issue that needs to be addressed more

adequately. This unsatisfactory state of affairs continues to motivate investigation in the

area.

This chapter reviews piecewise-smooth optical flow estimation. We will describe typical

formulations, representative techniques and their relative merits. The purpose is not to give

a comprehensive literature review, which is beyond our scope, but to provide background

knowledge for understanding difficulties in this problem, major achievements of previous

work and motivations for our study. We organize this chapter as follows. The first two

sections discuss the modelling of brightness conservation and flow coherence respectively,

16

and Section 2.3 describes typical formulations resulting from combinations of these models.

Section 2.4 addresses challenges arising from modelling violations and efforts at ameliorating

these problems. Section 2.5 points out the inherent ambiguity of optical flow and introduces

previous work on error analysis. Finally, Section 2.6 explains the hierarchical process that

is widely employed to handle large motions.

2.1 Brightness Conservation

Let I(x, y, t) be the image intensity at a point (x, y) at time t. The brightness conservation

assumption can be expressed as

I(x, y, t) = I(x + δx, y + δy, t + δt)

= I(x + uδt, y + vδt, t + δt), (2.1)

where (δx, δy) is the spatial displacement during the time interval δt, and (u, v) is the optical

flow vector. This equation simply states that a point maintains its intensity value during

motion, or corresponding points in different frames have the same brightness.

Matching-based methods [3, 120] find the flow vector or displacement that yields the

best match between image regions in different frames. Best match can be defined in terms

of maximizing a similarity measure such as the normalized cross-correlation, or minimizing

a distance measure such as the sum-of-squares difference (SSD):

EB(u, v) =∑

(x,y)∈R

[I(x, y, t)− I(x + uδt, y + vδt, t + δt)]2, (2.2)

where EB designates the brightness conservation error, and R is the image region spanned

by the template.

Such matching criteria normally do not lead to closed-form solutions. In order to find

the best match, usually a set of displacements are hypothesized, and the one with the best

matching score is retained. This discrete exhaustive search process has poor efficiency and

often results in low subpixel accuracy. For this reason, gradient-based methods have gained

popularity in the optical flow estimation community.

17

Gradient-based methods [58, 82, 53] make use of differential approximations of the

brightness constancy constraint Eq. 2.1. When the spatiotemporal image intensity I is

differentiable at the point (x, y, t), the right side of Eq. 2.1 can be expanded as Taylor

series, yielding

I(x, y, t) = I(x, y, t) + Ixuδt + Iyvδt + Itδt + ε,

where (Ix, Iy, It) is the image intensity gradient vector at the point (x, y, t), and ε represents

the higher-order terms. If the displacement (uδt, vδt) is infinitesimally small, ε becomes

negligible and the equation simplifies to the well known optical flow constraint equation

(OFCE) [58]

Ixu + Iyv + It = 0, (2.3)

which is a linear equation in the two unknowns u and v. Given n ≥ 2 pixels of the same

2D motion, their OFCEs can be grouped together and then u, v can be calculated through

linear regression.

Another way of obtaining additional constraints is to exploit second-order image deriva-

tives. Differentiating Eq. 2.3 with respect to x, y and t respectively gives three more equa-

tions:

Ixxu + Iyxv + Itx = 0

Ixyu + Iyyv + Ity = 0

Ixtu + Iytv + Itt = 0.

They can be used alone [7] or combined with OFCE [53] to solved for u.

The most distinct attraction of gradient-based constraints, compared with matching-

based constraints, is their ease of computation. The use of derivatives allows more efficient

exploration of the solution space and hence achieves lower complexity and higher floating-

point precision [7, 9]. However, it is important to point out that such advantages do come

with a price: the additional assumptions made in deriving the gradient-based constraints

dictates their more limited applicability. First of all, gradient-based constraints are valid

only for small displacements, which in practice means magnitude < 1 ∼ 2 pixels/frame. Sec-

ondly, in order for the higher-order terms to be negligible, the local image intensity function

18

should be close to a planar structure, which is also often violated. Finally, derivative es-

timation is a problematic process itself. Commonly used methods include neighborhood

differences [58], facet model fitting [145] and spatiotemporal filtering [119]. They all imply

constant optical flow in the neighborhood and therefore break down near motion bound-

aries. In fact, derivatives are low-level visual metrics just like optical flow, and thus their

computation also meets with difficulties produced by the aperture problem and assumption

violations [20, 13].

Maintaining high derivative quality, identifying unusable estimates and diagnosing fail-

ures are crucial to the robustness of (gradient-based) optical flow estimation. We address

these issues in developing both of our new techniques (Chapter 3 and Chapter 4).

Frequency-based methods. Performing the Fourier transform on the brightness con-

stancy constraint Eq. 2.1 yields

I(ωx, ωy, ωt) = I(ωx, ωy, ωt)e−j(uδtωx+vδtωy+δtωt),

where I(ω1, ω2, ω3) is the Fourier transform of I(x, y, t) and ω1, ω2, ω3 denote spatiotemporal

frequency. Clearly, for this equation to hold, it must satisfy

uωx + vωy + ωt = 0. (2.4)

This is the basic constraint for frequency-based approaches. It states that all nonzero

energy associated with a translating 2D pattern lies on a plane through the origin in the

frequency space, and the norm of the plane defines the optical flow vector. Frequency-

based approaches are often presented as biological models of human motion sensing. They

can handle cases that are difficult for matching approaches, e.g., the motion of random

dot patterns. But in most cases, they are close to the frequency-domain equivalents of

matching-based and gradient-based methods [10], and extracting the nonzero energy plane

usually involves heavy computation. As a consequence, they are not as popular as the other

two types of approaches.

19

2.2 Flow Field Coherence

For each pixel the brightness conservation constraint (Eq. (2.1), (2.3) or (2.4)) forms one

constraint in the two unknowns u and v. Additional constraints come from the flow field

coherence assumption, which means neighboring pixels experience consistent motion. Based

on how coherence is imposed, the approaches can be further divided into two major types,

local parametric and global optimization.

Local parametric methods assume that within a certain region the flow field is described

by a parametric model:

u(x) = u(x;p).

Here boldface letters denote column vectors: u = (u, v)T , x = (x, y)T , p is the vector of

model parameters. Common models include the constant model

u(x;p) =

u(x, y)

v(x, y)

=

p0

p1

,

which holds at any location as the region size approaches zero, the affine model

u(x;p) =

p0 + p1x + p2y

p3 + p4x + p5y

which approximates the 2D motion of a remote 3D surface, and the quadratic model

u(x;p) =

p0 + p1x + p2y + p6x2 + p7xy

p3 + p4x + p5y + p6xy + p7y2

(2.5)

which describes the instantaneous 2D motion of a planar surface undergoing 3D rigid motion

(we will use this model in the airborne visual surveillance application in Chapter 5).

Low-order polynomial flow models gain popularity from their clear physical meanings

and simple computation. But how to select a region appropriate for a given model and

how to choose models suitable for a given region are very complicated problems [130, 25].

The common practice of applying the same model uniformly to all image locations risks the

danger of under-fitting, over-fitting and compromising different models, and usually results

in a flow field of highly uneven accuracy.

20

Global optimization methods can avoid the region selection problem to a certain extent.

Instead of assuming a rigid model for an entire region, they allow arbitrary local variations as

long as the flow field is smooth (almost) everywhere. Such a global smoothness assumption

usually leads to a regularization type of formulation. A classical technique of this kind is

due to Horn and Schunck [58]. They define the best optical flow field as the one minimizing

the overall OFCE error and local flow variation:

∑s

[(Ixsus + Iysvs + Its)2 + λ(us − us)2 + (vs − vs)2]. (2.6)

Here s is a one-dimensional index of pixel locations (x, y), which traverses all pixel locations

in a progressive scan manner. The first quadratic term in the summation is the OFCE error

at location s; the second term requires minimal deviations between the flow vector and

(us, vs), the average of its neighbors i ∈ Ns. The constant λ is a tuning parameter which

controls the relative importance of data and flow variation.

Global optimization models deal with the aperture problem more effectively than local

parametric models by propagating flow estimates between different locations, but due to

the propagation, they tend to over-smooth the field. In addition, global models are sensitive

to the choice of the control parameter λ and their computation is more involved.

2.3 Typical Approaches

In principle, any of the above brightness conservation models and flow coherence models

can be paired up to derive a formulation for optical flow estimation. Among all possible

combinations, gradient-based local parametric, gradient-based global optimization and spa-

tiotemporal filtering approaches, especially the first two, have attracted the most attention

because of the good balance between their accuracy and complexity.

Gradient-based local parametrization

Combining gradient-based constraints and low-order polynomial flow models, one usually

arrives at a linear equation in the flow model parameter p:

Ap = b.

21

Particularly, using first-order constraints and the constant flow model, we have

Au = b (2.7)

A =

Ix1 Iy1

......

Ixn Iyn

b = −

It1

...

Itn

.

When A′A is nonsingular, the least-squares (LS) solution to the equation is

u = (A′A)−1A′b. (2.8)

Both sides of Eq. 2.7 can be multiplied by a window function W = diag[W1, . . . , Wn] to

assign heavier weights to certain constraints. The corresponding equation is WAu = Wb.

If the weights are absorbed by A and b: A ← WA,b ← Wb, the same LS solution Eq. 2.8

is obtained. Lucas and Kanade [82] employ an iterative version of the above algorithm

for stereo registration. Since they are probably the first to formalize this approach, the

(weighted) LS fit of local first-order constraints to a constant flow model is usually referred

to as the Lucas and Kanade technique, which we abbreviated as LK. This technique is

reported to be the most efficient and accurate, especially after confidence-based selection

[7].

An early technique using second-order constraints is due to Haralick and Lee [53]. They

interpret the OFCE as the interception line of the isocontour plane with a successive image

frame, calculate image derivatives from the facet model, and solve the first- and second-order

constraints at each pixel by singular value decomposition (SVD) [102].

Gradient-based global optimization

The seminal technique of this category by Horn and Schunck (HS) [58] was introduced in

the last section. They solve the constraint Eq. 2.6 for the flow field by iterative relaxation:

uns = un−1

s − Ixs(Ixsus + Iysvs + Its)λ + Ix

2s + Iy

2s

vns = vn−1

s − Iys(Ixsus + Iysvs + Its)λ + Ix

2s + Iy

2s

where n denotes the iteration number, (u0, v0) denote initial flow estimates (set to zero),

and λ is chosen empirically. Typically, flow fields obtained from this technique are visually

22

pleasing because of the smoothness, but their quantitative accuracy is not as good as local

gradient-based methods [7] due to over-smoothing and slow convergence of the relaxation

process.

Spatiotemporal filtering

Movements in the spatiotemporal image volume, formed by stacking images in a se-

quence, induce structures with certain orientations. For example, the trace of a translating

point is a line whose direction in the volume directly corresponds to its velocity. Different

methods were proposed to extract the orientations including inertia tensor [66], hyperge-

ometric filters [140] and orientation tensors [35]. Since determining 2D velocity in the

frequency domain amounts to finding a nonzero energy plane (Section 2.1), the filtering

approach is also adopted by frequency-based methods [56].

A recent filtering method with good accuracy reported is due to Farneback [35]. He fits

data in an image neighborhood to a quadratic polynomial model I(x) = xT Ax + bTx + c,

derives an orientation tensor from the model parameters T = AAT + ηbbT , and finds the

flow vector by minimizing vT Tv. Here we temporarily adopt his notations x = (x, y, t)′,v =

(u, v, 1)T /|(u, v, 1)T | for convenience of presentation.

It is not hard to see that this method closely resembles local gradient-based approaches:

tensor construction is equivalent to derivative (first- and second-order) calculation; solving

the homogeneous linear equation in the augmented flow vector vT Tv = 0 is equivalent

to solving the linear equation in the original vector (u, v)T Eq. 2.7. The efficiency of this

technique is mainly enabled by the intermediate step of tensor construction. Without the

intermediate step, or if a filter bank is used instead [56, 38], the computation can becomes

cumbersome and only discrete estimates can be obtained. This contrast is also similar to the

that between gradient-based and matching-based approaches. The equivalence between spa-

tiotemporal filtering/frequency-domain approaches and certain matching-based/gradient-

based methods was pointed out previously [118, 7].

Others

Block matching (SSD) methods can be used to find a pixel-accuracy displacement, and

a quadratic surface fitting of the neighboring matching errors can produce an estimate of

23

subpixel accuracy [126]. The techniques of Anandan [3] and Singh [120] initialize the flow

field using this method, and then employ some global smoothness constraints to propagate

flow from places of higher confidence to places of lower confidence. Matching-based ap-

proaches have better large motion handling capability than gradient-based approaches. But

the computational difficulties and poor subpixel accuracy make them less competitive in

optical flow estimation. For similar reasons, matching-based global optimization schemes

were attempted with very limited success [88, 14], and frequency-based global optimization

approaches are almost never explored.

2.4 Robust Methods

Most early techniques, as described in the above three sections, require brightness conser-

vation and flow smoothness to be satisfied everywhere in the flow field. The restrictive

assumptions make them easily break down in reality where model violations are abundant.

An obvious source of violation is motion discontinuity. Imposing flow smoothness in a region

containing multiple motion modes results in compromise between these modes and smeared

flow estimates. Such failure not only is detrimental to optical flow accuracy, but also ob-

scures important geometric or physical properties of the scene. Violations of the brightness

constancy assumption also occur commonly in natural scenes. Conditions such as specular

reflections, shadows and illumination variations induce non-motion brightness changes. In

cases of transparency, due to the interaction of translucent reflective surfaces, the image

intensity of a single pixel can be a composite of multiple 3D points’ brightness values [14].

Examples include looking into a running creek and watching through a pane of glass. Ap-

plying simple brightness matching criteria in these situations does not produce meaningful

motion estimates. As the above limitations of traditional techniques [7] are widely recog-

nized, a large number of recent efforts have been devoted to increasing robustness against

assumption violations, especially to allowing motion discontinuities.

Explicit segmentation

Assuming motion boundaries coincide with intensity discontinuities and the former are

subsets of the latter, a number of researchers [17, 129] first segment the visual field using

24

image intensity, then compute parametric (e.g. affine) motion in each segment, and finally

group neighboring segments into regions of coherent motion. Such approaches experience

two problems. First, accurate image segmentation is itself very difficult. Second, the as-

sumed relationship between motion and intensity discontinuities is not necessarily correct.

Motion estimation and motion-based segmentation form a chicken-and-egg dilemma: the

motion estimator needs to know where motion boundaries are in order to avoid smooth-

ing across them, whereas the motion-based segmenter requires an accurate motion field in

order to divide the scene into regions of consistent motion. In an attempt to circumvent

this problem, motion estimation and segmentation have been carried out simultaneously.

The generic approach can be described as finding a segmentation of the flow field and the

motion (parameters) in each segment that minimizes the difference between the observed

and predicted image data [151]. Actual techniques differ by the employed flow models,

optimization criteria and solution methods.

Kanade and Okutomi [72] develop an adaptive window technique that adjusts the rect-

angular window size to minimize the uncertainty in the estimate. Schweitzer [114] devises

a recursive algorithm to split the motion field into square patches according to the minimal

encoding length criterion. These methods use rectangular division of the flow field and can-

not adapt to irregular motion boundaries. Wang and Adelson [134] assume that an image

region is modeled by a set of overlapping layers which can be irregularly shaped or even

transparent. They compute initial motion estimates using a least-squares approach within

image patches, then use K-means clustering to group motion estimates into regions of con-

sistent affine motion. Jepson and Black [68] use a probabilistic mixture model to explicitly

represent multiple motions within a patch, and use the EM algorithm to estimate parame-

ters for the fixed number of layers. Darrell and Pentland [33] and Sawhney and Ayer [109]

automatically determine the number of layers under a minimum description length (MDL)

encoding principle, which regards the most compact interpretation as the best among all

possibilities [114].

The explicit segmentation approach usually involves modeling the visual field as a col-

lection of (rigid) objects of certain parametric motion. Appropriately choosing the motion

models and the number of objects, especially in a dynamic situation, is very difficult [25, 130]

25

and can be impossible when nonrigid motion such as human movement and facial expres-

sion are present. Furthermore, due to the extremely high dimension of the problem, how

to efficiently solve the associated numerical optimization problems remains a challenging

issue. In general, iterative methods in which, each updating step consists of sequential es-

timation and segmentation of the motion field, are used. The initial guess is also given by

a sequential method and its quality is crucial for convergence. For the above reasons, the

explicit segmentation approach is not suitable for general optical flow estimation and is not

pursued in this thesis.

Outlier-suppressed regression

A major cause of the failure of traditional gradient-based local parametric techniques is

the use of least-squares regression, which finds a compromise among all constraints and can

break down even in the presence of a single model outlier. To repair this problem, various

mechanisms have been attempted to reject outliers and fit the structure of the majority of

constraints.

It is sometimes possible to detect outliers by examining the residual of the least-squares

solution. After obtaining an initial least-squares estimate, Irani et al. [64] iteratively remove

outliers and recompute the least-squares solution. This process is still least-squares in

essence; it is sensitive to the initial quadratic estimate which may be arbitrarily bad. A

number of researchers (e.g. Fennema and Thompson [37], Schunck [113], Nesi et al. [94])

investigate robust clustering [70] based on the Hough transform. Such approaches have

better outlier resistance but are computationally very expensive. More success is achieved

by employing robust estimators, particularly, M-estimators [15, 109] and high-breakdown

robust estimators [5, 97, 145] in local optical flow constraint fitting. These estimators will

be formally introduced and compared in Section 3.1.1. Among these methods, the one

reporting the best accuracy is to first identify and reject outliers using high-breakdown

criteria and then estimate parameters from the remaining constraints [5, 97, 145].

The computational burden of high-breakdown robust estimators increases with the

amount of outlier contamination. Applying the same algorithm uniformly to the entire

flow field incurs excessive computation, since most places contain few outliers. We tackle

26

the efficiency problem with an adaptive algorithm (Section 3.2). Also, the limitations of

gradient-based and local regression approaches (Section 2.1, 2.2) remain regardless of the

regression technique. We will propose a matching-based global optimization formulation to

overcome such limitations (Section 4).

Discontinuity-preserving regularization

A significant amount of attention has been paid to reformulating the regularization prob-

lem to alleviate over-smoothing. Nagel and Enkelmann [91] suggest an oriented-smoothness

constraint in which smoothness is not imposed across steep intensity gradients (edges).

Their formulation differs from HS’s (Eq. 2.6) in that the terms (us, vs) are augmented by

functionals of local flow derivatives and first- and second-order image derivatives. Despite

the added complexity, this method yields similar experimental results to HS [7], which is not

surprising. On one hand, image discontinuities and flow discontinuities do not necessarily

overlap; reducing smoothing in all places of large image gradient hurts flow propagation from

areas to areas. On the other hand, in the vicinity of flow discontinuities where smoothing

needs to be stopped, image derivatives are of poor precision and do not serve as a reliable

indicator of occlusion.

Following Geman and Geman’s work on stochastic image restoration [42], Markov ran-

dom fields (MRF) formulations [88, 77, 57, 14] have become an important class of techniques

for coping with spatial discontinuities in optical flow estimation. An MRF is a distribution

of a random field in which the probability of a site having a particular value depends on its

neighbors’ values. The distribution of a piecewise-smooth field can be modeled by a dual

pair of MRFs, one representing the observed field values and the other representing the un-

observed discontinuities (line process), and then the best interpretation of the field can be

found as the one maximizing the a posterior (MAP) probability. Utilizing the equivalence of

the MRF and the Gibbs distribution, the MAP formulation reduces to minimizing a regu-

larization energy, which is often solved by stochastic relaxation. Blake and Zisserman show

that similar formulations can be obtained by modeling piecewise smoothness using weak

continuity [20], and they tackle the optimization problem using a graduated non convexity

(GNC) strategy. Their formulation is more compact with the elimination of the line process,

27

and their optimization strategy is more effective in practice than stochastic relaxation.

Shulman and Herve [116] first point out that spatial discontinuities can be treated as

outliers and they propose an approach based on Huber’s minimax estimator. This choice of

estimator leads to a convex optimization problem which is relatively easy to solve. Black

and Anandan propose a robust framework in which both brightness and flow smoothness

terms are modeled with robust estimators. They use redescending estimators which sup-

press outliers more effectively than convex estimators, and solve the optimization problem

by hierarchical continuation [15, 20]. Sim and Park adopt high-breakdown robust estimators

to achieve even more effective outliers rejection [117] than commonly adopted M-estimators

[116, 15, 86]. Black and Rangarajan [18] unify the line process and robust statistics per-

spectives and suggest the approach can benefit problem formulation and solution.

It is important to point out that, in refining optical flow formulations, computational

complexity increases rapidly with model sophistication. This is especially true for global

methods which usually involve large-scale nonconvex optimization problems. There are two

approaches to global optimization: stochastic and deterministic. Stochastic methods such

as simulated annealing [42] make updates probabilistically to avoid low minima and use a

temperature parameter to gradually dampen the randomness. They converge too slowly to

be practically useful [88, 77, 14, 15, 23]. Deterministic methods such as continuation [15]

and multigrid [86] assume a good initial flow estimate is available and make greedy updates

towards a local minimum. The procedure can be multi-stage, resembling the annealing

schedule. These methods have achieved more success in practice, but they have a limited

capability for avoiding local minima and their performance depend on the initialization

quality. Since global optimization is widely recognized as a powerful formulation technique

for inverse problems, and computing technology looks promising for solving the associated

numerical problems, developing global optimization algorithms has become a very hot topic

in computer vision [23, 137, 111].

Brightness conservation violations

Phenomena violating the brightness constancy assumption have only been studied to

a limited extent [10]. Transparency can be modeled by layered/mixture representations

28

[12, 134, 17, 33], which assign to each pixel a set of ownership weights indicating how

different surface layers contribute to the observed pixel brightness. Bergen et al. [12]

first consider the problem of extracting two motions, induced by either transparency or

motion discontinuity, from three image frames. They use an iterative algorithm to estimate

one motion, perform a nulling operation to remove the intensity pattern giving rise to the

motion, and then solve for the second motion. Variable illumination can be accommodated

by deriving more complex brightness conservation models such as the linear model (see

[116] for one example), or matching less illumination-sensitive image features such as phase

[38]. When violations comprise only a small fragment of observations in an area, they can

be treated as outliers in a robust estimation framework [15]. Considering that most real

objects are opaque and global illumination variation is usually negligible during a small

interval, we adopt a robust estimation framework in our study.

2.5 Error Analysis

Despite steady progress on robust visual motion analysis, accurate optical flow estimates

are generally inaccessible. One reason is that, in making necessary assumptions to turn the

estimation problem into a well-posed problem, errors are inevitably introduced by assump-

tion violations. Even under (unrealistic) conditions that no violations are encountered, the

estimate can have a large uncertainty due to the aperture problem [58]—brightness vari-

ation can be insufficient for uniquely determining the 2D velocity (Figure 2.1, also 1.2).

The aperture problem shows the intrinsic ambiguity in visual motion perception: optical

flow only approximates the projected image motion. In its most severe form, i.e., when the

image is completely textureless, recovering the projected motion is impossible; more gener-

ally, optical flow estimates in regions of more appropriate texture have higher confidence.

The sensitivity to assumption violations and the aperture problem varies from technique to

technique and from place to place in a visual field, and so does the uncertainty of the esti-

mated optical flow. If subsequent applications are to make judicious use of such a flow field

estimate, they must be equipped with certain error measurements indicating the uneven

reliability [52].

29

True Motion

1

2

4

3

?

?

Figure 2.1: Aperture problem: local information in an aperture might be insufficient todetermine the 2D motion vector. Each circle is an aperture. (1) corner: reliable estimate;(2) boundary: normal flow only; (3) homogeneous region: ambiguous; (4) highly texturedregion: multiple solutions (aliasing).

Extracting 2D velocity at a pixel requires exploiting a spatiotemporal image neighbor-

hood of that pixel. This fact introduces correlation between errors in nearby flow estimates.

Accounting for such error correlation, especially in a global formulation, is a daunting task

for both optical flow error analysis and subsequent applications; therefore it has seldom

been tackled. Most previous efforts seek to provide an error measure with each individual

estimate by analyzing error behaviors of local methods. Barron et al. compare and modify

a number of one-dimensional confidence measures and use them to select reliable optical

flow estimates [7, 10]. Since errors in optical flow estimates are in general directional and

anisotropic, a two-dimensional confidence measure, particularly the covariance matrix, is

more appropriate and informative.

Performance analysis in computer vision is often carried out with covariance propagation

[36, 52]. Haralick illustrates the derivation and application of covariance propagation theory

for a wide variety of vision algorithms [52]. A trivial case is to propagate additive random

perturbations through a linear system y = Tx with input x and output y, in which the

output covariance Σy can be expressed in terms of the input covariance Σx as Σy = TΣxT ′.

The solution to the constraint equation Eq. 2.7 under the least-squares criterion is optimized

30

when the only error source is the additive iid noise in b (temporal derivatives) that has zero

mean and variance σ2b . Under this assumption, the above conclusion can be applied and the

covariance of the optical flow estimate is simply

Σu = σ2b (A

′A)−1 (2.9)

where σ2b can be estimated from the residual errors ri = Ixiu + Iyiv + Iti as

σ2b =

1n− 2

n∑

i=1

r2i .

The error analysis on a local matching-based method by Szeliski [126] and that on a local

spatiotemporal filtering method by Heeger [56] make similar assumptions and obtain similar

results.

The assumptions enabling the above derivation are apparently unrealistic because (i)

spatial derivatives in A also contain noise, and (ii) errors in derivatives are correlated due

to the overlapping data supports for their computation. Ignoring these factors makes the

velocity and covariance estimates biased. Efforts have been made to calculate unbiased

estimates using generalized least-squares [90, 96, 95]. However, these methods bring little

accuracy improvement at the cost of much heavier computation, because bias is a much

weaker error source than variance [27] and outliers [15] in optical flow estimation. More

details of the related work will be given in Section 3.3 to facilitate comparison with our

methods.

In our earlier work [141], we have conducted an error analysis for the least-squares

based local estimation method using the covariance propagation theory for approximate

linear systems and small errors. In this thesis, we generalize the results to the newer robust

method. Our analysis estimates image noise and derivative errors in an adaptive fashion,

taking into account correlation of derivative errors at adjacent positions. It is more complete,

systematic and reliable than previous efforts.

2.6 Hierarchical Processing

Recall that gradient-based constraints are valid only for small image motion; in practice

this typically means below 2 pixels/frame. While matching-based and frequency-based

31

formulations may cope with larger motion, the computational burden and chances of false

matches (aliasing) increase rapidly with the search range. A general way of circumventing

the large motion and aliasing problems is to adopt a hierarchical, coarse-to-fine strategy

[12, 14].

The basic idea of hierarchical processing is to construct a pyramid representation [26] of

an image sequence in which higher levels of the pyramid contain filtered and sub-sampled

versions of the original images. Going up the pyramid, the image resolution decreases and

the motion magnitude reduces proportionally. When a certain level of reduction is reached,

the motion becomes small enough for estimation. Then computation proceeds in a top-

down fashion, on each level the incremental flow is estimated, added to the initial value,

and projected down to the lower level as its initial value. This process continues until the

flow in the original images is recovered. In what follows, we describe an implementation of

the hierarchical process that is used in our algorithms. Much of the recipe is adapted from

[14].

• Gaussian pyramid construction. We create a P -level image pyramid Ip, p = 0, . . . , P−1. Each upper-level image sequence Ip−1 is a smoothed and sub-sampled version of

its lower level images Ip, expressed as

Ip−1(x

2,y

2) = f ∗ Ip(x, y),∀ x, y at level p

where f is a 3 × 3 Gaussian filter, “∗” represents convolution, and the resolution

reduction rate is 2 which means each upper level image is one-fourth of its ancestor.

• Flow projection with interpolation. Once the optical flow field V is available at level

p − 1, it is projected down to level p. The simplest projection scheme is “projection

with duplication”: V p(x, y) = 2V p−1(bx2 c, by

2c), ∀ x, y at level p. To reduce the blocky

effect, we use “projection with interpolation”:

up(2x, 2y) = 2up−1(x, y),∀ x, y at level p− 1,

up(x, y) =14[up(x− 1, y − 1) + up(x− 1, y + 1)

+V p(x + 1, y − 1) + up(x + 1, y + 1)], ∀ other x, y at level p.

32

construct image pyramid Ip, p = 0, . . . , P − 1;

IP−1w ← IP−1;

for (p : P − 1 → 0) {estimate residual flow ∆up;

current total flow: up ← up + ∆up;

stop if (p=0);

project flow up to Level p− 1 yielding up−1;

warp Ip−1 yielding Ip−1w ; }

Figure 2.2: Hierarchical processing

• Image warping. Given the flow field up that explains the motion from image Ip(t) to

image Ip(t + dt), the image Ip(t + dt) can be warped to remove (compensate for) the

motion such that these two images are almost aligned. Using “backward warping”,

the stabilized version of Ipt + dt is defined as

Ipw(x, y, t + dt) = Ip(x + u(x, y), y + v(x, y), t + dt).

(x + u(x, y), y + v(x, y)) usually does not fall on a regular grid. We use bilinear

interpolation to estimate its intensity value. The warped images Ipw(t), Ip

w(t + dt)

show the residual motion ∆up.

• Motion estimation. The residual motion is estimated form the warped sequence:

∆up ← Ipw(t), Ip

w(t+ dt), and the overall motion at level p is the sum of the projected

and residual motion: up ← up + ∆up.

At the top level, the initial (projected) motion is assumed to be zero and the warped sequence

is the same as the pyramid sequence Ipw ← Ip. Finally, the procedure of the hierarchical,

coarse-to-fine framework is given in Figure 2.2.

Hierarchical schemes like the above have been used in a wide variety of motion estimation

algorithms but their limitations [9] are often overlooked: (i) the blind projection and warping

33

operations may extrapolate and interpolate across motion boundaries; (ii) in the top-down

fashion, errors produced in coarser levels are magnified and propagated to finer levels and

are generally irreversible [14]. Solving the first problem again brings up the estimation-

segmentation dilemma. To correct errors in coarser levels, certain multi-resolution schemes

are needed which can propagate results in a bottom-up fashion, too [102]. Since each

additional level of pyramid introduces new sources of errors, the number of levels should be

large enough to allow incremental flow estimation but no larger. Appropriately choosing

the number of levels is a difficult problem that has been addressed only to a very limited

extent [9]. Most current techniques including ours determine the number empirically.

34

Chapter 3

LOCAL FLOW ESTIMATION AND ERROR ANALYSIS

This chapter considers the problem of finding the most representative translation within

a small spatiotemporal image neighborhood and presents new algorithms to address the

involved accuracy, efficiency and uncertainty measuring issues. In particular, (i) the popular

local gradient-based approach is reformulated as a two-stage regression problem, appropriate

robust estimators are identified for both stages, and an adaptive scheme is introduced to

derivative evaluation to obtain sharp motion boundaries; (ii) a deterministic algorithm for

high-breakdown robust regression in visual reconstruction is proposed, and its effectiveness

is demonstrated at the optical flow constraint solving stage, and (iii) error analysis is carried

out by covariance propagation, it accounts for spatially varying image noise and derivative

errors and correlation of derivative errors at adjacent positions, and provides a reliable

measure of the estimation uncertainty. This chapter is composed of three sections dedicated

to the above three topics respectively. Experimental results on both synthetic and real data

are given in each individual section.

3.1 A Two-Stage-Robust Adaptive Technique

The gradient-based local regression approach to optical flow estimation has become very

popular because of its good overall accuracy and efficiency. Despite various formulations,

methods of this type are generally composed of derivative estimation and optical flow con-

straints (OFC) solving two stages. Both stages involve optimization by pooling information

in a certain neighborhood and are regression procedures in nature. Classical techniques

solve both regression problems in a Least-Squares (LS) sense [7]. In places where the mo-

tion is multi-modal, their results can be arbitrarily bad. To cope with this problem, a few

robust regression tools such as M-Estimators [15, 86] and Least Median of Squares (LMedS)

estimators [5, 97] have been introduced to the OFC stage. By carefully analyzing the charac-

35

teristics of the optical flow constraints and comparing strengths and weaknesses of different

robust regression tools [138, 106, 105, 104], we identify the Least Trimmed Squares (LTS)

technique as more appropriate for the OFC stage.

Meanwhile, as a very similar information pooling step, derivative calculation has seldom

received proper attention in optical flow estimation. Crude (least-squares-based) estimators

are widely used with the hope that the derivative estimation error can be averaged out or

treated as outliers in the OFC regression stage. However as illustrated in Figure 3.5, near

motion boundaries, derivative evaluation can completely fail and most of the constraints

become outliers; in such a situation, no matter what robust tool is employed, OFC regression

breaks down and motion boundaries cannot be preserved. Pointing out this limitation, we

use a 3D facet model to formulate derivative estimation as an explicit regression problem,

which can be robustified when the LS technique fails. We choose an LTS estimator for

robust facet model fitting. LTS is costly and it may yield less accurate estimates where

there are no outliers and LS suffices. Therefore, it should be applied to only when it is

necessary. We calculate a confidence measure for each estimate from the LTS OFC step,

and update the derivatives and the flow vector if the measure takes a small value. In this

way the one-stage and two-stage robust methods are carried out adaptively. Preliminary

experimental results show that this adaptive LTS scheme permits correct flow recovery even

at immediate motion boundaries.

Below we provide details of the two-stage-robust adaptive scheme. We will start with

introducing robust regression, which is the backbone of the proposed method and will be

extensively exploited in the rest of the thesis.

3.1.1 Linear Regression and Robustness

A linear regression model relates the output of a system yi, i = 1, . . . , n to its m-dimensional

input xi = (xi1, . . . , xim)T by a linear transform with an additive noise term ξi, i.e.,

yi = xi1θ1 + xi2θ2 + . . . + ximθM + ξi,

or in a more compact form,

yn×1 = Xn×mθm×1 + ξn×1. (3.1)

36

With sufficient data points (X;y) collected (n À m), the model parameters θ can be

estimated by minimizing a scalar criterion function F (r):

θ = argminθF (r),

where r is the residual fitting error

r = y − y = y −Xθ.

The criterion function F (r) differs among estimators depending on what error models are

assumed.

Least-squares estimator

The least-squares estimator uses a quadratic error function

F (r) = ‖r‖2 =n∑

i=1

r2i

and has a closed-form solution

θ = (XT X)−1XTy.

It is optimal only if X is error-free and ξi is iid Gaussian with zero mean and variance

σ2. When either condition has a significant violation, the least-squares estimate can be

completely disrupted.

There are two major types of significant model violations, or gross errors: those caused

by bad y values are called y-outliers and those caused by error in X are leverage points.

The performance of a regression estimator is usually characterized by its statistical efficiency

and breakdown point. Simply put, statistical efficiency indicates the accuracy (in terms of

estimate variance) when no gross error is present, and breakdown point is the smallest

fraction of contamination that can cause the estimator to take on values arbitrarily far from

the truth. These two factors are usually against each other. A good regression tool should

have both factors high. The reason for the poor accuracy of least-squares in many situations

is its 0% breakdown point, which means that a single outlier can lead to arbitrarily wrong

estimates. The goal of robust regression is thus to develop regression tools that are relatively

insensitive to gross errors while maintaining sufficiently high statistical efficiency [106, 138].

37

M-estimators

An M-estimator uses the criterion function

F (r) =n∑

i=1

ρ(ri, σi)

where the σi are scale parameters for the ρ-function. It includes the least-squares estimator

as a special case with ρ being an L2 (quadratic) error norm. The impact of each datum

on the overall solution is measured by the influence function: ψ(x, σ) = ∂ρ(x, σ)/∂x. The

least-squares estimator has ψLS(x, σ) = 2x/σ2, which allows an outlier to introduce infinite

bias to the estimate. One way to reduce outlier influence is to adopt a less drastic error

norm, e.g., the Geman-McClure error norm

ρGM (x, σ) = x2/(x2 + σ2)

[43]. Its ρ, ψ curves are compared to those of the L2 norm in Figure 3.1. The Geman-

McClure error norm saturates at 1 as the error increases. Its ψ function is bounded and

redesceding—the influence of small errors is almost linear while that of abnormally large

ones tends to zero. Finding an M-estimate is a nonlinear minimization problem. It is usually

solved by iterated reweighted least-squares

θ(k) = argminθ

n∑

i=1

w(r(k−1)i )r2

i

where the superscript (k) designates the iteration number, the weight function is defined by

w(x) = ψ(x)/x, and ri is the residual evaluated with the current estimate.

M-estimators are resistant to y-outliers and have relatively high statistical efficiency,

but they meet with computational difficulties such as initial guess dependency and non-

convexity (for redescending estimators), have a low breakpoint (about 1/(M + 1)), and are

vulnerable to leverage points [106, 138].

High-breakdown robust estimators

Two popular high-breakdown robust estimators are the least-median-of-squares (LMedS)

estimator and the least-trimmed-squares (LTS) estimator [106]. The LMedS estimator

θ = argminθmedni=1r

2 (3.2)

38

−5 0 50

0.2

0.4

0.6

0.8

1

1.2

(a) ρ(x, σ)

−5 0 5−1

−0.5

0

0.5

1

(b) ψ(x, σ) = ρ′(x, σ)

Figure 3.1: Comparison of Geman-McClure norm (solid line) and L2 norm (dashed line):(a) error norms (σ = 1), (b) influence function.

overcomes most limitations of M-estimators: it is resistant to both types of gross errors,

has a breakdown point as high as 50%, does not need an initial guess, and is guaranteed

to converge. However it has extremely low statistical efficiency, which means that it tends

to have very large estimation variances when no gross error is present. The LTS estimator

was introduced to repair the low efficiency of LMedS. It is defined as

θ = argminθ

h∑

i=1

(r2)i:n (3.3)

where h < n and (r2)1:n ≤ . . . ≤ (r2)n:n are the ordered squared residuals. LTS allows the

fit to stay away from the gross errors by excluding the largest squared residuals from the

summation. Owning almost all merits of LMedS and better statistical efficiency, LTS is

considered preferable to LMedS [104, 103].

High-breakdown estimators usually do not have closed-form solutions and are approx-

imated by Monte Carlo like algorithms ([106]). A trial solution pool is constructed by p

random draws from totally Cmn m-subsets, each yielding an exact solution and a correspond-

ing criterion value; the one with the minimum value is picked as the solution. The value p

is chosen so that the probability of having at least one good subset,

1− (1− (1− ε)m)p, (3.4)

39

where ε is the fraction of outliers (up to 50%), is close to 1. The randomness in the solution

is obvious especially when p is chosen small. A subsequent weighted least-squares (WLS) is

recommended to enhance the statistical efficiency. In particular, a preliminary error scale is

defined as σ = C√

F (r), where C makes σ roughly unbiased at Gaussian error distribution

[105]; then regression outliers which have |ri/σ| > 2.5 are removed. Finally a WLS estimate

is calculated from inliers as

θ = argminθ

n∑

i=1

wir2i (3.5)

and a more efficient scale estimate is given by the sample variance of inliers.

According to the above recipe, LTS takes slightly longer time to compute than LMedS

since finding the smallest n/2 numbers is more costly than finding the median of n numbers.

A new algorithm for approximating LTS, so called FAST-LTS, has been introduced recently,

which runs faster than all programs for LMedS and makes LTS the preferred choice of high-

breakdown robust estimator. What enables FAST-LTS is the concentration property of

LTS: starting from any approximate LTS estimate θold and its associated criterion value

Qold, it is possible to compute another approximation θnew yielding an even lower criterion

value Qnew [104]. In algorithmic terms, the C-step can be described as follows.

Given the h-subset Hold then:

• compute θold ← LS estimate from Hold

• compute the residuals rold(i) for i = 1, . . . , n

• sort the absolute values of these residuals, which yields a permutation π for which

|rold(π(1))| ≤ |rold(π(2))| ≤ . . . ≤ |rold(π(n)|

• put Hnew ← {π(1), π(2), . . . , π(h)}

• compute θnew ← LS estimate from Hnew

The C-step can iterate until convergence. It speeds up LTS computation by providing

a more efficient way of selecting trial solutions than random sampling.

40

Estimator Criterion Statistical Breakdown Y- Leverage Solution

F (r) Efficiency Point Outliers Points Technique

LS∑n

i=1 r2i High 0% No No Closed-form

M∑n

i=1 ρ(r) High 100/(1 + m)% Yes No Approximate

LMedS medni=1r

2i Low 50% Yes Yes Approximate

LTS∑n/2

i=1(r2)i:n

a Low 50% Yes Yes Approximate

a(r2)1:n ≤ · · · ≤ (r2)n:n : ordered squared residuals

Table 3.1: Comparison of four popular regression criteria (estimators)

To summarize the above discussion, properties of four popular estimators are given in

Table 3.1 for a regression problem of n equations and m unknowns.

3.1.2 Two-Stage Regression Model

In this section we show that both derivative estimation and optical flow constraint solving

stages in the gradient-based local approach can be formulated as linear regression problems.

Optical flow constraint

Following Haralick and Lee [53], we constrain the optical flow vector u = (u, v)T at

location (x, y, t)T by

Au + ξ = b (3.6)

where

A =

Ix Iy

Ixx Ixy

Iyx Iyy

Itx Ity

b = −

It

Ixt

Iyt

Itt

.

We further assume that flow vectors in each small neighborhood of N pixels is constant,

and hence each vector u conforms to N sets of constraints simultaneously. This constitutes

our optical flow constraint [144]: a linear regression model

Asu + ξ = bs (3.7)

41

where As = (A′1, A′2, . . . , A′N )′, bs = (b′1,b′2, . . . ,b′N ), and each pair of Ai,bi are the A,b

defined by Eq.(2.3) at pixel i, i = 0, . . . , N − 1. In our experiment, we choose the constant

flow neighborhood size to be 5× 5, so N = 25.

Comparing to the first-order constraint Eq. 2.7, this mixed-order constraint has the

advantage that a large number of equations are provided on a small data support (100

equations in a 9 × 9 × 5 neighborhood). Such compactness is desirable because a smaller

neighborhood size means less chance of encountering multiple motion, and a larger sample

size brings higher statistical efficiency. Although second-order constraints alone are often

avoided due to derivative quality concerns [7], we argue that they are beneficial when used

together with first-order constraints under a robust criterion, because (i) they are automati-

cally assigned less weights, due to the fact that second-order derivatives normally take much

smaller values than first-order derivatives in real imagery, and (ii) outliers among them can

be ignored under the robust criterion. In addition, experiments show that second-order

derivatives of reasonable accuracy can be obtained from the facet model.

Derivatives From the Facet Model

The facet model characterizes each small image data neighborhood by a signal model

and a noise model [54]. Low-order polynomials are the most commonly used signal form.

We use a 3D cubic polynomial for derivative estimation [143]. Here “3D” means that the

polynomial is about the spatiotemporal variable (x, y, t); “cubic” means that the highest

order of a term is 3. The facet model finds the polynomial coefficient vector a from the

linear regression model

Da + ξ = J (3.8)

where J is the observed image data vector (formed by traversing the neighborhood data

lexicographically), and D is the design matrix composed of 20 canonical polynomial bases

(1, x, y, t, x2, . . . , xyt). We use the facet model neighborhood size 5 × 5 × 5, so D has

dimension 125 × 20. Once a is found, the spatiotemporal derivatives are merely scaled

versions of its elements. More details about derivatives from the facet model can be found

in our earlier work [143, 141, 146].

Most popular derivative estimators in optical flow estimation are neighborhood masks.

42

They essentially come from facet models [54] of different dimension (1D, 2D or 3D), order

(1st, 2nd or 3rd) or neighborhood size (2, 3 or 5). For example, the four-point central

difference mask (−1, 8, 0,−8, 1)/12 that Barron et al. use [7] is actually a 1D cubic facet

model on a neighborhood of 5 pixels [146]. Our facet model outperforms it on most image

sequences.

3.1.3 Choosing Estimators

In this section, we analyze the characteristics of the two regression problems and identify

appropriate regression estimators for them.

Solving OFC by LTS.LS

We observe that (i) both y-outliers and leverage points can happen in Eq.(3.7) because

both As and bs are composed of derivative estimates; (ii) leverage points are roughly twice

as likely as y-outliers due to the size contrast of As and bs; (iii) a significant portion of

the constraints can be gross errors, when, for example, multiple motion models happen in a

neighborhood; and (iv) the number of constraints are relatively small. Therefore the desired

estimator for the OFC stage should be resistant to both types of gross errors and have a

high breakdown point and good statistical efficiency on a small sample size.

M-estimators [15, 86] and LMedS estimators [97, 5] were previously used at the OFC-

stage. M-estimators are resistant to y-outliers and have relatively high statistical efficiency,

but they have a low breakpoint of about 1/(1+m) and are vulnerable to leverage points. The

LMedS estimator [105] is resistant to both types of gross errors and has a high breakdown

point of 50%, but it has extremely low statistical efficiency, which means it tends to perform

poorly when there is no gross error (Table 3.1).

Owning almost all merits of LMedS and better statistical efficiency, LTS is preferred to

LMedS [138, 103, 104]. We use least-trimmed-squares followed by (weighted) least-squares

to solve the optical flow constraint, and call the procedure “LTS.LS”.

LS or LTS: Adaptive Derivative Estimation

By default we solve the 3D cubic facet model in an LS sense to find the derivatives. When

the estimation quality is poor, we update the derivatives from robust facet model fitting.

43

To reduce computation and prevent over-fitting, we use a 3D quadratic facet model for this

purpose. As the dimension of the parameter vector a is as large as 10, and the breakdown

point has to be high, the LTS estimator is again a better choice than M- and LMedS

estimators. Unlike the OFC stage, where we estimate two parameters out of 100 constraint

equations, in this stage, there are 10 parameters but only 125 constraint equations. With a

rather small sample size, WLS can hardly improve results of LTS. So we decide to use LTS

for robust facet fitting.

Note that it might not be the best to apply LTS facet model fitting uniformly, because

LTS tends to have lower statistical efficiency than LS when there is no gross error, and

it involves much more computation. Therefore the LTS facet model should be used when

and only when the estimation fails due to the LS facet quality. We take the coefficient of

determination (R2) [105] from the LTS.LS OFC step as a confidence measure of the flow

estimate. R2 measures the proportion of observation variability explained by the regression

model. Here it is defined as

R2 = 1−∑

i∈inliers r2i∑

i∈inliers y2i

.

We detect poor flow estimates as those having R2 < T , and try robust facet model fitting

to improve them.

It is worth mentioning that our OFC stage and that of Bab-Hadiashar and Suter use a

similar local optimization formulation, the difference being that they use LMedS while we

use LTS as the regression tool. Both of us detect bad estimates with low R2 values but we

treat them very differently. They remove them as unreliable, whereas we apply a two-stage

LTS to improve their accuracy.

Finally, the diagram of the proposed algorithm is given in Figure 3.2.

3.1.4 Experiments and Analysis

We demonstrate on both synthetic and real data how optical flow accuracy improves as the

method upgrades from purely LS-based (LS-LS), one-stage robust (LS-LTS.LS) and two-

stage robust (LTS-LTS.LS). We also compare our results with those from Bab-Hadiashar

and Suter’s technique (BS) [5] which applies LMedS to the OFC stage. The results were

44

DerivativesImageData

Optica flow

& Confidence

High conf?

Robust

Y

N

LSFacet

RobustFacet

OFC

Figure 3.2: Block diagram of the two-stage-robust adaptive algorithm

computed using their own C program, all parameters set as default. For fair comparison,

the facet and OFC neighborhood sizes are fixed to 5 pixels for both techniques.

An illustrative example

We first use the synthetic data set in Figure 3.3 to demonstrate the necessity of robust

regression in both stages. The image size is 32× 32. Motion of the left and right halves are

vertical and horizontal respectively, both at 1 pixel/frame. Since an optical flow constraint

equation forms a line au + bv + c = 0 in the u, v coordinate, with its distance to the true

velocity an indicator of the degree of modeling imperfection, we use OFC cluster plots to

visualize derivative quality and results of different estimators. Three typical points: (5, 5),

(5, 20) and (14, 17) in Figure 3.3,3.4 are closely examined. Their true velocities are marked

by black dots in Figure 3.5.

(5, 5) is a point where most derivatives are of good quality, as we can tell from the

nice OFC cluster at the true velocity (Figure 3.5(a)). However, even in this favorable

case, LS-LS yields only (0.9734, 0.0015) while LS-LTS.LS yields (numerically) exactly (1,

0). The 9 × 9 × 5 data support of point (5, 20) has 1/9 conveying the left motion mode.

Accordingly we observe a clear cluster at the true velocity and a small vague cluster at the

left velocity (Figure 3.5(b)). LS is totally lost in this, yielding a compromise of (0.5933,

0.518), as oppose to LS-LTS.LS, which gives (-0.0051, 0.9913). These two cases suggest that

LS-LTS.LS significantly outperforms LS-LS at the OFC stage.

In the above cases, the facet model fitting errors can be accommodated by robust OFC.

45

Figure 3.3: Central frame of the synthetic se-quence (5 frames, 32× 32)

5 10 15 20 25 30

5

10

15

20

25

30

Figure 3.4: Correct flow field

But this is not the case with (14, 17), a boundary point on the right side. Figure 3.5(c)

shows constraint lines scattering around, with two very vague clusters at (0, 1) and (1, 0).

Estimates from LS-LS (-0.3937, 0.2482) and LS-LTS.LS (0.0708, 0.1267) are both totally

wrong. Here applying robust regression at the OFC stage alone does not help any more.

The reason is that derivative estimation at most points fails and a large portion of the

constraints become gross errors, so that the major optical flow constraint model does not

exist. Figure 3.5(d) shows the OFC plot from the robust facet model fitting. The ma-

jor motion model becomes clear so that LTS.LS yields a reasonably accurate estimate of

(0.0109, 1.0000).

Translating Squares Sequence (TS)

Figure 3.6(a),3.6(b) show the central frame and the correct flow field of another synthetic

sequence Translating Squares (TS). It contains two squares translating at 1 pixel/frame. The

image size is 64× 64. LTS-LTS.LS is applied at places with R2 < 0.99.

We calculate the error percentage as the quantitative accuracy measure. It is the error

vector magnitude normalized by the true velocity magnitude and multiplied by 100. We

report the average error percentages on the entire flow field (AEP) as well as those measured

46

(a) (5, 5): LS facet (b) (5, 20): LS facet

(c) (14, 17): LS facet (d) (14, 17): LTS facet

Figure 3.5: OFC cluster plots at three typical pixels. Each line represent a constraintequation. (5, 5): good derivative quality; (5, 20): a small number of bad derivatives; (14,17): on motion boundary, most derivative estimates are bad and robust facet fitting becomesnecessary.

47

(a) Central frame (b) Correct flow (c) BS

(d) LS-LS (e) LS-LTS.LS (f) LTS-LTS.LS (g) R2 map

Figure 3.6: TS sequence results. Flow field estimates are subsampled by 2. Estimates witherror percentages larger than 0.1% are shaded.

48

Technique AEP(%) AEPB(%)

LS-LS 18.83 50.61

BS 8.03 26.24

LS-LTS 7.53 24.59

LTS-LTS 4.75 15.51

Table 3.2: TS sequence: comparison of average error percentage

(a) Central frame (b) BS (c) LS-LTS.LS (d) LTS-LTS.LS

Figure 3.7: Pepsi sequence central frame and horizontal flow (darker pixels indicate largerspeeds to the left).

in the motion boundary area (AEPB). The motion boundary area is defined as a 9-pixel-wide

band. Since the spatiotemporal data support for each flow estimate is 9× 9× 5, out of this

band there are no outliers from motion boundaries at either the derivative or the OFC stage.

The AEP and AEPB values are summarized in Table 3.2. The flow fields estimated from

four algorithms are given in Figure3.6. To facilitate visual comparison, we shade estimates

with error percentages larger than 0.1%. To keep the flow field plots from being too crowded,

we subsample them by 2 in both x and y directions. We observe from the results that (i)

robust methods out-perform LS methods, (ii) LTS seems to be slightly better than LMedS

in the OFC stage, and (iii) LTS derivative estimation significantly reduces boundary errors.

The Pepsi Sequence

This is a real image sequence in which a Pepsi can and background move approximately

49

(a) BS (b) LS-LTS.LS (c) LTS-LTS.LS

Figure 3.8: Pepsi: estimated flow fields

0.8 and 0.35 pixels to the left respectively (Figure3.7(a)). We show subsampled flow fields

of four techniques in Figure3.8 and the (linearly scaled) horizontal flow values in Figure3.7.

BS’s result (Figure 3.8(a),3.7(b) has significant vertical speed components in the upper-left

and the lower parts, and the flow is still over-smoothed. Figure 3.8(b),3.7(c) is the result

of LS-LTS (1st and 2nd order constraint). Motion contrast and discontinuities are much

clearer. LTS-LTS.LS (Figure 3.8(c),3.7(d)) updated LS-LTS estimates with R2 < 0.75 and

further improved the boundary accuracy.

Discussion

The primary contribution of the above work is that it formulates optical flow estimation

as two regression problems and adaptively solves them using one-stage or two-stage LTS

methods. Preliminary experimental results on both synthetic and real image sequences ver-

ified the effectiveness. Since derivative estimation is a fundamental step of many computer

vision problems, and most optimization problems can be fit into the regression framework,

conclusions of this paper may extend to other fields.

A limitation of the proposed method lies in the high computational cost, induced by uni-

formly applying an expensive high-breakdown robust estimator to both regression stages.

In the next section, we exploit the piecewise-smooth property of visual fields to develop a

deterministic algorithm whose complexity adapts to the degree of local outlier contamina-

50

tion. It converges faster and achieves more stable accuracy than the random-sampling-based

algorithm.

3.2 Adaptive High-Breakdown Robust Methods For Visual Reconstruction

Visual reconstruction is the process of recovering the underlying true visual field from a

noisy observation [20]. It includes many fundamental tasks in early vision such as image

restoration, 3D surface reconstruction, stereo matching and optical flow estimation. What

permits the reconstruction is the piecewise continuity property of a visual field, which is

often imposed by local parametric models [55, 53]. In recent years, many robust methods

have been employed to solve the associated regression problems [13, 121, 75]; among them,

those based on high-breakdown robust criteria [106, 85], e.g. least-median-of-squares and

least-trimmed-squares, have reported the best accuracy.

High-breakdown criteria usually have no closed-form solutions, so certain approximation

schemes must be used. Different approximation methods may lead to very different accuracy

and convergence rate; and esearch is still going on in the statistics community to find

more appropriate methods [104]. So far almost all high-breakdown robust methods in

visual reconstruction applications [121, 75, 5, 97, 117, 145] adopt the random-sampling-based

algorithm outlined by Rousseeuw and Leory [106]—the estimate with the best criterion value

is picked from a random pool of trial estimates, and the algorithm is uniformly applied to

all pixels in an image. The generic scheme can be summarized as follows.

Using the same number of trial subsets p to all locations causes both efficiency and

accuracy concerns. According to Eq. 3.4, p must be chosen large enough to ensure a high

breakdown point. Since evaluating the criterion value F is an expensive operation, a large p

value incurs a heavy computational burden. Meanwhile, much of the burden is unnecessary

because most places in a normal visual surface have few outliers and the above complicated

process often ends up generating least-squares estimates. If a priority is placed on saving

computation and a smaller p value is chosen, the probability of locking on the correct

solution can be hurt, especially at locations with significant contamination.

What is needed, in order to circumvent the efficiency-accuracy predicament, is an adap-

51

choose number of subsets p according to Eq. 3.4

for all pixels {for p subsets {

compute LS solution uLS;

store criterion value F ; }select solution with best criterion value;

compute WLS solution from Eq.3.5; }

Figure 3.9: Random sampling based algorithm for high-breakdown robust estimators

tive scheme which performs least-squares estimation when no outlier is present and increases

the p value as the noise contamination becomes more severe. This does not seem possible for

an isolated regression problem in which no prior information about outlier contamination

is available. But in visual reconstruction problems, we can exploit the piecewise-smooth

property of visual surfaces to achieve the adaptiveness.

3.2.1 The Approach

Now we present an adaptive algorithm for high-breakdown robust visual reconstruction by

considering the example problem of estimating piecewise-constant optical flow from noisy

measurements of first-order derivatives, i.e., solving the Lucas-Kanade constraint equation

Eq. 2.7 for all pixel locations under a high-breakdown robust criterion.

Observing that in normal image sequences, (i) the majority of pixel locations do not have

outliers and least-squares estimates are reasonably good, and (ii) flow fields are smooth and

nearby estimates have similar values (true even at motion boundaries), we initialize the

flow field using least-squares estimates, and then iteratively generate trial solutions for each

pixel using its neighbors’ values. Given a trial solution, which may come from either the

least-squares initial or a neighbor’s value, we identify the part of local constraints consistent

with it and calculate a new solution following the weighted least-squares (WLS) procedure

(Eq. 3.5). In this way, we obtain an updated trial solution which represents the local

52

for all pixels {compute LS estimate VLS;

V, F ← WLS on VLS; }while #{pixels updated}>0 {

for all pixels {for all its neighbors Vn {

if ( Vn updated and |Vn − V | > T ) {Vtry, Ftry ← WLS on Vn;

if ( Ftry < F )

{ update V, F } } } } }

Figure 3.10: Adaptive algorithm for high-breakdown robust estimators

constraints more closely. We then compute its criterion value, and retain this trial solution

if it achieves the best criterion value so far. The algorithm is described in Figure 3.10.

Because of the “locking” capability of the WLS update step, similar neighbor values

usually result in very close or identical trial solutions, and hence not all neighboring values

need to be borrowed. Also, to expedite convergence, it is better to use neighbors a few

pixels away rather than immediate ones. Therefore, we use four neighbors to the N,S,E,W

directions which are w/2 pixels away, where w is the window size of constant local flow.

In this approach, the piecewise smoothness property of the visual field and the selection

capability of robust estimators are exploited to produce trial solutions in a much more

educated way than random sampling methods. The complexity of the estimator varies with

the local structure: least-squares solutions are used where no outlier is present, and the

number of trials increases with the outlier percentage. The adaptive nature is revealed in

Figure 3.2.2, which shows the number of trials at each pixel as an intensity image for the

TS sequence (see description in Section 3.1.4). More trials, indicated by brighter colors,

are carried out closer to the boundary where the structure is more complex. The trial set

size ranges between 1 and 13 in this case, as opposed to the uniform p = 30 in the random

53

sampling based algorithm [5].

3.2.2 Experiments and Analysis

We calculate derivatives from a first-order spatiotemporal facet model on a support of size

3 × 3 × 3 [145], and solve the optical flow constraint under the LMedS criterion. Optical

flow is estimated on the middle frame of every three frames. A hierarchical scheme [12] is

adopted to handle large motions, with the number of pyramid levels empirically determined.

We carefully handled boundary cases such that the resulted flow field is of the same size as

the original image.

Comparison is made with modified versions of Lucas and Kanade’s LS based method

(LK) [82] and Bab-Hadiashar and Suter’s random LMS based method (BS) [5]. Their

original implementations do not have hierarchical process and derivatives are estimated

differently. To emphasize the contrast of different regression methods, we implemented LK

and BS by modifying the code of our algorithm. No pre-smoothing is done and the constant

flow window size is fixed at 9× 9. p = 30 random subsets are drawn in BS as [5] suggested.

Experiments are carried out on a PIII 500MHz PC running Solaris. Vector plots below are

appropriately subsampled and scaled to faciliate visual inspection.

Five image sequences with flow groundtruth are used for quantitative comparison. Two

error measures are reported. One is the angular error e 6 used in [7]. It is defined on the

augmented flow vector u = (u′, 1)′ as arccosu · u0, where u0 is the correct flow vector. The

other one is the error vector magnitude measure e|·| = |u − u0|/|u0|. We also report the

consumed CPU time in seconds to give a rough idea on the speed contrast.

Translating Squares Sequence (TS). This data set was introduced in Section 3.1.4

(Figure 3.6). Calculation is done on the original resolution. The correct and estimated

vector plots are given in Figure 3.12. LK’s result is smeared around motion boundaries

and good elsewhere. This shows the necessity of robust estimation, and also justifies using

LS for initialization in our method. BS’s and our results look close for this data set.

The reason is that for the case m = 2 which we examine in this paper, p = 30 samples

make the probability of having at least one good initial as high as 99.98%, even with 50%

54

Figure 3.11: TS trial set size map. Value range: 1 (darkest) → 13 (brightest). More trialsolutions are generated in places of higher motion complexity.

outliers present. For these reasons and because of the simplicity of the TS sequence, the

random scheme succeeds most of the time. This similarity in turn justifies the viability of

implementing high-breakdown criteria without random sampling.

TT, DT, YOS Sequences. Three popular synthetic data sets, Translating Tree

(TT), Diverging Tree (DT) and Yosemite (YOS), are obtained from Barron [7]. Their

middle frames, the 20th in the TT and DT data sets and the 9th in the YOS data set, are

given in Figs. 3.2.2, 3.2.2. TT and DT (150 × 150) simulate translational camera motion

with respect to a textured planar surface. TT’s motion is horizontal, DT’s is divergent and

their maximum speeds are about 2 pixels/frame. The motion in YOS (316× 252) is mostly

divergent with the maximum speed about 4-5 pixels/frame. The cloud part is excluded

from evaluation. We use two levels of pyramid for TT and DT, and three levels for YOS.

OTTE Sequence. This real image sequence is provided by Nagel [98]. The scene

is stationary except for a marble block in the center moving leftwards; and the camera is

translating. Groundtruth is available where the vector is nonzero (Figure 3.15). Three

levels of pyramid are used. Measures on all sequences are summarized in Table 3.3.

Robust methods are more accurate but much slower. Quite noticeably however, the

accuracy advantage fades with the use of image pyramids. This is caused by the limitations

55

(a) Groundtruth (b) LK

(c) BS (d) Ours

Figure 3.12: TS: correct and estimated flow fields

56

Data Technique e 6 (◦) e|·| (%) time (sec)

LK 6.14 15.12 0

TS BS 1.10 2.64 2

Ours 1.09 2.65 0

LK 2.36 5.48 1

TT BS 1.67 3.75 16

Ours 1.39 3.22 6

LK 6.12 18.33 1

DT BS 5.73 18.53 16

Ours 5.00 16.14 10

LK 3.69 12.68 4

YOS BS 3.81 11.87 61

Ours 3.42 11.10 40

LK 17.22 48.56 13

OTTE BS 17.02 48.23 205

Ours 16.84 47.90 121

Table 3.3: Quantitative comparison of the proposed adaptive LMedS algorithm to Lucasand Kanade (LK) [82] and Bab-Hadiashar and Suter (BS) [5]. The new algorithm is moreaccurate, and more efficient than BS.

57

Figure 3.13: TT, DT middle frame Figure 3.14: YOS middle frame

[14] of the simple hierarchical strategy [12]. The error it introduces can be greater than

that from LS estimation and becomes the quality bottleneck. This issue will be addressed

in the next chapter. Note in Table 3.3 that while ours method significantly outperforms LK

in all cases, BS produces larger errors than LK for the DT, YOS sequences. This suggests

the unstable nature of LMS based on random sampling.

TAXI Sequence. TAXI is a real sequence with no groundtruth also from Barron

[7]. In the street scene there are four moving objects: a taxi turning the corner, a car in

the lower left driving leftwards, a van in the lower right driving towards the right, and a

pedestrian in the upper-left. Their image speeds are approximately 1.0, 3.0, 3.0 and 0.3

pixels/frame respectively. Two levels of pyramid are used. To enhance details, we display

the horizontal flow component as intensity images in Figure 3.17. Brighter pixels represent

larger speeds to the right. In LK’s estimate the flow fields of the vehicles have severely

invaded the background. BS and our method preserve motion boundaries better. However

it is quite obvious that BS has bumpier boundaries and produces more gross errors, for

instance on the taxi. In addition, BS took 36 seconds CPU time while ours only took 13

seconds. We show the trial solution set sizes and give the minimum and maximum numbers

of trials on two levels of pyramid and their sum in Figure 3.16. Apparently larger numbers

58

(a) Middle frame (b) True flow

Figure 3.15: OTTE sequence

of trials were used for places with more complex motion. This observation suggests that the

trial set size map might help motion structure analysis.

3.2.3 Discussion

In this section we have presented an adaptive high-breakdown robust method for visual re-

construction and applied it to optical flow estimation. By taking advantage of the piecewise

smoothness property of visual fields and the selection capability of robust estimators, this

algorithm can be faster and more accurate than algorithms based on random sampling.

Although we have chosen locally constant flow estimation to illustrate its effectiveness,

the strengths of this approach should be more apparent in problems of higher dimensions,

such as affine flow estimation and piecewise-cubic image restoration, for which random

sampling methods quickly become computationally formidable. One of our future work

directions is to extend this approach to these applications. Also worth further investigation

is a less expensive alternative to the WLS estimator (Eq. 3.5) for updating estimates during

59

(a) Middle frame (b) Level 1: (1,14)

(c) Level 0: (1,13) (d) Total: (2,25)

Figure 3.16: TAXI: snapshot and trial set sizes map (in parentheses: min. and max.numbers of trials)

60

(a) BS (36sec)

(b) Ours (13sec)

Figure 3.17: TAXI: intensity images of x-component. Note that BS has bumpier boundariesand produces more gross errors, e.g., near the center of the taxi.

61

each visit. One possibility is to use the concentration property of LTS [104]. Finally, it

would be interesting to see how the trial set size could be used as an early cue for analyzing

scene complexity.

3.3 Error Analysis on Robust Local Flow

In this section we provide an error analysis of the first-order differential local regression tech-

nique through covariance propagation [52]. By using a high-breakdown robust criterion, we

minimize the impact of outliers on both optical flow estimation and its error evaluation.

We calculate spatiotemporal derivatives from a facet model, which enables us to take corre-

lation of adjacent pixels into account, and estimate image noise and derivative errors in an

adaptive fashion. In the regression problem, we consider errors from both the observations

and the measurements. In addition, we adopt a hierarchical process to handle large motion.

Our error analysis is more complete, systematic and reliable than previous attempts. The

advantages are demonstrated in a motion boundary detection application.

3.3.1 Covariance Propagation

Covariance Propagation Theory

Consider a system relating the output y to the input x by the function

y = f(x).

Generally f(·) is nonlinear; but when the perturbation ∆x is small enough to fit its linear

range, the output error is well approximated by

∆y =df(x)dx

∆x.

Then the covariance of the output is

Σy = (df(x)dx

)′Σxdf(x)dx

. (3.9)

f(·) of most real systems cannot be expressed explicitly. Instead a relationship

g(x,y) = 0

62

usually exists. In such cases we have

df(x)dx

) = (∂g(x, Y )

∂y)−1 ∂g(x,y)

∂x,

and finally

Σy = [(∂g(x,y)

∂y)−1 ∂g(x,y)

∂x]′Σx[(

∂g(x,y)∂y

)−1 ∂g(x,y)∂x

] (3.10)

Evaluating the covariance requires the knowledge of the true x,y values, which are seldom

available and commonly approximated by their estimates in practice.

The assumptions of unimodal noise and accurate observations and estimates limit the

application of the theory to only near-perfect systems with no outlier present.

Below we introduce the application of the theory to our robust optical flow estimator.

The explanation goes as OFC and facet model two steps. Results presented here are more

general than that reported in an earlier paper about a least-squares technique [144].

The OFC Step

After outliers are removed under the robust criterion, the actual OFC we use is composed

of the rows in Eq. 2.7 corresponding to the inliers. For simplicity, from now on we overload

Eq. 2.7 by the actual OFC and let n be the number of inliers.

Solving Eq. 2.7 using least-squares is optimized for the model that only b is contaminated

by iid additive zero-mean noise with variance σ2b . Under this assumption, the optical flow

covariance is simply

Σu = σ2b (A

′A)−1 (3.11)

where σ2b can be estimated from the residual errors as

σ2b =

1n− 2

n∑

i=1

r2i .

The above error model is apparently unrealistic because spatial derivatives in A are noisy

as well. Under the Error-In-Variable (EIV) condition the LS estimate is biased. Accordingly,

efforts have been made to calculate the unbiased estimate using generalized least-squares

[90, 96, 95]. However, at the cost of much heavier computation, these methods bring little

accuracy improvement. It is because bias is a much weaker error source than outliers [15]

in optical flow estimation; methods which can suppress outliers [15, 147] achieve fairly good

63

accuracy, much better than what generalized least-squares can do. In addition, bias has

also turned out to be less significant than estimation variance in motion estimation [27].

Therefore, we solve the outlier-suppressed OFC using an LS estimator, and analyzing the

estimation error by propagating covariance from the derivative estimates.

The input to the system Eq. 2.7 is the spatiotemporal derivative vector d and the output

is the optical flow vector u. We assume their errors are both zero-mean and have covariance

matrices Σd and Σu respectively. u and d do not have a linear relationship, but they are

related by

g(u,d) =∂F (d,u)

∂u= A′(Au− b) = 0

where F (d,u) = |Au−b|2 is the criterion function. Proceeding as the covariance propaga-

tion theory (Section 3.3.1) suggests, we obtain

∂g(d,u)/∂u = A′A

and

∂g(d,u)/∂d = (∂g(d,u)/∂d1, . . . , ∂g(d,u)/∂dn)

where

∂g(d,u)∂di

=

ri + Ixiu Ixiv Ixi

Iyiu ri + Iyiv Iyi

.

Applying Eq. 3.10 yields the optical flow estimate covariance

Σu = [(∂g(d,u)

∂u)−1 ∂g(d,u)

∂d]′Σd[(

∂g(d,u)∂u

)−1 ∂g(d,u)∂d

]. (3.12)

This expression reveals that the error in the optical flow estimate not only depends on

the residual errors and the derivative values (through system conditioning) as indicated

by Eq. 3.11, but also relates to the optical flow value and errors in the derivatives. Such

observations have been made in many previous studies [7, 15, 5].

Assuming the derivatives and optical flow estimates are sufficiently accurate, we use

them in place of the unknown true values for evaluation. Now the only missing piece in

the above expression is the derivative error covariance Σd. Its modeling has posed a great

difficulty to many previous studies [119, 90, 96]. Below we tackle the problem using the

facet model.

64

The Facet Model Step

We assume the image noise is an iid zero-mean variable with variance σ2. From Sec-

tion 3.3.1 we know that the gradient vector di at pixel i, i = 1, . . . , s is linearly related to

its neighborhood data Ji by di = MJi. It permits directly applying of Eq. 3.9 and leads to

Σdi = σ2i MM ′.

Similarly, for any pair of gradient vectors di,dj we have

Σdidj = σ2ijMiM

′j ,

where Mi,Mj are the weights on their overlapped support, and σ2ij is approximated by

σiσj . Finally the full derivative covariance matrix is assemblied from Σdidj , i, j = 1, . . . , n.

Notice that MiM′j depends on the positional relationship of pixel i and j, and only takes

a few forms once the supports for the OFC and the facet model are determined. Hence in

implementation we create a lookup table of all possible MiM′j beforehand and refer to it

during pixel-by-pixel error estimation.

The above procedure defines the structure of Σd. [90, 95] arrive at similar conclusions

using their derivative masks. However they meet difficulty with image noise variance es-

timation. [90] attempts to evaluate the variance empirically, but the method turns out

unsuccessful. The reason is that here the image noise is not caused simply by acquisition

errors; it also depends on the derivative masks and the local image texture [119, 96, 92].

We derive an estimate of σ2 from the facet model fitting residual error

σ2 = |Da− J|2/(nd − nb)

where nd, nb are respectively the number of pixels in J and the number of polynomial bases.

This measure reflects the deviation of the local image texture from the assumed polynomial

model, which arises from either image noise or complex textures. It is a by-product of

derivative estimation, and is adaptive across the image. The use of the facet model fully

automates the error propagation from the image data to the optical flow estimation.

Hierarchical Processing

65

We build our optical flow estimation and covariance propagation method in a hierarchical

scheme to cope with large motions [12]. [119] propagates covariance down the pyramid by

a Kalman filtering like scheme. Currently we assume results on different pyramid levels are

independent from each other, and hence we combine covariance matrices of different levels

simply by multiplying the values at the higher level by 4 and adding them to the values at the

lower level. Due to the limitations of hierarchical schemes [147] and the crude combination

method, we observe performance degradation as the number of pyramid levels increases.

Handling large motions remains a very difficult problem and needs further investigation.

3.3.2 Experiments

Motion Boundary Detection. Inspired by [93, 90], we demonstrate the performance

of our error analysis method through a local statistical motion boundary detector. Given

two adjacent optical flow vectors and their covariance matrices (ui, Σui) and (uj , Σuj ),

we examine the hypothesis H0 that they originate from normal distributions of the same

mean. Under H0, their difference vector u = ui − uj obeys a bivariate normal distribution

V ∼ N(0, Σui + Σuj ). Thus the statistic

T = u′(Σui + Σuj )−1u

should obey a χ2 distribution of 2 degrees of freedom. We reject H0, or declare a boundary

pixel pair, when T > Tα. Each Tα corresponds to a significance level α, which is the

theoretical false alarm rate.

We estimate optical flow on the middle frame of an odd number of frames. The constant

flow window size is fixed at 9 × 9. We handled estimates at image borders, so that they

also have good accuracy and the resultant flow field is of the same size as the original image

[142].

We compare two optical flow estimators, LS and LMS, and covariance propagation under

two noise models: iid b error (Eq. 3.11) and correlated EIV (Eq. 3.12). This forms four

combinations. (a) LS OFC and covariance from Eq.3.11, (b) LS OFC and covariance from

Eq.3.12, (c) LMS OFC and covariance from Eq.3.11, and (d) LMS OFC and covariance from

Eq.3.12. (d) is the proposed method. (a) is similar to [126, 56], and (b) is similar to [90]. In

66

(a) 0.01 (b) 0.15 (c) 2e−11 (d) 0.05

Figure 3.18: TS motion boundary

all experiments, we adjust the α value to produce the best visual results. The performance

of different methods are compared by inspecting the false-alarm and misdetection rates. The

closeness of the α value and the observed false alarm rate is an indicator of the statistical

validity of the results.

TS Sequence. We calculate derivatives from a first-order spatiotemporal facet model

on a support of size 3× 3× 3 [145]. 3 frames are used. The results are shown in Figure 3.18

with α values given in captions. Simple as it is, the χ2 test is effective in detecting motion

boundaries. LS methods break down around motion boundaries and the covariance from

neither error model is reliable. This verifies that performance analysis based on covariance

propagation only works for near-perfect systems and is unable to detect its own failure. (c,d)

are similar by visual inspection. However, with an inappropriate noise model, (c) severely

underestimates the error. Its associated α = 2e−11 makes little statistical sense, and thus in

practice there is no good method for choosing the threshold. With outliers rejected and the

correlated EIV model assumed, good results with solid statistical meaning are produced by

(d).

Hamburg Taxi Sequence (TAXI). Derivatives are calculated from a cubic spa-

tiotemporal facet model on a support of size 5× 5× 5 [145]. 5 frames are used. Figure 3.19

gives the results using a 2-level pyramid. Due to limitations of the hierarchial processing

scheme (Section 3.3.1), the results are more noisy than those on the higher level of pyramid

67

only (original sequence spatially subsampled by 2) in Figure 3.20. But the overall observa-

tion is that robust estimates have more faithful boundaries, and the correlated EIV model

yields much less false alarms and misdetections. Quite noticeably though, the α values

become more problematic on real data. This suggests that our error modeling still needs

refining to meet the demand of real complexity.

3.3.3 Discussion

In this paper we have presented an error analysis on a robust optical flow estimation tech-

nique. Our work extends previous research in several directions. First of all, we make explicit

the dependence of the popular covariance propagation theory on accurate estimates, and

perform our analysis with a highly robust technique. We employ a high-breakdown robust

criterion to reject outliers, which are most detrimental to both optical flow estimation and

error analysis. By using a 3D facet model we obtain good derivative estimation, and in

addition we systematically estimate the image noise strength and the correlated errors in

spatiotemporal derivatives. We also adopt a hierarchical scheme to handle the large motion

case.

We illustrate the effectiveness of our error analysis on an application of statistical mo-

tion boundary detection. Compared to least-squares based methods, our method has sig-

nificantly higher motion estimation accuracy and boundary fidelity, and produces less false

alarms and misdetections. These exhibit the potential of our results in a wide range of

applications such as Structure From Motion (SFM) and camera calibration [34].

Automatic performance analysis is a very important yet very difficult problem. Although

it is one step farther than previous attempts, our approach is still based on the covariance

propagation theory and break downs when the estimate quality becomes too low. The open

issue is how to make the system be aware of when the quality of the estimates becomes too

low to make further inference.

68

(a) 0.05 (b) 0.4

(c) 0.001 (d) 0.5

Figure 3.19: TAXI motion boundary

69

(a) 0.0005 (b) 0.2

(c) 3e−7 (d) 0.25

Figure 3.20: TAXI: motion boundary on images subsampled by 2

70

Chapter 4

GLOBAL MATCHING WITH GRADUATED OPTIMIZATION

The local approaches we presented in the previous chapter analyze each optical flow

vector by exploring image data in a small spatiotemporal neighborhood surrounding that

pixel location. Due to the limited contextual information, drawbacks of such approaches

are obvious: if data in a neighborhood do not have enough brightness variation or they

happen to be very noisy, the analysis can completely fail; in other words, local approaches

are highly sensitive to the aperture problem and their reliability can vary greatly within a

single image. In order to overcome such limitations, appropriate global approaches must be

developed to incorporate contextual information more effectively.

Global optimization techniques for optical flow estimation have been extensively studied

throughout the years, but the state-of-the-art performance remains unsatisfactory due to

formulation defects and solution complexity. On one hand, approximate formulations are

frequently adopted for ease of computation, with the consequence that the correct flow is

unrecoverable even in ideal settings. As an example, many methods intended to preserve

motion discontinuities use gradient-based brightness constraints, which can break down at

discontinuities due to derivative evaluation failure and thus cannot reach the goal of precise

boundary localization [145]. On the other hand, more sophisticated formulations typically

involve large-scale nonconvex optimization problems, which are so hard to solve that the

practical accuracy might not be competitive to simpler methods. Motion estimation research

has arrived at a stage in which a good collection of ingredients are available; but in order

to significantly improve performance, both problem formulation and solution methods need

to be carefully considered and optimized.

In this chapter, we discuss the problem of optimal optical flow estimation assuming

brightness conservation and piecewise smoothness and propose a matching-based global

optimization method with a practical solution technique.

71




The novelty in this formulation lies mainly in two aspects. 1) Three-frame matching is





algorithms.

In order to solve the resultant energy minimization problem, we develop a hierarchical

three-step graduated optimization strategy. Step I is the robust local gradient method

that we have proposed in Section 3.2. It provides a high-quality initial flow estimate.

Step II is a global gradient-based formulation solved by Successive OverRelaxation (SOR),

which efficiently improves the flow field coherence. Step III minimizes the original energy

by greedy propagation. It corrects gross errors introduced by derivative evaluation and

pyramid operations. In this process, merits are inherited and drawbacks are largely avoided

in all three steps. As a result, high accuracy is obtained both on and off motion boundaries.

Performance of this technique is demonstrated on a number of standard test data sets.

On Barron’s benchmark synthetic data [7], this method achieves the best accuracy among

all low-level techniques. Close comparison with the well known Black and Anandan’s dense

regularization technique (BA) [14] shows that our method yields uniformly higher accuracy

in all experiments at a similar computational cost.

4.1 Formulation

Let I(x, y, t) be the image intensity at a point s = (x, y) at time t. The optical flow at

time t is a 2D vector field V with the vector at each site s denoted by Vs = (us, vs)T ,

where us, vs represent the horizontal and vertical velocity components, respectively. At

places with no confusion, we may drop the s index and denote an image frame by I(t)

and a flow vector by V = (u, v)T . The task of estimating optical flow can be described

72

as finding V to best interpret the spatiotemporal intensity variation in the image frames

I = {I(t1), . . . , I(t), . . . , I(t2)}, t1 ≤ t ≤ t2. We consider it as a Bayesian inference problem

and define the optimal solution under the maximum a posteriori (MAP) criterion.

4.1.1 MAP Estimation

Let P (V |I) be the posterior probability density of the flow field V conditioned on the

intensity observation I. According to the maximum a posteriori (MAP) criterion, the best

optical flow V is at the mode of this density, i.e.,

V = argmaxV P (V |I).

Applying Bayes rule, the posterior pdf can be factored as

P (V |I) =P (I(t)|V, I − I(t))P (V |I − I(t))

P (I(t)|I − I(t))(4.1)

where I − I(t) designates the image frames excluding the one on which we estimate optical

flow. Ignoring the denominator, which does not involve V , we have

V = argmaxV P (I(t)|V, I − I(t))P (V |I − I(t)) (4.2)

where P (I(t)|V, I − I(t)) is the likelihood of observing the image I(t) given the optical flow

V and its neighboring frames I − I(t); P (V |I − I(t)) is the prior probability density of the

optical flow.

4.1.2 MRF Prior Model

We model the prior distribution of the optical flow using a Markov random field. The

MRF is a highly effective model for piecewise smoothness. It was first introduced for image

restoration by Geman and Geman [42] and has been widely employed in motion estimation

to preserve boundaries [88, 77, 57, 14]. The elegance of the MRF lies in that once a

neighborhood system N is defined, due to MRF/Gibbs distribution equivalence, the prior

distribution with respect to N can be expressed in terms of a potential function ES(V ) as

P (V |I − I(t)) = exp(−ES(V ))/Z.

73

−5 0 50

0.2

0.4

0.6

0.8

1

1.2

(a) ρ(x, σ)

−5 0 5−1

−0.5

0

0.5

1

(b) ψ(x, σ) = ρ′(x, σ)

−5 0 50

0.02

0.04

0.06

(c) pdf ∝ exp(−ρ(x, σ))

Figure 4.1: Comparison of Geman-McClure norm (solid line) and L2 norm (dashed line): (a)error norms (σ = 1), (b) influence function, (c) corresponding probability density functions(truncated).

The partition function Z is a normalizing constant. ES(V ) is the flow smoothness energy

modeled as a sum of site potentials: ES(V ) =∑

s ES(Vs).

We use a second-order neighborhood system of only pairwise interactions in the flow

field prior. Correspondingly, the local flow smoothness potential ES(Vs) is specified by the

average deviation of Vs from its 8-connected neighbors Vi, i ∈ N8s :

ES(Vs) =18

∑

i∈N8s

ρ(Vs − Vi, σSs). (4.3)

Here σSs is the flow variation scale at the site s, and the error norm ρ(x, σ) reflects the flow

deviation distribution.

The choice of ρ is the decisive factor of the boundary preservation capability of an MRF

formulation. If ρ is an L2 norm and σSs is a fixed global parameter, the flow prior potential

reduces to the smoothness error in the Horn and Schunck formulation (Eq. (2.6)), which

does not preserve motion discontinuities at all. Geman and Geman [42] modeled continuous

surfaces as an MRF, and introduced the “line process”, a set of binary variables indicating

edges, as a dual MRF. This formulation has been widely adopted in motion estimation

[88, 77, 57, 14]. It was shown by Blake and Zisserman [20] to be equivalent to assuming ρ

as a truncated quadratic function. In a robust statisitics context, Black [14, 19] generalized

74

the line process to an analog “outlier process”. We adopt this point of view in designing

the error norm. More distribution and robust statistics insight to the error norm and our

design are given in the following two sections.

4.1.3 Likelihood Model: Robust Three-Frame Matching

If the likelihood P (I(t)|V, I − I(t)) is a site-independent exponential distribution propor-

tional to exp(−∑s EB(Vs)), the posterior distribution is also Gibbs, with the potential

resembling the regularization global energy (Eq. (2.6)). We take this approach so that

specifying the likelihood term reduces to modeling the brightness conservation error and its

potential function.

We use the matching constraint Eq. (2.1) to model brightness conservation. The tradi-

tional assumption that pixels are visible in all frames is a major source of gross errors in

occlusion areas. Taking such violations as outliers [14] may prevent error from propagating

to nearby regions, but does not provide constraints for occlusion pixels and thus does not

help their motion estimation. We observe that without temporal aliasing, all points in a

frame are visible in the previous or the next frame. Therefore we define the matching error

as the minimum of backward and forward warping errors in three frames, i.e.,

eW (Vs) = min(|Ib(Vs)− Is|, |If (Vs)− Is|)

where Is is the intensity of pixel s in the middle frame; Ib(Vs), If (Vs) are warped intensities

in the previous and the next frames respectively. We are the first to explicitly model

correspondence at occlusions in optical flow estimation [147]. A similar idea, known as a

temporally shiftable window, has shown high effectiveness in handling occlusions in multi-

view stereo [73].

It is conventional to assume that matching error comes from iid Gaussian noise and

correspondingly use its L2 norm as the potential function [58]. However, image noise is not

always Gaussian due to abrupt perturbations, and the matching error can come from other

sources such as warping failures. It may often take large values and thus has a distribution

with fatter tails than Gaussian. To represent the distribution more realistically, we use a

75

robust error norm ρ(x, σ) to define the potential, yielding

EB(Vs) = ρ(eW (Vs), σBs), (4.4)

where σBs is the local brightness variation scale. Figure 4.1 gives the ρ, ψ curves for the L2

and Geman-McClure error norms [15] and their corresponding pdfs 1. (Figure 4.1(c)) has

much fatter tails than the Gaussian.

The above prior and likelihood models are also justified from the robust statistics per-

spective. In optical flow estimation, small matching errors and smooth flows are dominant;

large errors and motion discontinuities can be considered as outliers to the modelled struc-

ture. Henceforth, applying robust constraints to both the EB and ES terms serves to reduce

the impact of local perturbations and prevent flow smoothing across boundaries. In addi-

tion, it gracefully handles motion estimation and segmentation, a difficult “chicken-and-egg”

problem, since motion boundaries can be easily located as flow smoothness outliers [15].

4.1.4 Global Energy with Local Adaptivity

A robust error norm is usually chosen to possess certain desirable properties to suit the

problem at hand. We use the Geman-McClure robust error function [15]

ρ(x, σ) = x2/(x2 + σ2)

in both EB and ES terms for its redescending [15, 106] and normalizing properties. The

first property ensures that the outlier influence tends to zero. We take errors exceeding

τ = σ/√

3, (4.5)

where the influence function begins to decrease, as outliers [15]. This is equivalent to

identifying pixels with error norm≥ 0.25 as outliers. The normalization property is desirable

because it makes the degrees of flow smoothness and brightness conservation comparable.

Together with the spatially varying scales, it allows the relative strength of these two terms

1ρ(x, σ) might not necessarily represent a proper distribution, like Gaussian, which is defined on x ∈(−∞,∞). But we consider it appropriate in an application to define a reasonable range of expected errorsand obtain a pdf by normalization in this range. Figure 4.1(c) shows the pdfs for x ∈ (−5, 5).

76

to be adaptive: where the observation is not trustworthy (σBs is large), stronger smoothness

is enforced, and vice versa. The scales are gradually learned from image data, as we will

discuss in Section 4.2.2 and 4.2.3.

Finally, the complete global energy is expressed as

E(V ) =∑s

[ρ(min(|Ib(Vs)− Ii|, |If (Vs)− Is|), σBs) +18

∑

i∈N8s

ρ(Vs − Vi, σSs)]. (4.6)

This design extends current robust global formulations [15, 86] in two aspects. First of

all, the three-frame matching error models correspondences even at occlusions and enables

higher accuracy upper bounds, which gradient-based or two-frame methods cannot achieve.

Secondly, the locally adaptive scheme is more reasonable than those taking σB, σS , λ as

fixed global parameters and eases parameter tuning in experiments.

4.2 Optimization

As we have discussed in Section 2.4, the global energy Eq. (4.6) resides in a high-dimensional

space and is nonconvex. Even finding its local minima is not easy lacking an explicit gradient

expression. Because no general global optimization technique is known to provide a practical

solution, we take a graduated deterministic approach to the minimization problem. We start

from an initial estimate and progressively minimize a series of finer approximations to the

original energy. In this process, we exploit the advantages of various formulations and

solution techniques for accuracy and efficiency.

Our first attempt to approximation is to replace the matching error by the OFCE, which

enables simple gradient evaluation and more efficient exploration of the solution space. This

step needs a good-quality initial estimate to start with. We provide this initial estimate

from a yet cruder approximation, a gradient-based local regression method. This method

is cruder, because the global smoothness is not enforced and estimation is solely based

on local data. After these two steps, we usually have very high accuracy except at motion

boundaries, and then we can directly minimize the original energy to correct residual errors.

We build the process in a coarse-to-fine framework to handle large motions and expedite

convergence. Details of this algorithm are explained below.

77

4.2.1 Step I: Gradient-Based Local Regression

Suppose a crude flow estimate V0 is available and has been compensated for. Step I uses

the robust gradient-based local regression method that we have developed in Section 3.2

to compute the incremental flow ∆V . Both least-median-of-squares (LMedS) and least-

trimmed-squares (LTS) were tried and yielded similar results, so henceforth our discussion

is based on LMedS. This step generates high-quality initial flow estimates. Its effectiveness

as an independent optical flow estimation approach has been verified in various studies

[5, 97, 145].

4.2.2 Step II: Gradient-Based Global Optimization

∆V0, the incremental flow resulting from Step I, has good accuracy at most places, but its

quality degrades where local constraints become unreliable. We improve its coherence using

a gradient-based global optimization method, which is a better approximation to Eq. (4.6).

The energy to minimize is

E(∆V ) =∑s

[ρ(eG(∆Vs), σBs) +18

∑

i∈N8s

ρ(Vs + ∆Vs − Vi −∆Vi, σSs)] (4.7)

where eG is the OFCE error (Eq. (2.3)), and Vs is the sth vector of the initial flow V0. The

local scales σBs, σSs are important parameters which control the shape of E and hence the

solution. Below we describe how to estimate them from Step I’s results.

Suppose normal errors are Gaussian variables with zero mean and standard deviation

σ; then those exceeding 2.5σ can be considered as outliers. Contrasting this threshold to

Eq. (4.5), we can establish an equivalence between the Geman-McClure scale σ and the

Gaussian standard deviation σ as:

σ = 2.5√

3 σ.

If we have sample standard deviation σ, we may compute σ using the above formula.

At a site s, we calculate σSs as the sample standard deviation of “inliers” in Vi −Vs, i ∈ N8

s . Inliers are selected using the RLS procedure described in the previous section.

Some σSs values might be very large due to bad flow estimates. We put a cap on them:

78

1.4826medians σSs . For stability concerns as well as for the estimate to be reasonable, we

also put a lower limit 0.001 pixels/frame on these values. We calculate σBs as the OFCE

residual. And similarly we bound the value in the range [0.01, 1.4826medians σBs ].

Now that the scales are all specified, we minimize the energy using Successive OverRe-

laxation (SOR) [20, 14, 102]. Starting with the initial estimate ∆V0, on the nth iteration,

each u component (and v similarly) is updated as

uns = un−1

s − ω1

T (us)∂E

∂un−1s

,

where

T (us) = I2x/σB

2s + 8/σS

2s.

SOR is well known as good at removing high-frequency errors while very slow in removing

low-frequency errors [137, 23]. In our algorithm, the initial estimate has dominant high-

frequency errors: it has good accuracy at most places but may lack coherence due to the

local constraints. In such a case, the SOR procedure is very effective and converges fast.

In addition, the update step size is adaptively adjusted by the local scales, which further

improves the efficiency in exploring the solution space.

4.2.3 Step III: Global Matching

The incremental flow from Step II, ∆V1, and the initial estimate V0 add up to V1, which

still exhibits gross errors at motion boundaries and other places where gradient evaluation

fails. But it is overall a sufficiently precise representation, based on which we are ready to

consider the original formulation Eq. (4.6).

The computation of local scales is similar to that in Step II with a few differences. We

adopt a globally constant matching error standard deviation σB. It is a bounded robust

estimate from all matching errors σBs : max{0.08, 1.4826 medians σBs}. The flow vector

standard deviation is kept spatially varying within [0.004,0.02] pixels/frame.

We minimize the global energy function by greedy propagation. We first calculate the

energy EB(Vs)+ES(Vs) from V1 for all pixels. Then we iteratively visit each pixel, examining

whether a trial estimate from a candidate set results in a lower global energy E. The

79

candidate set consists of the 8-connected neighbors and their average, which were updated

in the last visit. Once a pixel energy decrease occurs, we accept the candidate and update the

related energy terms. The simple scheme works reasonably well because bad estimates are

confined to narrow areas in the initial flow V1. It converged quickly in our experiments. Since

each flow estimate Vi affects E only through its own energy and the smoothness energies of

its 8-connected neighbors, the updating is entirely local can be carried out in parallel [42].

It is worth mentioning that a similar greedy propagation scheme was successfully applied

to solving a global matching stereo formulation in an independent study [129].

4.2.4 Overall Algorithm

We employ a hierarchical process [12] to cope with large motion and to expedite convergence.

We create a P -level image pyramid Ip, p = 0, . . . , P − 1 and start the estimation from the

top (coarsest) level P − 1 with a zero initial flow field. At each level p, we warp the image

sequence Ip using the initial flow V p0 , obtaining image frames Ip

w. On Ipw we initialize the

residual flow using the local gradient method, enhance it using the global gradient method,

and add it to V p0 yielding V p

1 . Then we refine V p1 by applying the global matching method

to Ip, resulting in the final flow estimate on Level p, V p2 , which is projected down to Level

p− 1 as its initial flow field V p−10 . At last the flow estimate on the original resolution is V 0

2 .

Operations on each pyramid level are illustrated in Figure 4.2. There is an exception: when

more than one pyramid level is used, we skip Step III on the coarsest level. The rationale

is that gradient-based methods suffice on the coarsest level, since the data are substantially

smoothed and the flow is incremental; applying the matching constraint is usually harmful

due to the smoothing and possible aliasing.

Hierarchical schemes have become standard in motion estimation, but their limitations

are often overlooked. The projection and warping operations oversmooth the flow field;

errors in coarser levels are magnified and propagated to finer levels and are generally irre-

versible [14]. These problems are much alleviated by the global matching step—it works

on the original pyramid images and corrects gross errors caused by derivative computation,

projection and warping.

80

Iwp

Iwp∆

V0p∆ σBi σSi

∆Vp

1

Vp

1

Vp

2

V0p

I p

III: Global Matching

Image Warping

Gradient Computation

I: Local Gradient

II: Global Gradient

Projection

+

V0

p−1

Level

Level

p

p−1

Figure 4.2: System diagram (operations at each pyramid level)

81

From a practical point of view, the graduated scheme benefits from the merits of all three

popular optical flow approaches and overcomes their limitations. Step I use a gradient-based

local regression method for high-quality initialization, while leaving local ambiguities to be

resolved later in more global formulations. Step II improves the flow coherence using the

gradient-based global optimization method, which converges fast because of the good ini-

tialization. Step III adopts a matching-based global formulation to correct gross errors

introduced by derivative computation and the hierarchical process. Matching-based for-

mulations have been studied before, but their advantages over gradient-based counterparts

were not apparent due to computational difficulties [88, 77, 14]. We provide, for the first

time, a practical solution and achieve highly competitive accuracy and efficiency.

4.3 Experiments

This section demonstrates the performance of the proposed technique on various synthetic

and real data and makes comparison with previous techniques.

The settings in our algorithm are given below. Optical flow is estimated on the middle

frame of every three frames. No image pre-smoothing is done. Derivatives are calculated

from a first-order spatiotemporal facet model [54] on a support of size 3× 3× 3 [145]. The

constant flow window size in Step I is set to W = 9. Sites at image borders use the valid

part of the window so that the resulting flow field is of the same size as the original image.

In Step II, 20 iterations are used for SOR. The values of the local scale bounds have been

given in Section 4.2.2 and 4.2.3. The image pyramid is constructed by sub-sampling 3× 3

Gaussian-smoothed images. Projection expands the flow field by 2× 2 times with bilinear

interpolation. Bilinear interpolation is also used for image warping. The above factors are

kept constant in all experiments. The only tuning parameter is the number of pyramid

levels. For each data set, we choose the number of levels to be just large enough for the

gradient-based constraints to hold on the finest level. Larger numbers introduce more errors

due to suppression of fine structures in reduction and smoothing in projection and warping

(Section 4.2.4). Adaptive hierarchial control [9, 14] is an important open problem, which is

not tackled in this work.

82

Close comparison is given with Black and Anandan’s dense regularization technique

(BA) [14] whose code is publicly available. We modified their code to output floating-point

data. BA calculates flow on the second of two frames. It uses the same number of pyramid

levels as ours and other parameters are set as suggested in [14]. All experiments are carried

out on a PIII 900M PC runing Linux. The computing time of our algorithm depends on

the motion complexity in the input data. It is typically close to that of BA. Some sample

CPU time values (in seconds) for our algorithm and BA are: 11.7 and 14.7 (Taxi), 29.5

and 27.4 (Flower Garden), 36.8 and 24.2 (Yosemite). Note that neither algorithm has been

optimized for speed.

4.3.1 Quantitative Measures

Quantitative evaluation can be conducted on data sets with flow groundtruth by reporting

statistics of certain error measures. The most frequently adopted error measures are the

angle and magnitude of the error vector; the first and second order statistics are commonly

reported.

We propose to use e, the absolute u or v error, as the error measure. It is a consistent

and fair measure since u, v components and positive and negative errors are treated sym-

metrically in optical flow estimation. Also, this 1-D measure is much easier to work with

than are 2-D or higher dimensional measures. In considering what statistics to use, we find

the popular first and second order statistics not representative enough for such a highly

screwed e distribution. Therefore we give the empirical cumulative distribution function

(cdf) of e in addition to its mean e. Better estimates should have cdfs closer to the ideal

unit step function. In order to facilitate comparison with other techniques, we also report

the popular average angular error e6 [7]. e and e 6 values for five synthetic image sequences

are summarize in Table 4.1.

4.3.2 TS: An Illustrative Example

The Translating Squares (TS) sequence (64× 64, Figure 4.3(a)) was created to examine the

theoretical merits of different approaches. It contains two squares translating at exactly 1

83

(a) (b) (c)

(d) (e)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

e (pixels)

cdf(

e)

S3S2S1LSBA400BA

(f)

Figure 4.3: TS sequence results. (a) middle frame. The motion boundary is highlightedwith a solid white line. In order to provide details near motion boundaries, (b,c,d,f) showflows in the window outlined by the dotted line. (b) groundtruth and our estimate look thesame. (c) BA estimate. (d) LS estimate in Step I. (f) Step I result. Step II result looksidentical and is hence not shown separately. (g) error cdf curves.

84

pixel/frame, with the foreground square outlined in solid white. The vector plot of the flow

groundtruth is given for the part near the boundary (marked by white dots). The images

are well textured and noise-free. The motion is small and thus no hierarchical process is

needed. In such an ideal setting, an optimal formulation assuming brightness conservation

and piecewise smoothness should fully recover the flow.

Our method does achieve the performance upper bound. The result is almost perfect;

its vector plot looks the same as the groundtruth (Figure 4.3(b)), the average errors are

negligible (Table 4.1), and the error cdf curve (Figure 4.3(g), curve “S3”) closely resembles

the unit step function.

Figure 4.3(d) shows the flow estimate from the LS initialization in Step I, which can

be considered as an embodiment of the Lucas and Kanade technique [82]. Because of LS’s

zero tolerance to outliers, the flow is completely smoothed out near the motion boundary

(shadowed). Figure 4.3(e) shows the final result of Step I. Replacing LS by LMS dramatically

improves the boundary accuracy, as it is also clear by comparing curves “LS” and “S1” in

Figure 4.3(g). This proves the necessity of robustification.

Due to gradient evaluation failure, in Figure 4.3(e) gross errors still remain at the motion

boundary. Moreover, the corners are rounded because the background motion becomes

dominant there. These problems are characteristic of robust local gradient techniques [5,

97, 145], and they become more severe as the number of pyramid levels and the constant

flow window size W increase. Since the TS sequence is well textured and there is no serious

aperture problem, the improvement from the global OFC formulation (Step II) is minimal

(see Figure 4.3(g) curve “S2”). The remaining gross errors at the motion boundary are

inevitable to gradient-based techniques. They are finally corrected in Step III.

BA yields poor accuracy (Figure 4.3(c), Table 4.1) on this data set. The oversmooth-

ing bias introduced by the LS initialization is not effectively corrected in the continuation

process. The SOR procedure converges very slowly. The suggested 20 iterations [15] does

not seem to be sufficient (see Figure 4.3(g) curve “BA”). Even after 400 iterations (curve

“BA400”) the bias persists and the accuracy remains low.

85

Data Technique e6 (◦) e (pix)

TS BA 8.04 0.12

Ours 1.1e-2 2.2e-4

TT BA 2.60 0.07

Ours 0.05 9.8e-3

DT BA 6.36 0.11

Ours 2.60 0.05

YOS BA 2.71 0.12

Ours 1.92 0.08

DTTT BA 10.9 0.20

Ours 4.03 0.08

Table 4.1: Quantitative measures

4.3.3 Barron’s Synthetic Data

The Translating Tree (TT), Diverging Tree (DT) and Yosemite (YOS), data sets were

introduced in Section 3.2.2. We use two levels of pyramid for TT and DT, and three levels

for YOS. The cloud part in YOS is excluded from evaluation. As it is consistently observed

from the average error measures (Table 4.1) and the error cdf curves (Figure 4.4), our

method achieves very high accuracy and consistently out-performs BA 2.

Most optical flow papers published after [7] report the average angular error e 6 on YOS.

Some of the results are quoted in Table 4.2. The first group take a dense regularization ap-

proach assuming piecewise constant flow. To our knowledge, our method gives the smallest

error among such techniques. The second group make stronger flow model assumptions such

as local affine flow and constant flow in a considerable number of frames. These assumptions

are appropriate on the YOS data set and may lead to higher accuracy. The smallest error

on YOS was reported by Farneback [35]. The algorithm couples orientation-tensor-based

2The BA angular error obtained here is different from the one reported by Black and Anandan [15]. Mostprobably it is because that their data are different from Barron’s and they calculated flow on the 14thframe. We adopt Barron’s experiment setup for wider comparability.

86

0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

e (pixels)

cdf(

e)

OursBA

(a) TT

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

e (pixels)

cdf(

e)

OursBA

(b) DT

0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

e (pixels)

cdf(

e)

OursBA

(c) YOS

Figure 4.4: Error cdf curves.

spatiotemporal filtering with region-competition-based segmentation, and estimates locally

affine motion in 9 frames. Although it uses only low-level models and 3 frames, our method

also compares favorably with these techniques.

DTTT: Motion Discontinuities

The above three data sets, including YOS, contain smooth motions and cannot display

the discontinuity-preserving capability of our method. We synthesize the DTTT sequence

(150 × 150) for this purpose. DTTT was generated from TT, DT and “cookie cutters”:

image data inside the cookie cutters come from TT and those outside come from DT. Its

middle frame with motion boundaries highlighted is given in Figure 4.5(a). We use two

pyramid levels for this set. For images of realistic sizes, vector plots with enough details do

not fit the page. Following [15], we show the horizontal and vertical flow components u, v

as intensity images, with brighter pixels representing larger speeds to the right. We linearly

stretch the image contrast so as to use the full intensity range.

Our flow estimate (Figure 4.5) has a clear layered look—it exhibits crisp motion dis-

continuity and smoothness at other places. Figure 4.5(d) marks motion boundary pixels

in black. They are located as smoothness outliers, i.e., those with final pixel smoothness

87

Technique e6 (◦)

Ye, Haralick and Shapiro (proposed) 1.92

Sim and Park [117] 4.13

Black and Anandan [15] 2.71

Szeliski and Coughlan [127] 2.45

Memin and Perez [86] 2.34

Black and Jepson [35] 2.29

Ju, Black and Jepson [71] 2.16

Bab-Hadiashar and Suter [5] 1.97

Farneback [35] 1.14

Table 4.2: Comparison of various techniques on Yosemite (cloud part excluded) with Bar-ron’s angular error measure

energy exceeding 0.25 (see Eq. 4.5). BA’s result is oversmoothed with local perturbations:

at many places foreground and background motions invade one another; meanwhile, a num-

ber of false boundaries arise corresponding to noise or intensity edges erroneously taken as

motion discontinuities. This is also reflected by its boundary map (output from their code),

which has lots of spurious detections. Note that their discontinuity is 1-pixel thick since

they only mark one of each pair of mutual outliers. Our result has much higher quantitative

accuracy than BA’s as shown in Figure 4.4 and Table 4.1.

However in our estimate, we do notice some gross errors near motion boundaries. For

example, the right corner of the triangle is smoothed into the background. A closer look

reveals that most of these errors happen in textureless regions, where even human viewers

are unable to tell what the actual motion is (aperture problem). In such situations, the

correctness of “groundtruth” becomes questionable and so does the authority of quantitative

evaluation based on it. For this reason, together with the simplicity of synthetic data and

error measures, “quantitative” results should be considered as qualitative at best. The above

suggests that the inherent ambiguity of the optical flow should be considered in quantitative

evaluation—a more convincing evaluation method should allow larger errors in regions of

88

(a) Middle frame(b) Our horizontal

flow

(c) Our vertical flow(d) Our motion

boundaries

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

e (pixels)

cdf(

e)

OursBA

(e) e cdf curves(f) BA horizontal

flow

(g) BA vertical flow(h) BA motion

boundaries

Figure 4.5: DTTT sequence results (motion boundaries highlighted in (a)).

89

(a) Middle frame (b) BA horizontal flow

(c) Our horizontal flow (d) Our smoothness error (ES) map

Figure 4.6: Taxi results.

90

less local information.

Also noticeable in our estimate is that our motion boundaries are not as smooth as one

would like. This is partly due to the weakness of the simple optimization method in Step

III. Developing more suitable optimization methods is an important direction in our future

work.

4.3.4 Real Data

In this section we show results on four well-know real image sequences: Taxi, Flower Garden,

Traffic and Pepsi. Taxi and Traffic contain independent motions; motions in the other two

data sets are caused by camera motion and scene depth. For each data set we give the

middle frame, the horizontal flow u from BA and our method, and the smoothness error ES

map from our method.

The Taxi sequence (256 × 190) is obtained from Barron [7]. It mainly contains three

moving cars (from left to right) at image speeds about 3.0, 1.0, 3.0 pixels/frame respectively.

The van on the left has low contrast and surface reflectance. The truck on the right is

fragmented by a tree in front. Difficulties also arise from the low image quality. Optical

flow is estimated on the 9th frame. Two levels of pyramids are used. BA’s result is almost

smoothed out. Better boundary performance might be obtained by tuning parameters.

But as we have discussed earlier, smoothing seems to be inevitable for BA especially on

data of such diverse motions. Our method yields a reasonable flow estimate and a motion

boundary map. Note that the car regions include shadows which move with the cars at the

same speeds. Motion boundaries inside the truck reflect the motion fragmentation.

Motion in the Flower Garden sequence (360 × 240, from Black) is caused by camera

translation and scene depth. The image speed of the front tree is as large as about 7

pixels/frame. Optical flow is estimated on the 2nd frame. Three pyramid levels are used.

In both BA’s and our results, the motion of the tree twigs smears into the background.

This is another example of inherent flow ambiguity (aperture problem). BA’s estimate has

considerable oversmoothing between layers. Our result shows clear-cut motion boundaries

and smooth flows within each layer. Its accuracy is highly competitive with those from

91



Figure 4.7: Flower garden results.

model- or layer- based techniques [109, 131].

Consistent observations are made on the remaining two data sets. The Traffic sequence

(512× 512, from Nagel) contains eleven moving vehicles with the maximum image speed at

about 6 pixels/frame. Optical flow is estimated on the 8th frame with three pyramid levels.

The motorcycle in the building shadow (upper middle) is missed by BA but picked out by

our method.

The Pepsi sequence (201 × 201) was used by Black [15] to illustrate motion boundary

preservation capability. Like Flower Garden, its motion discontinuities are caused by camera

translation and scene depths. The maximum image speed is about 2 pixels/frame. Optical

flow is estimated on the 3rd frame with three pyramid levels. We exclude a 5-pixel wide

92



Figure 4.8: Traffic results.

93

(a) Middle frame(b) BA horizontal

flow

(c) Our horizontal

flow

(d) Our ES map

Figure 4.9: Pepsi can results.

border from BA’s result to have a better contrast. The erroneous flow and discontinuity

estimated at the lower-left corner are also caused by the poor texture.

4.4 Conclusions and Discussion

This chapter has presented a novel approach to optical flow estimation assuming brightness

conservation and piecewise smoothness. From a Bayesian perspective, we propose a formu-

lation based on three-frame matching and global optimization allowing local variation, and

we solve it under a graduated minimization strategy. Extensive experiments verify that the

new method out-performs its competitors and yields good accuracy on a wide variety of

data.

The contributions of our work to visual motion estimation are summarized as follows.

• We introduced backward-forward matching for optical flow estimation. It avoids prob-

lematic derivative evaluation and models correspondences more faithfully than popular

gradient-based constraints and those ignoring the visibility problem at occlusions.

• We designed the global energy to automatically balance the strength of brightness and

smoothness errors according to local data variation. It is more complete and adaptive

than previous designs containing rigid tuning parameters.

94

• As a by-product of the robust formulation, motion discontinuities can be reliably

located as flow smoothness outliers.

• We developed a three-step graduated optimization strategy to minimize the resultant

energy. It is the first efficient algorithm yielding good accuracy for a global matching

formulation.

• The solution technique takes advantage of gradient-based local regression, gradient-

based global optimization and matching-based global optimization methods and over-

comes their limitations. The local gradient step provides a high-quality initial flow,

while leaving local ambiguities to be resolved later in more global formulations. The

global gradient step improves the flow coherence and it converges fast because of

the good initialization. The global matching step corrects gross errors introduced by

derivative computation and the hierarchical process.

• We proposed a deterministic algorithm to approximate the high-breakdown robust

estimator in the local gradient step. It can be faster and more accurate than algorithms

based on random sampling.

Many of the above conclusions are also applicable to other low-level visual problems

such as stereo matching, 3D surface reconstruction and image restoration.

As an accurate and efficient low-level approach to visual motion analysis, the new method

has great potential in a wide variety of applications. First of all, it provides a good start-

ing point for higher-level motion analysis. Our flow estimates already take a layered look

and motion boundaries of layers are closed curves. They can reliably initialize motion seg-

mentation [109], contour-based [86] and layered [131] representation. Model selection [130]

is a crucial problem in automatic scene analysis [16] which is difficult because comparing

a collection of models on the raw image data involves formidable computation. Our re-

sults can ease this task by supplying a higher ground for scene knowledge learning. The

backward-forward matching error, together with detected motion boundaries, can facilitate

occlusion reasoning [16]. It may also guide image warping to avoid smoothing across motion

95

discontinuities. Some success has been obtained in our preliminary experiments. This is

important to motion estimation as well as for novel view synthesis.

A noticeable problem in our results is that motion boundaries are not very smooth.

This is in part due to the simplicity of our minimization method. It has a limited ability for

generating new values. And propagating only among immediate neighbors might be slow

and can get stuck at trivial local minima. For the purpose of global optimization, methods

such as graph cuts [23], which yields very nice results in stereo matching, full multigrid

methods [102], Bayesian belief propagation (BBP) [137], and local minimization methods

alternative to SOR [102] are worth studying.

Furthermore, the benefits of the Bayesian framework should be fully exploited. Among

all criteria that the global energy may arise from, we find the Bayesian approach most ap-

pealing in both theoretic and practical interest. Estimating optical flow from a few images

is inherently ambiguous: areas with more appropriate textures have higher estimation cer-

tainty. This indicates that the nature of the problem is probabilistic instead of deterministic.

Furthermore, the Bayesian formulation may provide a graceful solution to two important

problems: global optimization and confidence estimation [126, 7, 144]. Interesting results

from a global optimization method Bayesian belief propagation (BBP) have been shown on

a limited domain of vision problems [137]. BBP propagates estimates together with their

covariances. If it converges, it converges in a small number of iterations with covariance as

a by-product. Confidence measures such as covariances are critical for subsequent applica-

tions to make judicious use of the results. It will be interesting to see if ideas like BBP are

applicable and beneficial to optical flow estimation.

96

Chapter 5

MOTION-BASED DETECTION AND TRACKING

This chapter considers an application of visual motion to detecting and tracking point

targets in an airborne video of intensity images. The research has been carried out with

Engineering 2000 Inc. and the Boeing Company, as a part of the efforts for developing a UAV

See And Avoid System. The greatest difficulty in this project lies in the extremely small

target size. For many purposes of airborne visual surveillance such as collision avoidance,

targets need to be identified as far/early as possible. This requires handling targets no

more than a few pixels large. Meanwhile, it is common for airborne video imagery to have

low image quality, substantial camera wobbling and plenty of background clutter. How to

reliably detect and track point targets irrespective of these distractions is a very challenging

issue that has seldom been dealt with.

The primary cue for detection in aerial surveillance is the motion difference between the

target and the background. It is especially true in our problem, because the tiny target has

almost no other features to separate it from background clutter. Detecting and associating

objects based on brightness patterns [40] easily leads to false matches and tracking failure. A

popular motion-based detection method fits the background motion into a gradient-based

parametric model and takes pixels with large fitting residuals as belonging to potential

targets [12, 63, 101]. These kind of approaches, unfortunately, only work for objects with

extended spatial supports, but do not apply to fast-moving point targets. We develop

a hybrid motion estimation method and a hypothesis test to identify small independently

moving objects. Specifically, for each pair of adjacent frames, we compute the global motion

from a hierarchical model-based method; estimate the individual pixel motion by template

matching; and detect object pixels as those with these two values statistically significantly

different. The detection threshold is chosen with clear statistical meaning and remains fixed

for all frames.

97

DetectorData Target

StateTracker

Measurement

Target

Figure 5.1: A typical detection-tracking system

Tracking has been intensively studied in a wide variety of areas and also as a general

information processing problem [6, 30, 139, 65]. It can become highly involved and error-

prone when dealing with multiple maneuvering targets and low-quality measurements. In

aerial surveillance applications, tracking can be considered relatively easier, since aerial

targets are normally well separated and have predictable dynamics. Single-target tracking

is usually formulated, either explicitly or implicitly, as a Bayesian state estimation problem.

The Kalman filter is a Bayesian tracker under linear/Gaussian assumptions and is the most

widely used tracking technique in practice. We assume the target position, after the global

motion is compensated, conforms to a second order kinematic model, and track it using a

Kalman filter. Gandhi et al. [40] take a similar approach, but they rely on a navigation

system for the global motion parameters. The temporal integration method proposed by

Pless et al. [101] is also similar to a Kalman filter but has more heuristics.

In a typical detection-tracking system, as Fig. 5.1 shows, the data flow between two

components takes only one direction: from the detector to the tracker. The detector assumes

a uniform prior distribution of the measurement space and makes decisions in a Neyman-

Pearson (NP) mode. This is what the detector starts with when it has no idea of the target

presence. It is also how most existing detection-tracking systems are operated [139, 101, 40].

Once an object is detected and a track is formed for it, priors become available to the

detector in the form of predicted state and its associated covariance. Taking the tracker

feedback into account, the detector can operate in a Bayesian mode. This amounts to

boosting the prior surface near the expected value, or equivalently lowering the NP test

threshold for states consistent with priors. At a minimal computational cost, the Bayesian

detector achieves remarkably lower false-alarm and misdetection rates and higher position

accuracy than the Neyman-Pearson detector. The data flow in the Bayesian system is bi-

98

Prediction

Detector

Bayesian

Motion−Based Kalman

Filtering

Tracker

Image Measurement State

& Covariance & CovarianceSequence

Prior

Figure 5.2: Proposed Bayesian system

directional: the tracker tells the detector where to look for measurements, and the detector

returns what it finds.

The hybrid motion-based detector and the Bayesian detection method are crucial to

high tracking accuracy and efficiency and form the major contributions of our work. In an

experiment on a 1800-frame real video clip, no false targets are detected; the true target is

tracked from the second frame, with position error mean and standard deviation as low as

0.88 and 0.44 pixels respectively.

5.1 Bayesian State Estimation

Bayes’ theorem gives the rule for updating belief in a Hypothesis H (i.e. the probability of

H) given background information (context) I and additional evidence E:

p(H|E, I) =p(E|H, I)p(H|I)

p(E|I).

The posterior probability p(H|E, I) gives the probability of the hypothesis H after consid-

ering the effect of evidence E in context I. The p(H|I) term is the prior probability of H

given I alone; that is, the belief in H before the evidence E is considered. The likelihood

term p(E|H, I) gives the probability of the evidence assuming the hypothesis H and back-

ground information I is true. The denominator, p(E|I), is independent of H, and can be

regarded as a normalizing or scaling constant. The information I is a conjunction of (at

least) all of the other statements relevant to determining p(H|I) and p(E|I). A Bayesian

estimate is optimal in the sense that it is derived from all available information, and no

other inference can do better.

99

Suppose we want to estimate the state variable xt of a dynamic system at time t given

all the observations yt up to time t. In the Bayesian framework, we need to propagate the

conditional probability p(xt|Yt), where Yt = {yi|i = 1, . . . , t} is the entire set of observations.

We define Xt = {xi|i = 0, 1, . . . , t} as the history of the system state; x0 reflects our prior

knowledge about the state before any evidence is collected.

Applying the Bayes’ rule we have

p(xt|Yt) =p(yt|xt, Yt−1)p(xt|Yt−1)

p(yt|Yt−1).

Note that to make an inference at time t, we need to carry the entire history of the state

and observation along. This incurs great modeling and computational difficulties. To keep

the problem manageable, the following three assumptions are commonly made.

• the yt’s are mutually independent: p(yi, yj) = p(yi)p(yj).

• each yt is independent of the dynamic process: p(yt|Xt) = p(yt|xt).

• the system has a Markov property such that any new state is only conditioned on the

immediately preceding state: p(xt|Xt−1) = p(xt|xt−1).

These assumptions are reasonable in our application, while they dramatically simplify

p(xt|Yt) to

p(xt|Yt) =p(yt|xt)p(xt|Yt−1)

p(yt)

where

p(xt|Yt−1) =∫

p(xt|xt−1)p(xt−1|Yt−1)dxt−1.

Since the denominator p(yt) is not related to xt, we may take it as a scaling factor and

rewrite the first equation as

p(xt|Yt) = Ctp(yt|xt)p(xt|Yt−1).

These equations suggest a recursive procedure to update p(xt|Yt). Once we have

• the measurement model p(yt|xt),

100

• the system model p(xt|xt−1) and

• the prior model p(x0) = p(x0|x−1),

we may propagate the probability in two phases:

• prediction: p(xt|Yt−1) =∫

p(xt|xt−1)p(xt−1|Yt−1)dxt−1,

• correction: p(xt|Yt) = Ctp(yt|xt)p(xt|Yt−1).

While the equations appeare deceivingly simple, propagating the conditional probability

density is not easy. In most real systems the three models take complex forms and might

not be expressed analytically. Usually Monte-Carlo type of methods have to be exploited

to provide a sample of the density function. As it is well known, such methods can be

very computationally demanding; and when efforts are made to reduce the computational

burden, the resulting density function might not represent the underlying truth faithfully.

Some work along this direction has been done [46, 65].

The general Bayesian approach may have to be pursued in cases of multiple maneuvering

targets with clutter. But for our problem a special case of the approach, the Kalman filter,

is more appropriate.

5.2 Kalman Filter

The Kalman filter is a recursive Bayesian state estimator optimized for linear systems with

Gaussian noise. It is derived from three probabilistic models.

• The prior model: x0 ∼ N(x0, P0). Here x0 and P0 are the mean and covariance of the

state before any observation is made.

• The system model: x−t+1 = Ftxt + Gtut + wt. Here Ft rules the linear evolution of

the state variable with time. Gtut reflects some control input to the system, which is

taken as a known constant. wt ∼ N(0, Qt) is the process noise.

101

• The measurement model: yt = Htxt + vt. Here Ht relates the observed measurement

yt to the underlying true state. vt ∼ N(0, Rt) is the measurement noise.

The updating process can be summarized as

• Prediction:

x−t = Ft−1xt + Gt−1ut−1

P−t = Ft−1Pt−1F

′t−1 + Qt−1.

• Correction:

Pt = (P−t−1 + H ′

tR−1t Ht)−1 = P−

t − P−t H ′

t(HtP−t H ′

t + Rt)−1HtP−t

Kt = PtH′tR

−1t = P−

t H ′t(HtR

−t H ′

t + Rt)−1

xt = x−t + Kt(yt −Htx−t )

where Pt is the posterior covariance of xt, and Kt is the gain matrix. The computation

of Kalman filtering is very easy. The crucial factors for a successful Kalman filter are: (i)

accurate modeling (defining the state variable, system model and measurement model) in-

cluding appropriate parameter setting, and (ii) supplying high-quality measurements. Below

we describe the Kalman filter we built for target tracking.

5.3 Tracking

The first step in designing the tracker is to model the dynamic behavior of the target. The

simplest model is the second-order kinematic model, or constant translation model,

pt = pt−1 + vt−1,

where pt = (xt, yt)T is the target position, and vt = (vxt, vyt) is the velocity which should

be a constant. In our problem, vt is not constant. It is a sum of two velocities, the velocity

of the target itself vRt and that of the background vG

t due to camera airplane motion. vGt

102

is quite random and so is vt. But, the component vRt = vt − vG

t remains quite steady over

time. It gives us a way of predicting vt:

vt = vt−1 + vGt − vG

t−1.

Here the background motion can be accurately estimated between each pair of frames (Sec-

tion 5.4), and is considered as a known control input to the system.

In Kalman filter notation, the tracking problem can be formalized as follows:

State Variable.

θt = (pTt , vT

t )T

where pt = (xt, yt)T is the centroid position of the target, and vt = (vxt, vyt)T is its image

velocity.

Prior Model. There are many ways of specifying the prior model. When no prior

knowledge about the target motion is available, a diffuse prior (infinite covariance) is used,

and the Kalman filtering process reduces to a recursive least-squares estimation.

System Model.

θt = Fθt−1 + ut + wt (5.1)

where the control input is

ut = (0, 0, (vGt − vG

t−1)′)′,

the stationary system matrix is

F =

1 0 1 0

0 1 0 1

0 0 1 0

0 0 0 1

,

and wt ∼ N(0, Q). We assume the process noise covariance to be

Q = εFPt−1F′

in Eq. 5.4. The result of multiplying FPt−1F′ by ε makes it become the covariance of

the predicted state P−t . This is an ad hoc approach accounting for errors from unmodeled

sources, known in stochastic estimation as exponential aging [126].

103

Measurement Model. We assume the observed state yt is a contaminated version of

the true value

yt = θt + νt (5.2)

where the noise νt ∼ N(0, Rt), and Rt is block-diagonal (position errors are correlated and

velocity errors are correlated). Accurate (yt, Rt) estimation is crucial to the success of the

filter, and it is the topic of the next two sections.

The optimal solution procedure is given below.

Prediction.

θ−t = F θt−1 + ut (5.3)

P−t = FPt−1F

′ + Q.

Correction.

Kt = P−t (P−

t + Rt)−1 (5.4)

θt = θ−t + Kt(yt − θ−t ) (5.5)

Pt = (I −Kt)P−t

The numerical instability of the Kalman filter is well-known. It arises from the matrix

inversion in evaluating Kt (Eq. 5.6): if (P−t + Rt) is poorly conditioned, the value Kt gets

is pretty much due to round-off errors. We deal with the problem using a simple method:

adding a tiny positive number ε to the diagonal entries of the posterior covariance Pt. Cur-

rently we set ε to be 1% of the smallest diagonal entry. The importance of parameter tuning

cannot be understated in optimizing the performance of real Kalman filtering systems. But

here we emphasize on the theoretical part of the problem, and leave the practical aspects

for future consideration.

5.4 Motion-Based Detection

This section addresses the problem of detecting independently moving pixels between two

frames, It and It+1, and measuring the object state. Examples are given on three 100× 100

sample frames (Fig. 5.3). f16502 and f18300 have a target near the center. f19000 has no

104

target but many ground objects resembling targets by appearance. These data sets are

cropped out of the full data set (Section 5.7).

Candidates as Background Motion Outliers.

The background motion vG is introduced by the relative ground-camera movement. The

ground is well approximated by a planar object. Its image motion conforms to the quadratic

model Eq. 2.5. Putting this model into the optical flow constraint equation Eq. 2.3, we have

a linear constraint Asa = bs at each pixel location s where

As =[

Ix Iy xIx xIy yIx yIy (x2Ix + xyIy) (xyIx + y2Iy)], bs = −It.

We solve this regression problem using least-squares. To reduce the impact of outliers, we

refine the LS estimate under the least-trimmed-squares criterion Eq. 3.3 using the C-step

in the FAST-LTS implementation (Section 3.1.1). The number of equations n is equal to

the number of pixels in each frame. Processing time increases proportionally with n while

estimation accuracy quickly saturates, since the number of unknowns is fixed at 8. Our

experiments show that using 2500 out of the n equations achieves the same accuracy at a

small constant cost. To handle large motion, up to 4 pixels in our client data, we adopt

a hierarchical scheme with two pyramid levels (Section 2.6 and Fig. 2.2). The projection

operation for the planar flow parameter a can be expressed as

ai−1p (0, 1) = 2ai

p(0, 1),

ai−1p (2, 3, 4, 5) = ai

p(2, 3, 4, 5),

ai−1p (6, 7) = ai

p(6, 7)/2.

Once the estimate a is available, the velocity at any position (x, y), vG(x, y) can be calculated

from Eq. 2.5.

Background motion can be very accurately recovered from the above process. Fig. 5.4(a)

shows the frame difference between f16502 and f16503 after the background motion is re-

moved by image warping. The difference is very small except near the target, which moves

independently, and the image border, where warping errors are significant. It is this high

accuracy that allows us to consider vG as a known control input to the system in the Kalman

filter model (Eq.5.1).

105

Figure 5.3: Example data sets. Column 1: first frame, Column 2: second frame, Column3: frame difference (first frame minus second frame). Row 1: f16502 (target is the whitedot near the center), Row 2: f18300 (target near the center), Row 3: f19000 (no target butmany target-like ground objects).

106

(a) (b)

(c) (d)

Figure 5.4: f16502 target pixel candidates. (a) frame difference after background motion isremoved. (b) pixels of large warping errors. Postprocessing: (c) isolated pixels removed.(d) dilated by 3× 3.

107

For slow-moving objects of sufficiently large size, the planar model fitting error serves as

a good indicator of independent motion [63, 101]. Another method, which applies to small

objects with considerable image motion, is to warp It+1 towards It according to vG yielding

a new image IWt , and consider pixels with large warping errors |It − IW

t | as potentially

independently moving [40]. We use this method to find candidate pixels. We estimate the

image noise variance σ2 by fitting a facet model to the image data (Section 3.1.2). Denoting

by σi the standard deviation estimate at the ith pixel, σ = 1.4826 median σi. Then we

take pixels with warping errors exceeding 2.5σ as candidates. Two processes further refine

the candidate set: isolated pixels are pruned and residual ones are dilated by one pixel

in each direction. Results for f16502 are given in Fig. 5.4. We observe that target pixels

are successfully selected and the number false alarms is very small. Results for f18300 and

f19000 are given in Fig. 5.5.

As we see in the above results, this method does locate the target, but also produces

a considerable amount of false detections, especially for f19000 (Fig. 5.5). This is because

large intensity change can result from independent motion as well as great intensity vari-

ation. Feeding such measures to the tracker imposes a penalty on both tracking accuracy

and responding time [40]. Therefore we further exploit motion information to resolve the

ambiguity.

Candidate Pixel Motion and Covariance.

It is important to point out that hierarchical gradient-based methods cannot be extended

to calculating candidate pixel motion because they require information aggregation in a

neighborhood much larger than the spatial support of a point target. Therefore, we calculate

target candidate pixel motions using the matching-based (template matching) method [3,

126], which requires the minimal spatial integration.

For each candidate pixel, we take its 3 × 3 neighborhood as the template and find its

best match in a window of size w × w in the next frame It+1. The displacement gives us

a pixel-accuracy solution v0. w should be chosen large enough to encompass the range of

the velocity, but as small as possible to avoid false matches. Now we let w = 4 according

to the observed maximum image motion. To achieve sub-pixel accuracy, we refine v0 by a

108

Figure 5.5: f18300 and f19000 target pixel candidates. Left column: f18300. Right column:f19000. Row 1: frame difference after background motion is removed. Row 2: pixels of largewarping errors. Row 3: isolated pixels removed.

109

quadratic fit of the error surface surrounding it. Particularly, we take matching errors in

the 3× 3 neighborhood centered at v0, fit them into a 2D quadratic surface

e = v′Av + b′v + c = a1 + a2vx + a3vy + a4v2x + a5vxvy + a6v

2y ,

find the minimum of the surface and the displacement achieving it as, respectively,

emin = c− b′A−1b/4 = c + b′vmin/2,

vmin = −A−1b/2,

and obtain the final motion estimate as

v = v0 + vmin.

The covariance matrix of v is available as a by-produce of the the quadratic fitting. It can

be shown [126] that

Σv =emin

9A−1.

When vmin is greater than 1 pixel either way, emin < 0, or the diagonal entries of Σv < 0,

the motion estimator clearly has made a mistake, and we give that pixel up and take it as

belonging to the background. We also give the estimate up when the larger eigenvalue of

Σv exceeds 0.25, which corresponds to position errors above 0.5 pixels.

Independent Motion from χ2 Test.

If a candidate pixel actually belongs to the background, its image motion v should be no

different from the background motion vG. That is, under the null hypothesis H0 : v = vG,

we should have v − vG ∼ N(0, Σ) and the test statistic

T = (v − vG)′Σ−1(v − vG)

conforms to a χ2 distribution with 2 degrees of freedom. An independent motion is detected

when H0 is rejected. We reject H0 at the significance level α = 0.05 which amounts to finding

T > Tα = 5.9915. This test is simple yet very effective. As Fig. 5.6 shows, almost all the

clutter is now gone and the target stands out as the only significant connected pixel set.

Data Association, Object Detection and State Measurement.

110

Figure 5.6: Target pixels for f16502, f18300 and f19000. Left column: target pixel candi-dates. Right column: target pixels detected by the statistical test (small connected pixelsets are considered noise and removed). Row 1: f16502. Row 2: f18300. Row 3: f19000.

111

Moving pixels are first assigned to existing tracks according to the nearest neighbor rule.

New tracks are formed for residual large connected sets (≥ 5 pixels). Given a cluster of

pixels, we calculate the object velocity and covariance by averaging all the (v, Σ) values; and

calculate the position and covariance from the sample mean and variance. This provides

accurate (yt, Rt) estimates to the tracker.

5.5 Bayesian Detection

The object detector described above assumes every pixel is equally likely to be an object

pixel, and tries to make the distinction solely based on evidence it collects from two adjacent

frames. It is the best we can do when initiating a track with no prior knowledge at that

time. Due to the extremely small object size, detection is still difficult, especially for small

independent motions. Meanwhile, once a track is formed, priors on the object state become

immediately available from the tracker in the form of predicted state distribution. From a

Bayesian point of view, to pursue the optimal detection results we are obliged to exploit

the priors. This section introduces a Bayesian object detector. It is an important feature

of our system; most detectors in previous visual surveillance applications [101, 40] always

operate in the Neyman-Pearson mode.

The prior distribution of the state at time t is exactly the predicted distribution from

time t − 1 by the tracker. Our system is in all respects linear/Gaussian, and hence the

distribution is defined by the mean θ−t and covariance P−t as in Eq. 5.4. The Bayesian

detector utilizes the priors in two phases: (i) augmenting the candidate set by adding pixels

falling into predicted 3σ regions, and (ii) validating/updating the velocity estimates.

Each predicted candidate pixel has two sets of velocity and covariance available: one

from the matching-based motion estimation step and the other from prediction, which we

denote as (v0, Σ0) and (v−, Σ−), respectively. To combine these two pieces of evidence, we

first conduct a consistency test: candidates with pixel motion significantly different from

the prediction are rejected. As summarized below, the χ2 test works in the same way as

the detection of independent motions. Pixels failing the test are considered to have poor

motion estimates and taken as background pixels.

112

• Null hypothesis: H0 : v0 ∼ N(v−, Σ−).

• Test statistic: T = (v0 − v−)′(Σ−)−1(v0 − v−) ∼ χ22 under H0.

• Reject H0 when T > Tα. (α, Tα) is fixed at (0.05,5.9915).

We next calculate its posterior motion estimate (v, Σ) for each remaining candidate using

formulae similar to the correction phase of the Kalman filter [148]:

K = Σ−(Σ− + Σ0)−1

v = v− + K(v0 − v−)

Σ = (I −K)Σ−,

Then the independent motion test is done for both the prior and posterior estimates. Object

pixels are much easier to identify in the Bayesian mode because of the boosted density

distribution, or equivalently the dampened threshold in the χ2 test, around the predicted

values [139].

Fig. 5.7 illustrates the prior impact on f16503. By comparing the original candidate set

(a) and the augmented one (b, predicted pixels in gray), we see that the position priors help

to locate candidates missed by the motion-based detector. Object pixels detected with no

priors, with position priors only and with full priors are given in (c), (d) and (e) respectively.

Lower misdetection rates are achieved as more priors are incorporated.

(a) (b) (c) (d) (e)

Figure 5.7: Detection results w and w/o priors on f16503

113

5.6 The Algorithm

Note that so far we have been carefully using the word “object” instead of “target” when

talking about detection and tracking. This is because not all objects we track are targets,

only those with consistent dynamic behavior. In our application, we watch any object for

Ni = 25 frames (about 0.8 seconds), and declare it as a target only if during the period, it

never misses measurements, and its position covariance is always within an allowed range.

We also permit a target to miss measurements up to Nt = 5 frames. When the measurement

is missing, we use the prediction as the new state; and we confine the position covariance to

be within the allowed range. The range is defined by the larger eigenvalue of the covariance

matrix, and the maximum is set to 2 squared pixels. Tracks corresponding to false objects

and dead targets are terminated.

We record four properties for the track associated with each object: a unique ID id,

the track length (since the first detection) hist, the number of successive frames in which

the object is not measured miss, and the object state Xt. Here we briefly synopsize the

execution of the algorithm on each frame.

1. Prediction. Extrapolate new state 1 and update covariance (Eq. 5.4).

2. Detection.

(a) Calculate global motion parameters. Find candidates from both large warping

errors and position priors.

(b) Estimate candidate motion. Further find posterior estimates for predicted can-

didates.

(c) Detect independently moving pixels using χ2 test.

(d) Assign detected pixels to existing tracks; initiate new tracks from unassigned

large connected sets.

1We do prediction before detection in order to supply priors to the Bayesian detector. However, sincepredicting target positions needs the new global motion parameters (Eq. 5.1), the prediction process isactually completed in the detection phase.

114

(a) Frame 16540 (b) Frame 18000

Figure 5.8: Two sample frames in the video clip (targets at the center of the 41 × 41windows).

(e) Measure track states.

3. Correction.

(a) Update object states (use prediction and increase miss by 1 if no new measure-

ments, otherwise do Kalman filter correction (Eq. 5.6) and assign miss ← 0).

(b) Remove a target with miss = Nt. Terminate an object track if miss > 0 or

position covariance too large.

5.7 Experiments

We demonstrate the system performance on a 1800-frame video clip (sample frames in

Figure 5.8) obtained from real flight data. The frame rate is 30 frames/second and the

image size is 256 × 192 pixels. There is one target in the clip. It is about 2 × 1 to 3 × 3

pixels in size, and its maximum image motion is about 5 pixels/frame. Many ground objects

resemble the target in brightness and/or shape. The camera is constantly wobbling and the

image quality is low.

115

Exp MD FA ITR min max med mean sd

I 0 0 0.999 0.05 3.88 0.82 0.88 0.44

II 0 2 0.988 0.12 3.82 0.99 1.06 0.52

III 0 0 0.618 0.14 3.78 0.83 0.83 0.37

Table 5.1: Quantitative measures in 1800 frames. MD: number of missed targets. FA:number of false targets. ITR: in-track rate (target track length divided by 1800). Theremaining measures refer to position error vector magnitude statistics (unit: pixel).

All experiments are carried out on a PIII 500MHz PC running Solaris. The current

implementation is not optimized for speed, but it is reasonably fast, spending about 2.3

seconds on each frame including 2 seconds on background motion estimation. The algorithm

has the potential to execute in real-time.

In the 1800-frame clip, a total of 252 objects are detected, and only one of them is

identified as a target. The target is in-track since the second frame. We mark the target

by placing a 7× 7 white box centered at its estimated position. As can be observed in the

output video [148], the marker encloses the target throughout the sequence.

We managed to locate target centroid positions in 564 frames and used them as the

groundtruth in quantitative evaluation. Table 5.1 gives results from three experiments.

Experiment I shows the proposed method. To illustrate the effectiveness of the Bayesian

detector, we also performed Experiment II in which only position priors are used and Ex-

periment III in which no priors are used.

In both Experiments II and III, the true target is still detected and thus there is no

mis-detection. II has two false alarms due to the absence of the consistency test between

the estimated and predicted motion vectors (Section 5.5). Its in-track rate is slightly lower,

while the localization accuracy severely degrades. III’s error measures are comparable to

I’s. However its in-track rate suffers a drastic decrease. The target is not confirmed until

23 seconds later and its track is broken twice. This could be unacceptable in time-critical

applications such as collision avoidance.

116

5.8 Discussion

We have introduced a novel approach to point target detection and tracking in a low-quality

airborne video. We identify objects by the statistical difference between their motions and

the background motion, and track their dynamic behavior in order to detect targets and

update their states. Compared to most previous visual surveillance studies, our method has

four main advantages.

• The hybrid motion-based detector is highly efficient in suppressing background clutter,

locating moving objects and modeling their dynamics. It enables us to employ a simple

Kalman filter for object tracking.

• With priors exploited in detection, false alarm, misdetection and in-track rates receive

significant improvement.

• The extensive use of statistical tests rather than heuristics reduces parameter tuning

to a minimum.

• The Bayesian detection-tracking approach is readily applicable to other data sources

such as UV and RGB images. It allows results from different channels to be easily

integrated to yield more reliable output.

Performance of the proposed technique has been very encouraging in preliminary ex-

periments. More real and synthetic data are needed for further evaluation. Currently the

approach is being integrated to a UAV See And Avoid System being jointly developed by

Engineering 2000 Inc. and the Boeing Company.

117

Chapter 6

CONCLUSIONS

Visual motion is a compelling cue to the structures and dynamics of the world around

us. Its analysis is crucial to many key problems in today’s vision research such as ob-

ject/environment/human modeling, video compression, event analysis and image-based ren-

dering. This dissertation has addressed two fundamental problems in visual motion analysis:

optical flow estimation and motion-based detection and tracking. Two new approaches, ex-

ploiting local and global motion coherence, respectively, have been proposed for estimating

piecewise-smooth optical flow. A video surveillance system has been designed based on mo-

tion cues and Bayesian estimation theory to achieve reliable target detection and tracking.

In the process of developing these techniques, statistical methods have been extensively

used to measure estimation uncertainty, facilitate information fusion and achieve high ro-

bustness. This chapter will summarize the main contributions of the dissertation and point

out some open questions and future work directions.

6.1 Summary and Contributions

A two-stage-robust adaptive scheme for gradient-based local flow estimation.

Gradient-based optical flow estimation techniques consist of two stages: estimating

derivatives and organizing and solving optical flow constraints (OFC). Both stages pool

information in a certain neighborhood and are regression procedures in nature. Least-

squares solutions to the regression problems break down in the presence of outliers such as

motion boundaries. To cope with this problem, a few robust regression tools have been in-

troduced to the OFC solving stage. By carefully analyzing the characteristics of the optical

flow constraints and comparing the strengths and weaknesses of different robust regression

tools, we identified the least-trimmed-squares (LTS) technique as more appropriate for the

118

OFC stage. As a very similar information pooling step, derivative calculation has seldom

received proper attention in optical flow estimation. Crude derivative estimators are widely

used; as a consequence, robust OFC (one-stage robust) methods still break down near mo-

tion boundaries. Pointing out this limitation, we proposed to calculate derivatives from a

robust facet model. To reduce the computation overhead, we carried out the robust deriva-

tive stage adaptively according to a confidence measure of the flow estimate. Preliminary

experimental results show that the two-stage robust scheme permits correct flow recovery

even at immediate motion boundaries.

A deterministic high-breakdown robust method for visual reconstruction.

High-breakdown criteria are employed in both of the above regression problems. They

have no closed-form solutions and past research has resorted to certain approximation

schemes. So far all applications of high-breakdown robust methods in visual reconstruction

have adopted a random-sampling algorithm—the estimate with the best criterion value is

picked from a random pool of trial estimates. These methods uniformly apply the algorithms

to all pixels in an image disregarding the actual number of outliers, and suffer from heavy

computation as well as unstable accuracy. Taking advantage of the piecewise smoothness

property of the visual field and the selection capability of robust estimators, we proposed a

deterministic adaptive algorithm for high-breakdown local parametric estimation. Starting

from least-squares estimates, we iteratively choose neighbors’ values as trial solutions and

use robust criteria to adapt them to the local constraint. This method provides an estima-

tor whose complexity depends on the actual outlier contamination. It inherits the merits of

both least-squares and robust estimators and results in crisp boundaries as well as smooth

inner surfaces; it is also faster than algorithms based on random sampling.

Error analysis on gradient-based local flow.

Due to the intrinsic ambiguity of visual motion and modeling imperfections, an optical

flow estimate generally has spatially varying reliability. In order for subsequent applications

to make judicious use of the results, error statistics of the flow estimate have to be analyzed.

In our earlier work, we conducted error analysis for the least-squares-based local estimation

119

method using the covariance propagation theory for approximate linear systems and small

errors. In this thesis, we have generalized the results to the newer robust method. Our

analysis estimates image noise and derivative errors in an adaptive fashion, taking into

account correlation of derivative errors at adjacent positions. It is more complete, systematic

and reliable than previous efforts.

Piecewise-smooth optical flow from global matching and graduated optimiza-

tion.

By drawing information from the entire visual field, the global optimization approach

to optical flow estimation is conceptually more effective in handling the aperture problem

and outliers than the local approach. But its actual performance has been somehow dis-

appointing due to formulation defects and solution complexity. On one hand, approximate

formulations are frequently adopted for ease of computation, with the consequence that the

correct flow is unrecoverable even in ideal settings. On the other hand, more sophisticated

formulations typically involve large-scale nonconvex optimization problems, which are so

hard to solve that the practical accuracy might not be competitive with simpler methods.

The global optimization method we developed provide better solutions to both problems.




The novelty in our formulation lies mainly in two aspects. 1) Three-frame matching is





algorithms.

In order to solve the resultant energy minimization problem, we developed a hierarchical

three-step graduated optimization strategy. Step I is a high-breakdown robust local gradient

method with a deterministic iterative implementation, which provides a high-quality initial

flow estimate. Step II is a global gradient-based formulation solved by Successive Over-

120

Relaxation (SOR), which efficiently improves the flow field coherence. Step III minimizes

the original energy by greedy propagation. It corrects gross errors introduced by derivative

evaluation and pyramid operations. In this process, merits are inherited and drawbacks are

largely avoided in all three steps. As a result, high accuracy is obtained both on and off

motion boundaries.

Performance of this technique was demonstrated on a number of homebrew and standard

test data sets. On Barron’s synthetic data, which have become the benchmark since the

publication of [7], this method achieved the best accuracy among all low-level techniques.

Close comparison with the well known Black and Anandan’s dense regularization technique

(BA) [14] showed that in all of our experiments the new method yields uniformly higher

accuracy at a similar computational cost.

A motion-based Bayesian approach to aerial point target detection and tracking.

In a visual surveillance project funded by the Boeing Company, we have investigated an

application of optical flow to airborne target detection and tracking. The greatest difficulty

in this problem lies in the extremely small target size, typically 2× 1− 3× 3 pixels, which

makes results from most previous aerial visual surveillance studies unapplicable. Challenges

also arise from low image quality, substantial camera wobbling and plenty of background

clutter.

The proposed system consists of two components; a moving object detector identifies

objects by the statistical difference between their motions and the background motion, and

a Kalman filter tracks their dynamic behaviors in order to detect targets and update their

states. Both the detector and the tracker operate in a Bayesian mode, and they each benefit

from the other’s accuracy. The system exhibited excellent performance in experiments. On

an 1800-frame real video clip with heavy clutter and a true target (1 × 2 − 3 × 3 pixels

large), it produced no false targets and tracked the true target from the second frame with

average position error below 1 pixel. This probabilistic approach reduces parameter tuning

to a minimum. It also facilitates data fusion from different information channels.

121

6.2 Open Questions and Future Work

Recovering optical flow from image sequences is a very challenging problem due to the

intrinsic ambiguity of visual motion, inevitable modeling imperfections, computational dif-

ficulties, and the interweaving of these issues. Although more effective methods have been

presented in this thesis to tackle these difficulties, our exploration is only a beginning; there

are a host of issues worth further investigation and attention.

Optical flow formulation.

The formulation of an optical flow estimation technique determines the accuracy upper

bound of that technique. For example, robust formulations model the noise effect more

realistically than least-squares formulations, and hence their accuracy uniformly supersedes

the latter in practice. Years of research efforts have been devoted to developing more precise

formulations. A question that naturally arises is “is there an optimal formulation”?

Answering the question requires defining the best optical flow, which is largely application-

specific: if the application is 3D structure/dynamics analysis, the best flow should be iden-

tical to the projected velocity field, whereas if the application is motion-compensated video

coding, the best flow does not necessarily coincide with the projected motion as long as it

minimizes the coding cost. Low-level approaches do not aim at any specific application—it

is exactly their goal and strength to measure visual motion in a general setting—and there-

fore the best flow cannot be defined for them. It is usually implied in developing low-level

techniques that the best flow estimate is the projected 2D velocity field. But since the

project motion is normally unknown, it cannot be used to derive the objective function.

For the above reasons, there is no optimal formulation for low-level approaches.

Progress in refining low-level formulations has been made by identifying more appropri-

ate models for each individual component. In more than two decades’ intensive research,

a large number of formulations have been studied, among them those considered the most

promising are formulations employing global optimization and robust criteria. Our work ad-

vances the state of art in this direction by introducing three-frame matching to overcome the

visibility problem at occlusions and allowing local variation in the global scheme to reduce

122

parameter tuning and improve local adaptivity. For further improvement, problems such

as the modeling of the three-frame matching error, the choice of robust estimator and the

learning of parameters will be investigated in our future work. Developing more appropriate

formulations continues to be a significant topic in the field of optical flow estimation.

Energy minimization.

In refining optical flow formulations, computational complexity increases rapidly with

model sophistication. This is especially true for global approaches involving large-scale

nonconvex optimization problems. No practical numerical methods exist for finding the

global optimum; only a local optimum can be found. This fact causes great difficulties in

problem diagnosis: when a technique yields a poor estimate, it can be very hard to tell

whether it is due to formulation weaknesses or to the local-minimum nature of the solution.

Investigating more globally optimal solutions to the large-scale nonconvex problems is

one of our immediate future work directions. Towards this end, methods such as graph cuts

[23] which has yield nice results in stereo matching, full multigrid methods [102], Bayesian

belief propagation (BBP) [137], and local minimization methods alternative to SOR [102] are

worth studying. As many areas in computer vision are converging to energy minimization

formulations, progress in this research is expected to have impacts in a wide context.

Uncertainty analysis.

Systematic uncertainty analysis is a very crucial, yet very difficult problem. We made an

attempt to examine the uncertainty in the local gradient-based flow estimate and demon-

strated its effectiveness in a motion boundary detection application. Although our approach

is one step farther than previous efforts, it is based on propagating small perturbations

through approximately linear systems and break downs when the estimate quality becomes

too low. How to make a system be aware of its own failure remains an open issue. In addi-

tion, almost all previous error analysis were performed for local techniques; how to measure

the uncertainty of a global approach is another open issue.

Performance evaluation and comparison.

123

The significance of comparative evaluation cannot be understated: it is necessary to

assess the performance of both established and new algorithms and to gauge the progress in

the field [7, 111]. So far the most popular evaluation method is to measure the difference be-

tween the flow estimate and the projected 2D velocity field. To maintain comparability with

previously published results, we followed the methodology of Barron et al. in this thesis and

reported certain statistics of the difference between our flow estimates and the synthesized

groundtruth. However, as we pointed out in Section 4.3.3, such evaluation methods are

flawed due to the aperture problem; they become problematic in textureless regions, where

the correctness of “groundtruth” becomes questionable and so does the authority of quan-

titative evaluation based on it. For this reason, together with the simplicity of synthetic

data and error measures, “quantitative” results should be considered as qualitative at best.

The above suggests that the inherent ambiguity of the optical flow should be taken into

account in quantitative evaluation—larger errors should be allowed in regions of less local

information. Developing more convincing evaluation methods deserves serious attention.

Bayesian framework.

The benefits of the Bayesian framework should be fully exploited. Among all criteria

from which the global energy may arise, we find the Bayesian approach most appealing in

both theoretic and practical interest. Estimating optical flow from a few images is inherently

ambiguous: areas with more appropriate textures have higher estimation certainty. This

indicates that the nature of the problem is probabilistic instead of deterministic. Further-

more, the Bayesian formulation may provide a graceful solution to two important problems:

global optimization and uncertainty analysis [126, 7, 144]. Interesting results from a global

optimization method, Bayesian belief propagation (BBP), have been shown on a limited do-

main of vision problems [137]. BBP propagates estimates together with their covariances. If

it converges, it converges in a small number of iterations with covariances as a by-product.

It will be interesting to see if ideas like BBP are applicable and beneficial to optical flow

estimation.

Applications.

124

As an accurate and efficient low-level approach to visual motion analysis, the new method

has great potential in a wide variety of applications. First of all, it provides a good starting

point for higher-level motion analysis. Our flow estimates already take a layered look and

motion boundaries of layers are closed curves. They can reliably initialize motion segmen-

tation [109], contour-based [86] and layered [131] representation. Model selection [130] is

a crucial problem in automatic scene analysis [16]; it is difficult because comparing a col-

lection of models on the raw image data involves formidable computation. Our results can

ease this task by supplying a higher ground for scene knowledge learning. The backward-

forward matching error, together with detected motion boundaries, can facilitate occlusion

reasoning [16]. It may also guide image warping to avoid smoothing across motion disconti-

nuities. Some success has been obtained in our preliminary experiments. This is important

to motion estimation as well as for novel view synthesis.

Visual reconstruction.

Motion estimation is one of many low-level visual reconstruction problems. Many con-

clusions from our work are also extendable to other low-level visual problems such as stereo

matching, 3D surface reconstruction and image restoration.

125

BIBLIOGRAPHY

[1] M.D. Abramoff, W. J. Niessen, and M. A. Viergever. Objective quantification of the

motion of soft tissues in the orbit. IEEE Trans. on Medical Imaging, 19(10):986–995,

2000.

[2] G. Adiv. Determining three-dimensional motion and structure from optical flow gen-

erated by several moving objects. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 7(4):384–401, 1985.

[3] P. Anandan. A computational framework and an algorithm for the measurement of

visual motion. International Journal of Computer Vision, 2:283–310, 1989.

[4] S. Ayer, Pschroeter, and J. Bigun. Segmentation of moving objects by robust motion

parameter estimation over multiple frames. In Proc. European Conf. on Computer

Vision, volume 2, pages 316–327, 1994.

[5] A. Bab-Hadiashar and D. Suter. Robust optical flow estimation. International Journal

of Computer Vision, 29(1):59–77, 1998.

[6] Y. Bar-Shalom and T.E. Fortmann. Tracking and Data Association. Academic Press,

1988.

[7] J. L. Barron, S. S. Beauchemin, and D. J. Fleet. Performance of optical flow tech-

niques. International Journal of Computer Vision, 12(1):43–77, 1994.

[8] J.L. Barron. A survey of approaches for determining optic flow, environmental layout

and egomotion. In RBCV-TR, 1984.

126

[9] R. Battiti, E. Amaldi, and C. Koch. Computing optical flow across multiple scales: An

adaptive coarse-to-fine strategy. International Journal of Computer Vision, 6(2):133–

145, 1991.

[10] S. S. Beauchemin and J. L. Barron. The computation of optical flow. ACM Computing

Surveys, 27(3):433–467, 1995.

[11] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical model-based

motion estimation. In Proc. European Conf. on Computer Vision, pages 237–252,

1992.

[12] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. A three-frame algorithm for esti-

mating two-component image motion. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 14(9):886–896, 1992.

[13] P. J. Besl, J. B. Birch, and L. T. Watson. Robust window operators. In Proc. European

Conf. on Computer Vision, pages 591–600, 1988.

[14] M. J. Black. Robust Incremental Optical Flow. Doctoral dissertation (research report),

Yale Univ., 1992.

[15] M. J. Black and P. Anandan. The robust estimation of multiple motions: paramet-

ric and piecewise-smooth flow fields. Computer Vision and Image Understanding,

63(1):75–104, 1996.

[16] M. J. Black and D. J. Fleet. Probabilistic detection and tracking of motion disconti-

nuities. International Journal of Computer Vision, 38:229–243, 2000.

[17] M. J. Black and A. Jepson. Estimating optical flow in segmented images using variable-

order parametric models with local deformations. IEEE Trans. on Pattern Analysis

and Machine Intelligence, 18(10):972–986, 1996.

127

[18] M. J. Black and A. Rangarajan. On the unification of line processes, outlier rejec-

tion, and robust statistics with applications in early vision. International Journal of

Computer Vision, 19:57–91, 1996.

[19] M. J. Black, G. Sapiro, D. Marimont, and D. Heeger. Robust anisotropic diffusion.

IEEE Trans. Image Processing: Special issue on Partial Differential Equations and

Geometry Driven Diffusion in Image Processing and Analysis, 7(3):421–432, 1998.

[20] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, Cambridge, MA, 1987.

[21] P. Bouthemy and E. Francois. Motion segmentation and qualitative dynamic

scene analysis from an image sequence. International Journal of Computer Vision,

10(2):157–182, 1993.

[22] P. Bouthemy and J. S. Rivero. A hierarchical likelihood approach for region segmen-

tation according to motion-based criteria. In Proc. International Conf. on Computer

Vision, pages 463–467, 1987.

[23] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via

graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(11):1–18,

2001.

[24] M. Brooks, W. Chojnacki, D. Gawley, and A. van den Hengel. What value covari-

ance information in estimating vision parameters? In Proc. International Conf. on

Computer Vision, pages 302–308, 2001.

[25] K. Bubna and C. V. Stewart. Model selection and surface merging in reconstruction

algorithms. In Proc. International Conf. on Computer Vision, pages 895–902, 1998.

[26] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE

Trans. on Communication, 31:532–540, 1983.

128

[27] R. Pless C. Fermuller and Y. Aloimonos. Statistical biases in optical flow. In Proc.

Computer Vision and Pattern Recognition, volume 1, pages 561–566, 1999.

[28] C. Cafforio and F. Rocca. Methods for measuring small displacements of television

images. IEEE Trans. on Information Theory, (5):573–579, 1976.

[29] S. E. Chen and L. Williams. View interpolation for image synthesis. Computer

Graphics, 27(Annual Conference Series):279–288, 1993.

[30] C-Y. Chong, D. Garren, and T.P. Grayson. Ground target tracking— a historical

perspective. In IEEE Proc. Aerospace Conference, volume 3, pages 433–448, 2000.

[31] R. Cipolla, K. Okamoto, and Y. Kuno. Robust structure from motion using motion

parallax. In Proc. International Conf. on Computer Vision, pages 374–382, 1993.

[32] C. Colombo and A. del Bimbo. Generalized bounds for time to collision from first-

order image motion. In Proc. International Conf. on Computer Vision, pages 220–226,

1999.

[33] T. Darrell and A. Pentland. Cooperative robust estimation using layers of support.

IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(5):474–487, 1995.

[34] A.M. Earnshaw and S.d. Blostein. The performance of camera translation direction

estimators from optical flow: Analysis, comparison, and theoretical limits. IEEE

Trans. on Pattern Analysis and Machine Intelligence, 18(9):927–932, 1996.

[35] G. Farneback. Very high accuracy velocity estimation using orientation tensors, para-

metric motion, and simultaneous segmentation of the motion field. In Proc. Interna-

tional Conf. on Computer Vision, volume 1, pages 171–177, 2001.

[36] O. Faugeras. Three-dimensional computer vision: a geometric viewpoint. MIT Press,

1993.

129

[37] C.L. Fennema and W.B. Thompson. Velocity determination in scenes containing

several moving objects. Computer Graphics and Image Processing, 9:301–315, 1979.

[38] D. J. Fleet and A. D. Jepson. Computational of component image velocity from local

phase information. International Journal of Computer Vision, 5(1):77–104, 1990.

[39] B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills. Recovering motion fields:

An evaluation of eight optical flow algorithms. In Proc. British Machine Vision Conf.,

volume 1, pages 195–204, 1998.

[40] T. Gandhi, M. Yang, R. Kasturi, O. Camps, and L. Coraor. Detection of obstacles

in the flight path of an aircraft. In Proc. Computer Vision and Pattern Recognition,

volume 2, pages 304–311, 2000.

[41] D. Geman and G. Reynolds. Constrained restoration and the recovery of disconti-

nuities. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(3):367–384,

1992.

[42] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian

restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence,

6(6):721–741, 1984.

[43] S. Geman and D. E. McClure. Statisical mothods for tomographic image reconstruc-

tion. Bull. Int. Statist. Inst., 2(4):5–21, 1987.

[44] S. Ghosal and P. Vanek. A fast scalable algorithm for discontinuous optical flow

estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(2):181–

194, 1996.

[45] A. Giachetti, M. Campani, and V. Torre. The use of optical flow for the autonomous

navigation. In Proc. European Conf. on Computer Vision, pages A:146–151, 1994.

130

[46] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non-

gaussian bayesian state estimation. IEE Proceedings-F, 140(2):107–113, 1993.

[47] N. Gupta and L. Kanal. Gradient based motion estimation without computing gra-

dients. International Journal of Computer Vision, 22:81–101, 1997.

[48] K.J. Hanna. Direct multi-resolution estimation of ego-motion and structure from

motion. In Proc. Workshop on Visual Motion, pages 156–162, 1991.

[49] R. M. Haralick. Computer vision theory: the lack thereof. CVGIP, 36:372–386, 1986.

[50] R. M. Haralick, editor. Proc. 1st International Workshop on Robust Computer Vision,

Seattle, WA, Oct. 1990.

[51] R. M. Haralick, editor. Workshop Proc. Performance vs. Methodology in Computer

Vision, Seattle, WA, June 1994.

[52] R. M. Haralick. Propagating covariance in computer vision. 10(5):561–72, 1996.

[53] R. M. Haralick and J. S. Lee. The facet approach to optic flow. In Proc. Image

Understanding Workshop, pages 84–93, 1983.

[54] R. M. Haralick and L. G. Shapiro. Computer and Robot Vision. Addison-Wesley

publishing company, 1992.

[55] R. M. Haralick and L. Watson. A facet model for image data. CVGIP, 15:113–129,

1981.

[56] D. Heeger. Optical flow using spatio-temporal filters. International Journal of Com-

puter Vision, 1(4):279–302, 1988.

[57] F. Heitz and P. Bouthemy. Multimodal estimation of discontinuous optical flow using

markov random fields. IEEE Trans. on Pattern Analysis and Machine Intelligence,

15(12):1217–1232, 1993.

131

[58] B. K. P. Horn and B. G. Schunck. Determining optic flow. Artificial Intelligence,

17:185–203, 1981.

[59] Y. Huang, K. Palaniappan, X. Zhuang, and J. e. Cavanaugh. Optical flow field seg-

mentation and motion estimation using a robust genetic partitioning algorithm. IEEE


[60] M. Ioka and M. Kurokawa. Estimation of motion vectors and their application to

scene retrieval. MVA, 7(3):199–208, 1994.

[61] M. Irani. Multi-frame optical flow estimation using subspace constraints. In Proc.

International Conf. on Computer Vision, pages 626–633, 1999.

[62] M. Irani and P. Anandan. A unified approach to moving object detection in 2d and

3d scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(6):577–589,

1998.

[63] M. Irani, B. Rousso, and S. Peleg. Computing occluding and transparent motions.

International Journal of Computer Vision, 12:5–16, 1994.

[64] M. Irani, B. Rousso, and S. Peleg. Recovery of ego-motion using region alignment.


[65] M. Isard and A. Blake. Condensation — conditional density propagation for visual

tracking. International Journal of Computer Vision, 29(1):5–28, 1998.

[66] B. Jahne. Motion determination in space-time images. In Proc. European Conf. on


[67] T. Jebara, A. Azarbayejani, and A. Pentland. 3d structure from 2d motion. IEEE

Signal Processing Magazine, pages 66–84, 5 1999.

132

[68] A. Jepson and M. J. Black. Mixture models for optical flow computation. Tech.

Report, Res. in Biol. and Comp. Vision RBCV-TR-93-44, Univ. of Toronto, 1993.

[69] A. D. Jepson, D. J. Fleet, and T. El-Maraghi. Robust, on-line appearance models for

vision tracking. In Proc. Computer Vision and Pattern Recognition, volume 1, pages

415–422, 2001.

[70] J.-M. Jolion, P. Meer, and S. Bataouche. Robust clustering with applications in com-

puter vision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 163(8):791–

802, 1991.

[71] S. X. Ju, M. J. Black, and A. D. Jespon. Skin and bones: multi-layer, locally affine,

optical flow and regularization with transparency. In Proc. Computer Vision and

Pattern Recognition, pages 307–314, 1996.

[72] T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window:

theory and experiment. IEEE Trans. on Pattern Analysis and Machine Intelligence,

16(9):920–932, 1994.

[73] S. B. Kang, R. Szeliski, and J. Chai. Handling occlusions in dense multi-view stereo.

In Proc. Computer Vision and Pattern Recognition, volume 1, pages 103–110, 2001.

[74] J. K. Kearney, W. B. Thompson, and D. L. Boley. Optical flow estimation: an error

analysis of gradient-based methods with local optimization. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 9(2):229–244, 1987.

[75] V. Koivunen. A robust nonlinear filter for image restoration. IEEE Trans. Image

Processing, 4(5):569–578, 1995.

[76] D. Koller, J. Weber, and J. Malik. Robust multiple car tracking with occlusion

reasoning. In Proc. European Conf. on Computer Vision, volume 1, pages 189–196,

1994.

133

[77] J. Konrad and E. Dubois. Bayesian estimation of motion vector fields. IEEE Trans.

on Pattern Analysis and Machine Intelligence, 14(9):910–927, 1992.

[78] R. Kumar, P. Anandan, and K. Hanna. Direct recovery of shape from multiple views:

a parallax based approach. In Proc. International Conf. on Pattern Recognition, pages

685–688, 1994.

[79] J. O. Limb and J. A. Murphy. Computer Graphics and Image Processing, pages =.

[80] M. I. A. Lourakis, A. A. Argyros, and S. C. Orphanoudakis. Independent 3d motion

detection using residual parallax normal flow fields. In Proc. International Conf. on


[81] M. I. A. Lourakis and S. C. Orphanoudakis. Using planar parallax to estimate the

time-to-contact. In Proc. Computer Vision and Pattern Recognition, pages 640–645,

1999.

[82] B. D. Lucas and T. Kanade. An iterative image-registration technique with an ap-

plication to stereo vision. In Proc. Image Understanding Workshop, pages 121–130,

1981.

[83] D. Marr. On the purpose of low-level vision. In MIT AI Memo, 1974.

[84] L. H. Matthies, R. Szeliski, and T. Kanade. Kalman filter-based algorithms for es-

timating depth from image sequences. International Journal of Computer Vision,

3:209–236, 1989.

[85] P. Meer, D. Mintz D. Y. Kim, and A. Rosenfeld. Robust regression methods for

computer vision: a review. International Journal of Computer Vision, 6(1):59–70,

1991.

[86] E. Memin and P. Perez. Dense estimation and object-based segmentation of the optical

flow with robust techniques. IEEE Trans. Image Processing, 7(5):703–719, 1998.

134

[87] M. Middendorf and H. H. Nagel. Estimation and interpretation of discontinuities in

optical flow fields. In Proc. International Conf. on Computer Vision, pages 178–183,

2001.

[88] D. W. Murray and B. F. Buxton. Scene segmentation from visual motion using global

optimization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):220–

228, 1987.

[89] H. H. Nagel. On the estimation of optical flow: Relations between different approaches

and some new results. Artificial Intelligence, 33(3):299–324, 1987.

[90] H. H. Nagel. Optical flow estimation and the interaction between measurement errors

at adjacent pixel positions. International Journal of Computer Vision, 15:271–288,

1995.

[91] H. H. Nagel and W. Enkelmann. An investigation of smoothness constraints for

the estimation of displacement vector fields from image sequences. IEEE Trans. on

Pattern Analysis and Machine Intelligence, 8(5):565–593, 1986.

[92] H.H. Nagel and M. Haag. Bias-corrected optical flow estimation for road vehicle

tracking. In Proc. International Conf. on Computer Vision, pages 1006–1011, 1998.

[93] H.H. Nagel, G. Socher, H. Hollnig, and M. Otte. Motion boundary detection in image

sequences by local stochastic tests. In Proc. European Conf. on Computer Vision,

volume 2, pages 305–315, 1994.

[94] P. Nesi, A. D. Bimbo, and D. Ben-Tzvi. A robust algorithm for optical flow estimation.

Computer Vision and Image Understanding, 62(1):59–68, 1995.

[95] L. Ng and V. Solo. Errors-in-variable modelling in optical flow problems. In Proc. Int.

Conf. on Acoustics Speech and Signal Processing, volume 5, pages 2773–2776, 1998.

135

[96] N. Ohta. Uncertainty models of the gradient constraint for optical flow computation.

IEICE Trans. Info. & Sys., E79-D(7):958–962, 1996.

[97] E. P. Ong and M. Spann. Robust optical flow computation based on least-median-of-

squares. International Journal of Computer Vision, 31(1):51–82, 1999.

[98] M. Otte and H.H. Nagel. Optical flow estimation: Advances and comparisons. In

Proc. European Conf. on Computer Vision, pages 51–60, 1994.

[99] N. Peterfreund. Robust tracking of position and velocity with kalman snakes. IEEE


[100] D. Piponi. Virtual cinematography in “the matrix”.

http://www2.parc.com/ops/projects/forum/2000/forum-07-13.html, 2000.

[101] R. Pless, T. Brodsky, and Y. Aloimonos. Detecting independent motion: The statistics

of temporal continuity. IEEE Trans. on Pattern Analysis and Machine Intelligence,

22(8):768–773, 2000.

[102] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes

in C. Cambridge Univ. Press, 2 edition, 1997.

[103] P. J. Rousseeuw and S. Van Aelst. Positive-breakdown robust methods in computer

vision. Computing Science and Statistics, 31:451–460, 1999.

[104] P. J. Rousseeuw and K. Van Driessen. Computing lts regression for large data sets.

Tech. report (submitted), Univ. of Antwerp.

[105] P. J. Rousseeuw and M. Hubert. Recent developments in progress. In Y. Dodge, editor,

L1-Statistical Procedures and Related Topics, volume 31, pages 201–214. Institute of

Mathematical Statistics Lecture Notes-Monograph Series, 1997.

136

[106] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John

Wiley and Sons, 1987.

[107] Y. Rui and P. Anandan. Segmenting visual actions based on spatio-temporal motion

patterns. In Proc. Computer Vision and Pattern Recognition, volume 1, pages 111–

118, 2000.

[108] H. S. Sawhney. 3d geometry from planar parallax. In Proc. Computer Vision and

Pattern Recognition, pages 929–934, 1994.

[109] H. S. Sawhney and S. Ayer. Compact representations of videos through dominant

and multiple motion estimation. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 18(8):814–830, 1996.

[110] H. S. Sawhney and R. Kumar Y. Guo. Independent motion detection in 3d scenes.


[111] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo

correspondence algorithms. International Journal of Computer Vision, 47(1):7–42,

2002.

[112] R.R. Schultz, L. Meng, and R.L. Stevenson. Subpixel motion estimation for super-

resolution image sequence enhancement. Journal of Visual Communication and Image

Representation, 9(1):38–50, 1998.

[113] B. G. Schunck. Image flow segmentation and estimation by constraint line clustering.


[114] H. Schweitzer. Occam algorithms for computing visual motion. IEEE Trans. on

Pattern Analysis and Machine Intelligence, 17(11):1033–1042, 1995.

137

[115] A. Shashua and N. Navab. Relative affine structure: theory and application to 3d

reconstruction from perspective views. In Proc. Computer Vision and Pattern Recog-

nition, pages 483–489, 1994.

[116] D. Shulman and J. Herve. Regularization of discontinuous flow fields. In Proc. Work-

shop on Visual Motion, pages 81–85, 1989.

[117] D-G. Sim and R-H Park. Robust reweighted map motion estimation. IEEE Trans.


[118] E. P. Simoncelli. Distributed Analysis and Representation of Visual Motion. Doctoral

dissertation, MIT, 1993.

[119] E.P. Simoncelli, E.H. Adelson, and D. Heeger. Probability distributions of optical

flow. In Proc. Computer Vision and Pattern Recognition, pages 310–315, 1991.

[120] A. Singh. Optic Flow Computation: A Unified Perspective. IEEE Press, 1990.

[121] S. S. Sinha and B. G. Schunk. A two-stage algorithm for discontinuity-preserving

surface reconstruction. IEEE Trans. on Pattern Analysis and Machine Intelligence,

14(1):36–55, 1992.

[122] S. Srinivasan. In Proc. International Conf. on Computer Vision, volume 1.

[123] G. P. Stein and A. Shashua. Model-based brightness constraints: on direct estimation

of structure and motion. IEEE Trans. on Pattern Analysis and Machine Intelligence,

22(9):992–1015, 2000.

[124] C. V. Stewart. Expected performance of robust estimators near discontinuities. In

Proc. International Conf. on Computer Vision, pages 969–974, 1995.

[125] S. Sun, D. Haynor, and Y. Kim. Motion estimation based on optical flow with adaptive

gradients. In Proc. International Conf. on Image Processing, pages 852–855, 2000.

138

[126] R. Szeliski. Bayesian modeling of uncertainty in low-level vision. Kluwer Academic

Pub., 1989.

[127] R. Szeliski and J. Coughlan. Hierarchical spline-based image registration. In Proc.

International Conf. on Image Processing, pages 194–201, 1994.

[128] R. Szeliski and H.-Y. Shum. Motion estimation with quadtree splines. IEEE Trans.


[129] H. Tao, H.S. Sawhney, and R. Kumar. A global matching framework for stereo com-

putation. In Proc. International Conf. on Computer Vision, pages 532–539, 2001.

[130] P. H. S. Torr. Geometric motion segmentation and model selection. In J. Lasenby

et al., editor, Proc. The Royal Society of London, pages 1321–1340, 1998.

[131] P. H. S. Torr, R. Szeliski, and P. Anandan. An integrated bayesian approach to layer

extraction from image sequences. In Proc. International Conf. on Computer Vision,

pages 983–990, 1998.

[132] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment –

A modern synthesis. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision

Algorithms: Theory and Practice, LNCS, pages 298–375. Springer Verlag, 2000.

[133] W. N. Venables and B. D. Ripley. Modern Applied Statistic with S-Plus, 2nd Edition.

Springer, 1997.

[134] J. Y. A. Wang and H. Adelson. Representing moving images with layers. IEEE Trans.

Image Processing, 3(5):625–638, 1994.

[135] J. Weber and J. Malik. Robust computation of optical flow in a multi-scale differential

framework. International Journal of Computer Vision, 14(1):67–81, 1995.

139

[136] Y. Weiss. Bayesian belief propagation for image understanding. Workshop on Sta-

tistical and Computational Theories of Vision 1999–Modeling, Learning, Computing,

and Sampling (submitted for publication), 1999.

[137] Y. Weiss and W. T. Freeman. Correctness of belief propagation in gaussian graphical

models of arbitrary topology. Neural Comp., 13(10):2173–200, 2001.

[138] R. R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing. Academic

Press, 1997.

[139] P. Willet, R. Niu, and Y. Bar-Shalom. Integration of bayes detection with target

tracking. IEEE Trans. on Signal Processing, 49(1):17–30, 2000.

[140] Y. Xiong and S. A. Shafer. Moment and hypergeometric filters for high precision

computation of focus, stereo and optical flow. International Journal of Computer

Vision, 22(1):25–59, 1997.

[141] M. Ye. Image flow estimation using facet model and covariance propagation. M.S.

Thesis, Univ. of Washington, Seattle, WA, USA, 1999.

[142] M. Ye, M. Bern, and D. Goldberg. Document image matching and annotation lifting.

In Proc. International Conference on Document Analysis and Recognition, pages 753–

760, 2001.

[143] M. Ye and R. M. Haralick. Image flow estimation using facet model and covariance

propagation. In Vision Interface, pages 51–58, 1998.

[144] M. Ye and R. M. Haralick. Image flow estimation using facet model and covariance

propagation. In M. Cheriet and Y.H. Yang, editors, Vision Interface: Real World

Applications of Computer Vision, pages 209–241. World Sci., 2000.

140

[145] M. Ye and R. M. Haralick. Optical flow from a least-trimmed squares based adaptive

approach. In Proc. International Conf. on Pattern Recognition, pages 1052–1055,

2000.

[146] M. Ye and R. M. Haralick. Two-stage robust optical flow estimation. In Proc. Com-

puter Vision and Pattern Recognition, pages 623–628, 2000.

[147] M. Ye and R. M. Haralick. Local gradient global matching piecewise smooth optical

flow. In Proc. Computer Vision and Pattern Recognition, pages 712–717, 2001.

[148] M. Ye and R. M. Haralick. Point aerial target detection and tracking — a motion-

based bayesian approach. ISL Tech Report, Univ. of Washington, 2001.

[149] M. Ye, R. M. Haralick, and L. G. Shapiro. Estimating optical flow using a global

matching formulation and graduated optimization. In Proc. International Conf. on

Image Processing, Rochester, NY, September 2002. to appear.

[150] K. Zhang, M. Bober, and J. Kittler. Motion based image segmentation for video

coding. In Proc. International Conf. on Image Processing, pages 476–479, 1995.

[151] S. C. Zhu and A. Yuille. Region competition: unifying snakes, region growing, and

bayes/mdl for multiband image segmentation. IEEE Trans. on Pattern Analysis and

Machine Intelligence, 18(9):884–900, 1996.

141

VITA

Ming Ye was born in Chengdu, P.R. China, in January 1975. She received her B.S.

degree in Electrical Engineering from the University of Electronic Science and Technology

of China in June 1997. She then joined the Intelligence Systems Laboratory at the University

of Washington as a research assistant, where she obtained her M.S. degree in March 1999

and will receive her Ph.D. degree by December 2002, both in Electrical Engineering. She was

a research intern at the Xerox Palo Alto Research Center during the summer of 2000. Her

research is in the area of computer vision and image processing, with a focus on statistical

and robust approaches to visual motion analysis and applications.

Date post:	09-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Robust Visual Motion Analysis: Piecewise-Smooth Optical ...qji/Papers/mingye_thesis.pdfRobust Visual...

Documents