Johnson Roberson2010JFRRoboticSurveys

Generation and Visualization of Large-ScaleThree-Dimensional Reconstructions from UnderwaterRobotic Surveys

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

Matthew Johnson-Roberson, Oscar Pizarro, Stefan B. Williams, and Ian MahonAustralian Centre for Field Robotics, University of Sydney, Sydney 2006, New South Wales, Australiae-mail: [email protected]

Received 30 January 2009; accepted 4 August 2009

Robust, scalable simultaneous localization and mapping (SLAM) algorithms support the successful deploy-ment of robots in real-world applications. In many cases these platforms deliver vast amounts of sensor datafrom large-scale, unstructured environments. These data may be difficult to interpret by end users withoutfurther processing and suitable visualization tools. We present a robust, automated system for large-scalethree-dimensional (3D) reconstruction and visualization that takes stereo imagery from an autonomous un-derwater vehicle (AUV) and SLAM-based vehicle poses to deliver detailed 3D models of the seafloor in theform of textured polygonal meshes. Our system must cope with thousands of images, lighting conditionsthat create visual seams when texturing, and possible inconsistencies between stereo meshes arising fromerrors in calibration, triangulation, and navigation. Our approach breaks down the problem into manage-able stages by first estimating local structure and then combining these estimates to recover a compositegeoreferenced structure using SLAM-based vehicle pose estimates. A texture-mapped surface at multiple scalesis then generated that is interactively presented to the user through a visualization engine. We adapt estab-lished solutions when possible, with an emphasis on quickly delivering approximate yet visually consistentreconstructions on standard computing hardware. This allows scientists on a research cruise to use our systemto design follow-up deployments of the AUV and complementary instruments. To date, this system has beentested on several research cruises in Australian waters and has been used to reliably generate and visualize re-constructions for more than 60 dives covering diverse habitats and representing hundreds of linear kilometersof survey. C© 2009 Wiley Periodicals, Inc.

1. INTRODUCTION

As robotic platforms are successfully deployed in scien-tific (Bajracharya, Maimone, & Helmick, 2008; Germanet al., 2008), industrial (Durrant-Whyte, 1996; Thrun et al.,2004), defense (Kim & Sukkarieh, 2004), and transportation(Thrun et al., 2006) applications, the ability to visualize andinterpret the large amounts of data they can collect has be-come a pressing problem. High-resolution imaging of theseafloor using robotic systems presents a prime exampleof this issue. Optical imaging by robots has been used ex-tensively to study hydrothermal vents (Kelley et al., 2005;Yoerger, Jakuba, Bradley, & Bingham, 2007), document an-cient and modern wrecks (Ballard et al., 2000; Howland,1999), characterize benthic habitats (Armstrong et al., 2006;Singh, Eustice, et al., 2004; Webster et al., 2008), and in-spect underwater man-made structures (Walter, Hover, &Leonard, 2008). Optical imagery is rich in detail and iseasily interpretable by scientists. However, it is often dif-

Additional Supporting Information (a video displaying the reconstruc-tions generated in this paper) may be found in the online version.

ficult to acquire high-quality georeferenced imagery un-derwater, given that water strongly attenuates electromag-netic waves [including light and radio frequency (RF) sig-nals] (Duntley, 1963), which forces imaging close to theseafloor and precludes the use of high-bandwith commu-nications and global positioning system (GPS)–based po-sitioning. Autonomous underwater vehicles (AUVs) canaddress the requirements for near-bottom, high-resolutionimaging in a cost-effective manner. These robotic platformsclosely follow rugged seafloor features to acquire well-illuminated imagery over controlled track lines coveringhundreds to thousands of linear meters. Operating un-tethered and away from the surface also minimizes wave-induced motions, resulting in a steady, responsive sensorplatform.

Although having thousands of georeferenced imagesof a site is useful, being able to easily visualize and inter-act with the imagery and associated structure at scales bothlarger and smaller than a single image can provide scien-tists with a powerful data exploration tool, potentially al-lowing them to observe patterns at scales much larger thanthat covered by a single image. Such a tool should allowusers to quickly build an intuitive understanding of the

Journal of Field Robotics 27(1), 21–51 (2010) C© 2009 Wiley Periodicals, Inc.Published online in Wiley InterScience (www.interscience.wiley.com). • DOI: 10.1002/rob.20324

22 • Journal of Field Robotics—2010

spatial relationships between substrates, morphology, ben-thos, and depth. This might then be used to test hypothesesrelated to the distribution of benthic habitats that could in-form further surveys and sampling.

Large-scale visualization underwater requires the cre-ation of composite views through two-dimensional (2D) orthree-dimensional (3D) reconstructions. Approaches for 2Dmosaicking (Sawhney, Hsu, & Kumar, 1998; Sawhney &Kumar, 1999) are significantly simpler than 3D approachesand are easy to visualize at multiple scales but can pro-duce strong distortions in the presence of 3D relief. In termsof large-scale underwater reconstructions, most mosaickinghas been motivated largely by vision-based navigation andstation keeping close to the seafloor (Fleisher, Wang, Rock,& Lee, 1996; Gracias & Santos-Victor, 2001; Negahdaripour,Xu, & Jin, 1999; Negahdaripour & Xun, 2002). Additionally,2D mosaics with stereo compensation has been explored(Negahdaripour & Firoozfam, 2006). Large-area mosaick-ing with low overlap under the assumption of planarity isaddressed by Pizarro and Singh (2003).

Because AUVs can operate in very rugged terrain,we argue that a sounder approach is to account for 3Dstructure. In fact, AUV surveys are typically undertakenin environments that feature complex structure, such asreefs, canyons, and trenches, where a 2D seafloor model isnot appropriate. The machinery to convert optical imageryinto 3D representations of the environment has been stud-ied extensively (Fitzgibbon & Zisserman, 1998; Hartley &Zisserman, 2000), including systems that operate reliablyfor large-scale environments (Pollefeys, Koch, Vergauwen,& Gool, 2000). Some promising work has gone into 3D im-age reconstruction underwater (Negahdaripour & Madjidi,2003) using a stereo-rig with high overlap imagery in a con-trolled environment or single moving cameras (Nicosevici& Garcia, 2008; Pizarro, Eustice, & Singh, 2004). Underwa-ter stereo 3D reconstruction is shown by Jenkin et al. (2008)and Saez, Hogue, Escolano, and Jenkin (2006) on high-frame-rate dense stereo imagery using simultaneous local-ization and mapping (SLAM) and energy minimization toproduce consistent 3D maps but without explicitly address-ing the fast reconstruction and visualization of thousandsof images.

Most end-to-end systems for visualizing data collectedby robotic systems have focused on reconstructing 3D mod-els of urban environments (Fruh & Zakhor, 2004; Hu, You,& Neumann, 2003). The abundance of man-made structuressupports strong priors on structure that result in simpleor fast algorithms. One recent, state-of-the-art system usesvideo rate imagery and a multiview dense stereo solutionwith poses derived from high-end navigation instruments(Pollefeys et al., 2008). The mesh fusion stage addresses mi-nor inconsistencies, and the implicit assumption is that thequality of the local data and navigation are sufficient formodeling purposes.

Although there has been much work on outdoorvision-based SLAM (Agrawal, Konolige, & Bolles, 2007;Ho & Jarvis, 2007; Lemaire, Berger, Jung, & Lacroix, 2007;Steder et al., 2007), interactive visualization capabilitiestend to be limited or nonexistent, with results being used tovalidate reconstruction methods rather than to explore andunderstand the reconstructions. For unstructured scenesShlyakhter presents some impressive results of 3D treereconstruction, but these involved human input and oper-ate at relatively small scales (Shlyakhter, Rozenoer, Dorsey,& Teller, 2001).

In this paper we present a robust, automated sys-tem for large-scale, 3D reconstruction and visualizationthat combines stereo imagery with self-consistent vehi-cle poses to deliver dense 3D, texture mapped terrain re-constructions. This work takes advantage of recent ad-vances in visual SLAM techniques proposed by Eustice,Singh, Leonard, and Walter (2006) and extended by Mo-han, Williams, Pizarro, and Johnson-Roberson (2008) thatgenerate consistent estimates of the pose of an AUV dur-ing benthic survey missions. The novelty of this work arisesfrom our capacity to process and render tens of thousandsof images with sufficient speed to allow end-user interac-tion with the reconstruction in time to inform further datagathering missions. Our sytem is geared toward deliveringfast, approximate reconstructions that can be used duringa research cruise, and examples illustrating the utility ofthe reconstructions for deployment planning are discussed.Because of the system’s focus on delivering timely results,we also examine robustness issues and several instances re-quiring trade-offs between performance, accuracy, and thecomplexity of the reconstructed geometry.

The processing pipeline for our system can be brokendown into the following main steps as shown in Figure 1:

1. Data Acquisition and Preprocessing. The stereo imageryis acquired by the AUV. The primary purpose of thepreprocessing step is to partially compensate for light-ing and wavelength-dependent color absorption. Thisallows improved feature extraction and matching dur-ing the next stage.

2. Stereo Depth Estimation. Extracts 2D feature pointsfrom each image pair, robustly proposes correspon-dences, and determines their 3D position by triangula-tion. The local 3D point clouds are converted into indi-vidual Delaunay triangulated meshes.

3. Mesh Aggregation. Places the individual stereo meshesinto a common reference frame using SLAM-basedposes and then fuses them into a single meshusing volumetric range image processing (VRIP)(Curless & Levoy, 1996). The total bounding volume ispartitioned so that standard volumetric mesh integra-tion techniques operate over multiple smaller problemswhile minimizing discontinuities between integrated

Journal of Field Robotics DOI 10.1002/rob

Johnson-Roberson et al.: Large-Scale 3D Reconstructions from Underwater Robotic Surveys • 23

Figure 1. Processing modules and data flow for the reconstruction and visualization pipeline.

meshes. This stage also produces simplified versions ofthe mesh to allow for fast visualization at broad scales.

4. Texturing. The polygons of the complete mesh are as-signed textures based on the overlapping imagery thatprojects onto them. Lighting and misregistration arti-facts are reduced by separating images into spatial fre-quency bands that are mixed over greater extents forlower frequencies (Burt & Adelson, 1983).

The remainder of this paper is structured around the mainsteps in this processing pipeline and is organized as fol-

lows. Section 2 presents the AUV platform and preprocess-ing that enable the acquisition of georeferenced stereo im-agery. Section 3 presents our approach to generating localstructure, and Section 4 describes how these local represen-tations are merged into one consistent and readily viewablemesh. Section 5 details the application of visually consistenttextures to the global mesh. Section 6 describes the practi-cal considerations that enable the system to operate on thevery large volumes of data collected by the vehicle. Sec-tion 7 illustrates the effectiveness of the system using datacollected on a number of research cruises around Australia.



Finally, Section 8 provides conclusions and discusses on-going work.

2. DATA ACQUISITION AND PREPROCESSING

2.1. AUV-Based Imaging

The University of Sydney’s Australian Centre for FieldRobotics (ACFR) operates an ocean-going AUV called Sir-ius capable of undertaking high-resolution, georeferencedsurvey work. Sirius is part of the Integrated Marine Observ-ing System (IMOS) AUV Facility, with funding available ona competitive basis to support its deployment as part of ma-rine studies in Australia. Sirius is a modified version of theSeaBED AUV built at the Woods Hole Oceanographic In-stitution (Singh, Can, et al., 2004). This class of AUV is de-signed specifically for near-bottom, high-resolution imag-ing and is passively stable in pitch and roll. In additionto a stereo camera pair and a multibeam sonar, the sub-mersible is equipped with a full suite of oceanographic sen-sors (see Figure 2 and Table I). The two 1,360 × 1,024 cam-eras are configured as a down-looking pair with a baselineof approximately 7 cm and 42 × 34 deg field of view (FOV),

whereas the down-looking multibeam returns can be beam-formed to 480 beams in a 120-deg fan across track.

The AUV is typically programmed to maintain analtitude of 2 m above the seabed while traveling at 0.5 m/s(1 kn approx.) during surveys. Missions last up to 5 h with2-Hz stereo imagery and 5–10-Hz multibeam data, result-ing in approximately 40 GB/h of raw imagery, sonar data,and navigation data.

The vehicle navigates using the Doppler velocity log(DVL) measurements of both velocity and altitude rela-tive to the seafloor. Absolute orientation is measured us-ing a magnetoinductive compass and inclinometers, anddepth is obtained from a pressure sensor. Absolute po-sition information from a GPS receiver is fused into theposition estimate when on the surface. Acoustic observa-tions of the range and bearing from the ship are providedby an ultra short baseline (USBL) tracking system that in-cludes an integrated acoustic modem. USBL observationsare communicated to the vehicle over the acoustic link,and the vehicle returns a short status message, includingbattery charge, estimated position, and mission progress,so that its performance can be monitored while it isunderway.

Table I. Summary of the Sirius AUV specifications.

Parameter Specification

VehicleDepth rating 800 mSize 2.0 m (L) × 1.5 m (H) × 1.5 m (W)Mass 200 kgMaximum speed 1.0 m/sBatteries 1.5-kWh Li-ion packPropulsion 3 × 150 W brushless dc thrusters

NavigationAttitude + heading Tilt ±0.5 deg, compass ±2 degDepth Digiquartz pressure sensor, 0.01%Velocity RDI 1,200-kHz Navigator DVL ±2 mm/sAltitude RDI navigator four-beam averageUSBL TrackLink 1,500 HA (0.2-m range, 0.25 deg)GPS receiver uBlox TIM-4S

Optical imagingCamera Prosilica 12-bit, 1,360 × 1,024 charge-coupled device stereo pairLighting 2 × 4 J strobeSeparation 0.75 m between camera and lights

Acoustic imagingMultibeam sonar Imagenex DeltaT, 260 kHzObstacle avoidance Imagenex 852, 675 kHz

Tracking and commsRadio Freewave RF modem/EthernetAcoustic modem Linkquest 1,500-HA integrated modem

Other sensorsConductivity and temperature (CT) Seabird 37SBIChlorophyll-A, CDOM, and turbidity Wetlabs Triplet Ecopuck



Figure 2. (a) The AUV Sirius being retrieved after a mission aboard the R/V Southern Surveyor, (b) layout of internal components,and (c) the imaging configuration of the stereo cameras’ 42 × 34 deg FOV (depicted in dark blue) and multibeam 120 × 0.75 degFOV (depicted in teal).



2.2. Illumination Compensation

Range and wavelength–dependent attenuation of lightthrough water implies that the appearance of a scene pointwill have a strong dependence on the range to the lightsource(s) and camera. For example, underwater imagerytypically has darker edges because of stronger attenuationassociated with the viewing angle and longer path lengths(Jaffe, 1990). An image patch being tracked on a movingcamera will therefore violate the brightness constancy con-straint (BCC) that underlies many standard image match-ing algorithms.

Lighting compensation for underwater imagery has re-ceived some attention as a way of improving the generalappearance of imagery or to aid in establishing correspon-dences between images (Garcia, Nicosevici, & Cufi, 2002).The simplest approaches increase contrast by stretching thehistogram of intensities. This can offer some visual im-provement over individual images but can result in signifi-cant changes in mean over a sequence of images. In the caseof nonuniform lighting, the resulting histogram may al-ready be broad, and stretching the whole image histogrammay fail to adequately correct for illumination artifacts.Adaptive histogram equalization operates over subregionsof the image and can be used to account to some extentfor variation of illumination across an image (Zuiderveld,1994). Homomorphic processing variants decompose im-ages into a low-frequency component assumed to be re-lated to the lighting pattern and invert that field before re-assembling the image (Singh, Roman, Pizarro, Eustice, &Can, 2007). These techniques do not, however, enforce con-sistency across an ensemble of images, which would lead toseams in the texture maps used in our 3D reconstructions.

We have addressed the illumination issue in two ways:optimizing camera strobe configuration and performingpostprocessing. In the current configuration the vehicleis programmed to maintain a constant altitude above theseafloor with the cameras pointed downward. The vehi-cle carries a pair of strobes separated along the length ofthe frame that are synchronized with the image capture.The fore–aft arrangement of strobes partially cancels shad-owing effects while reducing the impact of backscatter inthe water column between the cameras and the seafloor.In postprocessing, we construct an approximate model ofthe resulting lighting pattern by calculating the mean andvariance for each pixel position and channel over a rep-resentative sample of images. A gain and offset for eachpixel position and channel is then calculated to transformthe distribution associated with that position and channelto a target distribution with high variance (i.e., contrast)and midrange mean. This is a form of the “gray world”assumption (Barnard, Cardei, & Funt, 2002), in which eachpixel position and channel is treated independently and thesamples of the world are acquired over many images. Moresophisticated versions of this approach could identify mul-timodal distributions and correct them accordingly. How-

Figure 3. Illustration of a stack of more than 5,000 gray-scaleimages from a mission averaged across each pixel creating themean lighting pattern image on the right. (Note that the con-trast has been stretched slightly to enhance viewing.)

ever, we have found this straightforward approach to besufficient for improving the feature matching process andthe consistency of illumination of the resulting images formost situations. An example of a set of sample images andassociated lighting pattern can be seen in Figure 3.

In addition to improving the visual quality of the im-ages from a user perspective, applying this normalizationyields significant improvements in the reliability of fea-ture matching. Feature extraction and description can bemade robust or even invariant to some changes in light-ing (Burghouts & Geusebroek, 2009). To illustrate the ef-fect of lighting compensation, we apply the stereo match-ing algorithm described in Section 3 to two pairs of images,as shown in Figure 4. The first has no lighting correctionapplied to the images prior to matching of stereo features,and the second has had the proposed lighting correction al-gorithm applied. The results are displayed in Figures 4(c)and 4(d). As can be seen, feature matching performs signif-icantly better when the illumination is corrected for, partic-ularly in the dark corners where the contrast is poor.

3. STEREO DEPTH ESTIMATION

There is a large body of work dedicated to two-view andmultiview stereo (Scharstein & Szeliski, 2002; Seitz, Curless,Diebel, Scharstein, & Saeliski, 2006), but dense stereo re-sults tend to be too complex for our application and forlimited overlap can produce incorrect surfaces. Other ap-proaches have examined the use of structure from mo-tion (SFM) (Hartley & Zisserman, 2000; Tomasi & Kanade,1992) to recover scene structure, utilizing the location,matching, and tracking of feature points over sequencesof images to recover the 3D structure of the underlyingscene as well as the associated camera poses. SFM’s sim-ple hardware requirements make it popular, but scale islost if a single camera is used. It is difficult to build a ro-bust system based solely on monocular SFM because it



Figure 4. (a) Uncorrected image illustrating the lighting pattern induced by the strobes, with darker corners and a bright centralregion. Stronger attenuation of the red channel also causes the image to appear green (we have applied a constant gain to the threechannels to brighten the overall image for easier viewing while preserving the relationship between channels and the lightingfalloff toward the edges). (b) The lighting is considerably more consistent in the compensated image. (c) Left image from stereo pairwithout lighting correction showing matched Harris corners as circles; green are valid; red have been rejected based on epipolargeometry. (d) Image feature matches with lighting compensation applied. Note the increased number of matches especially in thecorners. A wider range of color, in particular in the red channel, can be seen in the corrected image on the right. The histograms ofthe red, green, and blue channels of the uncorrected and corrected images appear in (e) and (f), respectively.

is sensitive to configurations of motions and surfaces thatcannot be solved uniquely. SFM modified to rely on resec-tion (Nicosevici & Garcia, 2008) or navigation instruments(Pizarro et al., 2004) has been applied successfully in an un-derwater context.

3.1. Sparse Feature-Based Stereo

For computational reasons we require a simple repre-sentation that captures the coarse structure of the scene.

One approach would be to use a dense stereo algo-rithm and then simplify the resulting depth map us-ing a mesh simplification technique such as quadric er-ror metric simplification (Garland & Heckbert, 1998) thatpreserves detail at the expense of extra computations.Instead, we have chosen to extract a sparse set of 2Dfeatures, robustly triangulate their positions, and then fita mesh to the 3D points using Delaunay triangulation(Hartley & Zisserman, 2000). By focusing on a sparseset of well-localized points, we expect to minimize gross



errors down the pipeline while keeping computationaldemands low.

The choice of feature in a correspondence-based stereomethod is heavily dependent on the camera geometry. Thecameras on the AUV are in a small baseline configuration,resulting in negligible change in scale of corresponding fea-tures and close to complete overlap between left and rightframes. This means that the majority of pixels in one imageframe should have matches in the corresponding view ex-cept where portions of the surface are occluded. The down-side of the small baseline configuration is an increased un-certainty in depth. An overview of the feature matchingprocess is as follows:

1. Feature points are extracted from the left-side source im-agery using a Harris corner detector (Harris & Stephens,1998).

2. Correspondences are proposed using a Lucas–Kanadetracker (Lucas & Kanade, 1981) seeded into right-sideimages by intersecting the associated epipolar line witha plane at the altitude given by a sonar altimeter (theDVL).

3. Proposed matches that are not consistent with the epipo-lar geometry derived from stereo calibration, i.e., out-liers, are then rejected.

4. Remaining feature points can then be triangulated usingthe midpoint method (Hartley & Zisserman, 2000).

An example of Harris corners that have been matched fromthe left to right frame of a sample stereo pair can be seen inFigures 4(c) and 4(d). The red points correspond to rejectedassociations based on the constraints of epipolar geometry,and the green points represent features that have been suc-cessfully triangulated. This example illustrates the distri-bution of features typically recovered with our imagery. Inmost cases it it possible to recover on the order of 2,000–3,000 triangulated points per image pair with the camerageometry and distance to the seabed used by our system.

The density of features extracted is a crucial consid-eration in feature-based stereo. Too many features will re-sult in complex meshes with large memory requirements,particularly when dealing with thousands of other images,and too few features will result in loss of detail in therelief and more ghosting when reprojecting images ontothe oversimplified mesh. Figure 5 illustrates the change inthe quality of the scene reconstruction when using vari-ous numbers of features. As the number of features de-creases, the model is less able to capture variation in theterrain, and differences in the estimated depth of the scenepoints become more pronounced, particularly around ar-eas of relatively high relief. A set of 100 randomly selectedimages were triangulated using between 200 and 1,700 fea-tures relative to a benchmark reconstruction using 2,000points. The objective was to determine whether the ef-fect of the sparsity of points is apparent across a numberof sample images. As shown in Figure 6, there is an ap-

proximately linear relationship between the number of fea-tures and error induced. Additionally there is a linear rela-tionship in computation time, which increases with num-ber of features. In practice 800 features strikes a balancebetween computation time and the quality of the outputmesh that has been sufficient in our experience. If therewere less emphasis on the computation time of the stereocalculation, a larger number of features could of coursebe triangulated and then simplified to the desired level,again at a higher computational cost than directly select-ing the desired number of features to triangulate. Alterna-tively, we could extract dense stereo meshes and simplifythem to the desired level of detail (LOD) in a fashionsimilar to that described in Section 4.3. We are currentlyinvestigating these approaches to characterize their perfor-mance.

3.2. Reconstruction Accuracy

The accuracy of the stereo camera triangulation is difficultto determine for general underwater imagery as groundtruth is not available for the natural underwater scenes weimage. We present the results for the estimation of the cor-ners of the checkerboard target used to calibrate our stereorig. This calibration is undertaken in a pool prior to deploy-ment of the vehicle. Figure 7 shows that the system is suc-cessfully able to estimate the positions of the corners on ourcalibration target, with a maximum error in the z positionon the order of 2 cm. Although these results are generatedusing an ideal corner feature, this suggests that the calibra-tion of our camera is of reasonable quality for the purposesof imaging the seafloor.

A complementary approach that would be applica-ble in the field would be to use multibeam sonar datato assist in the calibration or validation of the stereo sys-tem. We have performed a preliminary comparison be-tween the 3D surface generated from the stereo imageryand the multibeam sonar. Although the results are in gen-eral agreement, we are still characterizing the performanceof our sonar and as such we cannot draw strong conclu-sions from the comparison. Some of the issues we are ad-dressing are the calibration of the camera–sonar offsets,the consistency of the sonar beam forming, and outlierrejection.

4. MESH GENERATION

The mesh generation stage transforms the individual stereomeshes into a common reference frame and integrates theminto a single approximate mesh. This is a necessary stepin the process of generating a model because the separatestereo meshes can have errors in the estimated structureand in georeferencing. There are also redundant data inthe overlapping meshes that may fill holes resulting fromocclusion or poor viewing geometry. It is therefore desir-able to integrate several aligned meshes into a single global



Figure 5. An example of a 3D mesh and the effect of extracting a decreasing number of features on the surface geometry. As themesh complexity decreases, greater differences in height are seen as the mesh is no longer able to model variability in terrain.



400 600 800 1000 1200 1400 1600 1800 2000

0

0.02

0.04

Number of features

RM

S E

rror

in Z

[m

]

400 600 800 1000 1200 1400 1600 1800 20000.1

0.2

0.3

0.4

0.5

Runtim

e [s]

Figure 6. RMS error in reconstructed meshes vs. the number of extracted features. The results are generated using 100 differentfeature-based meshes. A 2,000-feature mesh is used as a reference in each case, and errors in height are calculated for meshes withbetween 400 and 2,000 vertices. The solid line (circles) represents the mean error in height, and the dotted lines are one sigmaabove and below the mean. Additionally the time required to extract and triangulate the features is plotted in squares on the samegraph. The relationship is near linear between number of features and error. It also appears linear between number of features andcomputation time. On the basis of both these observations, the user may safely tune the desired number of features depending ontime requirements. We typically use 800 features as a compromise between mesh quality and computational cost for constructingthe local stereo meshes.

model. This stage is also responsible for generating multi-ple decimated meshes to allow for use in a level-of-detailsystem.

4.1. Georeferencing Stereo Meshes

An estimate of the vehicle’s trajectory is required to placeall data collected by it into a common reference frame.Navigation underwater is a challenging problem becauseabsolute position observations such as those provided byGPS are not readily available. Acoustic positioning sys-tems (Yoerger et al., 2007) can provide absolute position-ing but typically at lower precision and update rates thanobservations from environmental instruments onboard theAUV (i.e., cameras and sonars). Using a naive approach,the mismatch between navigation and sensor precision re-sults in “blurred” maps. A more sophisticated approachuses the environment to aid in the navigation process andensure poses that are consistent with observations of theenvironment. SLAM is the process of concurrently build-ing a map of the environment and using this map to ob-tain estimates of the location of the vehicle using its on-board sensors. The SLAM problem has seen considerableinterest from the mobile robotics community as a tool to en-able fully autonomous navigation (Dissanayake, Newman,

Clark, Durrant-Whyte, & Csorba, 2001; Durrant-Whyte& Bailey, 2006). Earlier work at the ACFR demonstratedSLAM machinery in an underwater setting (Williams &Mohan, 2004). Work at the Woods Hole Oceanographic In-stitution has also examined the application of SLAM (Eu-stice et al., 2006; Roman & Singh, 2007) and SFM (Pizarroet al., 2004) methods to data collected by remotely operatedvehicles (ROVs) and AUVs.

To provide self-consistent, georeferenced reconstruc-tions, the imagery and navigation data acquired by ourAUV are processed by an efficient SLAM system to es-timate the vehicle state and trajectory (Mahan, 2008;Mahon et al., 2008). Our approach extends the visualaugmented navigation (VAN) methods proposed by Eu-stice et al. (2006). This technique uses an extended in-formation filter (EIF) to estimate the current vehicle statealong with a selection of past vehicle poses, typicallythe poses at the instant a stereo pair was acquired. Anappealing property of this technique is that it does notrely explicitly on features to be maintained within the fil-ter framework, sidestepping the issue of deciding whichfeatures are likely to be revisited and used for loop-closureobservations.

The information matrix for a view-based SLAM prob-lem is exactly sparse, resulting in a significant reduction



(a)

0 10 20 300

100

200

300

400

500

Error in x [mm]

fre

qu

en

cy

0 10 20 300

100

200

300

400

500

Error in y [mm]

fre

qu

en

cy

0 10 20 300

100

200

300

400

500

Error in z [mm]

fre

qu

en

cy

(b)Figure 7. (a) Triangulation of calibration board corners from 24 views. Camera intrinsic and extrinsic parameters were estimated.Ellipses represent the three-sigma covariance of the triangulations. The points are the actual position of the corners. (b) Histogramsof triangulation errors in x, y, z for all 48 corners in 24 stereo views. Consistent with a narrow-baseline configuration, the highestvariability is along z (depth away from the camera). The baseline is approximately parallel to the x axis, resulting in triangulationwith higher uncertainty in x (along epipolar lines) than y.

in the computational complexity of maintaining the corre-lations in the state estimates when compared with densecovariance or information matrices caused by marginaliz-ing past vehicle poses. Recovering the state estimates is re-quired in the EIF prediction, observation, and update op-erations, whereas state covariances are required for dataassociation or loop-closure hypothesis generation. Efficientstate estimate and covariance recovery is performed usinga modified Cholesky factorization to maintain a factor ofthe VAN information matrix (Mahon et al., 2008).

4.1.1. Loop Closures: Wide-Baseline Matching

Visual feature extraction and matching is an expensive pro-cess relative to the entire pipeline. Therefore it is importantto be able to evaluate whether a pair of poses are likely can-didates for a loop closure. This evaluation will be run many

times on a large number of candidate poses and thereforemust be efficient. A simplified sensor, vehicle, and terrainmodel is used to assess the likelihood of overlap betweenstereo pairs. The terrain is assumed to be planar, the vehi-cle’s pitch and roll are assumed to be zero, and the FOV ofthe vehicle is treated as a cone. Using this model, the al-titude and XY position of the vehicle define the overlap.As the vehicle’s position is uncertain in the VAN frame-work, the likelihood of image overlap is calculated by inte-grating the probability distribution of the 2D separation ofthe poses in question. This conservative test allows a largenumber of potential loop-closure candidates to be rejectedwithout performing the feature extraction.

Wide-baseline feature extraction and matching is awell studied field, and several algorithms robust to changesin scale and rotation are now available. A number of tech-niques have been proposed to improve the speed of such



techniques to make them applicable for real-time systems.Speeded-up robust features (SURF) (Bay, Ess, Tuytelaars,& Gool, 2008) is a wavelet-based extension (primarily forspeed) of the popular scale-invariant feature transform(SIFT) algorithm (Lowe, 2004). When used after lightingcorrection, both SURF and SIFT features have been success-fully used to identify loop-closure observations in our AUVbenthic survey imagery. An example of such a loop closureis presented later in Figure 10(c).

Recently there has been research into utiliz-ing graphics processing units (GPUs) to provideadditional speed improvements to SURF and SIFT(Cornelis & Gool, 2008; Sinha, Frahm, Pollefeys, & Genc,2007). We have explored the benefits of GPU-based andmultithreaded feature extraction to further increase thespeed of this step in the pipeline. With this selection oftools, we can generate a fast SLAM loop-closure system ona variety of platforms. A comparison of the performancesof the various systems is beyond the scope of this paper;however, all afford speedups over the single-threadednon-GPU solutions. This implies that missions can berenavigated on the order of tens of minutes, making areal-time SLAM system feasible.

4.1.2. Stereo Relative Pose Estimation

Once a loop closure has been hypothesized, the likelihoodthat the pairs of stereo images are imaging the same patchof seafloor is evaluated and, if a match is identified, the rel-ative poses from which the images were acquired must beestimated. Wide-baseline feature descriptors allow matchesto be proposed (i.e., corresponding features) between loop-closure images, but misassociations arising from visual self-similarity and low contrast are still possible. Whereas mostproposed matches are correct, a few incorrect ones can cre-ate gross errors in pose estimation if not recognized as out-liers (i.e., proposed correspondences that are inconsistentwith a motion model or a geometric constraint such as theepipolar geometry). To address this problem, we generaterelative pose estimates using a robust pipeline to processthe stereo pairs. The steps involved in the process are il-lustrated in Figure 8 and are summarized here, with fulldetails appearing in Mahon (2008).

1. Features are extracted in the images using one ofthe wide-baseline feature descriptors discussed in Sec-tion 4.1.1.

2. Features are matched within each stereo pair con-strained by epipolar geometry, and the resulting 3Dpoints are triangulated.

3. The features are then associated across the two stereopairs using wide-baseline descriptors. The majority ofoutliers or misassociations can be rejected by applyingepipolar constraints between each of the first and sec-ond pairs of images.

Feature Extraction

Stereo Image Pair 1

Feature Matching

Feature Extraction

Stereo Image Pair 2

Feature Matching

Feature Coordinates

FeaturePair

Matching

Outlier RejectionRelative Pose Hypothesis Generation

Pair 1 Matched Image Coords

Pair 1 to Pair 2 Proposed Correspondences

Inlier Relative Pose Optimization

Relative Pose

Hypothesis

Inlier Matches

Outlier Matches

Optimized Relative

Pose

Feature Coordinates

3D Triangulated Points Pose 1

Pair 2 Matched Image Coords

3D Triangulated Points Pose 2

Figure 8. The stereo-vision relative pose estimation process.

4. Remaining outliers are rejected by calculating a robustrelative pose estimate using the Cauchy ρ-function (Hu-ber, 1981) and then using a Mahalanobis outlier rejec-tion test (Matthies & Shafer, 1987) designed to accept95% of inliers. The robust estimate is calculated usingthe random sample initialization method, in which eachinitial hypothesis is calculated by maximum likelihood3D registration on a minimal set of three randomly se-lected features.

5. A final relative pose estimate and covariance is pro-duced from the remaining inlier features using maxi-mum likelihood 3D registration initialized at the robustrelative pose estimate.

A loop-closure event comprises an observation of therelative pose between the current and a past pose. Giventhe availability of feature observations from our stereocameras, we can compute a full six-degree-of-freedom(DOF) relationship between poses rather than a five-DOF



constraint (attitude and direction of motion) available froma monocular camera.

4.1.3. Decoupling SLAM and Reconstruction

We have explicitly elected to decouple the SLAM andreconstruction steps in our pipeline. As shown inFigure 1, SLAM is considered as a first step in the process,in which the poses of the vehicle throughout the dive areestimated using the navigation sensors available on the ve-hicle and the constraints imposed using the matched fea-tures in the imagery. The reconstruction phase, describedin Section 4.2, uses these poses to project the stereo meshesinto space and to then compute a single, aggregate mesh.Any inconsistencies remaining at this point are assumedto be small and are dealt with using the mesh aggregation(Section 4.2) and texture blending (Section 5) techniques.Although in principle it is possible to formulate a repre-sentation of pose, structure, and visual appearance suchthat adjustments performed to the 3D structure and tex-ture could propagate to corrections in pose estimates andcalibration parameters, the complexity of such a problem issignificantly greater than the one we are addressing in thispaper. By decoupling the reconstruction problem from poseestimation (SLAM), stereo estimation, and texturing, theentire pipeline can be made to handle extremely large re-construction problems, featuring on the order of 10,000 im-age pairs. As outlined previously, our goal was to producea reconstruction system that can process whole missions ina timely manner, yielding approximate reconstructions inless time than they can be acquired, to allow future mis-sions to be planned. We believe that some trade-offs wererequired in order to achieve these goals.

4.1.4. Sample SLAM Results

A comparison of the estimated trajectories producedby dead reckoning and SLAM is shown in Figure 9for one of the deployments undertaken on the GreatBarrier Reef (GBR) in October 2007. Both filters integratethe DVL velocity (relative to the bottom), attitude, anddepth observations. The SLAM filter is also capable of in-corporating loop-closure constraints from the stereo im-agery. The dead-reckoning filter is not able to correct fordrift that accumulates in the vehicle navigation solution. Incontrast, loop closures identified in the imagery allow forthis drift to be identified and for the estimated vehicle pathto be corrected. This particular deployment comprised a to-tal of more than 6,500 image pairs and a state vector thatincludes the six state pose estimates for each image loca-tion. Loop-closure observations were applied to the SLAMfilter, shown by the red lines joining observed poses. Apply-ing the loop-closure observations results in a trajectory es-timate that suggests that the vehicle drifted approximately10 m north of the desired survey area. Figure 10 illustrates

the role of SLAM in providing self-consistent camera posesfor 3D model generation.

4.2. Mesh Aggregation

Once the individual stereo meshes have been placed in acommon reference frame, they must be combined to createa single mesh from the set of georeferenced stereo meshes.Although care is taken to use self-consistent camera posesand to generate stereo meshes from feature points, errorsin the structure and pose of the meshes are still possible.The main issues to address when aggregating meshes areas follows (Campbell & Flynn, 2001):

Error. In the georeferencing and range estimation leadingto inconsistencies in the overlapping meshes.

Redundancy. The set of stereo meshes includes redundantinformation, depending on the amount of overlap. Atechnique that removes this allows for more efficientstorage and rendering.

Occlusion/aperture. Sensors with limited sensor apertureare not capable of capturing the entirety of an arbitraryscene. Any particular view of that scene may have oc-cluded sections that result in “holes” in the associatedmesh. Combining multiple views allows these holes tobe filled in.

An example of a number of stereo meshes gatheredfrom Sirius is illustrated in Figure 11. As can be seen, thereare a number of strips of seabed that have been imagedfrom multiple positions and there is some inconsistencyin the estimated height of the seafloor, particularly aroundareas of high structure. Merging these multiple estimatesof seabed height requires a technique for fusing multiple,noisy observations of the height in a consistent manner.

A number of techniques consider the problem of merg-ing source geometry into a single surface. The most ba-sic of these techniques simply generate a new mesh fromall the available source points, resulting in a single mesh.Generating a single interpolated mesh that incorporatesall the data may be achieved using Delaunay triangula-tion (Boissonnat, 1984), Voronoi diagrams (Amenta & Bern,1999), or digital elevation map (DEM) greedy insertion(Fowler & Little, 1979); however, such approaches createjagged models in response to noise in the data. Othertechniques stitch together a set of source meshes, removeoverlapping vertices, and average out the resulting surface(Turk & Levoy, 1994), but again this tends to be sensitive toinconsistencies and noise. We investigated the use of thesetechniques but found that they lacked the robustness to thelevel of noise in our data and produced poor results. Wetherefore selected a class of volumetric techniques that pro-vided the robustness to error that was required.

Volumetric techniques create a subdivision of the 3Dspace to integrate many views into a single volume. VRIP(Curless & Levoy, 1996) provides a weighted average ofmeshes in voxel space, creating an averaged surface from



-60-60

-40

-20

0

20

40

60

80

100

-80-8 -60 -40 -20 0 20 40 60

No

rth

(m

)

East (m)

Figure 9. Comparison of dead-reckoning and SLAM vehicle trajectory estimates. The mission begins near 0,0 and ends near 40,60and covers a distance of approximately 1.5 km. The SLAM trajectory is shown in black, with dots marking positions where a stereopair was acquired, and the dead-reckoning estimates are shown in blue. The SLAM estimates suggest that the vehicle has driftedapproximately 10 m north of the desired survey area. The red lines connect vehicle poses for which loop-closure constraints havebeen applied. The red circle shows a loop-closure area highlighted in Figure 10.

several noisy samples. This technique was used to gener-ate a large, highly detailed model of Michelangelo’s David(Levoy et al., 2000). Their approach is used as a benchmarkand as a standard tool for reconstruction (Kazhdan, Bolitho,& Hoppe, 2006; Seitz et al., 2006). This technique is lim-ited by the constant resolution of the grid requiring largeamounts of memory to be capable of generating detailedmodels. This limitation inspired the use of adaptable gridsto allow for greater memory efficiency. Ohtake, Belyaev,Alexa, Turk, and Seidel (2003) introduced octree structuresas a means of adaptively subdividing space. The version ofour system presented in this paper uses VRIP and a sim-ple strategy to subdivide the total volume into manage-able problems that include spatially adjacent meshes evenif they are temporally distant (i.e., loop closures).

We have also explored the use of more recent tech-niques based on fast Fourier transform (FFT) convolutionof the points with a filter solving for an implicit surface(Kazhdan, 2005). This work was later reformulated as solv-ing for a Poisson distribution (Kazhdan et al., 2006). How-ever, these techniques are more complex than VRIP andallowed us less flexibility in multithreading and controlover the output. Both techniques were poorly suited as theyare intended to estimate closed surfaces and make no useof the visibility space carving of VRIP and as such pro-duce overconfident interpolations creating large amountsof data where none exists. Another promising techniqueis the irregular triangular mesh representation (Rekleitis,Bedwani, & Dupuis, 2007), which provides a variable-resolution model that can incorporate overhangs. However,



Figure 10. Mesh errors induced when local consistency is not enforced using SLAM. The red dots in (a) represent commonfeatures that have not been correctly placed in space due to drift in the estimated vehicle pose, and (b) shows the same intersectionswhen SLAM has been used to correct for navigation errors. (c) Loop-closure feature associations. The first stereo pair is shown ontop, with the second stereo pair below. The lines join the positions of associated features between left and right frames of the twopairs. The relative pose estimate based on these features is incorporated into the SLAM filter as an observation that constrains thevehicle’s trajectory.

the outlier rejection in this case is based on edge length andtherefore lacks the full 3D voxel averaging and outlier con-trol of VRIP. For these reasons we decided to utilize VRIPfor the integration of the meshes.

When using VRIP, the quality of the integrated mesh isdependent on the selection of an appropriate ramp functionused to weight meshes. The length of this ramp determinesthe distance over which points influence a voxel cell. Theamount of noise in the data and the resolution of the grid

help dictate the length of the ramp, trading off smoothnessfor detail. In other words, on noisy data averaging a largenumber of samples by using a large ramp will producesmoother results, whereas a short ramp averages only afew samples, thereby preserving high-frequency data. Forour data, a ramp value was selected experimentally in pro-portion to our largest estimated misregistration. An ex-ample of the typical standard deviation in Z is shown inFigure 11(b). Grid resolution is another important factor to



Figure 11. Number of samples and standard deviation along the vertical (Z) axis in 10-cm cells.

consider when using this algorithm. We chose to limit theonscreen polygon count to approximately 20,000 to guar-antee smooth rendering on a laptop. Another requirementwas to be able to view sections at least 10 m long (approx.20 m2) at the highest LOD. A grid resolution of 33 mm pro-duced meshes of 10-m transects with approximately 20,000faces.

VRIP is a fixed-resolution technique and even thoughit uses run-length encoding of the voxel space, which canoffer a 1:10 savings in memory usage, integrating entiremission areas is infeasible. For a 4-h dive at 0.5 m/s, thevehicle covers 7,200 linear meters. Assuming a 2-m-wideswath and a 30-m vertical excursion, the volume of spaceto discretize is 432,000 m3. At 33-mm voxel resolution,there are 27,826 voxels per cubic meter, resulting in morethan 12 × 109 voxels for that volume. Even if a voxel wasencoded as 1 byte, this is already 12 GB of RAM, whichexceeds the limits of 32-bit systems. In addition, the bound-ing volume of a survey will grow with greater depth excur-sions and survey patterns that deviate from a simple lineartransect. One possible solution to this sparse problem is touse adaptive grids, such as octrees, to manage the compu-

tational requirements of the map building process. We havestarted exploring an integration technique using quadtrees,the 2D analog of an octree. The quadtree method use a 2.5-dimensional representation, which is a reasonable approx-imation given our imaging geometry (Johnson-Roberson,Pizarro, & Willams, 2009).

In this paper we present a more mature approach thatuses constant-resolution grids but subdivides the probleminto several subtasks to perform integration within avail-able memory. A number of methods to achieve subdivisionof the imaged space were considered. Splitting the meshesbased on temporal constraints is not appropriate in thiscase as many of the AUV deployments feature overlappinggrids, and portions of the survey that are temporally sep-arated may in fact be imaging spatially nearby regions. Asshown in Figure 12, if two meshes are first merged based ontemporal constraints, the resulting aggregate mesh featuresrelatively large errors when the meshes are finally assem-bled. A spatial subdivision is therefore more appropriate,and two potential approaches were considered. The first isa trivial even division of space where grid lines are evenlydistributed across the entire modeling space. An example



Figure 12. Mesh integrated using temporal splitting. The twocrossing transects have first been merged individually. The re-sulting aggregate mesh features an area at the crossover thatis not consistently merged when the final meshes are assem-bled. The inconsistency here can most likely be attributed tostereo triangulation errors or tidal effects that are currently notmodeled.

can be seen in Figure 13(a). This can introduce errors if thesubdivisions are laid down along a transect or at an in-tersection point as portions of individual meshes may beintegrated into different subdivisions. This may introduceseams at the intersection when the final meshes are com-

bined. The second approach uses a cost function that penal-izes putting boundaries over loop closures, and an examplesubdivision can be seen in Figure 13(b). These loop-closureintersections are points at which there is considerable re-dundancy in the available meshes, and errors in georef-erencing associated with the navigation solution may bemost pronounced. By ensuring that the subdivision of themeshes around these crossovers is avoided, VRIP is able tomore consistently aggregate the meshes. This type of ap-proach is appropriate only for overlapping grid and tran-sect mission profiles. In the case of dense mission trajecto-ries, the even splitting technique is used.

4.3. LOD Generation

LOD techniques enable the viewing of extremely largemodels with limited computational bandwidth (Clark,1976). The underlying concept is to reduce the complexityof a 3D scene in proportion to the viewing distance or rel-ative size in screen space. The scale and density of missionreconstruction requires some LOD processing to allow forrendering the models on current hardware. Some AUV mis-sions have upward of 10,000 pairs of images, which expandto hundreds of millions of vertices when the individualstereo pairs are processed. This would require multiple gi-gabytes of RAM if kept in core, which is impractical on con-ventional hardware. To view these data, a discrete pagedLOD scheme is used in which several discrete simplifica-tions of geometry and texture data are generated and storedon disk. These are paged in and out of memory based on theviewing distance to the object.

0 20 40 60

0

20

40

60

Y (m)

X (

m)

(a) Even subdivision

0 20 40 60

0

20

40

60

Y (m)

X (

m)

(b) Heuristic subdivision

Figure 13. Subdivision of the mesh integration tasks to allow VRIP to operate on subsets of the problem. Divisions are madeusing (a) an even subdivision of space and (b) a heuristic-based subdivision of space that penalizes having boundaries along atransect or at an intersection. Stereo meshes are shown in blue, with crossover points depicted in black and red lines indicatingwhere subdivisions have been created.



Integral to the discrete LOD scheme is a method of sim-plification of the full-resolution mesh. We use a quadric er-ror method of decimation first introduced in Garland andHeckbert (1997) and extended in Garland and Heckbert(1998). It is based on collapsing edges, a process in whichtwo vertices connected by an edge are collapsed into a newvertex location. This simplification step moves the verticesv1 and v2 to the new position v, connects all their incidentedges to v, and deletes the vertices v2 and v1. The selectionof which vertices to remove from a mesh is done using aquadric error metric, Q(v), that describes the cost of remov-ing a vertex. This cost is equivalent to the distance from thevertex to all of its incident planes. The process of mesh sim-plification in outline is as follows:

1. Compute cost for all vertices.2. Place vertices in a priority queue with min cost at the

root.3. Collapse the root vertex and recompute all costs.4. Repeat algorithm until desired mesh complexity is

reached.

For the current setup we generate three simplified ver-sions from the original mesh at approximately 1000, 100,and 10 polygons per square meter. Figure 14 shows an ex-ample of the mesh simplification process, demonstratingthe reduction in complexity of the mesh as the number oftriangles included is reduced by an order of magnitude ateach step. Figure 14(c) shows a texture mapped version ofthe most simplified mesh, illustrating the fact that viewingthe mesh from a distance can still be informative even at arelatively low level of mesh complexity. As the user zoomsin on a particular section of mesh, the increasingly detailedmeshes are loaded from disk and presented for detailed in-spection of the seafloor structure.

Suitable LOD down-sampling ranges were selected toallow for satisfactory operation on laptops released in thepast 3 years. This makes the system accessible to most userson a ship but requires that the system be capable of runningon a system with limited graphics processing power (asmost laptop use integrated GPUs), limited graphics RAM(usually 16–64 MB), but access to most newer shader func-tionality ≥ OpenGL 1.5 support (Segal & Akeley, 2003).These requirements dictate the tuning of the lower levelsof detail; however, we wanted to maintain the ability tovisualize single full-resolution images and all associatedfeature points at the highest LOD. When zoomed in, thereis minimal loss of detail, minor compression artifacts, andfull-resolution imagery. With respect to geometry we havetuned the highest LODs to be at the limit of the GPUs atframe rate. Higher density models are possible but requirea change in the target hardware platform.

5. TEXTURING AND TEXTURE BLENDING

It is often desirable to display detail beyond that whichis modeled by the 3D shape of a mesh. Texture mapping

projects images onto the surface of a mesh, allowing finerdetail than the structure contains to be displayed (Heckbert,1986). Traditional techniques of visualizing AUV imagesutilize 2D image mosaicing to display the imagery in a spa-tially consistent fashion but eliminate structure that mayresult in strong distortions (Singh, Howland, & Pizarro,2004). Through parametric mapping of the imagery ontothe meshes, we can effectively mosaic the images while ac-counting for the structure in the scene. The process deter-mines the projective camera viewpoints that have imaged aparticular triangle on the mesh and then assigns two vary-ing parameters (u, v) to each vertex that is a mapping intothe corresponding image.

Using survey images directly as texture maps for a3D mesh can create distracting visual artifacts and destroythe impression of a photorealistic model. These issues ariseprimarily from visible lighting patterns and misregistra-tion. Although our system compensates partially for non-ideal moving light sources and strong lighting attenuation,any residual differences in appearance of the same scenepoint when viewed from a different viewpoint will pro-duce seam-like artifacts when switching to textures froma different view. Radiance maps can restore the dynamicrange of images (Debevec & Malik, 1997), which in partmitigates this problem for texturing but require highly re-dundant views at different exposure settings. This is im-practical underwater as it would require significantly morelighting energy and data storage.

In the same way that lighting patterns cause visual in-consistency, registration error can also introduce artifacts inthe reconstruction. Registration errors occur when the cam-era poses and 3D structure have errors that result in im-ages of the same scene point being reprojected on differentparts of the 3D model. These errors are unavoidable whenusing approximate 3D structure in the form of meshes de-rived from a sparse set of features. This type of problemis common in mosaicking applications when camera mo-tion induces parallax but the scene is assumed to be planar.To produce a visually consistent texture, most approachesexploit the redundancy in views by fusing the views insuch a manner that high-frequency components have a nar-row mixing region, reducing the chances of ghosting (Uyt-tendaele, Eden, & Skeliski, 2001). We adapt band-limitedblending (Burt & Adelson, 1983) for use on 3D meshes withcalculations performed on a GPU and without having to ex-plicitly determine the boundary on which blending is per-formed. This technique allows for the blending to be com-puted in real time in a manner that is transparent to therendering process.

Blending several textures at a vertex requires the cal-culation of the projection of that vertex into all camerasin which it was seen. In computer graphics this is theparameterization of images known as texture coordinates(Heckbert, 1986). Because the mesh integration step com-bines several meshes to produce the final mesh, the original



(a) Three meshes in wireframe displaying the reduction in meshcomplexity

(b) Three increasingly simplified meshes in shaded relief

(c) Texture mapped mesh

1 2 1 8 2 4 3 3 60

1

2

3

4

5

6

7

X [m]

Y[m

]

Figure 14. Meshes displaying the result of the simplification process described in Section 4.3. (a) Meshes from left to right repre-sent a reduction in triangles of one order of magnitude, where the first is 100%, followed by 10%, and finally 1%. Each correspondsto a LOD in the hierarchy described in Section 4.3. The shaded images in (b) highlight the loss of relief but the overall persistenceof outline and shape. (c) A texture mapped version of the most simplified mesh. Although the mesh itself is relatively simple, thetexture mapped images allow the gross structure to be inferred, even at a large viewing distance.

one-to-one mapping that existed between feature pointsand image coordinates has been lost. However, the newmerged vertices can be assigned to image coordinates inall cameras that view them through a process of backpro-jection. Naively one would traverse through all projectionsand check which were valid, but this is a costly procedurewhen performed on all-mesh vertices. We create a bound-ing box tree (Smits, 1999) to allow all camera frames thatview a mesh vertex to be quickly located. This boundingbox tree contains all of the bounding volumes of the origi-

nal triangulated meshes and produces a fast query of all ofthe camera views that have imaged a point. This operationis performed for every vertex of the mesh, generating mul-tiple image coordinates for all cameras. This allows us todescribe an image pixel’s correspondence to the world foreach view, which in turn allows us to blend all pixels asso-ciated with a particular face in the mesh. In the followingsection we discuss the blending in more detail.

The mechanism that performs the blending is based onthe use of image splines (Burt & Adelson, 1983). The image



spline is used to blend the image seam smoothly withoutlosing fine detail from either image. Assuming that there issome misregistration between the images, blending can re-sult in some ghosting (multiple instances of objects, partic-ularly noticeable at higher spatial frequencies). Prior workhas shown that for most real-world images it is impossi-ble to select a single spline that appropriately blends all thefrequency components of an image (Burt & Adelson, 1983).Therefore in order to perform blending, the images mustbe decomposed into frequency bands and an appropriateblending spline selected for each band. We choose to blendthree nonoverlapping frequency bands. Three bands wereselected empirically after limiting factors such as floatingpoint precision and image size in the GPU showed di-minishing returns for any additional band use. The fre-quency decomposition can be represented as a set of low-pass filters applied to a series of images in one-octave steps.Burt and Adelson propose the use of an image pyramidto perform the filtering using band pass component im-ages. We extend this work by using a novel implementa-tion on a GPU that allows for efficient and simple pro-cessing to achieve a similar result. Graphics cards are setup to handle images for the purposes of texturing geome-try and can quickly load and manipulate such texture data(Catmull, 1974; Oka, Tsutsui, Ohba, Kurauchi, & Tago,1987). Specifically the GPU’s hardware mipmapping(Williams, 1983) is leveraged to create the texture frequencydecomposition and then a shader is used to perform theblending.

The steps used to calculate the color of a pixel on themesh are shown in pseudocode in Algorithm 1. We usea weighting function that determines the degree to whichdifferent source pixels contribute to the blend. Because theimages display significant radial distortion and illumina-tion falloff away from the center (Jaffe, 1990), our weightingfunction favors pixels near the image centers. The formula

Algorithm 1 The color of a pixel on the mesh.for verti in the set of all vertex do

for k = (5,10,50) do {Each band limited image}for (u, v)j in the set of all projections of verti do{(ucenter, vcenter) is the center of the image for pro-jection j}r = dist((u, v)j − (ucenter, vcenter))Calculate Non-normalized weight Bk(r) usingEq. 1.

end forCalculate normalized weight Wi

k for all Bk for vertiusing Eq. 2∑

all normalized weight colors at (u, v)jend forRecombine each band into final pixel color

end for

to derive the nonnormalized weighting value Bk for an im-age at a pixel is shown in Eq. (1):

Bk(r) = e−k· rR

1 + e−2·k· rR

, (1)

where r is the distance to the center of the image and R

is a reference distance (typically the maximum distance tobe considered). Each frequency band is shaped by k, withlarger k inducing a sharper drop-off. Figure 15(a) illustratesthe shapes of the weighting function for k = 5, 10, 50. The

0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 1 1 10

0 1

0 2

0 3

0 4

0 5

rR

wei

ght

WLWMWH

(a) Weighting as a function of normalized distance to center

0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 1 1 10

0 1

0 2

0 3

0 4

0 5

0 6

0 7

0 8

0 9

1

rR from center of first image

nor

mal

ized

wei

ght

W1LW2LW3LW1MW2MW3MW1HW2HW3H

(b) Actual weightings for three images

Figure 15. Plot of three weighting functions used to com-bine frequency bands. (a) Weighting as a function of normal-ized distance to center for low-, medium- and high-frequencybands. (b) Actual weightings for three images with centersat r/R = {0, 0.5, 1}. The weights for image 1 are centered atr/R = 0. Image 2 is centered on r/R = 0.5. Image 3 is centeredon r/R = 1. The weights for the low-frequency bands have theL subscripts, for the medium-frequency bands M , and for thehigh-frequency bands H . Notice the sharp transition zone forthe high-frequency components, whereas there is a more grad-ual blending of the low-frequency components.



actual weights applied to the textures are normalized by thetotal weights contributed by the images nearby as shown inEq. (2):

Wik (ri ) = Bi

k(ri )∑

j Bjk (rj )

, (2)

where Wik is the normalized weight for image i with drop-

off shaped by k. Figure 15(b) shows an example of the nor-malized weights for three partially overlapping images.

The GPU shader code for this technique is novel in itsuse of texture arrays, which allows for simultaneous accessto many 2D textures, enabling blending to be performedin real time. The technique produces meshes with signifi-cantly fewer seams and inconsistencies in the texture maps,allowing the visual image data to be draped on the result-ing surface models. Figure 16 presents three views of thesame section of mesh with a different texture blending al-gorithm applied in each. Figure 16(a) shows an unblendedapproach of selecting the closest image and the characteris-tic seams that exist without blending in projective texturingfrom multiple images. A naively blended mesh can be seenin Figure 16(b), where each pixel is the average of all viewsof that point. The results of the proposed technique are dis-played in Figure 16(c). As can be seen, the proposed ap-proach results in a blended, textured mesh with fewer visi-ble seams without a loss of high-detail texture. Figure 16(d)illustrates a short section of blended mesh.

The algorithm uses a single programmable shader andtakes approximately 7 ms per render update using theGeForce 8600GS, a modest dedicated graphics card thatsupports texture arrays. It is possible to maintain more than120 frames per second with this technique, making it morethan suitable for real-time pipeline rendering; however, be-cause the blending is static this rendering is necessary onlyonce as the resulting texture is saved and replaces the orig-inal unblended texture. This means that the entire processrequires only an additional one-time cost of 7 ms per image(along with the time to write new images to disk) to pro-duce a final blended mesh. This also removes the constraintof requiring a graphics card that supports the extension of2D texture arrays to display the blended results, makingthis technique accessible to virtually all modern computerhardware once the blended textures have been generated.

6. PRACTICAL CONSIDERATIONS

Although considerable work has gone into establishingthe pipeline required to generate detailed, texture mappedmodels of the seafloor using the techniques described here,a number of practical considerations have been addressedin order to allow these models to be generated and dis-played in a timely fashion. This section examines some ofthese issues and how they have affected the design deci-sions and performance of the system.

6.1. Texture Compression

One of the central goals of the visualization of large datasets is the ability to display all images and structure si-multaneously. This is particularly challenging if the visu-alization is to run on commodity hardware. Two issuesdominate the visualization of tens of thousands of images.First is keeping all images in system memory. At approx-imately 2 MB per image, loading 10,000 images would re-quire 20 GB of system memory, which is beyond the capac-ity of most current desktops and laptops. Second, moderngraphics cards have limited memory and processing powerdespite great advances in the past 5 years. Thus only 10or 15 full-resolution images can be held in the 32 MB ofgraphics memory, as well as a limited number of verticesand faces.

Texture compression serves to increase the numberof images that can be viewed (Liou, Huang, & Reynolds,1990). Hardware implementations are now fairly ubiqui-tous in commodity graphics cards. We are utilizing theDXT1 variant of texture compression, which represents a4 × 4 block of pixels using 64 bits. DXT1 is a block compres-sion scheme, in which each block of 16 pixels in the imageis represented with a start and end color in 5 bits for red,6 for green, and 5 for blue and a 4 × 4 2-bit lookup table todetermine the color level for each pixel. This compressionalgorithm achieves an 8:1 compression ratio.

6.2. Texture Pyramid

With the limitations in system memory and GPU power,textures must be managed to maintain the performanceof the system. Just as LOD schemes are used for geom-etry (Clark, 1976), textures can be represented in a mul-tiresolution pyramid. An example of image sizes can beseen in Figure 17. Furthermore, a technique tradition-ally harnessed for distance-dependent display of texturesknown as mipmapping exists as a hardware feature of allmodern GPUs. Mipmapping is the generation of the afore-mentioned image pyramid (used for texture blending) froma high-resolution texture at reduced resolutions (Strengert,Kraus, & Ertl, 2006; Williams, 1983). Traditionally these im-ages are generated in quarter resolution. If the initial imageis 256 × 256, a total of eight mipmap images will be gen-erated at 128 × 128, 64 × 64, 32 × 32, 16 × 16, 8 × 8, 4 × 4,2 × 2, and 1 × 1 pixels. By using the hardware-generatedpyramid, computation time can be saved in the LOD gen-eration step. These automatically generated texture pyra-mids are then stored in an explicit discrete LOD (DLOD)textured model. Levels are created prior to run time, andthe system selects the LOD most appropriate for the view-ing distance. This makes effective use of the screen’s limitedresolution when viewing large numbers of images. DLODschemes can suffer from the introduction of visual errorwhen switching levels when compared with some recentcontinuous LOD schemes (Ma, Wu, & Shen, 2007; Ramos,



Figure 16. Blended and unblended meshes displaying visual results from the proposed technique. (a) Note the seams highlightedin red on the unblended image that selects the closest texture for each surface. (b) A naive blending that averages the textures foreach surface results in significant blurring. (c) The band-limited blending proposed here preserves significant detail while avoidingseams in the blended image. (d) Overview of a section of blended mesh



Figure 17. An example of texture LOD images.

Chover, Ripolles, & Granell, 2006). However, we considerthat these disadvantages are outweighed by the simpler re-quirements on hardware, making the visualization systemmore accessible.

6.3. Binary Mesh Format Generation

We optimize the storage of these meshes in individual bi-nary format meshes with textures stored internally in com-pressed format, allowing for the minimum of transforma-tion to load data files into system and graphics memory.The images are stored in their natively compressed Direct-Draw Surface (DDS) format, and the meshes can stream di-rectly into vertex buffers in the graphics card. We also usethe binary mesh format of Open Scene Graph to aid in effi-cient geometry storage (OpenSceneGraph, n.d.). We utilizea multithreaded paging system that pulls each submesh ascreated in Section 4.2 into memory when the viewing dis-tance is close. This paging allows for the entire mesh tobe seen simultaneously while only high-detail sections arepaged in when necessary. The final binary mesh with com-pressed textures takes up approximately 2.5 MB per imageon average. A typical 100-image strip is about 25 MB, anda typical 19,000-image complete mission is about 4.8 GB,which is well within the storage capabilities of currentcomputers.

6.4. GPU

GPUs are naturally suited to manipulating texture data,and significant speed gains can be achieved by reimple-menting the texture processing segments of the pipelinein graphics hardware. In the current implementation, weperform all the texture compression and blending in theGPU. This allows for greater parallelization as these taskscan be performed without overloading the CPU, leaving itfree to continue processing the mesh geometry. We makeuse of NVIDIA CUDA for texture compression, which of-fers a speedup of over 12× over CPU-based texture com-pression (NVIDIA, 2008). The texture blending (Section 5)

is performed in real time and creates a negligible slowdownin processing.

6.5. Multithreading and Distributed Processing

The introduction of multicore CPUs to the desktop markethas brought symmetric multiprocessing to the mainstream.The challenge is now to write code that can take advantageof this parallelism. We have taken several sections of thepipeline and made them parallel so that they can take ad-vantage of these modern CPUs. In addition, we have takenthis a step further and implemented a system for distribut-ing the processing across multiple machines, further de-creasing processing time. The basis of both techniques isthat there is no need for synchronization between framesin each individual stage of the pipeline. The stereo pro-cessing of each pair is independent of the previous pair inthe current implementation; therefore the task can be com-pletely divided along with the data. Thread synchroniza-tion is needed only in between each pipeline step. The dis-tribution of the tasks uses distributed file systems (DFSs), inthis case NFS (Network File System), but almost any mod-ern DFS will work. This is possible because all metadata arestored as a file maintained by a single synchronizing pro-cess. This process spawns distributed children, and thesechildren need read only that metadata file and have accessto the relevant source data for their section. As all data fora mission reside in this DFS, all machines with the binariesfor processing can operate on any part of the mission andstore their results back onto the same directories. For thefollowing discussion, we refer to nodes that can be eitherthreads or distributed processes. In the current implemen-tation we multithread or distribute the pipeline using themethod described as follows:

• Stereo depth estimation: Each stereo pair is indepen-dent, so each node receives a list of image pairs andstores the resulting meshes in the DFS.

• Synchronization occurs and meshes are transformedinto global space and divided as discussed in Section 4.2into spatial regions for integration.

• Mesh aggregation: Each node receives a list of spatialregions to integrate.

• Synchronization occurs.• Simplification: Each spatial region is simplified for LOD

rendering.• Synchronization occurs.• Texturing: Each node again receives a spatial region and

now a list of images to process, compress, and apply tothe geometry within its region.

• Finally synchronization occurs one last time, and thespatial regions are tied together in LOD hierarchy forviewing.

The results of this threading can be seen in Figure 18, wheretiming results show the effectiveness of dividing the prob-lem. The tests were run on a 3.2-GHz quad-core CPU. Note



1 2 3 4

0

200

400

600

800

1000

1200

1400

Number of Threads

Tim

e[s

]Stereo VRIP SimpGentex Total

Figure 18. Timing results as a function of number of threadson a 3.2-GHz quad-core CPU processing a mission of 1,623images. Note that the step between one and two threads fallsshort of the theoretical improvement of halving the processingtime due to overhead and management costs from the threadpool implementation, which is used only beyond one thread.This is compounded by a sync step that occurs when data arepushed onto the GPU.

the diminishing returns beyond three threads. This is at-tributable to limited disk access speed, with image loadingbecoming the bottleneck.

7. RESULTS

The techniques described in this paper have been used toproduce seabed reconstructions on a number of AUV mis-sions around Australia.

7.1. Timing Results

Speed was an important factor in the choices that wentinto developing this pipeline. One major requirement is theability to turn source data into 3D models quickly so thatthis tool can be used during field deployments to rapidlyperform habitat assessments and to help in planning sub-sequent dives. The stereo processing is by far the slowestprocessing step, as can be seen in Figure 19. However, this

0 0.5 1 1.5 2

·104

0

2000

4000

6000

8000

Number of Images

Tim

e[s

]

StereoVRIPSimp

GentexTotal

Figure 19. Timing results as a function of number of images.

Table II. Timing results for selected missions with increasingnumbers of images.

Parameter Number of images

1,623 3,393 6,559 19,004Mission duration (s) 812 1,697 3,280 9,502Time stereo (s) 639 1,337 2,863 7,103Time VRIP (s) 44 145 328 401Time simplification (s) 16 98 113 212Time mesh generation (s) 77 216 430 1,231Total (s) 776 1,796 3,734 8,947

is the simplest step to replace and the stereo calculation iscurrently being cached, meaning that it is typically run onceper mission. The optimization efforts have been focused onthe subsequent steps of the pipeline as illustrated by theperformance results in Table II. As can be seen, generat-ing the high-resolution, 3D seafloor models takes a similaramount of time as the mission itself.

7.2. Use Cases

We present the results of two missions processed using thispipeline and illustrate the potential use of the visualizationby progressively zooming into the reconstruction.

7.2.1. Drowned Reefs in the GBR

The AUV was part of a 3-week research cruise in Septem-ber 2007 aboard the R/V Southern Surveyor documentingdrowned shelf edge reefs at multiple sites in four areasalong the GBR (Webster et al., 2008; Williams et al., 2008).We were able to document relict reefs up to 20,000 yearsold formed during ice ages when sea levels were up to70 m lower than today. The study of these structures mayyield insights regarding potential future sea-level changesand their impact on sensitive reef areas such as the GBR.Figure 20(a) contains views from a reconstruction ofa dive site at Noggin Reef, one of the study sitesoff the coast at Cairns in far north Queensland,Australia. The dive targeted a particular reef structurein a depth of 60 m and featured multiple crossoverscovering both relatively flat bottoms and high-relief reefsections. The figure illustrates how a user might examinean overview of the dive profile before zooming in onparticular areas of interest.

The manner in which these reconstructions can be usedas tools for science are only beginning to be fully explored.One application that has emerged based on this work is thestudy of spatially correlated behaviors in benthic macro-fauna (Byrne et al., 2009). It was discovered using AUV im-agery that subsea dunes at Hydrographers Passage on theGBR support communities of the brittle star Ophiopsila pan-therina. The dunes covered approximately 340 h of seafloor



Figure 20. (a) Usage example for a dive undertaken on the GBR in far north Queensland. By first examining an overview ofthe dive, one gets a good sense of the distribution of reefs before zooming in on particular details of the terrain structure. (b)Usage example for a dive completed on the Tasman Peninsula demonstrating how the system allows a user to look at an entiredive transect, in this case more than 2 km long, before zooming in to examine details. The imagery is shown here alongside themultibeam data collected by the vehicle, and a user is able to interactively turn off the texturing to examine the underlying mesh.



Figure 21. (a) The reconstruction of sand dunes from a GBR AUV mission to study Ophiopsila pantherina, also known as brittlestars. The brittle stars appear in the right inset image but not in the left. Note that the orthographic views of the scene do not allowthe observer to understand the structure of the environment. Once the 3D mesh is viewed side-on, it becomes apparent where thebrittle stars are living on the dunes. Additionally the georeferenced positions of the mesh with brittle stars were used to acquirephysical samples. (b) and (c) The effect of a conservative bottom-following controller. Notice the increase in stereo footprint as thevehicle is coming off the slope and the bottom drops off. The lighting correction begins to break down as the assumptions aboutterrain height are violated, something we discovered upon visualizing the reconstruction.

at 60–70 m, placing them beyond a depth at which routineSCUBA-based study methods are tractable. It was thoughtthat O. pantherina take refuge on the lee side of the dunes;however, this is not apparent in the 2D image mosaics. The3D reconstructions made it easy to observe that, as pre-dicted, the brittle stars appeared only on the shielded slope

of the dunes. This phenomenon can be seen in Figure 21(a).In addition, O. pantherina’s appearance was quite patchy inthe survey region. From the AUV navigation we were ableto provide georeferenced locations for places in the recon-struction where brittle stars had been seen. The dunes wereresampled during a follow-up research cruise to the area,



and the stars were found in the georeferenced locationspredicted from the AUV dive completed 12 months earlier.Likewise, no brittle stars were found in the areas that werepredicted to be free of them. This supported two impor-tant hypotheses: first that the reconstruction is capable ofproviding georeferenced observations that can be used asa guide for not only additional AUV dives but also othersampling methods (physical grabs in this case) and that bi-ologically O. pantherina have locationally stable communi-ties (Byrne et al., 2009).

7.2.2. Marine Habitats in the Tasman Peninsula

In October 2008 the AUV was used to survey sev-eral sites along the Tasman Peninsula in SouthEastern Tasmania as part of the Commonwealth Envi-ronmental Research Facility (CERF) Marine BiodiversityResearch Hub (Williams, Pizarro, Jakuba, & Barrett, 2009).The science party used existing bathymetric maps to selectsites of ecological importance beyond recreational diverdepths. A total of 19 dives were completed over the courseof 10 days of operation, with dives lasting on the orderof 4–6 h. During these deployments, approximately oneterabyte of raw data was collected. This was our first re-search cruise with an upgraded set of field computers, anda first pass of all the postprocessed data was delivered oncompletion of the research cruise, including georeferencedvisual SLAM navigation estimates, 3D stereo meshes, andmultibeam bathymetery.

Figure 20(b) illustrates a reconstruction from a dive un-dertaken on this research cruise. This particular dive fea-tured a 2-km cross-shelf transect, starting on the reef at adepth of 35 m and transiting out to a flat bottom at a depthof approximately 75 m. The figure demonstrates how theentire dive transect can be viewed, zooming in to examineparticular detail of the seafloor relief. Multibeam data gath-ered by the AUV are included here to add context to thevisual swath. There is good correspondence between thestructure generated by our system and that recovered us-ing the multibeam swaths, once again demonstrating thequality of the stereo reconstructions.

We have also found the tool to be useful in visualizingand debugging the performance of the vehicle. One exam-ple of this arises when the vehicle traverses complex ter-rain while bottom following. It uses a forward-looking ob-stacle avoidance sonar to identify obstacles in its path andrises to avoid these. The present trajectory generator of thevehicle leads it to remain high over the terrain if it trav-els off a ledge, taking time to return to the programmedaltitude instead of dropping back down immediately. Theeffect of this, depicted in Figure 21(b), is a much widerstereo footprint and darker images. This consistent pattern,and its dependence on the magnitude of the drop-off, isreadily visible in the 3D reconstructions while harder toobserve solely from image sequences. The simple lighting

correction approach adopted here breaks down in thesehigh-altitude situations, and explicitly incorporating range-dependent lighting corrections is a direction of on-goingwork.

8. CONCLUSION AND FUTURE WORK

Our system has performed reliably over multiple dives cov-ering a broad range of marine habitats. The emphasis to thispoint has been in establishing a working pipeline capable ofprocessing and rendering tens of thousands of images withsufficient speed to allow end-user interaction with the re-construction in time to inform further data-gathering mis-sions. The timing results show that the system can deliverthe full reconstruction in time comparable to the missionduration. These results also show that our stereo depth es-timation stage accounts for most of the time. It is straight-forward to try other stereo processing modules, and futurework will evaluate recent promising approaches (Marks,Howard, Bajracharya, Cottrell, & Matthies, 2008; Seitz et al.,2006). It is reasonable to assume that both system memoryand the processing power of CPUs and GPUs will continueto increase. This will allow us to revisit the use of densestereo in order to generate even higher resolution terrainmodels.

Lighting correction and band-limited blending givethe end user the impression of interacting with a visuallyconsistent reconstruction. The underlying structure is onlyapproximate, and so far we have only partially character-ized the accuracy of the individual stereo reconstructions.Future work will include using ground truth surfaces andobjects at small to medium scales (requiring one to tens ofstereo meshes) in water. One possible approach is to laserscan the scene in a tank after draining it (Pizarro et al.,2004).

Mesh aggregation using VRIP requires partitioning thebounding volume into sections that VRIP can handle ef-fectively with available system memory. This bounds theuse of memory and imposes a computational cost approxi-mately linear with the number of images. The disadvantageof this approach is that inconsistencies are possible at parti-tion boundaries. We are currently investigating other formsof mesh aggregation that scale to large volumes.

The current pipeline does not explicitly account for adynamic world. In practice this has not been a significantdrawback for most surveys for several reasons: the AUV isused mostly beyond routine scientific diving depths, whichimplies very little motion from surface waves, so that anysessile organisms that can “sway” in the current are essen-tially static when imaged over a few seconds. Cases thatpresent problems are motile organisms such as fish, thoughin general it seems that they get out of the way of the AUVand are rarely captured in the imagery. Another problemoccurs when loop closing after minutes or hours and slow-moving organisms such as starfish have visibly moved, or ifthe tide has changed and anchored organisms are leaning in



another direction. When they do occur, moving objects aregenerally sparse in an image frame and robust matchingstill establishes correct loop closures. Even with these de-ficiencies, we argue that the representation allows the enduser to increase his understanding of the environment atmultiple scales. However, one could envision a system inwhich dynamic objects were detected and removed or pos-sibly even modeled and tracked. In possible future scienceapplications, this might be a beneficial or even necessaryextension of this work.

Being able to interact with a textured 3D reconstructionis only the first step in effectively using data acquired byrobotic platforms. We are currently augmenting our systemto allow it to incorporate multiple layers of textures thatcan represent overlays of quantitative information such assurface rugosity or classification of marine habitats. Thisshould allow marine scientists to observe relevant spatialpatterns at multiple scales, ideally facilitating their abil-ity to identify the underlying processes that generate thepatterns, as well as helping create testable hypotheses thatwill ultimately lead to an improved understanding of ouroceans.

ACKNOWLEDGMENTS

This work is supported by the ARC Centre of Excellenceprogramme, funded by the Australian Research Council(ARC) and the New South Wales State Government, andthe Integrated Marine Observing System (IMOS) throughthe DIISR National Collaborative Research InfrastructureScheme. A special thanks to the captain and crew of theR/V Southern Surveyor and R/V Challenger; without theirdetermined assistance we could not have gathered the dataused in this paper. We also acknowledge the help of allthose who have contributed to the development and oper-ation of the AUV, including Duncan Mercer, George Pow-ell, Stephen Barkby, Ritesh Lal, Paul Rigby, Jeremy Randle,Bruce Crundwell, and the late Alan Trinder.

REFERENCES

Agrawal, M., Konolige, K., & Bolles, R.C. (2007,February). Localization and mapping for autonomousnavigation in outdoor terrains: A stereo vision approach.In IEEE Workshop on Applications of Computer Vision,Austin, TX (pp. 7–13).

Amenta, N., & Bern, M. (1999). Surface reconstruction byVoronoi filtering. Discrete and Computational Geometry,22(4), 481–504.

Armstrong, R.A., Singh, H., Torres, J., Nemeth, R.S., Can, A.,Roman, C., Eustice, R., Riggs, L., & Garcia-Moliner, G.(2006). Characterizing the deep insular shelf coral reefhabitat of the Hind Bank marine conservation district (USVirgin Islands) using the Seabed autonomous underwatervehicle. Continental Shelf Research, 26, 194–205.

Bajracharya, M., Maimone, M., & Helmick, D. (2008). Auton-omy for Mars rovers: Past, present, and future. IEEE Com-puter, 41(12), 44–50.

Ballard, R., McCann, A., Yoerger, D., Whitcomb, L., Mindell,D., Oleson, J., Singh, H., Foley, B., Adams, J., & Piechota,D. (2000). The discovery of ancient history in the deep seausing advanced deep submergence technology. Deep-SeaResearch Part I, 47(9), 1591–1620.

Barnard, K., Cardei, V., & Funt, B. (2002). A comparison ofcomputational color constancy algorithms. i: Methodol-ogy and experiments with synthesized data. IEEE Trans-actions on Image Processing, 11, 972–984.

Bay, H., Ess, A., Tuytelaars, T., & Gool, L.V. (2008). Speeded-up robust features (SURF). Computer Vision and ImageUnderstanding, 110(3), 346–359.

Boissonnat, J.-D. (1984). Geometric structures for three-dimensional shape representation. Transactions onGraphics, 3, 266–286.

Burghouts, G.J., & Geusebroek, J.-M. (2009). Performance eval-uation of local colour invariants. Computer Vision andImage Understanding, 113, 48–62.

Burt, P., & Adelson, E. (1983). A multiresolution spline with ap-plication to image mosaics. ACM Transactions on Graph-ics, 2, 217–236.

Byrne, M., Williams, S., Woolsey, E., Davies, P., Bridge, T.,Thornborough, K., Bridge, T., Beaman, R., Webster, J., &Pizarro, O. (2009). Flashing stars light up the reef’s shelf.Ecos(150), 28–29.

Campbell, R., & Flynn, P. (2001). A survey of free-form objectrepresentation and recognition techniques. Computer Vi-sion and Image Understanding, 81(2), 166–210.

Catmull, E.E. (1974). A subdivision algorithm for computerdisplay of curved surfaces. Unpublished doctoral disser-tation, University of Utah.

Clark, J. (1976). Hierarchical geometric models for visible-surface algorithms. Communications of the ACM, 19, 547–554.

Cornelis, N., & Gool, L.V. (2008, June). Fast scale invariant fea-ture detection and matching on programmable graphicshardware. In IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition Workshops, 2008.CVPR Workshops 2008, Anchorage, AK (pp. 1–8).

Curless, B., & Levoy, M. (1996). A volumetric method for build-ing complex models from range images. In ACM SIG-GRAPH: Computer graphics and interactive techniques(pp. 303–312). New York: ACM.

Debevec, P.E., & Malik, J. (1997). Recovering high dynamicrange radiance maps from photographs. In ACM SIG-GRAPH: Computer graphics and interactive techniques(pp. 369–378). New York: ACM.

Dissanayake, M., Newman, P., Clark, S., Durrant-Whyte, H.,& Csorba, M. (2001). A solution to the simultaneouslocalization and map building (SLAM) problem. IEEETransactions on Robotics and Automation, 17(3), 229–241.

Duntley, S. (1963). Light in the sea. Journal of the Optical Soci-ety of America, 53(2), 214–233.



Durrant-Whyte, H. (1996). An autonomous guided vehiclefor cargo handling applications. International Journal ofRobotics Research, 15(5), 407–440.

Durrant-Whyte, H., & Bailey, T. (2006). Simultaneous lo-calisation and mapping (SLAM): Part I. The essen-tial algorithms. Robotics and Automation Magazine, 13,99–110.

Eustice, R., Singh, H., Leonard, J., & Walter, M. (2006). Visuallymapping the RMS Titanic: Conservative covariance esti-mates for SLAM information filters. International Journalof Robotics Research, 25(12), 1223–1242.

Fitzgibbon, A., & Zisserman, A. (1998, June). Automatic cam-era recovery for closed or open image sequences. In Pro-ceedings of the 5th European Conference on Computer Vi-sion (pp. 311–326). Freiburg, Germany: Springer-Verlag.

Fleischer, S., Wang, H., Rock, S., & Lee, M. (1996, June). Videomosaicking along arbitrary vehicle paths. In Proceedingsof the 1996 Symposium on Autonomous Underwater Ve-hicle Technology, 1996, Monterey, CA (pp. 293–299).

Fowler, R., & Little, J. (1979). Automatic extraction ofirregular network digital terrain models. In ACMSIGGRAPH: Computer graphics and interactive tech-niques (pp. 199–207). New York: ACM.

Fruh, C., & Zakhor, A. (2004). An automated method for large-scale, ground-based city model acquisition. InternationalJournal of Computer Vision, 60, 5–24.

Garcia, R., Nicosevici, T., & Cufi, X. (2002). On the way to solvelighting problems in underwater imaging. In MTS/IEEEOceans (vol. 2, pp. 1018–1024).

Garland, M., & Heckbert, P. (1997, August). Surface simplifi-cation using quadric error metrics. In ACM SIGGRAPH:Computer graphics and interactive techniques, Los Ange-les, CA (pp. 209–216). New York: ACM.

Garland, M., & Heckbert, P. (1998). Simplifying surfaces withcolor and texture using quadric error metrics. In Proceed-ings Visualization ’98, Washington, DC (pp. 263–269).

German, C.R., Yoerger, D.R., Jakuba, M., Shank, T.M.,Langmuir, C.H., & Nakamura, K.-I. (2008). Hydrothermalexploration with the autonomous benthic explorer. Deep-Sea Research Part I, 55, 203–219.

Gracias, N., & Santos-Victor, J. (2001, November). Underwa-ter mosaicing and trajectory reconstruction using globalalignment. In MTS/IEEE Oceans, Honolulu, HI (vol. 4, pp.2557–2563).

Harris, C., & Stephens, M. (1988, August). A combined cornerand edge detection. In Proceedings of the Fourth AlveyVision Conference, Manchester, UK (pp. 147–151).

Hartley, R., & Zisserman, A. (2000). Multiple view geometry incomputer vision. Cambridge, UK: Cambridge UniversityPress.

Heckbert, P. (1986). Survey of texture mapping. IEEE Com-puter Graphics and Applications, 6, 56–67.

Ho, N., & Jarvis, R. (2007, May). Large scale 3D environmentalmodelling for stereoscopic walkthrough visualisation. In3DTV Conference, Kos, Greece (pp. 1–4).

Howland, J. (1999). Digital data logging and processing, Der-byshire survey, 1997 (Tech. Rep. WHOI-99-08). WoodsHole, MA: Woods Hole Oceanographic Institution.

Hu, J., You, S., & Neumann, U. (2003). Approaches to large-scale urban modeling. IEEE Computer Graphics and Ap-plications, 23, 62–69.

Huber, P.J. (1981). Robust statistics. Hoboken, NJ: Wiley.Jaffe, J. (1990). Computer modeling and the design of optimal

underwater imaging systems. IEEE Journal of Oceanic En-gineering, 15(2), 101–111.

Jenkin, M., Hogue, A., German, A., Gill, S., Topol, A., &Wilson, S. (2008). Modeling underwater structures. Inter-national Journal of Cognitive Informatics and Natural In-telligence, 2(4), 1–14.

Johnson-Roberson, M., Pizarro, O., & Willams, S. (2009, May).Large scale optical and acoustic sensor integration for vi-sualization. In MTS/IEEE Oceans, Bremen, Germany.

Kazhdan, M. (2005, July). Reconstruction of solid modelsfrom oriented point sets. In SGP ’05: Proceedings of theThird Eurographics Symposium on Geometry Processing,Vienna, Austria.

Kazhdan, M., Bolitho, M., & Hoppe, H. (2006, June). Pois-son surface reconstruction. In SGP ’06: Proceedings of theFourth Eurographics Symposium on Geometry Process-ing, Cagliari, Sardinia (pp. 61–70).

Kelley, D.S., Karson, J.A., Fruh-Green, G.L., Yoerger,D.R., Shank, T.M., Butterfield, D.A., Hayes, J.M., Schrenk,M.O., Olson, E.J., Proskurowski, G., Jakuba, M., Bradley,A., Larson, B., Ludwig, K., Glickson, D., Buckman,K., Bradley, A.S., Brazelton, W.J., Roe, K., Elend, M.J.,Delacour, A., Bernasconi, S.M., Lilley, M.D., Baross, J.A.,Summons, R.E., & Sylva, S.P. (2005). A serpentinite-hostedecosystem: The Lost City Hydrothermal Field. Science,307, 1428–1434.

Kim, J., & Sukkarieh, S. (2004). Autonomous airbornenavigation in unknown terrain environments. IEEETransactions on Aerospace and Electronic Systems, 40,1031–1045.

Lemaire, T., Berger, C., Jung, I.-K., & Lacroix, S. (2007). Vision-based SLAM: Stereo and monocular approaches. Interna-tional Journal of Computer Vision, 74, 343–364.

Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D.,Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg,J., Shade, J., & Fulk, D. (2000, July). The Digital Michelan-gelo Project: 3D scanning of large statues. In Proceed-ings of ACM SIGGRAPH, New Orleans, LA (pp. 131–144).New York: ACM.

Liou, D., Huang, Y., & Reynolds, N. (1990, September). A newmicrocomputer based imaging system with c3 technique.In TENCON 90. 1990 IEEE Region 10 Conference on Com-puter and Communication Systems, Hong Kong (vol. 2,pp. 555–559).

Lowe, D.G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of ComputerVision, 60(2), 91–110.

Lucas, B., & Kanade, T. (1981, August). An iterative image reg-istration technique with an application to stereo vision. InInternational Joint Conference on Articial Intelligence (IJ-CAI), Vancouver, Canada (p. 674–679).

Ma, X., Wu, J., & Shen, X. (2007). A new progressive meshwith adaptive subdivision for LOD models. Lecture Notes



in Computer Science. Heidelberg, Germany: Springer-Verlag.

Mahon, I. (2008). Vision-based navigation for autonomousunderwater vehicles. Unpublished doctoral dissertation,University of Sydney.

Mahon, I., Williams, S., Pizarro, O., & Johnson-Roberson, M.(2008). Efficient view-based SLAM using visual loop clo-sures. IEEE Transactions on Robotics, 24, 1002–1014.

Marks, T., Howard, A., Bajracharya, M., Cottrell, G., &Matthies, L. (2008, May). Gamma-SLAM: Using stereo vi-sion and variance grid maps for SLAM in unstructuredenvironments. In International Conference on Roboticsand Automation, Pasadena, CA (pp. 3717–3724).

Matthies, L., & Shafer, S. (1987). Error modeling in stereo nav-igation. IEEE Journal of Robotics and Automation, 3(3),239–248.

Negahdaripour, S., & Firoozfam, P. (2006). An ROV stereo-vision system for ship-hull inspection. IEEE Journal ofOceanic Engineering, 31(3), 551–564.

Negahdaripour, S., & Madjidi, H. (2003). Stereovision imag-ing on submersible platforms for 3-D mapping of benthichabitats and sea-floor structures. IEEE Journal of OceanicEngineering, 28(4), 625–650.

Negahdaripour, S., Xu, X., & Jin, L. (1999). Direct esti-mation of motion from seafloor images for automaticstation-keeping of submersible platforms. IEEE Journal ofOceanic Engineering, 24(3), 370–382.

Negahdaripour, S., & Xun, X. (2002). Mosaic-based position-ing and improved motion-estimation methods for auto-matic navigation of submersible vehicles. IEEE Journal ofOceanic Engineering, 27(1), 79–99.

Nicosevici, T., & Garcia, R. (2008, September). Online robust 3Dmapping using structure from motion cues. In MTS/IEEEOceans, Quebec, Canada (pp. 1–7).

NVIDIA, C. (2008). NVIDIA texture tools 2. (http://developer.nvidia.com/object/texture tools.html). Retrieved 15 May2009.

Ohtake, Y., Belyaev, A., Alexa, M., Turk, G., & Seidel,H.-P. (2003). Multi-level partition of unity implicits. ACMTransactions on Graphics, 22(3), 463–470.

Oka, M., Tsutsui, K., Ohba, A., Kurauchi, Y., & Tago, T. (1987,July). Real-time manipulation of texture-mapped surfaces.In ACM SIGGRAPH: Computer graphics and interactivetechniques, Anaheim, CA (vol. 21, pp. 181–188). NewYork: ACM.

OpenSceneGraph. (n.d.). http://www.openscenegraph.org/projects/osg. Accessed 10 May 2009.

Pizarro, O., Eustice, R., & Singh, H. (2004, November). Largearea 3D reconstructions from underwater surveys. InMTS/IEEE Oceans, Kobe, Japan (vol. 2, pp. 678–687).

Pizarro, O., & Singh, H. (2003). Toward large-area mosaic-ing for underwater scientific applications. IEEE Journal ofOceanic Engineering, 28(4), 651–672.

Pollefeys, M., Koch, R., Vergauwen, M., & Gool, L.V. (2000).Automated reconstruction of 3D scenes from sequencesof images. ISPRS Journal of Photogrammetry and RemoteSensing, 55, 251–267.

Pollefeys, M., Nister, D., Frahm, J.M., Akbarzadeh, A.,Mordohai, P., Clipp, B., et al. (2008). Detailed real-time ur-ban 3D reconstruction from video. International Journal ofComputer Vision, 78, 143–167.

Ramos, F., Chover, M., Ripolles, O., & Granell, C. (2006).Continuous level of detail on graphics hardware. Lec-ture Notes in Computer Science, 4245 (pp. 460–469).Heidelberg, Germany: Springer-Verlag.

Rekleitis, I., Bedwani, J., & Dupuis, E. (2007, October). Over-the-horizon, autonomous navigation for planetary explo-ration. In IEEE/RSJ International Conference on Intelli-gent Robots and Systems, San Diego, CA (pp. 2248–2255).

Roman, C., & Singh, H. (2007). A self-consistent bathymetricmapping algorithm. Journal of Field Robotics, 24(1–2), 23–50.

Saez, J., Hogue, A., Escolano, F., & Jenkin, M. (2006, May).Underwater 3D SLAM through entropy minimization. InInternational Conference on Robotics and Automation,Orlando, FL (pp. 3562–3567).

Sawhney, H., Hsu, S., & Kumar, R. (1998, June). Robust videomosaicing through topology inference and local to globalalignment. In European Conference on Computer Vision,Freiburg, Germany (pp. 103–119).

Sawhney, H., & Kumar, R. (1999). True multi-image alignmentand its application to mosaicing and lens distortion cor-rection. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 21(3), 235–243.

Scharstein, D., & Szeliski, R. (2002). A taxonomy and eval-uation of dense two-frame stereo correspondence algo-rithms. International Journal of Computer Vision, 47, 7–42.

Segal, M., & Akeley, K. (2003). The OpenGL graphics system:A specification (Version 1.5). URL: http://www.opengl.org/documentation/specs/version1.5/glspec15.pdf. Re-trieved 17 May 2009.

Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R.(2006, June). A comparison and evaluation of multi-viewstereo reconstruction algorithms. In IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recogni-tion, 2006, New York, NY (vol. 1, pp. 519–528).

Shlyakhter, I., Rozenoer, M., Dorsey, J., & Teller, S. (2001).Reconstructing 3D tree models from instrumented pho-tographs. IEEE Computer Graphics and Applications, 21,53–61.

Singh, H., Can, A., Eustice, R., Lerner, S., McPhee, N.,Pizarro, O., & Roman, C. (2004). SeaBED AUV offers newplatform for high-resolution imaging. EOS, Transactions,American Geophysical Union, 85(31), 289, 294–295.

Singh, H., Armstrong, R., Gilbes, F., Eustice, R., Roman, C.,Pizarro, O., & Torres, J. (2004). Imaging coral I: Imagingcoral habitats with the SeaBED AUV. Subsurface SensingTechnologies and Applications, 5(1), 25–42.

Singh, H., Howland, J., & Pizarro, O. (2004). Advances inlarge-area photomosaicking underwater. IEEE Journal ofOceanic Engineering, 29(3), 872–886.

Singh, H., Roman, C., Pizarro, O., Eustice, R., & Can, A. (2007).Towards high-resolution imaging from underwater



vehicles. International Journal of Robotics Research, 26(1),55–74.

Sinha, S., Frahm, J.-M., Pollefeys, M., & Genc, Y. (2007). Fea-ture tracking and matching in video using programmablegraphics hardware. Machine Vision and Applications,DOI 10.1007/s00138-007-0105-7 (online).

Smits, B. (1999). Efficiency issues for ray tracing. Journal ofGraphics Tools, 6, 1–14.

Steder, B., Grisetti, G., Grzonka, S., Stachniss, C., Rottmann, A.,& Burgard, W. (2007, October). Learning maps in 3D usingattitude and noisy vision sensors. In IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems, SanDiego, CA (pp. 644–649).

Strengert, M., Kraus, M., & Ertl, T. (2006, November). Pyramidmethods in GPU-based image processing. In Proceedingsof the Vision, Modeling, and Visualization, Aachen, Ger-many (pp. 169–176).

Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron,A., Diebel, J., Fong, P., Gale, J., Halpenny, M., Hoffmann,G., Lau, K., Oakley, C., Palatucci, M., Pratt, V., Stang,P., Strohband, S., Dupont, C., Jendrossek, L.-E., Koelen,C., Markey, C., Rummel, C., van Niekerk, J., Jensen,E., Alessandrini, P., Bradski, G., Davies, B., Ettinger, S.,Kaehler, A., Nefian, A., & Mahoney, P. (2006). Stanley: Therobot that won the DARPA Grand Challenge. Journal ofField Robotics, 23(9), 661–692.

Thrun, S., Thayer, S., Whittaker, W., Baker, C., Burgard, W.,Ferguson, D., Haehnel, D., Montemerlo, M., Morris, A.C.,Omohundro, Z., Reverte, C., & Whittaker, W.L. (2004). Au-tonomous exploration and mapping of abandoned mines.IEEE Robotics and Automation Magazine, 11(4), 79–91.

Tomasi, C., & Kanade, T. (1992). Shape and motion from im-age streams under orthography—A factorization method.International Journal of Computer Vision, 9, 137–154.

Turk, G., & Levoy, M. (1994). Zippered polygon meshes fromrange images. In ACM SIGGRAPH: Computer graph-

ics and interactive techniques (pp. 311–318). New York:ACM.

Uyttendaele, M., Eden, A., & Skeliski, R. (2001, December).Eliminating ghosting and exposure artifacts in image mo-saics. In Proceedings of the 2001 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition,2001. CVPR 2001, Kauai, HI (vol. 2, pp. II-509–II-516).

Walter, M., Hover, F., & Leonard, J. (2008, May). SLAM for shiphull inspection using exactly sparse extended informationfilters. In International Conference on Robotics and Au-tomation, Pasadena, CA (pp. 1463–1470).

Webster, J.M., Beaman, R.J., Bridge, T., Davies, P.J., Byrne, M.,Williams, S., Manning, P., Pizarro, O., Thornborough, K.,Woolsey, E., Thomas, A., & Tudhope, S. (2008). Fromcorals to canyons: The Great Barrier Reef margin. EOS,Transactions, American Geophysical Union, 89, 217.

Williams, L. (1983). Pyramidal parametrics. ACMSIGGRAPH Computer Graphics, 17, 1–11.

Williams, S., & Mahon, I. (2004, April). Simultaneous lo-calisation and mapping on the Great Barrier Reef. InInternational Conference on Robotics and Automation,Barcelona, Spain (vol. 2, p. 1771–1776).

Williams, S., Pizarro, O., Jakuba, M., & Barrett, N. (2009, July).AUV benthic habitat mapping in south eastern Tasmania.In International Conference on Field and Service Robotics,Cambridge, MA.

Williams, S.B., Pizarro, O., Webster, J., Beaman, R.,Johnson-Roberson, M., Mahon, I., & Bridge, T. (2008,September). AUV-assisted surveying of relic reef sites. InMTS/IEEE Oceans, Quebec, Canada.

Yoerger, D.R., Jakuba, M., Bradley, A.M., & Bingham, B. (2007).Techniques for deep sea near bottom survey using anautonomous underwater vehicle. International Journal ofRobotics Research, 26, 41–54.

Zuiderveld, K. (1994). Contrast limited adaptive histogramequalization. Academic Press Graphics Gems IV, 478–485.


Date post:	01-May-2017
Category:	Documents
Upload:	karthik-uppuluri
View:	213 times
Download:	0 times

Johnson Roberson2010JFRRoboticSurveys

Documents