T o w ar ds Le ar ned T rav er sabil ity fo r Ro b o t N...

Journal of Machine Learning Research (2006) Submitted 3/2006; Published ?

Towards Learned Traversabilityfor Robot Navigation:

From Underfoot to the Far Field

Andrew Howard, Michael Turmon, Anelia Angelova,Larry Matthies, Benyang Tang [email protected] Propulsion LaboratoryCalifornia Institute of Technology, Pasadena, CA, 91109, USA

Eric Mjolsness [email protected] of Computer Science, University of California, Irvine, CA, 92697, USA

Editor: Jane Mulligan, Greg Grudic

Abstract

Autonomous o!-road navigation of robotic ground vehicles has important ap-plications on Earth and in space exploration. Progress in this domain has beenretarded by the limited lookahead range of 3-D sensors and by the di"culty ofheuristically programming systems to understand the traversability of the widevariety of terrain they can encounter. Enabling robots to learn from experiencemay alleviate both of these problems. We define two paradigms for this, learningfrom 3-D geometry and learning from proprioception, and describe initial instanti-ations of them as developed under DARPA and NASA programs. Field test resultsshow promise for learning traversability of vegetated terrain, learning to extend thelookahead range of the vision system, and learning how slip varies with slope.

1. Introduction

Robotic ground vehicles for outdoor applications have achieved some remarkablesuccesses, notably in autonomous highway following (Dickmanns, 1992; Pomerleau,1996), planetary exploration (Bapna, 1998; Biesiadecki, 2005; Leger, 2005; Maimone,2006), and o!-road navigation on Earth (Lacaze, 2002; Bodt, 2004; Krotkov, 2006).Nevertheless, major challenges remain to enable reliable, high-speed, autonomousnavigation in a wide variety of complex, o!-road terrain. 3-D perception of terraingeometry with imaging range sensors is the mainstay of o!-road driving systems.However, the stopping distance at high speed exceeds the e!ective lookahead dis-tance of existing range sensors. Moreover, sensing only terrain geometry fails toreveal mechanical properties of terrain that are critical to assessing its traversabil-ity, such as potential for slippage, sinkage, and the degree of compliance of potentialobstacles. Rovers in the Mars Exploration Rover (MER) mission have stuck in sand

c!2006 .

Howard, Turmon, Angelova et al.

Stereo cameras

Range image

Elevationmap

Local costmap

Global costmap

Steering commands

Path planning

Figure 1: LAGR robot (left), Rocky 8 robot (center), and a simple view of theirbaseline navigation software architecture (right). Both robots are just over1 meter long.

dunes and experienced significant downhill slippage in the vicinity of large rock haz-ards. Earth-based o!-road robots today have very limited ability to discriminatetraversable vegetation from non-traversable vegetation or rough ground. It is impos-sible today to preprogram a system with knowledge of these properties for all types ofterrain and weather conditions that might be encountered. The 2005 DARPA GrandChallenge robot race, despite its impressive success, faced few of these issues, sincethe route was largely or completely on smooth, hard, relatively low-slip surfaces withsparse obstacles and no dense, vegetated ground cover on the route itself.

Learning may alleviate these limitations. In particular, 3-D geometric propertiesof obstacle vs. drivable terrain are often correlated with terrain appearance (e.g.,color and texture) in 2-D imagery. A close-range 3-D terrain analysis could thenproduce training data su"cient to estimate the traversability of terrain beyond 3-Dsensing range based only on its appearance in imagery. We call this learning from 3Dgeometry (Lf3D). In principle, information about mechanical properties of terrain isavailable from low-level sensor feedback as a robot drives over the terrain, for examplefrom contact switches on bumpers, slip measurements produced by wheel encodersand other sensors, and roughness measurements produced by gyros and accelerome-ters in the robot’s inertial measurement unit (IMU). Recording associations betweensuch low-level traversability feedback and visual appearance may allow prediction ofthese mechanical properties from visual appearance alone; we call this learning fromproprioception (LfP). While learning-related methods have a long, extensive historyof use for image classification and robot road-following (e.g., Pomerleau (1989)), workin the paradigms described here is quite limited. LfP has been addressed recently informulations aimed at estimating where the ground surface lies under vegetation andclosely related work (Wellington, 2005). We are unaware of work closely related tothe Lf3D paradigm.

This paper outlines some key issues, approaches, and initial results for learningfor o!-road navigation. We describe work in the DARPA-funded Learning Appliedto Ground Robotics (LAGR) program and the NASA-funded Mars Technology Pro-

2


gram (MTP). Both use wheeled robotic vehicles with stereo vision as the primary3-D sensor, augmented by an IMU, wheel encoders, and in LAGR, GPS; they alsouse similar software architectures for autonomous navigation (Figure 1). Section 2outlines these architectures and how they need to change to address Lf3D and LfP.Sections 3, 4, and 5 present results of our initial work on Lf3D and two flavors of LfP,one aimed at learning about vegetation and the other aimed at learning about slip.Our work to date necessarily stresses simple methods with real-time performance, dueto the demonstration-oriented nature of the LAGR and MTP programs; nevertheless,the results justify the value of our approaches and their potential to evolve to moresophisticated methods.

2. Architectures and issues

The baseline navigation software architecture in both the LAGR and MTP programsoperates roughly as follows (Figure 1, right panel). Stereo image pairs are processedinto range imagery, which is converted to local elevation maps on a ground plane gridwith cells roughly 20 cm square covering five to ten meters in front of the vehicle,depending on camera height and resolution. The image and the map are the twobasic coordinate systems used, but only pixels with nonzero stereo disparity can beplaced into the map. Geometry-based traversability analysis heuristics are used toproduce local, grid-based, “traversability cost” maps over the local map area, with areal number representing traversability in each map cell. The local elevation and costmaps are accumulated in a global map as the robot drives. Path planning algorithmsfor local obstacle avoidance and global route planning are applied to the global map;the resulting path is used to derive steering commands sent to the motor controllers.

This description illustrates both the source of the myopia and the lack of in-depthterrain understanding of traditional systems: 1) the extent of the elevation map islimited to the distance at which stereo (or ladar) get range data with acceptableresolution on the ground plane, and 2) the local map encodes only elevation, possiblyenhanced with terrain class information derived from color or other properties ofthe image or range data, but with at best only crude prior knowledge of mechanicalproperties of each terrain class. Architectures similar to this have dominated DARPA,Army, and NASA robotic vehicle programs to date, though details in each box vary(Stentz, 1995; NRC, 2002; Lacaze, 2002; Maimone, 2006; Krotkov, 2006).

Figure 2 schematically illustrates the proprioceptive and stereo information avail-able to the robot in image and map coordinates, and how this information relatesto Lf3D, LfP, and richer local map representations. We divide the scene into fourregions — underfoot, near-field, mid-field, and far-field — and use s to indicate po-sition. Underfoot, the robot has proprioceptive sensors (accelerometers, gyroscopes,wheel encoders and bumpers) that provide direct measurements u of the terrain be-neath the robot. Additionally, the terrain geometry underfoot is known because itis present in previous maps; however, imagery is not available. In the near-field,

3


Far-field (2%)Sky (21%)

Mid-field (7%)Near-field (70%)

Near-field(1 – 10m)

Mid-field(10 – 50m)

Far-field(> 50m)

Underfoot(< 1m)

{s}, v

vs, v

{s}, u

Figure 2: Typical information zones from proprioception and stereo (image space,left; map space, right), with specific numbers for the LAGR robot. Seetext for discussion.

stereo vision gets range data of su"cient density and accuracy to build a griddedlocal elevation map, where the grid spacing is set by the robot’s size. The near-fieldis distinguished by the property that enough sites S = {s} land in one map cell tocollect meaningful height or roughness statistics at the scale of the robot’s footprint.For example, roughness could be measured by the standard deviation of the heightcomponent of s ! S. In the near-field, color and texture information (collectively,“visual appearance” v) is also available for insertion into the map. In the mid-field,range data and visual appearance are available. However, the range data samples theground too sparsely to create a useful elevation map: we have range s but not rangestatistics {s}. The far-field region is beyond the range of stereo vision (it has zerodisparity), so only visual appearance is available.

Image pixels relate nonlinearly to ground areas, magnifying the importance of themid-field and far-field to long-range planning. To make things concrete, in LAGR,the near-field is about 70% of the image, the mid-field is 7%, and the far-field is 2%.However, on the ground plane, the near-field covers about 1–10 m, the mid-field from10–50 m, and the far-field from 50 m to infinity (right side of Figure 2).

Given this view, our problem can be cast as transferring knowledge between theadjacent distance regimes in Figure 2: (1) between underfoot and near-field (pro-prioception vs. appearance plus rich geometry), (2) between near-field and mid-field(appearance plus rich geometry vs. appearance plus poor geometry), and (3) betweenmid-field and far-field (appearance plus poor geometry vs. appearance only). Learn-ing will extend the e!ective lookahead distance of the sensors by using the learnedcorrelations to ascribe properties sensed proprioceptively or geometrically in the closerzones to regions sensed just by appearance or weaker geometric perception in the moredistant zones. The same obstacle classes, and sometimes the same obstacle, will bepresent across all zones. Our ultimate goal is to jointly estimate terrain traversabil-ity across zones, unifying the Lf3D and LfP concepts, and encompassing slippage,sinkage, and obstacle compliance in the notion of traversability.

4


Proxies and learned estimators of traversability

Traversability T is a random variable associated with a certain site s, either a pixel inthe scene or a cell in the map. When Ts is associated with a pixel, it must be placedin the map to a!ect the route planner (Figure 1, right panel); for more on this, seethe end of section 4. Ts always takes values in the unit interval, but depending oncontext, we may take it to be binary (e.g., bumper hits) or real-valued (e.g., wheelslip). In making the link to path planning, it may be helpful to define Ts as theprobability that the robot can successfully move out of a map cell s after deciding todo so. We could imagine a physics-based simulation that would determine this exitprobability given vehicle and terrain parameters. Accumulating this T over a pathwould then yield the cumulative probability of a successful sequence of moves.

Lacking such a model, we view T as a random variable to be estimated fromcorrelated information, where the estimator is in turn learned from training data. Tolearn traversability in a given zone, our strategy is to use high-quality input examples(typically, from a zone nearer the robot, as in Figure 2) to produce training labelsT , which serve as proxies for the unknown T . This labeled data can be regardedas produced by a (noisy) membership ‘oracle’ (Valiant, 1984). The proxy labels aregiven to a learning algorithm which trains a regression model T (·) that approximatesT . The regression model is then used to drive the robot.

In Lf3D, terrain geometry, measured through local elevation statistics like rough-ness and slope, is used to provide the proxy T , which is estimated using appearanceinformation (normalized color in the work reported here). In LfP, the proprioceptiveinputs (e.g., bumper hits and slip) are used to generate the proxy T , which is then es-timated using the available appearance and geometry information from stereo images.Other approaches also fit into this framework. For example, back-tracking bumpedobjects through past frames can be viewed as using prior appearances of the object tocompute the proxy T . Fundamentally, we construct a proxy for traversability usinghigher-quality training data, accumulate a training set, and then select a regressorthat is a function of lower-quality data at greater range. The next sections show howthis idea is used in three di!erent extrapolation schemes.

3. Learning near-field traversability from proprioception

In the LAGR program, we are using the LfP paradigm to address the key problem oflearning about traversability of vegetation. For robots in general, the bumper, IMU,and slip measurements ultimately will all be important in assessing traversabilityunderfoot. In practice, for the robot and terrain used in the LAGR program todate, the bumper provides most of the information, so we currently take the proxyT to be a 0/1 quantity. Operationally, we can gather samples of T by recordingthe geometric and visual characteristics ({s}, v) of objects we can and cannot pushthrough. In principle, each bumper hit provides several frames of prior presentationsof the o!ending object. However, due to limitations of localization, especially under

5


conditions of partial slip, bumper hits are a very sparse source of data. Also, becausebumper hits temporarily disable the LAGR robot, gathering non-traversable examplesis expensive.

Furthermore, a technical problem of blame attribution arises because roughly sixmap cells are overlapped by the bumper at any time, so the nontraversable samples arecontaminated with data from traversable cells. Heuristics alone may prove su"cientto narrow down blame to one cell, or a constrained clustering approach may be neededto separate these two classes. In these experiments, we have sidestepped the blameattribution problem by obtaining training data from hand-labeled image sequences:a human identifies sets of traversable and untraversable map cells.

Terrain representation

Elevation maps per se do not adequately capture the geometry of vegetated andforested terrain. Three-dimensional voxel density representations have been usedsuccessfully with range data from ladar (Lacaze, 2002). We are experimenting withsuch a representation for range data from stereo vision. The space around the robotis represented by a regular three-dimensional grid of 20 cm"20 cm"10 cm high vox-els (Figure 3, top left). Intuitively, we expect that only low-density voxels will bepenetrable. The voxel density grid is constructed from range images by ray-tracing:for each voxel, we record both the number of passes (rays that intersect the voxel)and the number of hits (rays that terminate in the voxel). The per-voxel density! equals the ratio of hits to passes. Since the ground may be non-planar, we alsoidentify a ground voxel g in each voxel column; we assume that this voxel representsthe surface of support for a robot traversing this column. The ground voxel is deter-mined using a simple heuristic that locates the lowest voxel whose density exceedssome preset threshold. Although calculating it is relatively complex, in practice thedensity estimate is robust and rich in information.

Each map cell s has an above-ground density column [!g(s)+1 !g(s)+2 · · · !32]. Forsimplicity, we have started with the following reduced feature set: !", the maximumdensity; !"", the next-highest; i", the height of !"; and i"". We have used !" and !""

below, but !" and i" provide similar performance when used together. The averagecolor within a map cell would also be a good feature, but we have not used this inclassifications yet.

Learning algorithm

Initially, we wanted to validate the use of the density features and to replace our ex-isting hand-coded, geometry-based traversability cost heuristic with a learned value.In this o#ine context, training time is not an issue so we use a Support VectorMachine (SVM) classifier. We used a radial basis function kernel, with the SVMhyper-parameters estimated by cross-validation. The training data consisted of 2000traversable and 2000 non-traversable examples, and the resulting model has 784 sup-

6


Camera

0 < ! < 1

! = 1

Ground voxels

! = 0

Penetrable Impenetrable

0.2 0.4 0.6 0.8max1

0.2

0.4

0.6

0.8

max2

-1.0

-0.5

0.0

0.5

1.0

Figure 3: Learning from proprioception. Left: schematic illustrating the voxel densitymap representation (left), sample camera image (middle), and its projectiononto a local map (each map cell is colored with the mean of all pixelsprojecting into it). Bottom left: learned cost lookup table as a function of!" and !"". Bottom right: cost map computed from voxel densities; greenis traversable, red is not.

port vectors (SVs). The large number of SVs for a relatively modest two-dimensionalproblem indicates a considerable degree of overlap between the classes (which is borneout if one generates scatters plots of the data). Tests were performed on an indepen-dent image sequence which contains roughly 2000 examples. We achieved a classifi-cation error rate of 14% on the test set, again indicative of strong class overlap fromthese limited features.

Classification is done at frame rates of 2–5Hz, so SVM query time would beprohibitive. We therefore coded the SVM into a lookup table (LUT) for speed andsimplicity, but a reduced-set SVM would be another alternative (Scholkopf, 1999).

7


Figure 4: Back-projected LfP results for three frames of a LAGR trial. Pixels cor-responding to traversable map cells are shown with a green overlay; pixelscorresponding to non-traversable cells are shown in purple and white.

The continuous output of the SVM (Figure 3, right) is turned into a traversabilitymeasure through a simple linear function.

The results of SVM classification and the LUT are shown in Figure 4. The learnedclassifier is able to distinguish beteen traversable areas (green), and non-traversableareas (purple); intermediate values are also shown (white). Unfortunately, the designof the LAGR tests to date is such that the potential of this approach has not beenfully explored. The test courses, in particular, have been almost entirely ‘binary’,consisting of close cropped grass or unvegetated terrain (traversable) and tall bushesand trees (non-traversable). At no point has the robot been required to push throughlow vegetation in order to reach the goal. Under these conditions, simpler approaches(such as those based on pure elevation) may su"ce.

4. Learning mid and far-field traversability from near-field3-D geometry

To address another goal of the LAGR program, we are using the Lf3D paradigmto extend near-field range-based proxies T to mid-field and far-field traversabilityestimates T . Here T is a function of the heights of all pixels landing in a (20 cm)2

map cell. When at least ten pixels land in one cell, their average height "z abovea nominal ground plane becomes resolvable: a large value indicates rough groundor obstacles. We compute a traversability proxy T = f("z) for the cell, which isassociated with the visual appearance v of all pixels mapping into that cell, thusproviding a training set T of (v, T ) pairs. We use this T to select an extrapolatingfunction T = T (v) from visual appearance to traversability.

We currently use two appearance-based features: the normalized R and G compo-nents of the RGB color; i.e., v1 = R/(R+G+B), v2 = G/(R+G+B). These featuresare chosen to provide some degree of robustness to variable lighting conditions andshadows. We also have choice for the amount of training data we use: at one extreme,we can train and extrapolate within a single frame only; at the other extreme, we

8


can train over many hundreds or thousands of frames, and use the learned regressorover all subsequent frames. Given that the current feature set is relatively weak, wehave mainly pursued the former approach. We thus assume that appearance andtraversability are well-correlated with a single image, but do not assume that theyare well-correlated over time.

For the single-frame case, speedy training and evaluation are required, promptingthe reduction of T to a parameterized model. Below, we consider two approaches:unsupervised k-means clustering followed by regression, and supervised discriminantanalysis with Mixtures of Gaussians (MoG); a close variant of the first was used inLAGR test 7.

Unsupervised K-means regression

The geometry-based proxy is itself heuristic, so we might prefer to use T somewhatweakly. We had success with unsupervised clustering of the input pixel appearance,followed by deducing the per-cluster traversability from the average proxy value withineach cluster. That is, we discard the T labels within T and perform a k-meansclustering with K = 4. The traversability estimate is a weighted average of per-cluster traversabilities

T (v) =!K

k=1Tk exp(#$v # µk$2/2#2)

" !K

k=1exp(#$v # µk$2/2#2),

where Tk is the average traversability proxy value per cluster, µk is the kth clustercenter, and $·$ is Euclidean norm. This can be viewed as nearest neighbors regres-sion which uses a k-means data compression step to allow fast evaluation of T (v)at classification time. It also corresponds to a mixture of factor analyzers struc-ture (Ghahramani and Hinton, 1997) which uses only constant factors, a view whichclarifies how it could be extended to higher-dimensional feature vectors without aprohibitive increase in evaluation speed.

We have used this method in frame-by-frame on-line learning with results that areillustrated, for three frames, in figure 5. For each frame, the images on the bottomrow show the rectified RGB image, the elevation training data, and the k-meansregression result. Note that the training data has a somewhat ‘blocky’ structure, dueto the fact that this data is back-projected from a 2D map (i.e., for each pixel in theimage, we determine the corresponding cell in a 2D map and assign to the pixel the"z value for that cell). In the results image, very light and very dark pixels have beenthresholded, which has the e!ect of removing both the sky (very bright) and some ofthe foreground bushes (very dark).

The top row of plots for each frame are generated in the feature space, and showtraining data and k-means cluster centers (annotated with "z), the alternate Mixtureof Gaussians results (discussed in the next section), and the final k-means regressionfunction. In the left-hand plots, the coherence of the training data indicates thatthe appearance-based clusters do indeed capture the traversability structure (“data

9


compression” without too much lossiness). Comparing the right-hand plots across thethree di!erent frames, one can also observe that the relationship between appearanceand traversability does indeed change over time (in the first two frames, redish pixelsare traversable; in the final frame, they are non-traversable).

Figure 7 (left) shows the learning error rates as a function of frame number overa single trial. Four curves are shown: training and test error with four classes, andtraining and test error with eight classes; all curves depict the RMS error between theregressed and measured elevation with # = 0.1. Test data was generated by dividingeach frame into bands of 32 columns, and using alternating bands for training andtesting.

Two key features should be noted. First, the results are insensitive to the numberof classes, suggesting that the limiting factor on training error is not the learningalgorithm itself, but rather the relatively weak set of features (normalized R and G).Second, the training and test error rates track quite closely (up to some o!set). Thissuggests that we may use the training error as a measure of reliability for the regressor,and selectively ignore the k-means traversability predictions on frames where the erroris large.

Supervised MoG-based discriminants

It may be preferable to constrain the cluster membership a priori (using T up front)rather than extracting clusters after the fact. At the expense of some reliance on aprior rule about association based on T , we may extract more stable and homogeneousclusters. We have experimented with three approaches: introducing T -based cannot-link constraints into k-means (Wagsta!, 2001), stratifying the cluster membershipsaccording to T within the EM algorithm in a semi-supervised framework (McLachlan,2000, sec. 2.20) and adopting a two-class discriminant-based approach with popula-tions determined by T . We describe the last approach, which is simple and e!ectivefor the problems we have seen.

In the discriminant method, rough thresholds are used to form sets of traversableand nontraversable examples: T0 = {(v, T ) : T % $0}, T1 = {(v, T ) : T > $1}.We selected $0 = 0.1m and $1 = 0.2m: obstacles lower than $0 are very likelytraversable, those higher than $1 are very likely not, and we remain agnostic aboutthose in between. Two separate MoGs p0(v) and p1(v) are fit to the two trainingsets, and we declare a pixel traversable if p1(v)/p0(v) exceeds a threshold, which isset with reference to error rates on the training set.

We have used K = 3 component, full-covariance Gaussian mixtures to parameter-ize each of the two distributions, and fit the parameters by maximum-likelihood usingthe EM algorithm. The results are similar for 2 % K % 6. Figure 6 shows results forthis method when used in a frame-by-frame mode. We found that using a full three-dimensional RGB feature v gave superior results to normalized color. The explanationmay be that the full covariance structure in the mixtures can accommodate the spreadcaused by varying illumination. On the other hand, the illumination-stretched clumps

10


do not mesh well with the implicit spherical assumption of k-means. We project thethree-dimensional mixtures down into the R # G vs. G # B plane for purposes ofvisualization.

In each of the frame-by-frame results of Figure 6, there is reasonably good sepa-ration between the two classes. The classes sometimes have internal structure thatwould not be well-captured by a single Gaussian. The test error achieved is 6%.Training time for N = 1000, three-dimensional data, and K = 3 is about 40msin an unoptimized code. Evaluation time for 5000 pixels (about 9% of a 192"256pixel image) is less than 10ms, which easily permits training and evaluation at ourpath-planning rates of 2–5Hz.

Putting traversability in the map

Note that both approaches classify terrain in the image space: each pixel is assigneda traversability estimate T . To use this result for navigation, these values must beprojected into the map. There are two issues: how to combine traversability estimatesand proxies, and how to determine the 3D location of image pixels. For data fusion, wecurrently allow traversability proxies derived from geometry (the near-field trainingset) to override traversability estimates inferred from appearance (the mid- and far-field query set). To project pixels from the image into the map, when the pixel has anon-zero disparity (near- and mid-field), 3D locations are computed by triangulation.Because of range uncertainty, this leads to maps with more blur with increasingrange; at present, this is unavoidable. When disparity is zero (far-field), pixels can inprinciple be projected onto a nominal ground plane; currently, we ignore these pixels.

Figure 7 (right) shows the relative contribution of these three pixel classes, plottedas a function of frame number over a single trial. From a training set containingaround 50% of image pixels, the k-means algorithm is able to regress over 90% of theimage. Of these pixels, however, only 60% have range data, and are thus projectedinto the map. While the overall improvement is small when measured in the image,it translates to a very significant improvement in the e!ective sensor range (between50 and 100%).

Nevertheless, given the di!erence between the number of pixels regressed and thenumber of pixels projected into the map, it is clear that the map-based navigationparadigm is not able to fully exploit the results of image-based learning.

LAGR Test 7 Results

A close variant of the k-means Lf3D algorithm described above was used in Test 7 ofthe LAGR program. This test was carefully designed such that near-sighted behaviorwould lead the robot into a maze-like area of scrubby brush, while a far-sighted robotwould take the obvious (and much shorter) path to the goal. Figure 8 shows the viewfrom the start of the test course, along with the maps generated with and withoutLf3D. Clearly, Lf3D extends the e!ective range significantly: the first map shows

11


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7k-means data Examples

Clusters

0.3 0.35 0.4 0.45 0.5 0.55R/(R+G+B)

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

G/(R

+G+B

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7k-means cost Clusters

0.3 0.35 0.4 0.45 0.5 0.55R/(R+G+B)

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

G/(R

+G+B

)

0

0.1

0.2

0.3

0.4

0.5

0.6


Clusters

0.3 0.35 0.4 0.45 0.5 0.55R/(R+G+B)

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

G/(R

+G+B

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7k-means cost Clusters

0.3 0.35 0.4 0.45 0.5 0.55R/(R+G+B)

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

G/(R

+G+B

)

0

0.1

0.2

0.3

0.4

0.5

0.6


Clusters

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7R/(R+G+B)

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

G/(R

+G+B

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1k-means cost Clusters

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7R/(R+G+B)

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

G/(R

+G+B

)

Figure 5: Learned traversability for three frames, each illustrated via two rows ofplots. Upper row: k-means clusters and elevation-coded scatterplot (left);learned regression model (right). Lower row: rectified RGB image (left);image-plane elevation training data (center); learned k-means regression(right).

12


−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

12

3

Traversable

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

1 23

Not Traversable

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

12

3

Traversable

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

123

Not Traversable

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

12 3

Traversable

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

12

3

Not Traversable

Figure 6: Learned traversability as in Figure 5, but for supervised MoG classifier.Upper row: Training set, with blue for traversable and red for not (left);mixture models for p0(v) (center) and p1(v) (right). These plots are pro-jections of RGB features into R # G (abscissa) and G # B (ordinate) co-ordinates. Lower row: rectified RGB image (left); image-plane elevationtraining data (center); learned classification (right).

13


0

0.05

0.1

0.15

0.2

0.25

0.3

0 50 100 150 200 250 300 350 400

RMS

erro

r (m

eter

s)

Frame number

Error rates

2 class train error4 class train error2 class test error4 class test error

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400

Frac

tion

Frame number

Training and classification image fractions

TrainingClass

Class and disparity

Figure 7: Learning across many frames. Left: RMS error on the training set forthe k-means regressor (k = 2, k = 4), and on a test set disjoint from thetraining set. Error is insensitive to k. Right: Fraction of image pixels inthe training set (solid line), fraction for which a traversability classificationis available (dashed line), and fraction for which both classification andrange are available (dotted line). Many sites have known traversability,but cannot be placed into the map.

the incorrect route (though scrubby terrain) as a cul-de-sac, whereas the non-Lf3Dmap provides no information. Unfortunately, these maps also highlight one of theweaknesses of the approach as currently implemented. In order to learn that greenbushes are non-traversable, the robot must first acquire some examples to train on.Since there are no bushes at the start of the course, the robot drove to within afew meters of the first rank of bushes before recognising them as obstacles, turningaround, and taking the correct route to the goal.

5. Learning slip from proprioception

While the main terrain unknown so far in LAGR has been the presence or absenceof obstacles, slippage on slopes is one of the most important unsolved traversabilityissues on Mars. Slip measures the lack of progress of the vehicle and can be definedas the di!erence between the commanded velocity and the actual velocity (Helmick,2004). The commanded velocity is computed by the vehicle’s kinematics, while theactual velocity is estimated here using Visual Odometry (“VO”) (Matthies, 1987).Slip influences total traversability cost: the robot’s mobility on certain terrains sig-nificantly degrades, especially as slope angle increases (Lindemann, 2005). Thus, weseek to improve path planning by predicting slip before entering a given terrain. Weultimately intend to address compliance and slip in a unified framework, but for nowwe are addressing them separately in each domain.

14


Figure 8: Top: LAGR Test 7, robot view from the start location. The obvious pathleads to the goal at the top of the rise; a shortest-distance path takes therobot into the scrubby terrain on the left. Middle: Cost map generatedover a successful trial using Lf3D. Green cells are traversable, purple are“lethal”, gray are intermediate. The red line denotes the path of the robot.Bottom: Corresponding map generated using geometry data only.

15


Slip depends on terrain slopes, but the precise relationship varies with the ter-rain type (Bekker, 1969), so both geometry and appearance must be considered. Sliplearning fits into our proprioceptive learning framework: information about the ter-rain geometry x and appearance v of pixels within a map cell, collectively referredto as {(x,v)}, is measured from stereo imagery. At training time, this informationis correlated to the traversability proxy, in this case the robot’s slip, as the robottraverses the cell. At query time, slope and appearance alone are used to estimateslip. Slip prediction is intended for the near to mid-field ranges, as range informationis needed to obtain reliable slope estimates.

Slip learning

Slip is learned in the following framework. First, terrain is classified using appearanceinformation. Then, conditioning on terrain class, slip is learned as a function of terrainslopes (Angelova, 2006). The rationale for this decomposition is: 1) terrain type andappearance are approximately independent of slope; 2) introducing this structurehelps constrain learning to better balance limited training data and a potentiallylarge set of texture features. We focus on the individual components of slip learning:learning the terrain type from appearance and learning slip as a function of slopeswhen the terrain type is known.

As slip is a nonlinear function of terrain slopes (Lindemann, 2005), we use theLocally Weighted Projection Regression method (Vijayakumar, 2005). We prefer itto other nonlinear approximation methods, like Neural Networks, because it can beeasily extended to online learning. The slip T is estimated in terms of input slopes xvia:

T (x) =!C

c=1K(x,xc)

#bc0 +

!R

i=1bci &dc

i ,x'$,

where K(x,xc) = exp(#$x#xc$2/#) is a receptive field centered about xc, controllingthe dominant local linear regression model, and R is the number of linear projections(here R % 2). The parameters to be learned are the receptive field centers xc, 1 %c % C, and the linear regression parameters bc

i , dci , 1 % i % R, 1 % c % C. Learning

proceeds by assigning receptive fields to cover populated regions of the input domainand then fitting a linear regression (i.e. estimating factor loadings bc

i and directionsdc

i) locally in each receptive field. This fit weights all training points with theircorresponding distances to the receptive field center xc, thus giving more influence tothe nearest points (Angelova, 2006, has details). The receptive field size, # > 0, isselected using a validation set and varies depending on terrain.

The terrain classification is done in the image plane. We apply a texton-based ap-proach (Varma, 2005) for that purpose. Small patches (of size roughly correspondingto image-plane regions falling in one map cell at close range) are used for training.A feature space of a 5"5 pixel neighborhood in all three color channels is consideredand a dictionary of cluster centers (textons) is learned from the data. The frequencyof texton occurrence in a query image patch is compared to texton frequencies of

16


0 200 400 600 800

0

50

100

Slip

X (%

)Soil. Train. RMS:7.23%

0 200 400 600 800!20

!10

0

10

20

Step number

Slop

es (d

eg)

0 200 400 600 800

0

50

100

Soil. Test. RMS:11.8%Slip [gnd truth]Slip [predicted]

0 200 400 600 800!20

!10

0

10

20

Step number

PitchRoll

0 100 200 300 400

0

50

100

Slip

X (%

)

Gravel. Train. RMS:7.16%

0 100 200 300 400!20

!10

0

10

20

Step number

Slop

es (d

eg)

0 100 200 300 400

0

50

100

Gravel. Test. RMS:27.5%Slip [gnd truth]Slip [predicted]

0 100 200 300 400!20

!10

0

10

20

Step number

PitchRoll

Figure 9: Learning slip as a function of terrain slopes: soil (left), gravel (right). Thepredicted and measured (ground truth) slip are shown for each frame in longimage sequences. Both training and test modes are shown. The averageroot mean squared (RMS) error is given atop each plot.

training examples. The k-Nearest Neighbor method (k=3) is used to retrieve closesttraining examples and the %2 criterion is used as the distance measure (Varma, 2005,has details). This particular representation has been selected because it facilitatesthe discrimination between visually similar textures, such as sand and soil, especiallyconsidering the textures at a map cell resolution.

Results

In this section we show training and testing results separately for the two maincomponents of the slip prediction framework, namely terrain type classification andlearning slip as a function of slopes. Learning and prediction of slip in an integratedframework is the subject of our current work.

We compute slopes within a 2D map cell representation. Cells are 0.2m"0.2mwithin a 15m"15m robot-centered map. The minimum squared-error plane fit atcell s is computed using the mean elevations of cells in the 6"6-cell neighborhoodof s. The terrain’s slope is then decomposed into a longitudinal (along the directionof motion) and a perpendicular lateral component, corresponding respectively to thepitch and roll of the vehicle. VO is used for localization and the vehicle’s attitude(received from the IMU) gives an initial gravity-leveled frame to retrieve correctlongitudinal and lateral slope angles from the terrain (Angelova, 2006). In theseexperiments we consider terrain classification in the image plane only. Each mapcell contains information about the images which have observed it and therefore it isstraightforward to retrieve the predicted terrain type information per cell from theimage plane prediction.

17


Sand Soil

Grass Gravel

Asphalt Woodchip

soil sand gravel asphalt woodchip40

50

60

70

80

90

100

% C

orre

ct

Terrain classification results

15m map (75.2%)12m map (77.36%)

Figure 10: Example terrain classification results: sand, soil, and gravel. A slidingwindow of size 100"30 pixels is positioned at uniform pixel locations inthe image and the predicted terrain class is drawn with the correspondingcolor. Predicted class colors are overlaid on the images to the right. Theassigned colors are shown on the bottom row. Summary terrain classifi-cation results are at the bottom right.

The performance of the slip learning algorithm has been evaluated on several longframe sequences (400–1000 frames), each one collected on a particular terrain typein a neighboring park. We train on the first portion of the traverse and test on alater nonoverlapping portion, holding out a small validation set as well. This testingscenario is challenging because the test and training frames can be very dissimilar.Figure 9 shows learning and prediction of longitudinal slip as a function of slopes forsoil and gravel terrains. The test errors for the soil and gravel datasets are 11.8%and 27.5% respectively (see Angelova, 2006, for more results). Note the significantnoise involved in measuring slip (Figure 9). Slip is normalized by the average velocityper step to get the results in percent. It is apparent that the correct qualitativerelationship between slope and slip has been captured. Note that there are (roll,pitch) angle combinations in the gravel test data which were not seen during training,which requires good generalization. The results are very promising given the noiselevel and the limitations of the training data.

The performance of the terrain classification has been evaluated on shorter se-quences ((100 frames) in which the ground truth terrain classification has been pro-vided by a human operator. The test consists of a sequence of 1000 frames, andevaluation is done on every tenth frame, so as not to consider frames which are sim-ilar to one another. We had sequences available on five di!erent terrain types (soil,

18


sand, gravel, asphalt and wood chips), each one containing 15–22 test frames. Fig-ure 10 shows example results of the terrain classification algorithm and summaryterrain classification results on the five test sequences. The average test error is 23–25% relative to the human operator, considering only the pixels in the images, whichcorrespond to a cell map of size 12"12m or 15"15m thus excluding distant and skypixels. Despite some classification errors, the method is successful in discriminatingvisually similar terrains at close range which serves the purposes of slip prediction.For now, the system is working o#ine, but we are exploring methods to speed up theterrain classification algorithm and integrate it into the navigation system.

Several issues regarding slip prediction are worth mentioning. There are otherfactors, such as wheel sinkage, unequal vehicle weight distribution, and unequal trac-tion across di!erent wheels, which also influence slip. This makes slip prediction ahard problem a priori, both from a mechanical and a machine vision point of view.Indeed, the slip measurements themselves are quite noisy because of random e!ectscoming from the terrain, measurement errors, etc. While collecting slip data, we haveencountered the problem of how to factor out the dependence of slip on the vehicle’svelocity, which we addressed by forcing the vehicle to drive at relatively low con-stant speeds. This method assumes good vehicle localization and, while VO providessatisfactory results here, robot localization is still a topic of ongoing research.

6. Discussion

In section 2, we laid out a conceptual framework for learning to extrapolate traversabil-ity knowledge from underfoot to the far field. This progression attempts to exploitcorrelations among sensor modalities to use richer sensor data closer to the robot toenhance the interpretation of poorer sensor data far from the robot. Sections 3, 4,and 5 showed some initial steps toward instantiating that framework and describedhow these steps have been tested in the DARPA LAGR and NASA MTP programs.There is still a very long way to go to fully develop this framework and establishits value experimentally, but we believe that our results to date justify pursuing thispath.

A number of thorny system and infrastructure-related problems arose in both theLAGR and MTP programs that had to be solved before progress could be made onlearning. Localization and data registration are key issues that impact any e!ort toattempt to learn by associating perceptions at one point in time with experience atanother point in time, such as in LfP. Localization problems we encountered includewheel slip and GPS jumps that corrupted position estimates, time-stamping laten-cies that introduced attitude errors between IMU coordinate frames and image posestamps, and stereo camera calibration errors that caused map misalignments evenwith perfect vehicle state knowledge. Improving solutions to these problems dramat-ically helped to come to grips with the learning issues; we will not elaborate on thosesolutions here, and only note that this was a major e!ort in itself.

19


Characteristics of stereo vision as a range sensor also have important implicationsfor o!-road perception on Earth. Specifically, stereo vision currently does not workwell on sparse, nearby vegetation — it fails to produce range data in this case — and itdoes not “penetrate” vegetation even to the limited degree that ladar does (Matthies,2003). This put some limits what could be done with the voxel density-based terrainrepresentation we used for LfP. This situation will be improved as real-time stereovision progresses to higher resolution cameras and to correlation algorithms that aremore tolerant of range discontinuities. Although we remain convinced that learningthe traversability of di!erent types of vegetation is a key open problem for o!-roadrobotics, such terrain has not been the focus of the LAGR program to date, so oure!orts in this direction have been constrained by the need to address other priorities.

Lf3D has proven to be quite successful at extrapolating traversability informationbeyond the range of the local map for the kind of terrain we have faced so far inthe LAGR program. Nevertheless, there are still key open issues in how to usethe extrapolated information e!ectively for path planning. As range increases, thenumber of image pixels per map cell, for fixed size map cells, still decreases rapidly,so di!erent terrain representations may be appropriate for planning than Cartesianmaps with fixed cell size. A number of other possibilities exist, including obviouscandidates that have been explored in the past, like multi-resolution Cartesian maps.Also, we have only scratched the surface on visual features that could be used inLf3D-like strategies; key topics for future work include exploring texture features anddesigning features that are invariant to changes in illumination and range – not tomention season.

Learning slip is a kind of proprioceptive learning that we believe will be importantin Earth-based applications, though we are currently addressing it for Mars. A keyquestion here was whether slip could in fact be predicted with a useful degree ofaccuracy, given the inherent variability of terrain and other sources of noise in thesystem. Based on our results to date, we feel the answer to that question is yes. Theresults reported here focused on learning to predict slip from slope, assuming theterrain type was known. We are currently addressing the other side of the problem,using visual appearance to determine the terrain type, with promising initial results.Given that sandy, muddy, and other sorts of slippery terrain exist in Earth-basedoutdoor applications, we expect that it will become important to integrate this workmore closely with the rest of the learning framework we have outlined here.

Other areas for future work include blame attribution, confidence assessment,joint estimation of traversability across all regimes, and strategic navigation, thatis, choosing to push an obstacle across a regime boundary to gain appearance orgeometric information about it.

Acknowledgments

20


The research described here was carried out by the Jet Propulsion Laboratory, Cal-ifornia Institute of Technology with funding from the DARPA LAGR and NASAMTP programs. We thank Nathan Koenig for data collection and the rest of the JPLLAGR team.

References

Angelova, A., Matthies, L., Helmick, D., Sibley, G., Perona, P., Learning to predict slip forground robots, IEEE International Conference on Robotics and Automation, May 2006

Bapna, D., Rollins, E., Murphy, J., Maimone, M., Whittaker, W., Wettergreen, D., TheAtacama desert trek: Outcomes, IEEE International Conference on Robotics and Au-tomation, May 1998

Bekker, M., Introduction to terrain-vehicle systems, Univ. of Michigan Press, 1969Biesiadecki, J., et al., Mars Exploration Rover surface operations: driving Opportunity at

Meridiani Planum, IEEE Conference on Systems, Man, and Cybernetics, October 2005Bodt, B., Camden, R., Technology readiness level six and autonomous mobility, Proc. SPIE

Vol. 5083: Unmanned Ground Vehicle Technology V, September 2004Dickmanns, E., Mysliwetz., Recursive 3-D road and relative ego state recognition, IEEE

Trans. PAMI, Vol. 14, No. 2, 1992Ghahramani, Z. and Hinton, G. The EM Algorithm for Mixtures of Factor Analyzers Tech.

Report CRG-TR-96-1, Department of Computer Science, University of Toronto, 1997Helmick, D., Cheng, Y., Clouse, D., Matthies, L., Path following using visual odometry for

a Mars rover in high-slip environments, IEEE Aerospace Conf., Big Sky, Montana, March2004

Krotkov, E., Fish, S., Jackel, L., McBride, B., Pershbacher, M., Pippine, J., The DARPAPerceptOR evaluation experiments, International Journal of Robotics Research, to ap-pear, 2006

Lacaze, A., Murphy, K., DelGiorno, M., Autonomous mobility for the Demo III experimen-tal unmanned vehicles, AUVSI Conf. on Unmanned Vehicles, July 2002

Leger, C., et al., Mars Exploration Rover surface operations: driving Spirit at Gusev Crater,IEEE Conference on Systems, Man, and Cybernetics, October 2005

Lindemann, R., Voorhees., C., Mars exploration rover mobility assembly design, test andperformance, IEEE Intl. Conf. Systems, Man and Cybernetics, 2005

McLachlan, G., Peel, D., Finite Mixture Models, Wiley, 2000Maimone, M., Biesiadecki, J., Tunstel, E., Cheng, Y., Leger, C., Surface navigation and

mobility intelligence on the Mars Exploration Rovers, Intelligence for Space Robotics,TSI Press, Albuquerque, NM, 2006

Matthies, L., Schafer, S., Error modeling in stereo navigation, IEEE Journal of Roboticsand Automation, Vol. RA-3, No. 3, June, 1987

Matthies, L., Bergh, C., Castano, A., Macedo, J., Manduchi, R., Obstacle detection infoliage with ladar and radar, Proc. 11th International Symposium of Robotics Research,Siena, Italy, October 2003

21


Technology Development for Army Unmanned Ground Vehicles, The National AcademiesPress, 2002

Pomerleau, D. ALVINN: An autonomous land vehicle in a neural network. In NIPS 1,pages 305–313. Morgan-Kaufmann, 1989.

Pomerleau, D., Jochem, T., Rapidly adapting machine vision for automated vehicle steering,IEEE Expert: Special Issue on Intelligent Systems and their Applications, Vol. 11, No. 2,April 1996

Scholkopf, C., Mika, S., Burges, C., Knirsch, Ph., Muller, K., Ratsch, G., Smola, A., Inputspace versus feature space in kernel-based methods, IEEE Trans. Neural Networks, vol.10,5, 2002

Stentz, A., Hebert, M., A complete navigation system for goal acquisition in unknownenvironments, Proc. IEEE/RSJ International Conference on Intelligent Robotic Systems(IROS), August 1995

L. G. Valiant. A theory of the learnable. Comm. Assoc. Comput. Mach., 27(11):1134–1142,1984.

Varma, M., Zisserman, A., A statistical approach to texture classification from single images,Int. Journal of Computer Vision, Vol. 62, 2005

Vijayakumar, S., D’Souza, A., Schaal, S., Incremental online learning in high dimensions,Neural Comp., 2005

Wagsta!, K., Cardie, C., Rogers, S., Schroedl, S., Constrained K-means clustering withbackground knowledge, Proc. Intl. Conf. Machine Learning, 2001

Wellington, C., Courville, A., Stentz, A., Interacting Markov Random Fields for simultane-ous terrain modeling and obstacle detection, Robotics Science and Systems, 2005

22

Date post:	11-May-2018
Category:	Documents
Upload:	lydung
View:	213 times
Download:	1 times

T o w ar ds Le ar ned T rav er sabil ity fo r Ro b o t N...

Documents