Self-Supervised Sensor Learning and Its Application

Research ArticleSelf-Supervised Sensor Learning and Its Application:Building 3D Semantic Maps Using Terrain Classification

Chuho Yi,1 Donghui Song,2 and Jungwon Cho3,4

1 Future IT R&D Lab, LG Electronics, Seoul 137-724, Republic of Korea2Department of HCI and Robotics, University of Science and Technology, Daejeon 305-350, Republic of Korea3 Department of Computer Education, Jeju National University, Jeju 690-756, Republic of Korea4Department of Mathematics and Computer Science, Salisbury University, Salisbury, MD 21801, USA

Correspondence should be addressed to Jungwon Cho; [email protected]

Received 2 January 2014; Accepted 28 February 2014; Published 7 April 2014

Academic Editor: Tai-hoon Kim

Copyright 2014 Chuho Yi et al.This is an open access article distributed under theCreativeCommonsAttribution License, whichpermits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

An autonomous robot in an outdoor environment needs to recognize the surrounding environment to move to a desired locationsafely; that is, a map is needed to classify/perceive the terrain.This paper proposes a method that enables a robot to classify a terrainin various outdoor environments using terrain information that it recognizes without the assistance of a user; then, it creates a three-dimensional (3D) semantic map. The proposed self-supervised learning system stores data on the appearance of the ground datausing image features extracted by observing the movement of humans and vehicles while the robot is stopped. It learns about thesurrounding environment using a support vector machine with the stored data, which is divided into terrains where people orvehicles have moved and other regions.This makes it possible to learn which terrain an object can travel on using a self-supervisedlearning and image-processing methods. Then the robot can recognize the current environment and simultaneously build a 3Dmap using the RGB-D iterative closest point algorithm with a RGB-D sensor (Kinect). To complete the 3D semantic map, it addssemantic terrain information to the map.

1. Introduction

As seen in the Defense Advanced Research Projects Agency(DARPA) Grand Challenge, USA, robotics has progressedmarkedly from industrial robots that perform only giventasks to autonomous mobile robots that determine how totravel to a target. For robots to operate more intelligently,research on mobile robots needs to consider the following:(1) how to recognize certain objects, humans, or specific pat-terns; (2) simultaneous localization and mapping (SLAM)[1]; and (3) navigation to a specific destination. When anautonomous robot needs to reach a destination, the mostimportant basic process before moving is for it to assess thesafety of the surrounding environment and find a safe pathfor movement [2]. This process could incorporate globalpositioning system (GPS) and mapping technology, but aGPS system is not accurate enough to identify an exact posi-tion, and maps do not include all objects, especially movingobjects. Therefore, a moving robot must be able to recognizethe surrounding environment. One recognition method is

terrain classification, in which a robot uses sensor responsesto recognize the surrounding environment and determinepossible safe pathways. This paper presents a method bywhich a robot equippedwith a vision sensor can stop and clas-sify terrain using self-supervised learning system and identifyroads and sidewalks from passing vehicles and humans.Thenit introduces the method used to create a three-dimensional(3D) map using a red/green/blue-depth (RGB-D) sensor(Kinect) and to add terrain information to create a 3D seman-tic map.

Figure 1(a) shows the environment of interest for thispaper. It is an urban environment with a road for vehicles anda sidewalk for people. Figure 1(b) is the expected 3D semanticmap after adding semantic information to the dense map.

The remainder of this paper is organized as follows. Wesummarize related work in Section 2 and introduce theterrain-classificationmethod and learningmethod for distin-guishing between road and sidewalk through observationsusing a camera sensor in Section 3. Section 4 introduces

Hindawi Publishing CorporationInternational Journal of Distributed Sensor NetworksVolume 2014, Article ID 394942, 10 pageshttp://dx.doi.org/10.1155/2014/394942

2 International Journal of Distributed Sensor Networks

(a) (b)

Figure 1: (a) Example of an urban environment with a road, sidewalk, and buildings. (b) Example of a 3D semantic map (blue region: road;green region: sidewalk; red region: obstacles).

the RGB-D iterative closest point (ICP)method, which uses aRGB-D sensor to make a 3D dense map and to develop a 3Dsemantic map by integrating the results of Section 3. Exper-imental results showing the effectiveness of the proposedmethod are presented in Section 5 and the conclusions andfuture research are presented in Section 6.

2. Related Works

2.1. Terrain Classification. During the past year, several stud-ies have proposed methods for terrain classification. Onestudy developed a method that involved searching for pos-sible obstacles using a stereo camera, eliminating candidatesbased on texture and color clues, and then modeling terrainafter obstacles had been defined [3]. Another study focusedon avoiding trees in a forest; it used a stereo camera torecognize trees and classify terrain to find a safe pathway[4]. Other studies used a vibrating sensor to classify terrainthat had already been traveled, based on various vibrationfrequencies [5].

However, these techniques only work in specific envi-ronments. Robots need to be able to learn about unknownterrain. Some studies have focused on supervised learningthat requires human intervention when a robot reaches anunknown area. Due to the limitations of supervised learning,many researchers are now working on self-supervised orunsupervised techniques, in which a robot can learn aboutan environment on its own, without any human supervision.

One recent study developed a technique in which a robotcan calculate the depth of a ground plane using a depth mapgenerated by a stereo camera and can classify and learn aboutthe ground and obstacles within 12m. Based on these data,it can recognize very distant regions, as far as 3040m [6].Another study developed an unsupervised learning methodthat deletes incorrect detections about a wide variety of ter-rain types (e.g., trees, rocks, tall grass, bushes, and logs) whilethe robot navigates and collects data [7]. Yet another studyinvolved self-supervised classification using two classifiers: anoffline classifier that used vibration frequencies to provide theother classifier, an online and visual classifier, with labels for

various observed terrains.This allowed the visual classifier tolearn about, and recognize, new environments [8].

However, some of these methods require more than onesensor; some use stereo cameras or vibrating sensors withmonocular cameras, and most assume either that the robotis facing a flat plane through which it can navigate or that therobot will learn about the terrain after it navigates through it.

2.2. 3D Map Building. The construction of 3D maps usingvarious sensors has been studied, including range scanners[9], stereo cameras [10], and single cameras.The biggest prob-lem in constructing a 3D map is the alignment of capturedimages. To process 3D laser data, the ICP algorithm is widelyused [9]. This algorithm finds a rigid transformation of theoptimal distance between a point on the frame and each otherpoint on the frame repeatedly. A passive stereo image systemcan extract depth data for features through the paired images.The feature points may be combined via optimization similarto the iterative process of the first ICP. Then, additionalalgorithms such as the random sample consensus (RANSAC)algorithm can be used to solve the problem of consistency[10].

Recent research on creating 3D maps has obtained depthinformation for each pixel using a sensor that simultaneouslyextracts a color depth image and a general video, such as bycombining Kinect with a time-of-flight (TOF) camera [11].Kim et al. [12] built a 3D map using a fixed TOF camera thathad a frame unrelated to the order of time. In contrast, Henryet al. [11] proposed a 3D map-building method that used afreelymoving RGB-D sensor and information on time, shape,and appearance simultaneously.

3. Self-Supervised Terrain Classification

The proposed method is based on a self-supervised frame-work. The robot observes moving objects and determinestheir movement along roads and sidewalks. From these datait extracts image data (e.g., patch) about the terrain and classi-fies it as one of three classes (road, sidewalk, or background).This framework consists of three parts: detection and tracking

RD-17Highlight

RD-17Highlight

RD-17Highlight

RD-17Highlight

RD-17Highlight

International Journal of Distributed Sensor Networks 3

(a) (b)

Figure 2: Detection, classification, and tracking of moving objects ((a) the result of background subtraction, (b) the result of object classifi-cation (blue square: human, green: vehicle) and tracking based on size and location).

of moving objects, recognition of paths taken by movingobjects, and learning the terrain patches that were extractedfrom the paths taken by moving objects and classifying theenvironment.

The proposed method has two classifiers. One is for clas-sifyingmoving objects; this is offline and supervised learning.The other is for classifying terrain; this is self-supervisedlearning, and it also learns image patches, based on labelsgenerated by the object classifier. Both classifiers are SupportVectorMachines (SVM) as proposed by Vapnik [13] and bothuse only visual features.

3.1. Detection and Tracking of Moving Objects. For this study,we assumed that only humans and vehicles move in theoutdoor environment. Thus, moving objects are defined intwo classes: human and vehicle. Background subtraction isused to detect moving objects; this involves a mixture ofadaptive Gaussians [14]. The system tracks objects based ontheir size and location. Figure 2 shows the results of detection,classification, and tracking of moving objects.

3.2. Object Classification. To identify whether a detectedobject is human or vehicle, the SVM selects an object classi-fier. Data about the objects edge are used for a feature vector.This classifier also provides information about what class ofterrain is involved.

(1) Classifier. The first SVM in this system classifies objectsas either human or vehicle. This binary classification can beexpressed as (1) where the classification function is : R 1,

is a training vector, and

is a label of class. For

example, is an objects edge feature vector and

shows that

a positive object is classified as human and a negative objectis classified as vehicle:

() =

=1

(,

) + , (1)

where is the number of total training data and () is akernel function and we used the Radial Basis Function (RBF)which is (,

) = /2

2

as the kernel function. and

are weights that reduce numbers of wrong classificationby making a distance between a hyper plane and support

Figure 3: Path extraction about moving objects and lines shows itspath (blue: human, green: vehicles).

vectors far and these weights are calculated by changing toan optimized problem using

max

=1

1

2

=1

=1

(, )

subject to

=1

= 0, 0

,

(2)

where is a constant tominimize incorrect classification and is calculated with

and support vector using (1) [15, 16].

This SVM is implemented using LIBSVM [17, 18].

(2) Features. The feature vector for the object classifier is aglobal edge histogram of the objects region [19, 20]; this isshown in (3) and also consists of responses from four orien-tations and other data:

= [V, , 45, 135, ] , (3)

where is a feature vector of edge responses of an object,

is the region of an object, Vexpresses the response about

the vertical orientation of objects region, is for horizontal,

45

is for 45, 135

is for 135, and is the response about

nonedges. These feature values are normalized 0 to 1.

3.3. Data Association and Path Extraction. To extract humanand vehicle movement paths (see Figure 3), the system savesdata about objects

at every frame during the classification


(1) Observe the object that is moving (is the object is detected at this frame).

(2) Confirm whether the presently detected object was seen before or not by comparing it with

every object that was already detected, based on size and position data of the objects

.

(2a) If the object was seen before (e.g.,

and (, )

,1are similar to

and (, )

,),

update data of through

{(, )

,1}, = {, }, = + 1 and

= + .

(2b) If the object is seen for the first time, create new object data

+1same as

.

Thus, the total number of detection objects increases to + 1.(3) Check whether is a multiple of .

(3a) If it is, do path extraction.(3b) If it is not, keep observing until becomes the next multiples of .

Algorithm 1: Data collection and association.

=The total number of detected objects =The threshold for defining a moving objectSidewalkmap: A map that includes the total paths of human movementRoadmap: A map that includes the total paths of vehicle movementInput: All detect object

( = 1, 2, . . . , )

Goal: Find all paths on which humans and vehicles move (Sidewalk: , Road: )(0) Initialize , , Sidewalkmap and Roadmap(1) for = 1 to do(2) if

then

(3) is a moving object

(4) for = 1 to do

(5) ,= , ,/2

(6) end for(7) if

> 0 then

(8) for = 1 to ( 1) do

(9) Draw a line from (, ),to (, )

,+1into Sidewalk map

(10) end for(11) else(12) for = 1 to (

1) do

(13) Draw a line from (, ),to (, )

,+1into Road map

(14) end for(15) else(16)

is a noise

(17) end for(18) for all pixels of Sidewalk and Road map do(19) if Sidewalk (, ) ! = 0 then {(, )}(20) if Road (, ) ! = 0 then {(, )}(21) end for

Algorithm 2: Path extraction.

and tracking of moving objects, in which means the thappearance of a moving object during the observation.Thesedata include the th objects locations in terms of imagecoordinate

and volume

(which consists of the objects

width and height

); the number

times the th object

is tracked and the accumulated label value that is generated

by classification. These data are shown as follows: (

{, }) ,

= {(, )

,1, (, )

,2, . . . , (, )

,

} ,

= {, } ,

( = 1, 2, . . . , ) .

(4)

Here, is the number of total objects that are detectedas moving object; our system defines the bottom-left of

the image as zero in the image coordinates. Algorithm 1shows the entire process of data collection and associationabout a moving object.

Algorithm 1 (2a) shows that elements of location data

increase by (, ),1

as (, ),+1

, the objects volume is

replaced by {, }, the number of iterative tracking

is

increased by one and labeled with the value , which shows

the class of the object, accumulated by the classificationresults of the newly detected object

.

We can use data about moving objects to identify

movement paths, in which sidewalk and road are a set ofimage coordinates. Algorithm 1 (3) presents how and whendata are sufficient for the system to extract a path. Here, =5, which means that path extraction proceeds each time anobject is observed over multiple (= 5) observations. Algo-rithm 2 shows the process of path extraction in pseudocode.


In Algorithm 2, lines (2), (3), (15), and (16) show howobserved objects are defined as useful or not, based on theassumption that the object is observed over a certain numberof iterative frames (= 30) with a specified number of itera-tive tracking

. Line (5) shows that the -coordinates of the

bottom of the object are the paths -coordinates, and thepaths -coordinates are the same as the objects -coordi-nates. Lines (7) and (11) show whether the observed objectover a certain number of iterative frames is a human or a vehi-cle based on its accumulated label value

: when

is positive,

the object is a human; when is negative, the object is a

vehicle. Lines (18)(21) show how the system obtains pathdata and based on a map and a map (cf.(, ) refers to the intensity of the map at imagecoordinates (, ): it is not zero when one point of linesappears at these coordinates).

Additionally, we use maps, which are one-channel imagespaces, to save objects movements and conduct random sam-pling for patches based on path data generated from maps. Ifwe used raw position data instead of path data, patch sam-pling would be very dependent on the objects location, ratherthan random, because objects appear at similar locations inan image when it is located over the cameras focal length.Wecan solve this problem by drawing lines onto maps and usingthese to generate path data. The elements of both path datashould bemore than 0.4, where is the length of the imagediagonal. If it is not, the system returns to the observationstep.

3.4. Terrain Data Extraction. Terrain data about sidewalksand roads are randomly extracted from each path and .

Then to extract nonpathway regions (cf., in this study,these regions were defined as backgrounds) not only meansthat

,and standard deviations

,of feature values (see (8))

of each sidewalk and road are calculated but also that candi-dates of backgrounds are extracted using global random sam-pling. Here, is an index of terrains which expresses eitherthe sidewalk (i.e., ) or the road (i.e., ) and is an indexwhich means th feature; range of is 1 to and standsfor dimensions of the feature vector (i.e., 11). The candidatesof backgrounds are clustered usingMean-Shift algorithm [21]and mod

,denotes th features mode of th cluster.

And th cluster of backgrounds candidates is selectedas backgrounds data; that is, BG

= 1, when a gap Gap

,

between themode of the cluster and themean of the extractedsidewalks and roads data is larger than a distance Dist

which can be calculated using standard deviations ,

(seeFigure 4):

Gap,=

=1

(mod, ,)2

, (5)

Dist=

=1

( ,)2

, (6)

BG= {1 if Gap

,> Dist

, Gap

,> Dist

0 else.(7)

Cluster number 1

GapS,1GapR,1

DistR

DistRDistS

DistS

Figure 4: Example of 2D features clustering, and comparisons ofclusters with sidewalks and roads data. Deep blue: a distribution ofthe sidewalk data, deep red: a distribution of the road data, smalldots: candidates of the background; same color represents samecluster.

In this study, we assumed that distributions of extractedsidewalks and roads data were the Gaussian distributions.Thus, in (6) is a multiple of the standard deviation thatspecifies the width of the interval (i.e., = 1: 68%, 2: 95%, 3:99.7% confidence intervals) and is set at 2. is a weight whichinfluences a distance and is calculated by 95% confidenceintervals, and also it is set at 2.

Figure 4 shows an example of 2D features clustering andnotations.

The size of terrain data, that is, patch size, is set at 1212;the numbers of extraction are set at 200 for sidewalk and roadclasses and 400 for the background class, based on experi-mental results about classification performance using variousnumbers of training data.

Figure 5 shows experiments about backgrounds candi-dates clustering inwhich clusterswere selected as backgroundor not.

3.5. Terrain Classification. The terrain classifier learns terraindata generated using observed paths as a feature vector (see(8)) and determines where a robot can move within the sur-rounding environment as a self-supervised learning method.For a case with different numbers of training data, we canweigh each terrain class according to the number of data inits own class compared with the total number of data in allclasses. The use of features and classifiers to classify imagedata into class labels has been popular in recent years [22].

(1) Classifier.We defined terrain using three classes (sidewalk,road, and background) and used amulticlass SVM for terrainclassification.

SVM systems were initially designed for binary classifica-tion, but two common methods allow SVM systems to clas-sify more than two classes: one method involves combiningnumerous binary SVMsystems.Theother involves approach-ing a problem as an optimization problem and considers alldata simultaneously.

We selected the former approach, specifically the one-against-one method. It constructs

2= (1)/2 as a clas-

sifier to define -classes and then uses MaxWins, a votingsystem, to predict a final class with the most votes [23].

(2) Features. We used two visual features, color and texture,for the terrain classifier. We used RGB and Lab color spacesfor color features; these have been proven as good features


(a) (b)Figure 5: (a) The result of clustering (small dots: candidates of the background; same color represents same cluster), (b) the result of terraindata extraction (red dots: background, blue dots: sidewalk, green dots: road, the others: excluded clusters of candidates from backgroundclass).

(a) (b)Figure 6: (a) A sidewalk with complex texture, such as fallen leaves. (b) The road plan estimated using a V-disparity map.

for scene classification [24] and we used a Global Edge His-togram for texture features. Equation (8) presents the visualfeature vector used for the terrain classifier:

= [, , , , , , V, , 45, 135, non] , (8)

whereis a feature vector of the image patch (e.g.,).

,,

,, , and

aremean values of each channel of RGB and

Lab color. Vexpresses the response for a vertical orientation

of the patch,is for horizontal,45

is for 45,135

is for 135,

and non

is the response for nonedges. These feature valuesare normalized 0 to 1.

4. 3D Semantic Map Building

This paper constructs a 3D semantic map that shows regionswhere people or vehicles can move. To build the 3D map, itis necessary to determine the position of the sensor at everytime point. It is possible to estimate the location with thesensor or to do so more precisely using multiple sensors.Normal position estimation can use the motion model of therobot and it is also possible to make use of GPS or an imageprocess. We build the 3D map using the position of the robotestimated by matching the point cloud and features from theimage and odometry data.

4.1. Ground Plane Estimation with Vertical Disparity Map.Using a learning environment based on the trajectories of

passing vehicles and humans, it is still possible to classifysome terrains incorrectly because their texture and colordiffer completely. For example, as shown in Figure 6(a), thesidewalk might be recognized as an obstacle because thereare so many leaves with different colors and textures on it.However, we can assume that a sidewalk or road is in the sameplane after obtaining the ground plane. Then, we solve thisdrawback by depending on the class that forms the principalof the plane.

To estimate the plane, V-disparity is widely used to detectobstacles and the plane [25].The horizontal axis of the V-dis-parity map indicates the depth, and the vertical axis is madeby accumulating the pixels with the same depth along thehorizontal axis of the depth data. A plane in the 3D worldactually appears in the form of a line in the V-disparitymap. Therefore, the extraction of a strong line from the V-disparity map represents the ground plane. Extraction of theground plane is shown in Figure 6(b). The ground plane isdetermined using the voting method Max Wins to obtainthe most votes from parts of the terrain class for planes. Avalue exceeding the plane in the column direction in a V-disparity map is an obstacle [26].

4.2.MapBuildingUsing RGBD Iterative Closest Point.TheICPalgorithm searches for the rigid transformation that mini-mizes the distance between the source point cloud,

, and the

target point cloud, . The process repeats the optimization


(0)

= Extract RGB point features()

(1)

= Extract RGB point features()

(2) (, ) = Perform RANSAC Alignment(

,

)(3) repeat(4)

= Compute Closet Points(,

, )

(5) = argmin

(1

(

)

2

) + (1 )(1

( (

)

2

)

(6) until (Error Change () ) or (maxIter reached)(7) return

Algorithm 3: RGBD-ICP algorithm.

and alignment of the cloud data until convergence. The ICPalgorithm is useful when some extent of the adjustment pointcloud of two points is aligned. Conversely, when the dataassociation between the point clouds of two points is notaccurate, a local minimum occurs, which is inappropriateconvergence [11]. In contrast, visual alignment makes use offeature point matching. The main advantage of using visualfeatures is that it does not require an initialization processand can be corrected using theRANSACalgorithm.However,due to the inaccuracy of the scale, two-dimensional (2D)matching does not guarantee an optimal solution.

In this paper, we apply the RGB-D ICP algorithm, whichhas the advantage of matching RGB-D data in two ways. Thealgorithm is described inAlgorithm 3. It takes input source

and target from the RGB-D sensor to obtain a rigid trans-

formation of the camera. Lines (1) and (2) in Algorithm 3describe the extraction of visual features and associationbetween

and

. We use the Harris corner point as the

visual feature. To calculate the optimal-rigid transformation between two feature point sets, the RANSAC algorithm isapplied in line (3). Perform RANSAC Alignment searches forthe transformation of the best alignment by matching setsand selecting the features of three points repeatedly. Thenthe number of inliers is counted for the remaining featurepoints.The transformation is estimatedwithmost pairs of theinlier. The association

with matched feature points is also

determined through the transformation.Lines (3) to (6) are the main loop of the ICP algorithm.

The association between point clouds is calculated at line

(4). This process converts the source point cloud using

transformation . Then the first initialization is done with avisual RANSAC transformation. Consequently, it is possibleto match the point cloud without knowing the relative orien-tation of clouds. Line (5) minimizes the alignment error ofthe point cloud association and the visual feature matchingbetween clouds.Thefirst termof the error function representsthe average distance of the association of visual features andthe second indicates the error distance of the associationbetween clouds. Finally, we give the two errors a weight of. The process ends when any errors have been reducedsufficiently or the maximum number of iterations, which waspredetermined, is reached. It determines the ICP transforma-tion using the odometry data of the robot, when RANSACdoes not find sufficient inliers.

Figure 7 shows a 3D map created using the RGB-D ICPalgorithm in an outdoor environment.This map will become

Figure 7: An example of a 3Dmap using the RGB-D ICP algorithm.

94

92

90

88

86

84

82

80

78

Mea

n tr

ue p

ositi

ve (%

)

Number of training data (each sidewalk and road, background)

10

,20

25

,50

50

,100

100

,200

200

,200

200

,400

400

,400

300

,600

600

,600

600

,1000

1000

,1000

Figure 8: Classification performance with various numbers oftraining data in whole dataset.

a 3D semantic map when the results of terrain classificationare added.

5. Experiments

5.1. Experimental Environment. We conducted experimentsat four locations at Seongdong-gu, Seoul, Republic of Korea,and captured test datasets using a camera set at 640 360.

The experiments involved a robot observing movingobjects and learning about the surrounding terrain based onthe paths of moving objects; the robot was motionless withthe assumption that it was in an unknown location.

To extract a reasonable number of patches, test was con-ducted to classify terrain with various numbers of trainingdata. Figure 8 presents the results.


(a)

(b)

(c)

(d)

(e)

Figure 9: Experimental results. Columns present the results and processes of experiments in the whole dataset. (a) Experimental environ-ments, (b) extraction of training patches based on paths after observing moving objects, (c) ground-true images for terrain classification(red: background, green: road, blue: sidewalk), (d) supervised terrain classifications based on ground-true image, (e) classification resultsof the proposed method.

We evaluated the proposed method by comparing itsresults with those of supervised learning, which ensuredcorrect labels using ground-true images. Both methods ran-domly extracted training data about each terrain (sidewalk:200, road: 200, background: 400) based on Figure 9. To trainthe object classifier, which involves offline learning, about1968 units of training data were obtained from NICTA [27]and UIUC [28]. Each class of object (i.e., humans and vehi-cles) had the same amount of data; the RBF kernel parameter, which was used for training, was 23.5 and it was generatedusing a tool provided by LIBSVM [17]. The performance ofthe object classifier was 94.5408% based on 10-fold cross-validation, which was also provided by the LIBSVM tool.

5.2. Results of Self-Supervised Terrain Classification. In thisstudy, self-supervised classification was compared to super-vised classification, and classification error rates for bothmethods were compared with ground-true images. Table 1lists the terrain classification error rates asmean and standarddeviation in four datasets. Tests were conducted 10 timesfor extraction patches, training, and classification for eachdataset.

Table 1 shows that the results of the proposed methodwere approximately 39% different to those of supervised

(a)

(b)

Figure 10: (a) 3Dmap of the university, (b) RGB images of the samescene.


(a) (b)

Figure 11: (a) The 3D semantic map. (b) Magnification of part of the 3D semantic map (red: obstacles; blue: road; green: sidewalk).

Table 1: Comparison of self-supervised classification and the super-vised learning method.

Dataset Mean error rate (STD DEV)Supervised classification Self-supervised classification

1 4.58% (0.32%) 8.79% (1.14%)2 5.23% (0.29%) 14.39% (0.56%)3 3.30% (0.73%) 6.08% (0.36%)4 8.00% (0.24%) 17.11% (1.41%)

classification. The biggest difference between supervised andproposed classification result was 11% in dataset 2. As shownin Figure 9, themost wrong classification appeared in bound-aries between roads and sidewalks, because any human orvehicles did not pass through these regions and also those haddifferent color and textures compared to sidewalk and road.This fact could be found in datasets 3, 4, and 1 which turnedout the second, third, and fourth worse classification results.To sum up, the performance of the proposed method mightbe almost the same as the supervised method which wasconducted by human interventions who expects the problemthat critically happened at dataset 2, such as the boundaryregions.

Figure 9 presents the experimental environments, extrac-tion of patches for terrain classes based on the paths ofmoving objects, ground-true images of terrain, supervisedterrain classification, and the results of the proposed methodper dataset.

5.3. Results of 3D Map Building with Terrain Classification.After self-supervised learning by observing objects movingin the unknown environment, the robot moves around tobuild a 3D map with the RGB-D sensor. The result is shownin Figure 10. Finally, Figure 11 shows a 3D semantic map thatincludes the result of the terrain classification. Compared toFigure 10, the map was created by classifying terrain regionsas road, sidewalk, and obstacles that the robot cannot movethrough, such as trees and buildings. However, the proposedterrain classification has a disadvantage in that the regionbetween the sidewalk and road is classified as an obstacle thatthe robot cannot pass through.

6. Conclusions

We proposed a self-supervised terrain classification frame-work in which a robot observes moving objects and learnsabout its environment based on the paths of the movingobjects captured using a monocular camera. The resultswere similar (ca. 39% worse) to those of supervised terrainclassification methods, which is sufficient for an autonomousrobot to recognize its environment. In addition, we built a 3Ddense map with a RGB-D sensor using the ICP algorithm.Finally, the 3D semantic map was created by adding theresults of the terrain classification to the 3D map.

In future research, we will focus on better ways to extractbackground data, using a vanishing point or depth data anda total framework for navigation. We will also conduct testsin various environments, such as indoors and unstructuredoutdoor environments.

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper.

Acknowledgment

This research was supported by the 2014 Scientific PromotionProgram funded by Jeju National University.

References

[1] C. Yi, S. Jeong, and J. Cho, Map representation for robots,Smart Computing Review, vol. 2, no. 1, pp. 1827, 2012.

[2] X. Li and B. J. Choi, Design of obstacle avoidance system formobile robot using fuzzy logic systems, International Journalof Smart Home, vol. 7, no. 3, pp. 321328, 2013.

[3] A. Talukder, R. Manduchi, R. Castano et al., Autonomousterrain characterisation and modelling for dynamic control ofunmanned vehicles, in Proceedings of the IEEE InternationalConference on Intelligent Robots and Systems, pp. 708713,October 2002.

[4] A. Huertas, L. Matthies, and A. Rankin, Stereo-based tree trav-ersability analysis for autonomous off-road navigation, in Pro-ceedings of the 7th IEEE Workshop on Applications of ComputerVision (WACV 05), pp. 210217, January 2005.

RD-17Highlight


[5] C. A. Brooks and K. Iagnemma, Vibration-based terrain classi-fication for planetary exploration rovers, IEEE Transactions onRobotics, vol. 21, no. 6, pp. 11851191, 2005.

[6] M. J. Procopio, J. Mulligan, and G. Grudic, Learning terrainsegmentation with classifier ensembles for autonomous robotnavigation in unstructured environments, Journal of FieldRobotics, vol. 26, no. 2, pp. 145175, 2009.

[7] D. Kim, J. Sun, S. M. Oh, J. M. Rehg, and A. F. Bobick, Travers-ability classification using unsupervised on-line visual learningfor outdoor robot navigation, in Proceedings of the IEEE Inter-national Conference on Robotics andAutomation (ICRA 06), pp.518525, May 2006.

[8] A. B. Christopher and K. Iagnemma, Self-supervised terrainclassification for planetary rovers, in Proceedings of the NASAScience Technology Conference, 2007.

[9] S. May, D. Droeschel, D. Holz et al., Three-dimensional map-ping with time-of-flight cameras, Journal of Field Robotics, vol.26, no. 11-12, pp. 934965, 2009.

[10] K. Konolige and M. Agrawal, FrameSLAM: from bundleadjustment to real-time visual mapping, IEEE Transactions onRobotics, vol. 24, no. 5, pp. 10661077, 2008.

[11] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, RGB-Dmapping: using depth cameras for dense 3Dmodeling of indoorenvironments, in Proceedings of the International Symposiumon Experimental Robotics (ISER 10), 2010.

[12] Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Miscusik, andS. Thrun, Multi-view image and ToF sensor fusion for dense3D reconstruction, in Proceedings of the IEEE 12th InternationalConference on Computer Vision Workshops (ICCV Workshops09), pp. 15421546, October 2009.

[13] V. Vapnik, Statistical Learning Theory, 1998.[14] M. Piccardi, Background subtraction techniques: a review, in

Proceedings of the IEEE International Conference on Systems,Man and Cybernetics (SMC 04), vol. 4, pp. 30993104, October2004.

[15] H. Sidenbladh, Detecting human motion with support vectormachines, inProceedings of the 17th International Conference onPattern Recognition (ICPR 04), vol. 2, pp. 188191, August 2004.

[16] D. Mezghani, S. Boujelbene, and N. Ellouze, Evaluation ofSVMkernels and conventionalmachine learning algorithms forspeaker identification, International Journal of Hybrid Informa-tion Technology, vol. 3, no. 3, pp. 2334, 2010.

[17] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vec-tor machines, 2013, http://www.csie.ntu.edu.tw/cjlin/libsvm.

[18] J.-C. Liu, C.-H. Lin, J.-L. Yu,W.-S. Lai, and C.-H. Ho, Anomalydetection using LibSVM training tools, International Journal ofSecurity and Its Applications, vol. 2, no. 4, pp. 8998, 2008.

[19] C. S. Won, D. K. Park, and S.-J. Park, Efficient use of MPEG-7edge histogram descriptor, ETRI Journal, vol. 24, no. 1, pp. 2330, 2002.

[20] I. Sarker and S. Iqbal, Content-based image retrieval usinghaar wavelet transform and color moment, Smart ComputingReview, vol. 3, no. 3, pp. 155165, 2013.

[21] D. Comaniciu and P. Meer, Mean shift: a robust approachtoward feature space analysis, IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 24, no. 5, pp. 603619,2002.

[22] C. J. C. Burges, A tutorial on support vector machines for pat-tern recognition,DataMining and Knowledge Discovery, vol. 2,no. 2, pp. 121167, 1998.

[23] C.-W. Hsu and C.-J. Lin, A comparison of methods for mul-ticlass support vector machines, IEEE Transactions on NeuralNetworks, vol. 30, no. 2, pp. 415425, 2002.

[24] S. Achar, B. Sankaran, S. Nuske, S. Scherer, and S. Singh, Self-supervised segmentation of river scenes, in Proceedings of theIEEE International Conference on Robotics and Automation,2011.

[25] R. Labayrade, D. Aubert, and J. -P. Tarel, Real time obstacledetection in stereovision on non-flat road geometry through v-disparity representation, in Proceedings of the IEEE IntelligentVehicle Symposium, vol. 2, pp. 646651, June 2002.

[26] Y. Cong, J.-J. Peng, J. Sun, L.-L. Zhu, and Y.-D. Tang, V-dis-parity based UGV obstacle detection in rough outdoor terrain,Acta Automatica Sinica, vol. 36, no. 5, pp. 667673, 2010.

[27] G. Overett, L. Petersson, N. Brewer, L. Andersson, and N. Pet-tersson, A new pedestrian dataset for supervised learning, inProceedings of the IEEE Intelligent Vehicles Symposium (IV 08),pp. 373378, June 2008.

[28] S. Agarwal, A. Awan, and D. Roth, Learning to detect objectsin images via a sparse, part-based representation, IEEE Trans-actions on Pattern Analysis andMachine Intelligence, vol. 26, no.11, pp. 14751490, 2004.

Submit your manuscripts athttp://www.hindawi.com

VLSI Design

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

RotatingMachinery


Hindawi Publishing Corporation http://www.hindawi.com

Journal of

EngineeringVolume 2014


Shock and Vibration


Mechanical Engineering

Advances in


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of


Distributed Sensor Networks


The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Active and Passive Electronic Components


Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of


Antennas andPropagation




Navigation and Observation


Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

RoboticsJournal of


Date post:	16-Dec-2015
Category:	Documents
Upload:	victor
View:	220 times
Download:	2 times

Self-Supervised Sensor Learning and Its Application

Documents