MVGrasp: Real-Time Multi-View 3D Object Grasping in Highly ...

1

MVGrasp: Real-Time Multi-View 3D ObjectGrasping in Highly Cluttered Environments

Hamidreza Kasaei1, Mohammadreza Kasaei2

Abstract—Nowadays robots play an increasingly importantrole in our daily life. In human-centered environments, robotsoften encounter piles of objects, packed items, or isolated objects.Therefore, a robot must be able to grasp and manipulate differentobjects in various situations to help humans with daily tasks.In this paper, we propose a multi-view deep learning approachto handle robust object grasping in human-centric domains. Inparticular, our approach takes a point cloud of an arbitraryobject as an input, and then, generates orthographic views ofthe given object. The obtained views are finally used to estimatepixel-wise grasp synthesis for each object. We train the modelend-to-end using a small object grasp dataset, and test it onboth simulations and real-world data without any further fine-tuning. To evaluate the performance of the proposed approach,we performed extensive sets of experiments in three scenarios,including isolated objects, packed items, and pile of objects.Experimental results show that our approach performed verywell in all simulation and real-robot scenarios, and is able toachieve reliable closed-loop grasping of novel objects acrossvarious scene configurations.

I. INTRODUCTION

Industrial robots are mainly designed to perform repetitivetasks in controlled environments. In recent years, there hasbeen increasing interest in the deployment of service robotsin human-centric environments. In such unstructured environ-ments, object grasping is a challenging task due to the highdemand for grasping a vast number of objects with a widevariety of shapes and sizes under various clutter and occlusionconditions (see Fig. 1). It is also expected that the robot couldaccomplish a given task as quickly as possible.

Although several grasping approaches have been developedsuccessfully, many challenges remain. Recent works in graspsynthesis have mainly focused on developing end-to-end deepconvolutional learning approaches to plan grasp configurationdirectly from sensor data. Although it has been proven thatthese approaches can outperform hand-crafted grasping meth-ods, the grasp planning is mainly constrained to top-downgrasps from a single depth sensor [1]–[4]. These approachesassume a favorable global camera placement and force therobot to grasp objects from a single direction, which isperpendicular to the image plane (4D grasp). Such constraintsbound the flexibility of the robot, and the robot will not beable to grasp a range of household objects, e.g., bottles, boxes,etc. Furthermore, a certain set of objects has convenient partsto be grasped, e.g., the ear of a mug, or in some situation, it is

1Department of Artificial Intelligence, University of Groningen, TheNetherlands. Email: [email protected]

2School of Informatics, University of Edinburgh, UK Email:[email protected]

Fig. 1. In human-centric environments, a robot often has to deal with threescenarios: (top-row) isolated cluttered objects, (bottom-left) packed items, and(bottom-right) piles of objects. The robot should be able to predict a feasiblegrasp configuration for the target object based on the poses of the target objectand other objects in the scene.

easier to approach an object from different directions. Some ofthe deep-learning-based approaches take a very long time tosample and rank grasp candidates (e.g., [1] [5]), while othersneed to first explore the environment to acquire a full model ofthe scene and then generate point-wise 6D grasp configuration(e.g., Volumetric Grasping Network (VGN) [4]. Such 6Dgrasping approaches are mainly used in open-loop controlscenarios and are not suitable for closed-loop scenarios.

In this work, we propose a real-time multi-view deeplearning approach to handle object grasping in cluttered en-vironments. Our approach takes as an input a partial pointcloud and generates multi-view depth images for each of theobjects present in the scene. The obtained views are thenpassed to a view selection function. The best view is then fedto a deep network to estimate a pixel-wise grasp synthesis.Figure 2 depicts an overview of our work. In summary, ourkey contributions are threefold:

• We propose a new deep learning architecture that receivesa depth image as input and produces pixel-wise antipodalgrasp synthesis as output for each object individually. Wetrain the model end-to-end using a small object graspdataset, and test it on both simulations and real-worlddata without any further fine-tuning.

• We perform extensive sets of experiments in both simu-lated and real-robot settings to validate the performanceof the proposed approach. In particular, we evaluate theperformance of the proposed method on three common

arX

iv:2

103.

1099

7v4

[cs

.RO

] 8

Mar

202

2

2

Fig. 2. Overview of the proposed approach: we designed a mixed autoencoder for multi-view object grasping to be used in both isolated and highly crowdedscenarios. First, multiple views of a given object are generated. The view selection module selects which view is the best one in terms of grasping and thenfed it to the grasp network to obtain a pixel-wise grasp synthesis. As shown in the most right part of the figure, the best grasp configurations are finallyranked and transformed from 2D to 3D using a set of known transformations.

everyday situations, isolated objects, packed items, andpile of objects. We demonstrate that our approach outper-forms previous approaches and achieves a success rate of> 91% in all simulated and real scenarios, except for thesimulated pile of objects which is 80%.

• We show that the proposed approach is able to achievereliable closed-loop grasping of novel objects acrossvarious scenes and domains. Our approach, on average,could estimate stable grasp configurations in less than25ms. Therefore, the proposed approach is suitable to beused in real-time robotic applications that need closed-loop grasp planning.

II. RELATED WORK

Traditional object grasping approaches explicitly modelhow to grasp different objects mainly by considering priorknowledge about object shape, and pose [6]. It has been proventhat it is hard to obtain such prior information for never-seen-before objects in human-centric environments [7]. Recent ap-proaches address this limitation by formulating object graspingas an object-agnostic problem, in which grasp synthesis isdetected by visual features without taking prior object-specificinformation into account. Therefore, these approaches are ableto generalize the learned grasping features to novel objects. Inthis vein, much attention has been given to object graspingapproaches based on Convolutional Neural networks (CNNs)[1], [2], [8], [9]. Deep-learning approaches for object graspingfalling into two main categories depending on the input to thenetwork. We briefly review recent approaches to each category.

Point-based approaches: In this category, objects are rep-resented as either 3D voxel grid or point cloud data andthen fed into to a CNN with 3D filter banks [5], [9]–[12].Some approaches first estimate the complete shape of thetarget object using a variational autoencoder network (e.g.,[9], [12]). In other approaches, the robot first moves to variouspositions to capture different views of the scene, and then theobtained views are combined to create a complete 3D modelof the scene (e.g., [10]). There are other methods that usemachine learning approach to predict the grasp configurationfrom a partial view of the scene (e.g., [5], [11], [13]). Unlikeour approach, these approaches are often computationallyexpensive and not suitable for real-time closed-loop roboticapplications. Furthermore, training such networks requiredenormous amount of data.

Other approaches in this category use point cloud datadirectly [5], [13], [14]. One of the biggest bottlenecks withthese approaches is the execution time and sensitivity to pointcloud resolution. Unlike these methods, our approach gener-ates virtual depth images of the object, and then generatesgrasp synthesis for the obtained object’s views.View-based methods: As an input to the network, someapproaches use depth images. For instance, DexNet [1] andQT-Opt [7] learn only top-down grasping based on depthimages from a fixed static camera. Morrison et al. [2] proposedthe Generative Grasping CNN (GG-CNN), a small neuralnetwork, which generates pixel-wise grasp configurations fora given depth image. Kumra et. al., [4] developed GR-ConvNet, a large deep network that generates pixel-wise graspconfigurations for input RGB and depth data. Although GR-ConvNet performed well on public grasp datasets, illuminationand brightness play a role in its performance. Our approachpredicts grasp configurations for each object, whereas theseapproaches generate grasp maps per scene. Additionally, allthe reviewed view-based approaches only work for top-downcamera settings and are mainly focused on solving 4DoFgrasping, which forces the gripper to approach objects fromabove. The major drawbacks of these approaches are inevitablyrestricted ways to interact with objects. Moreover, the robot isnot able to immediately generalize to different task configura-tions without extensive retraining. We tackle these problems byproposing a multi-view approach for object grasping in highlycrowded scenarios. We show that our model can be trained onsmall grasp dataset in an end-to-end manner and performedon both simulations and real-world data without any furtherfine-tuning.

III. PROBLEM FORMULATION

In this work, the robot uses a single Kinect camera toperceive the environment. We formulate grasp synthesis as alearning problem of planning parallel-jaw grasps for objects inclutter. In particular, we intend to learn a function that receivesa collection of virtual depth images of a 3D object as input,and returns as outputs the best view to approach the object, anda grasp map, which represents pixel-wise grasp configurationfor the selected view.

A. Generating multiple views of objectsA three-dimensional (3D) object is usually represented

as point cloud, pi : i ∈ {1, . . . , n}, where each point is

3

Fig. 3. Two examples of generating bounding box, local reference frame, andthree projected views for: (left) a glass-cleaner; (right) a juice-box.

described by their 3D coordinates [x, y, z]. To capture depthimages from a 3D object, we need to set up a set of virtualcameras around the object, where the Z axes of camerasare pointing towards the centroid of the object. We firstcalculate the geometric center of the object as the averageof all points of the object. Then, a local reference frameis created by applying principal component analysis on thenormalized covariance matrix, Σ, of the object, i.e., ΣV = EV,where E = diag(e1, e2, e3) contains the descending sortedeigenvalues, and V = (~v1, ~v2, ~v3) shows the eigenvectors.Therefore, ~v1, shows the largest variance of the points of theobject. In this work, ~v1 and the negative of the gravity vectorare considered as X and Z axes, respectively. We define theY axis as the cross product of ~v1 × Z. The object is thentransformed to be placed in the reference frame.

From each camera pose, we map the point cloud of theobject into a virtual depth image based on z-buffering. Inparticular, we first project the object to a square plane, M ,centered on the camera’s center. The projection area is thendivided into l× l square grid, where each bin is considered asa pixel. Finally, the minimum z of all points falling into a binis considered as the pixel value. In the case of object-agnosticgrasping, since the grasp configurations depend on the poseand size of the target object, a view of the object should notbe scale-invariant, we consider a fixed size projection plane(l × l).

B. View selection for grasping

View selection is crucial to make a multi-view approachcomputationally efficient. Although it is possible to pass allthe views of the object into the network and then execute thegrasp with a maximum score (Fig. 4 left), such approachesare computationally expensive. In contrast, choosing a viewthat covers more of the target object’s surface will not onlyreduce the computation time but also increase the likelihood ofgrasping the object successfully. Information theory providesa range of metrics (variance, entropy, etc.) from which theexpected information gain can be calculated. Among thesemetrics, viewpoint entropy is a good proxy for expectedinformation gain [15]. In particular, viewpoints that observethe area of high entropy are likely to be more informative thanthose that observe low entropy areas. Therefore, we formulateour view ranking procedure based on viewpoint entropy. Itcovers both the number of occupied pixels and the pixels’values. In particular, we calculate the entropy of a normalizedprojection view, v, by H(v) = −

∑k2

k=1 pk log2(pk), wherepk is the normalized value of pixel k, and

∑k pk = 1. The

Fig. 4. Examples of grasping objects in different situations: (left) predictedgrasp configurations for the given scene; (center) grasps that are bothkinematically feasible and collision free; (right) the best grasp configurationfor grasping a given coke-can object in two different situations.

view with highest entropy is considered as the best view forgrasping and then fed to the network to predict pixel-wisegrasp configuration. We also consider the kinematic feasibilityand distance that the robot would require to travel in theconfiguration space. In the case of large objects, or a pile ofobjects, there is a clear advantage (e.g., collision-free) to graspfrom above (see Fig. 4 middle), while for an isolated object,the direction of approaching object completely depends on thepose of the object relative to the camera (see Fig. 4 right). Thegripper approaches the object from an orthogonal directionto the projection. It should be noted that the view selectionfunction can be easily adapted to any other tasks’ criteria.

C. Network architecture

We aim to learn a function that maps an input object’sview to multiple outputs representing pixel-wise antipodalgrasp configurations, fθ : X → Y . Towards this goal, wehave designed a convolutional autoencoder that receives adepth image with height h and width w pixels as input,x(i,j) ∈ Rh×w, and returns a pixel-wise grasp configurationmap, G, i.e., y(i,j) = [G(i,j)]. The network is parameterizedby its weights θ. Our model is a single-input multiple-outputs network and is constructed using three types of layers,including convolution, deconvolution, and batchnormalization.

The encoder part is composed of six convolutional layers(C1 to C6). We use a Rectified Linear Unit (ReLU) as theactivation function of all layers of the encoder part to forcethe negative values to zero and eliminating the vanishinggradient problem which is observed in other types of acti-vation functions. We added a batch normalization layer aftereach convolutional layer to stabilize the learning process andreducing the number of training epochs by keeping the meanoutput and standard deviation close to 0 and 1, respectively.The decoder part is composed of six transposed convolutionallayers (T1 to T6), followed by three separate output layersfor predicting grasp quality, width, and rotation. Similar tothe encoder part, we use the ReLU activation function forall layers and add a batch normalization after each layer.We use the same padding in all convolution and transposedconvolution layers to make the input and output be of thesame size (see Section IV).

D. Grasp execution

In this work, an antipodal grasp point is represented as atuple, gi = 〈(u, v), φi, wi, qi〉, where (u, v) stands the center

4

of grasp in image frame, φi represents the rotation of thegripper around the Z axis in the range of [−π2 ,

π2 ], wi shows

the width of the gripper where wi ∈ [0, wmax], and thesuccess probability of the grasp is represented by qi ∈ [0, 1].Given an input view, the network generates multiple outputs,including: (φ,W,Q) ∈ Rh×w, where pixel value of imagesindicate the measure of φi, wi, qi respectively. Therefore, fromfθ(Ii) = Gi, the best grasp configuration, g*, is the one withmaximum quality, and its coordinate indicate the center ofgrasp, i.e., (u, v)← g* = argmaxQ Gi. Given a grasp objectdataset, D, containing nd images, D = {(xi, yi)|1 ≤ i ≤ nd},our model can be trained end-to-end to learn fθ(.).

After obtaining the grasp map of an input view, the Carte-sian position of the selected grasp point, (u, v), can be trans-formed to object’s reference frame since the transformation ofthe orthographic view relative to the object is known. Thedepth value of the grasp point is estimated based on theminimum depth value of the surrounding neighbors of (u, v)that are within a radius of ∆, where ∆ = 5mm. Finally, therobot is instructed to perform grasping action.

IV. RESULTS AND DISCUSSIONS

We performed several rounds of simulation and real-robotexperiments to evaluate the performance of the proposedapproach. We were pursuing three goals in these experiments:(i) evaluating the performance of object grasping in three sce-narios; (ii) investigating the usefulness of formulating objectgrasping as an object-agnostic problem for general purposetasks; (iii) determine whether the same network can be usedin both simulation and real-robot systems without additionalfine-tuning. Towards this goal, we employed the same codeand network (trained on the Cornell dataset) in both simulationand real-robot experiments.

A. Experimental setup

Figure 5 shows our experimental setup in simulation (left)and real-robot (right). In this work, we have developed a sim-ulation environment in Gazebo similar to our real-robot setupto extensively evaluate performance of our approach as wellas its ability to adapt to different test environment settings.The robot and the camera in the simulated environment wereplaced according to the real-robot setup to obtain consistentperformance. Our setup consists of a Universal Robot (UR5e)with a two-fingered Robotiq 2F-140 gripper, a Kinect camera

Fig. 5. Our experimental setups in: (left) simulation environment; and (right)the real-world setting. It should be noted, both simulated and real sets ofobjects used for evaluations are shown in these figures.

mounted on a tripod, and a user interface to start and stop theexperiments.

To assess grasp performance, we designed a clear table task(see Fig. 1), where the robot has to pick up all objects fromits workspace and put them into a basket. In all experiments,the robot knows in advance the pose of the basket as theplacing position. We remove the table plane from the measuredpoint cloud, then cluster the remaining points [16], [17]. Weconsider each of the cluster as an object candidate. Note thatobject segmentation is beyond the scope of this paper andmore advance technique can be considered, e.g., [18].

At the beginning of each experiment, we set the robotto a pre-defined setting, and randomly place objects on thetable. Afterward, the robot needs to predict grasp synthesisand select the best graspable pose of the target object, picksit up, and puts it in the basket. This procedure is repeateduntil all objects get removed from the table, or the robot cannot find an executable grasp point for the present objects.We use RRT motion planner to check for a collision-freepath to each grasp pose and execute the one with the highestquality score. We assessed the performance of our approachby measuring success rate, i.e., number of successful grasps

number of attempts .If the robot could place the object inside the basket weconsider it as a success, otherwise a failure. If non of thepredicted grasp points can be executed we consider the trial afailure too. For grasping experiments in simulation, we used20 simulated objects, imported from different resources (e.g.,the YCB dataset [19], Gazebo repository, and etc). For real-robot experiments, we used 20 daily-life objects with differentmaterials, shapes, sizes, and weight (see Fig. 5). All objectswere inspected to be sure that at least one side fits within thegripper.

We benchmark our approach against three single modal(depth-only) deep learning based approach, DexNet [1], GG-CNN [2], and Morrison et al. [3], and one analytical approach(Grasp Pose Detection (GPD) [13]). In our experiments,DexNet [1], GG-CNN [2], and Morrison et al. [3] have accessto global projected top-down view of the full scene, while ourapproach uses projected views of the target object. GPD usesthe partial point cloud of the object as input. All tests wereperformed with a PC running Ubuntu 18.04 with a 3.20 GHzIntel Xeon(R) i7, and a Quadro P5000 NVIDIA.

B. Ablation Studies

We study the impact of each of the key components ourapproach, i.e, network architecture and view selection strategy,through several ablation experiments.

1) Network architecture: To train the proposed network,we used the extended version of the Cornell dataset [20],containing 1035 RGB-D images of 240 household objects, foroptimizing the architecture and parameters of our network. Inparticular, we considered the 5110 positive grasp configura-tions and discard all the negative labels. Furthermore, since theCornell dataset is a small dataset, we augmented the data byzooming, random cropping, and rotating functions to generate51100 images. The 80% of the augmented data is used fortraining and the remaining 20% is used as the evaluation set.

5

We trained several networks with the proposed architecturebut different parameters including filter size, dropout rate,loss functions, optimizer, and various learning rates for 100epochs each. We used Intersection over Union (IoU) metric.In particular, a grasp pose is considered as a valid graspif the intersection of the predicted grasp rectangle and theground truth rectangle is more than 25%, and the orientationdifference between predicted and ground truth grasp rectanglesis less than 30 degrees. The final architecture is shapedas: C(9×9×8, S3), C(5×5×16, S2), C(5×5×16, S2), C(3×3×32),C(3×3×32), C(3×3×32), T(3×3×32), T(3×3×32), T(3×3×32),T(5×5×16, S2), T(5×5×32, S2), T(9×9×32, S3), where S standsfor strides. We used Adam optimizer with a learning rate of0.001, and Mean Squared Error as a loss function.

TABLE IRESULT OF OBJECT GRASPING ON

THE CORNELL DATASET [20].

approach input data IoU (%)DexNet [1]∗ depth 89.0

GG-CNN [2]∗ depth 75.05Morrison et al. [3]∗ depth 78.41

Ours depth 89.51∗ retrained from scratch.

We compared our ap-proach with three depth-only grasp prediction base-lines, including DexNet [1],GG-CNN [2], and Morrisonet al. [3]. Table I providesa summary of the resultsobtained. By comparing allapproaches, it is clear thatour approach outperformed all the selected approaches by alarge margin. Concerning IoU metric, our approach achieved89.51% which was 0.51, 14.46, and 11.10 percentage point(p.p) better than DexNet [1], GG-CNN [2], and Morrisonet al. [3] respectively. Furthermore, we realized that whileour approach, GGCNN, and Morrison et al. [3] are suitablefor closed-loop real-time scenarios (> 45 Hz, tested on thementioned hardware), DexNet often took a long time to predictgrasp candidates for a given input. It was due to the fact thatthe DexNet model contains a sampling routine that often takesa long time depending on the complexity of the scene (e.g.,shape of the object and also number of objects in the scene).

2) View selection for grasping: Another key componentof the proposed method is the view selection strategy forgrasping purposes. Therefore, we performed extensive sets ofexperiments in the context of isolated object removal task tostudy the impact of view selection on grasping. For theseexperiments, we randomly placed an arbitrary object insidethe robot’s workspace, and instruct the robot to grasp and putthe object into the basket.

Each object was tested 50 times in simulation and 10 timesin a real environment. To speed up the real-robot experiments,we randomly placed five objects on the table. In each executioncycle, the robot selected the nearest object to the base andtried to remove it from the table. Results are reported inTable II. By comparing both real and simulation experiments,it is visible that our approach outperformed GPD by a largemargin. Particularly, in the case of simulation experiments, weachieved a grasp success rate of 89.7% (i.e., 897 success outof 1000 trials), while GPD and GGCNN obtained 78.7%, and72.6%, respectively. We visualized the best grasp configurationfor 10 simulated objects in Fig. 6.

In the case of real-robot experiments, the success rate of ourapproach was 90.5% (181 success out of 200 attempts), whichwas 9.5% and 12.0 better than GPD and GGCNN, respectively.

Fig. 6. Qualitative results for 10 never-seen-before household objects inthe isolated scenario: visualizing objects in Gazebo and their best graspconfigurations. These results showed that our approach learned very-well theintended object-agnostic grasp function.

This was due to the fact that the proposed approach generatedpixel-wise grasp configurations for the most informative viewof the target object, resulting in a variety of grasp options.This was not the case for the GPD and GGCNN approaches.

TABLE IIISOLATED SCENARIO

Method Type Success rate (%)GPD sim 78.7 (787/1000)

GGCNN sim 72.6 (726/1000)Ours (top-down) sim 73.2 (732/1000)Ours (random) sim 51.3 (513/1000)

Ours sim 89.7 (897/1000)GPD real 81.0 (162/200)

GGCNN real 78.5 (157/200)Ours (top-down) real 67.5 (135/200)Ours (random) real 49.0 (98/200)

Ours real 90.5 (181/200)

In particular, GPD often re-turned a few grasp config-urations and GGCNN onlyreturned top-down configu-rations. Occasionally, noneof the returned configu-rations led to a success-ful grasp. In the case ofGGCNN and our approachwith top-only view selec-tion, failures mainly hap-pened in grasping soda-can, bottle, human toy, and mustardobject since the supporting area around the selected grasppoint was too small and therefore, the object slipped and fallduring manipulation. In the case of our approach with randomview selection, the main failures were due to collision withthe table, e.g., grasping a toppled soda-can from side. Somefailures also occurred when one of the fingers of the gripperwas tangent to the surface of the target object, which led topushing the object away. In the case of our approach with viewselection, the failed attempts were mainly due to inaccuratebounding box estimation, some objects in specific pose had avery low grasp quality score, and collision between the objectand the bin (mainly happened for large objects e.g., Pringlesand Juice box). The experiments indicated that the proposedapproach worked well for grasping isolated objects in bothsimulation and real-world environments without fine-tuning.In the following subsections, we test the performance of theapproach in two challenging cluttered scenarios.

C. Grasp evaluation in cluttered environments

For these experiments, we randomly generate an evaluationscene consisting of four to six objects as packed objectsand pile of objects scenarios. To generate a simulated scenecontaining a pile of objects, we randomly spawn objects into abox placed on top of the table. We wait for a couple of seconds

6

Fig. 7. Illustrative examples of two cluttered evaluation scenarios in Gazebo:(left) packed objects; (right) pile of objects.

until all objects become stable, and then remove the box,resulting in a cluttered pile of objects. To generate a packedscenario, we iteratively placed a set of objects next togetherin the workspace. An example for each scenario is shown inFig. 7 (left and right). In the case of real-robot experiments, werandomly put four to six objects in a box, then shake the boxto remove bias, and finally pour the box in front of the robotto make a pile of objects. In the case of packed experiments,we manually designed scenes by putting several objects tightlytogether. In this round of evaluation, in addition to the successrate, we report the average percentage of objects removedfrom the workspace. An experiment is continued until eitherall objects get removed from the workspace, or three failuresoccurred consecutively, or the quality of the grasp candidateis lower than a pre-defined threshed, τ ∈ {0.8, 0.9}. In eachexecution cycle, the robot must execute the best possiblegrasp synthesis. Moreover, we also performed experimentswith GPD and GGCNN to compare their results with ours.

1) Packed experiments: Results are reported in ta-ble III. It is clear from the results that by setting τto 0.9, the robot was able to remove fewer objectsfrom the workspace while achieving a greater successrate, while setting τ to 0.8 led to good balance be-tween success rate and percentage of objects removed.

TABLE IIIPERFORMANCE ON PACKED SCENARIO

Method Type Success rate Percent clearedGPD sim 0.54 (139/256) 0.70 (139/200)

GGCNN sim 0.55 (144/262) 0.72 (144/200)τ = 0.8 sim 0.84 (176/210) 0.88 (175/200)τ = 0.9 sim 0.94 (141/150) 0.71 (141/200)

GPD real 0.64 (31/48) 0.76 (31/40)GGCNN real 0.46 (25/54) 0.63 (25/40)τ = 0.8 real 0.91 (38/42) 0.90 (36/40)

Our approachoutperformedboth GPD andGGCNN with alarge margin inboth simulatedand real-robotexperiments(> 25%). Oncloser inspectionof real-robot experiments, we realized that the proposedmethod could successfully grasp 38 objects out of 42attempts, resulting in 91% success rate and 90% percentcleared, while GPD resulted in 64% success rate and 76%percent cleared. In the case of GGCNN, the success rate andpercent cleared degraded to 46% and 63% respectively byexecuting 29 unsuccessful grasp attempts.

We found that mug-like objects and bottle-like objects aredifficult to grasp for GPD and GGCNN respectively, as thetarget object mostly slipped out of the gripper during themanipulation phase. We observed that the proposed approachis able to predict robust grasp quality scores for a givenobject. Figure 8 illustrates four examples of packed removalexperiments.

Fig. 8. Qualitative results on packed scenarios: visualizing the top-three graspconfigurations on four different densely-packed objects.

2) Pile experiments: The obtained results are summarizedin Table IV. We use an example to explain the results.Figure 9 depicts a successful sequence of removing a pileof four objects using the proposed approach. It was ob-served that after removing Mustard and Colgate objects fromthe workspace (Fig. 9 a, b), the complexity of the scenereduced significantly. Therefore, the robot could find moregrasp configurations whose grasp quality exceeds the threshold(Fig. 9 c, d). As shown in this example, while the robotwas interacting with an object, the pose of the other objectschanged completely, resulting in a situation that the targetobject is not graspable (e.g., toppled Oreo box). Such a situ-ation was one of the main reasons for unsuccessful attempts.

TABLE IVPERFORMANCE ON PILE SCENARIO

Method Type Success rate Percent clearedGPD sim 0.61 (131/214) 0.66 (131/200)

GGCNN sim 0.64 (143/223) 0.72 (143/200)τ = 0.8 sim 0.80 (153/192) 0.77 (153/200)

GPD real 0.72 (31/43) 0.78 (31/40)GGCNN real 0.78 (32/41) 0.80 (32/40)τ = 0.8 real 0.92 (35/38) 0.88 (35/40)

Some otherfailures occurreddue to the lack offriction, applyinglimited forceto the object,collision withother objects, andunstable graspspredictions.

By comparing the results obtained from different rounds ofexperiments (isolated, packed, and pile), we found that top-down grasps performed poorly for isolated objects (see Ta-ble II) but worked well for packed/piled objects (see Tables IIIand IV). The underlying reason was that in the case of thepile and packed scenarios, the center of packed items and alsothe center of the pile of objects was defined such that allobjects can be reached from above, whereas in the case of theisolated object scenario, objects are randomly placed withinthe workspace. Consequently, in some cases the target objectcould not be reached from above. A video of these experimentsis available online: https://youtu.be/P9ehBDbGLnY

Fig. 9. An example of successful sequence of removing a pile of objects: inthis experiment, the robot successfully removed Mustard, Colgate, Juice box,and Coke can objects one by one. In each execution cycle, grasp configurationsthat are collision-free and kinematically feasible are shown by green color.

https://youtu.be/P9ehBDbGLnY

7

V. CONCLUSION

In this paper, we proposed a deep learning approach forreal-time multi-view 3D object grasping in highly cluttered en-vironments. We trained the approach in an end-to-end manner.The proposed approach allows robots to robustly interact withthe environments in both isolated and highly crowded scenar-ios. In particular, for a given scene, our approach first generatesthree orthographic views, The best view is then selected andfed to the network to predict a pixel-wise grasp configurationfor the given object. The robot is finally commanded toexecute the highest-ranked grasp synthesis. To validate theperformance of the proposed method, we performed extensivesets of real and simulation experiments in three scenarios:isolated, packed, and pile of objects. Experimental resultsshowed that the proposed method worked very well in allthree scenarios, and outperformed the selected state-of-the-artapproaches. In the continuation of this work, we would liketo investigate the possibility of improving grasp performanceby learning a shape completion function that receives a partialpoint cloud of a target object and generates a complete model.We then use the full model of the object to estimate graspsynthesis map. The other direction would be extending theproposed approach by adding an eye-in-hand system to havemore flexibility to reconstruct parts of the environment thatwe are interested in. In particular, by moving the sensor to thedesired areas, we can capture significantly more details thatwould otherwise not be visible. This information can be veryhelpful and lead to better grasp planning.

REFERENCES

[1] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea,and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust graspswith synthetic point clouds and analytic grasp metrics,” arXiv preprintarXiv:1703.09312, 2017.

[2] D. Morrison, P. Corke, and J. Leitner, “Closing the Loop for RoboticGrasping: A Real-time, Generative Grasp Synthesis Approach,” in Proc.of Robotics: Science and Systems (RSS), 2018.

[3] ——, “Learning robust, real-time, reactive robotic grasping,” The Inter-national Journal of Robotics Research, vol. 39, no. 2-3, pp. 183–201,2020.

[4] S. Kumra, S. Joshi, and F. Sahin, “Antipodal robotic grasping usinggenerative residual convolutional neural network,” in 2020 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS),2020, pp. 9626–9633.

[5] A. Mousavian, C. Eppner, and D. Fox, “6-DoF graspnet: Variationalgrasp generation for object manipulation,” in Proceedings of theIEEE/CVF International Conference on Computer Vision, 2019, pp.2901–2910.

[6] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven graspsynthesis—a survey,” IEEE Transactions on Robotics, vol. 30, no. 2,pp. 289–309, 2013.

[7] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qt-opt:Scalable deep reinforcement learning for vision-based robotic manipu-lation,” in Conference on Robot Learning, 2018.

[8] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting roboticgrasps,” The International Journal of Robotics Research, vol. 34, no.4-5, pp. 705–724, 2015.

[9] M. Breyer, J. J. Chung, L. Ott, S. Roland, and N. Juan, “Volumetricgrasping network: Real-time 6 dof grasp detection in clutter,” in Con-ference on Robot Learning, 2020.

[10] J. Lundell, F. Verdoja, and V. Kyrki, “Beyond top-grasps through scenecompletion,” in 2020 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2020, pp. 545–551.

[11] Y. Li, L. Schomaker, and S. H. Kasaei, “Learning to grasp 3D objectsusing deep residual u-nets,” in 2020 29th IEEE International Conferenceon Robot and Human Interactive Communication (RO-MAN). IEEE,2020, pp. 781–787.

[12] J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen, “Shapecompletion enabled robotic grasping,” in 2017 IEEE/RSJ internationalconference on intelligent robots and systems (IROS). IEEE, 2017, pp.2442–2447.

[13] M. Gualtieri, A. Ten Pas, K. Saenko, and R. Platt, “High precisiongrasp pose detection in dense clutter,” in 2016 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS). IEEE, 2016, pp.598–605.

[14] H. Liang, X. Ma, S. Li, M. Gorner, S. Tang, B. Fang, F. Sun, andJ. Zhang, “Pointnetgpd: Detecting grasp configurations from point sets,”in 2019 International Conference on Robotics and Automation (ICRA).IEEE, 2019, pp. 3629–3635.

[15] S. Thrun, “Probabilistic robotics,” Communications of the ACM, vol. 45,no. 3, pp. 52–57, 2002.

[16] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” in2011 IEEE international conference on robotics and automation. IEEE,2011, pp. 1–4.

[17] S. H. Kasaei, M. Oliveira, G. H. Lim, L. S. Lopes, and A. M. Tome,“Towards lifelong assistive robotics: A tight coupling between objectperception and manipulation,” Neurocomputing, vol. 291, pp. 151–166,2018.

[18] Y. Xiang, C. Xie, A. Mousavian, and D. Fox, “Learning RGB-D featureembeddings for unseen object instance segmentation,” in Conference onRobot Learning (CoRL), 2020.

[19] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa,P. Abbeel, and A. M. Dollar, “Yale-cmu-berkeley dataset for roboticmanipulation research,” The International Journal of Robotics Research,vol. 36, no. 3, pp. 261–268, 2017.

[20] Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from RGBDimages: Learning using a new rectangle representation,” in IEEE In-ternational conference on robotics and automation. IEEE, 2011, pp.3304–3311.

Date post:	16-Mar-2022
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

MVGrasp: Real-Time Multi-View 3D Object Grasping in Highly ...

Documents