Object detection, shape recovery, and 3D modelling by depth-encoded hough voting☆
Introduction
Detecting objects and estimating their geometric properties are crucial problems in many application domains such as robotics, autonomous navigation, high-level visual scene understanding, surveillance, gaming, object modelling, and augmented reality. For instance, if one wants to design a robotic system for grasping and manipulating objects, it is of paramount importance to encode the ability to accurately estimate object orientation (pose) from the camera view point as well as recover structural properties such as its 3D shape. This information will help the robotic arm grasp the object at the right location and successfully interact with it. Moreover, if one wants to augment the observation of an environment with virtual objects, the ability to reconstruct visually pleasing 3D models for object categories is very important.
This paper addresses the above needs, and tackles the following challenges: (i) Learn models of object categories by combining view specific depth maps along with the associated 2D image of object instances of the same class from different vantage points. Depth maps with registered RGB images can be easily collected using sensors such as Kinect Sensor [5]. We demonstrate that combining imagery with 3D information helps build richer models of object categories that can in turn make detection and pose estimation more accurate. (ii) Design a coherent and principled scheme for detecting objects and estimating their pose from either just a single image (when no depth maps are available in testing) (Fig. 1b), or a single image augmented with depth maps (when these are available in testing). In the latter case, 3D information can be conveniently used by the detection scheme to make detection and pose estimation more robust than in the single image case. (iii) Have our detection scheme reconstruct the 3D model of the object from just a single uncalibrated image (when no 3D depth maps are available in testing) (Fig. 1c–g) and without having seen the object instance during training.
In this paper, we propose a two stages approach to address the above challenges (Fig. 2). In the first stage, our approach seeks to (i) detect the object in the image, (ii) estimate its pose, and (iii) recover a rough estimate of the object 3D structure (if no depth maps are available in testing). This is achieved by introducing a new formulation of the Implicit Shape Model (ISM) [1] and generalized Hough voting scheme [7]. In our formulation, depth information is incorporated into the process of learning distributions of object image patches that are compatible with the underlying object location (shape) in the image plane. We call our scheme DEHV – Depth-Encoded Hough Voting scheme (Section 3.1). DEHV addresses the intrinsic weaknesses of existing Hough voting schemes [1], [8], [9], [10] where errors in estimating the scale of each image object patch directly affects the ability of the algorithm to cast consistent votes for the object existence. To resolve this ambiguity, we take advantage of the interplay between the scale of each object patch in the image and its distance (depth) from the corresponding physical patch attached to the 3D object, and specifically use the fact that objects (or object parts) that are closer to the camera result in image patches with larger scales. Depth is encoded in training by using available depth maps of the object from a number of view points. At recognition time, DEHV is applied to detect objects (Fig. 1b), estimate their pose, and simultaneously infer their 3D structure given hypotheses of detected objects (Fig. 1c). The object 3D structure is inferred at recognition time by estimating (decoding) the depth (distance) of each image patch involved in the voting from the camera center. Critically, depth decoding can be achieved even if just a single test image is provided. If depth maps are available in testing, the additional information can be used to further validate if a given detection hypothesis is correct or not. We summarize the inferred quantities in Fig. 3 and the required supervision in Fig. 4. Notice that the inferred object 3D structure from stage one is partial (it does not account for the portions of the object that are not visible from the query image) and sparse (it only recovers depth for each voting patch). The goal of the second stage is to obtain a full 3D object model where both 3D structure and albedo properties (texture) are also recovered.
In the second stage, the information inferred from stage one (object location in the image, scale, pose, and rough 3D structure) is used to obtain a full 3D model of the object. Specifically, we consider a 3D modelling stage where a full 3D model of the object is obtained by 3D shape recovery and texture completion (Section 3.2). We carry out 3D shape recovery (i.e., infer shape from the unseen regions) by: (i) utilizing 3D shape exemplars from a database of 3D CAD models which can be collected from [11] and other online 3D warehouses, or obtained by shape from silhouette [12] and (ii) applying a novel 2D + 3D iterative closest point (ICP) matching algorithm which jointly registers the best 3D CAD model to the inferred 3D shape and the occlusion boundaries of back projected 3D CAD model to object contours in the image. By choosing the best fit, our system obtains a plausible full reconstruction of the object 3D shape (Section 3.3) (Fig. 1d). Object appearance is rendered by texture mapping the object image into the 3D shape. Such texture is clearly incomplete as non-visible object surface areas cannot be texture mapped (Fig. 1e). Thus, we perform texture completion by: (i) transferring texture to such non-visible object surface areas by taking advantage of the fact that some object categories are symmetric (when possible) (Fig. 1f) and (ii) using an error-tolerant image compositing technique inspired by [6] to fill the un-textured regions (i.e., holes) (Section 3.4) (Fig. 1g). We summarize the required supervision in Fig. 4.
Extensive experimental analysis on a number of public datasets (including car Pascal VOC07 [2], mug ETHZ Shape [3], mouse and stapler 3D object dataset [13]), an two in-house datasets (comprising at most five object categories), where ground truth 3D information is available, are used to validate our claims (Section 4). Experiments with the in-house datasets demonstrate that our DEHV scheme: (i) achieves better detection rates (compared to the traditional Hough voting scheme); further improvement is observed when depth maps are available in testing; (ii) produces convincing 3D reconstructions from single images; the accuracy of such reconstructions have been qualitatively assessed with respect to ground truth depth maps; (iii) achieves accurate 3D shape recovery and visually pleasing texture completion results. Experiments with public datasets demonstrate that our DEHV successfully scales to different types of categories and works in challenging conditions (severe background clutter, occlusions). DEHV achieves state of the art detection results on several categories in ETHZ Shape dataset [3], and competitive pose estimation results on 3D object dataset [13]. We also evaluate the accuracy of shape completion and quality of the texture completion on the 3D modelling dataset (Section 3.2). Finally, we show typical results demonstrating that DEHV is capable to produce convincing 3D reconstructions from single uncalibrated images using Pascal VOC07 dataset [2], ETHZ Shape dataset [3], and 3D object dataset [13] in Fig. 19, Fig. 15.
Section snippets
Previous work
In the last decade, the vision community has made substantial progress addressing the problem of object categorization from 2D images. While most of the work has focussed on representing objects as 2D models [14], [1], [15] or collections of 2D models [16], very few methods have tried to combine in a principled way the appearance information that is captured by images and the intrinsic 3D structure of an object category. Works by [17], [13], [4] have proposed solutions for modelling the way how
Our method
To summarize, our method can be roughly decomposed in a recognition/reconstruction stage and a 3D modelling stage.
In the recognition/reconstruction stage, Depth-Encoded-Hough-Voting detectors (DEHV) [64], trained with both object 3D shape and local diagnostic appearance information, identifies object’ locations and classes, and recovers approximate and partial 3D structure information from a single query image (Section 3.1) (Fig. 1(a-c)).
Because we obtain only a partial reconstruction (object
Experiment
We conduct experiments to evaluate the object detection and shape recovery performance of our DEHV algorithm in Section 4.1, and the quality of 3D modelling in terms of both shape recovery and texture completion in Section 4.2. Typical failure cases of the object detector and 3D ICP are shown in Fig. 18(a,b) respectively.
Conclusion
We proposed a new detection scheme called DEHV which can successfully detect objects, estimate their pose from either a single 2D image or a 2D image combined with depth information. Moreover, we demonstrated that DEHV is capable of recover the 3D shape of object categories from just one single uncalibrated image. Given such a partial 3D Shape of the object, we show that novel 3D shape recovery and texture completions techniques can be applied to fully reconstruct the 3D model of the object
Acknowledgments
We acknowledge the support of NSF (Grant CNS 0931474) and the Gigascale Systems Research Center, one of sixresearch centers funded under the Focus Center Research Program(FCRP), a Semiconductor Research Corporation Entity, Google Research Award(SC347174), and Willow Garage Inc. for collecting the 3D table-top object category dataset.
References (74)
- B. Leibe, A. Leonardis, B. Schiele, Combined object categorization and segmentation with an implicit shape model, in:...
- M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes Challenge 2007...
- et al.
Groups of adjacent contour segments for object detection
IEEE Trans. PAMI
(2008) - S. Savarese, L. Fei-Fei, View synthesis for recognizing unseen poses of object classes, in: ECCV,...
- Microsoft Corp. Redmond WA, Kinect for Xbox...
- M.K.J. Michael W. Tao, S. Paris, Error-tolerant image compositing, in: ECCV,...
- D.H. Ballard, Generalizing the hough transform to detect arbitrary shapes, Pattern...
- J. Gall, V. Lempitsky, Class-specific hough forests for object detection, in: CVPR,...
- S. Maji, J. Malik, Object detection using a max-margin hough tranform, in: CVPR,...
- B. Ommer, J. Malik, Multi-scale object detection by clustering lines, in: ICCV,...
The visual hull concept for silhouette-based image understanding
IEEE Trans. Pattern Anal. Mach. Intell.
Using multi-view recognition and meta-data annotation to guide a robot’s attention
Int. J. Rob. Res.
Recognizing solid objects by alignment with an image
IJCV
Representations and Techniques for 3D Object Recognition and Scene Interpretation
Visual modeling with a hand-held camera
IJCV
Modelling and interpretation of architecture from several images
IJCV
Calibrated, registered images of an extended urban area
IJCV
Cited by (10)
Semi-supervised learning and feature evaluation for RGB-D object recognition
2015, Computer Vision and Image UnderstandingCitation Excerpt :Recently, RGB-D data has attracted great interest in computer vision and robotics community with the advent of new depth sensors, such as Kinect. The Kinect-style depth cameras are capable of providing high quality synchronized images or videos of both color and depth, which represent an opportunity to dramatically improve the performance of many vision problems, e.g., object recognition [5,6], detection [7–9], tracking [10–12], SLAM [13,14] and human activity analysis [15,16]. This is mainly because that depth information has many extra advantages: being invariant to lighting and color variations, allowing better separation from the background and providing pure geometry and shape cues.
HGG-CNN: The generation of the optimal robotic grasp pose based on vision
2020, Intelligent Automation and Soft ComputingThe 3D Edge Reconstruction from 2D Image by Using Correlation Based Algorithm
2019, 2019 IEEE 6th International Conference on Industrial Engineering and Applications, ICIEA 2019Deep single-view 3D object reconstruction with visual hull embedding
2019, 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019
- ☆
This paper has been recommended for acceptance by Carlo Colombo.