Object detection, shape recovery, and 3D modelling by depth-encoded hough voting

doi:10.1016/j.cviu.2013.05.002

Computer Vision and Image Understanding

Volume 117, Issue 9, September 2013, Pages 1190-1202

https://doi.org/10.1016/j.cviu.2013.05.002 Get rights and content

Highlights

•
We propose a novel method to jointly detect objects, infer their categories, and estimate their poses.
•
When the model is trained with depth information, object depths can be decoded from a single image.
•
Finally, we obtain a convincing 3D shape reconstruction of the object from a novel 3D modeling stage.

Abstract

Detecting objects, estimating their pose, and recovering their 3D shape are critical problems in many vision and robotics applications. This paper addresses the above needs using a two stages approach. In the first stage, we propose a new method called DEHV – Depth-Encoded Hough Voting. DEHV jointly detects objects, infers their categories, estimates their pose, and infers/decodes objects depth maps from either a single image (when no depth maps are available in testing) or a single image augmented with depth map (when this is available in testing). Inspired by the Hough voting scheme introduced in [1], DEHV incorporates depth information into the process of learning distributions of image features (patches) representing an object category. DEHV takes advantage of the interplay between the scale of each object patch in the image and its distance (depth) from the corresponding physical patch attached to the 3D object. Once the depth map is given, a full reconstruction is achieved in a second (3D modelling) stage, where modified or state-of-the-art 3D shape and texture completion techniques are used to recover the complete 3D model. Extensive quantitative and qualitative experimental analysis on existing datasets [2], [3], [4] and a newly proposed 3D table-top object category dataset shows that our DEHV scheme obtains competitive detection and pose estimation results. Finally, the quality of 3D modelling in terms of both shape completion and texture completion is evaluated on a 3D modelling dataset containing both in-door and out-door object categories. We demonstrate that our overall algorithm can obtain convincing 3D shape reconstruction from just one single uncalibrated image.

Introduction

Detecting objects and estimating their geometric properties are crucial problems in many application domains such as robotics, autonomous navigation, high-level visual scene understanding, surveillance, gaming, object modelling, and augmented reality. For instance, if one wants to design a robotic system for grasping and manipulating objects, it is of paramount importance to encode the ability to accurately estimate object orientation (pose) from the camera view point as well as recover structural properties such as its 3D shape. This information will help the robotic arm grasp the object at the right location and successfully interact with it. Moreover, if one wants to augment the observation of an environment with virtual objects, the ability to reconstruct visually pleasing 3D models for object categories is very important.

This paper addresses the above needs, and tackles the following challenges: (i) Learn models of object categories by combining view specific depth maps along with the associated 2D image of object instances of the same class from different vantage points. Depth maps with registered RGB images can be easily collected using sensors such as Kinect Sensor [5]. We demonstrate that combining imagery with 3D information helps build richer models of object categories that can in turn make detection and pose estimation more accurate. (ii) Design a coherent and principled scheme for detecting objects and estimating their pose from either just a single image (when no depth maps are available in testing) (Fig. 1b), or a single image augmented with depth maps (when these are available in testing). In the latter case, 3D information can be conveniently used by the detection scheme to make detection and pose estimation more robust than in the single image case. (iii) Have our detection scheme reconstruct the 3D model of the object from just a single uncalibrated image (when no 3D depth maps are available in testing) (Fig. 1c–g) and without having seen the object instance during training.

In this paper, we propose a two stages approach to address the above challenges (Fig. 2). In the first stage, our approach seeks to (i) detect the object in the image, (ii) estimate its pose, and (iii) recover a rough estimate of the object 3D structure (if no depth maps are available in testing). This is achieved by introducing a new formulation of the Implicit Shape Model (ISM) [1] and generalized Hough voting scheme [7]. In our formulation, depth information is incorporated into the process of learning distributions of object image patches that are compatible with the underlying object location (shape) in the image plane. We call our scheme DEHV – Depth-Encoded Hough Voting scheme (Section 3.1). DEHV addresses the intrinsic weaknesses of existing Hough voting schemes [1], [8], [9], [10] where errors in estimating the scale of each image object patch directly affects the ability of the algorithm to cast consistent votes for the object existence. To resolve this ambiguity, we take advantage of the interplay between the scale of each object patch in the image and its distance (depth) from the corresponding physical patch attached to the 3D object, and specifically use the fact that objects (or object parts) that are closer to the camera result in image patches with larger scales. Depth is encoded in training by using available depth maps of the object from a number of view points. At recognition time, DEHV is applied to detect objects (Fig. 1b), estimate their pose, and simultaneously infer their 3D structure given hypotheses of detected objects (Fig. 1c). The object 3D structure is inferred at recognition time by estimating (decoding) the depth (distance) of each image patch involved in the voting from the camera center. Critically, depth decoding can be achieved even if just a single test image is provided. If depth maps are available in testing, the additional information can be used to further validate if a given detection hypothesis is correct or not. We summarize the inferred quantities in Fig. 3 and the required supervision in Fig. 4. Notice that the inferred object 3D structure from stage one is partial (it does not account for the portions of the object that are not visible from the query image) and sparse (it only recovers depth for each voting patch). The goal of the second stage is to obtain a full 3D object model where both 3D structure and albedo properties (texture) are also recovered.

In the second stage, the information inferred from stage one (object location in the image, scale, pose, and rough 3D structure) is used to obtain a full 3D model of the object. Specifically, we consider a 3D modelling stage where a full 3D model of the object is obtained by 3D shape recovery and texture completion (Section 3.2). We carry out 3D shape recovery (i.e., infer shape from the unseen regions) by: (i) utilizing 3D shape exemplars from a database of 3D CAD models which can be collected from [11] and other online 3D warehouses, or obtained by shape from silhouette [12] and (ii) applying a novel 2D + 3D iterative closest point (ICP) matching algorithm which jointly registers the best 3D CAD model to the inferred 3D shape and the occlusion boundaries of back projected 3D CAD model to object contours in the image. By choosing the best fit, our system obtains a plausible full reconstruction of the object 3D shape (Section 3.3) (Fig. 1d). Object appearance is rendered by texture mapping the object image into the 3D shape. Such texture is clearly incomplete as non-visible object surface areas cannot be texture mapped (Fig. 1e). Thus, we perform texture completion by: (i) transferring texture to such non-visible object surface areas by taking advantage of the fact that some object categories are symmetric (when possible) (Fig. 1f) and (ii) using an error-tolerant image compositing technique inspired by [6] to fill the un-textured regions (i.e., holes) (Section 3.4) (Fig. 1g). We summarize the required supervision in Fig. 4.

Extensive experimental analysis on a number of public datasets (including car Pascal VOC07 [2], mug ETHZ Shape [3], mouse and stapler 3D object dataset [13]), an two in-house datasets (comprising at most five object categories), where ground truth 3D information is available, are used to validate our claims (Section 4). Experiments with the in-house datasets demonstrate that our DEHV scheme: (i) achieves better detection rates (compared to the traditional Hough voting scheme); further improvement is observed when depth maps are available in testing; (ii) produces convincing 3D reconstructions from single images; the accuracy of such reconstructions have been qualitatively assessed with respect to ground truth depth maps; (iii) achieves accurate 3D shape recovery and visually pleasing texture completion results. Experiments with public datasets demonstrate that our DEHV successfully scales to different types of categories and works in challenging conditions (severe background clutter, occlusions). DEHV achieves state of the art detection results on several categories in ETHZ Shape dataset [3], and competitive pose estimation results on 3D object dataset [13]. We also evaluate the accuracy of shape completion and quality of the texture completion on the 3D modelling dataset (Section 3.2). Finally, we show typical results demonstrating that DEHV is capable to produce convincing 3D reconstructions from single uncalibrated images using Pascal VOC07 dataset [2], ETHZ Shape dataset [3], and 3D object dataset [13] in Fig. 19, Fig. 15.

Section snippets

Previous work

In the last decade, the vision community has made substantial progress addressing the problem of object categorization from 2D images. While most of the work has focussed on representing objects as 2D models [14], [1], [15] or collections of 2D models [16], very few methods have tried to combine in a principled way the appearance information that is captured by images and the intrinsic 3D structure of an object category. Works by [17], [13], [4] have proposed solutions for modelling the way how

Our method

To summarize, our method can be roughly decomposed in a recognition/reconstruction stage and a 3D modelling stage.

In the recognition/reconstruction stage, Depth-Encoded-Hough-Voting detectors (DEHV) [64], trained with both object 3D shape and local diagnostic appearance information, identifies object’ locations and classes, and recovers approximate and partial 3D structure information from a single query image (Section 3.1) (Fig. 1(a-c)).

Because we obtain only a partial reconstruction (object

Experiment

We conduct experiments to evaluate the object detection and shape recovery performance of our DEHV algorithm in Section 4.1, and the quality of 3D modelling in terms of both shape recovery and texture completion in Section 4.2. Typical failure cases of the object detector and 3D ICP are shown in Fig. 18(a,b) respectively.

Conclusion

We proposed a new detection scheme called DEHV which can successfully detect objects, estimate their pose from either a single 2D image or a 2D image combined with depth information. Moreover, we demonstrated that DEHV is capable of recover the 3D shape of object categories from just one single uncalibrated image. Given such a partial 3D Shape of the object, we show that novel 3D shape recovery and texture completions techniques can be applied to fully reconstruct the 3D model of the object

Acknowledgments

We acknowledge the support of NSF (Grant CNS 0931474) and the Gigascale Systems Research Center, one of sixresearch centers funded under the Focus Center Research Program(FCRP), a Semiconductor Research Corporation Entity, Google Research Award(SC347174), and Willow Garage Inc. for collecting the 3D table-top object category dataset.

References (74)

B. Leibe, A. Leonardis, B. Schiele, Combined object categorization and segmentation with an implicit shape model, in:...
M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes Challenge 2007...
V. Ferrari et al.
Groups of adjacent contour segments for object detection
IEEE Trans. PAMI
(2008)
S. Savarese, L. Fei-Fei, View synthesis for recognizing unseen poses of object classes, in: ECCV,...
Microsoft Corp. Redmond WA, Kinect for Xbox...
M.K.J. Michael W. Tao, S. Paris, Error-tolerant image compositing, in: ECCV,...
D.H. Ballard, Generalizing the hough transform to detect arbitrary shapes, Pattern...
J. Gall, V. Lempitsky, Class-specific hough forests for object detection, in: CVPR,...
S. Maji, J. Malik, Object detection using a max-margin hough tranform, in: CVPR,...
B. Ommer, J. Malik, Multi-scale object detection by clustering lines, in: ICCV,...

P. Shilane, P. Min, M. Kazhdan, T. Funkhouser, The princeton shape benchmark, in: Proceedings of the Shape Modeling...

A. Laurentini

The visual hull concept for silhouette-based image understanding

IEEE Trans. Pattern Anal. Mach. Intell.

(1994)

S. Savarese, L. Fei-Fei, 3D generic object categorization, localization and pose estimation, in: ICCV,...

N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR,...

R. Fergus, P. Perona, A. Zisserman, A sparse object category model for efficient learning and exhaustive recognition.,...

H. Schneiderman, T. Kanade, A statistical approach to 3D object detection applied to faces and cars, in: CVPR,...

H. Su, M. Sun, L. Fei-Fei, S. Savarese, Learning a dense multi-view representation for detection, viewpoint...

D. Hoeim, C. Rother, J. Winn, 3d layoutcrf for multi-view object class recognition and segmentation, in: CVPR,...

P. Yan, D. Khan, M. Shah, 3d model based object class detection in an arbitrary view., in: ICCV,...

J. Liebelt, C. Schmid, K. Schertler, Viewpoint-independent object class detection using 3d feature maps, in: CVPR,...

M. Arie-Nachimson, R. Basri, Constructing implicit 3d shape models for pose estimation, in: ICCV,...

A. Thomas et al.

Using multi-view recognition and meta-data annotation to guide a robot’s attention

Int. J. Rob. Res.

(2009)

M.R. Oswald, E. Toeppe, K. Kolev, D. Cremers, Non-parametric single view reconstruction of curved objects using convex...

D.P. Huttenlocher et al.

Recognizing solid objects by alignment with an image

IJCV

(1990)

F. Rothganger, S. Lazebnik, C. Schmid, J. Ponce, 3D object modeling and recognition using affine-invariant patches and...

A.C. Romea, D. Berenson, S. Srinivasa, D. Ferguson, Object recognition and full pose registration from a single image...

D.G. Lowe, Local feature view clustering for 3d object recognition, in: CVPR,...

R.B. Rusu, N. Blodow, Z.C. Marton, M. Beetz, Close-range scene segmentation and reconstruction of 3d point cloud maps...

T. Deselaers, A. Criminisi, J. Winn, A. Agarwal, Incorporating on-demand stereo for real time recognition, in: CVPR,...

D. Hoeim et al.

Representations and Techniques for 3D Object Recognition and Scene Interpretation

(2011)

P.E. Debevec, C.J. Taylor, J. Malik, Modeling and rendering architecture from photographs: a hybrid geometry- and...

M. Pollefeys et al.

Visual modeling with a hand-held camera

IJCV

(2004)

N. Snavely, S.M. Seitz, R. Szeliski, Photo tourism: Exploring photo collections in 3d, in: SIGGRAPH,...

S. Agarwal, N. Snavely, I. Simon, S.M. Seitz, R. Szeliski, Building rome in a day, in: ICCV,...

A.R. Dick et al.

Modelling and interpretation of architecture from several images

IJCV

(2004)

S. Teller et al.

Calibrated, registered images of an extended urban area

IJCV

(2003)

Y. Horry, K.-I. Anjyo, K. Arai, Tour into the picture: using a spidery mesh interface to make animation from a single...

Cited by (10)

Semi-supervised learning and feature evaluation for RGB-D object recognition
2015, Computer Vision and Image Understanding
Citation Excerpt :
Recently, RGB-D data has attracted great interest in computer vision and robotics community with the advent of new depth sensors, such as Kinect. The Kinect-style depth cameras are capable of providing high quality synchronized images or videos of both color and depth, which represent an opportunity to dramatically improve the performance of many vision problems, e.g., object recognition [5,6], detection [7–9], tracking [10–12], SLAM [13,14] and human activity analysis [15,16]. This is mainly because that depth information has many extra advantages: being invariant to lighting and color variations, allowing better separation from the background and providing pure geometry and shape cues.
With new depth sensing technology such as Kinect providing high quality synchronized RGB and depth images (RGB-D data), combining the two distinct views for object recognition has attracted great interest in computer vision and robotics community. Recent methods mostly employ supervised learning methods for this new RGB-D modality based on the two feature sets. However, supervised learning methods always depend on large amount of manually labeled data for training models. To address the problem, this paper proposes a semi-supervised learning method to reduce the dependence on large annotated training sets. The method can effectively learn from relatively plentiful unlabeled data, if powerful feature representations for both the RGB and depth view can be extracted. Thus, a novel and effective feature termed CNN-SPM-RNN is proposed in this paper, and four representative features (KDES [1], CKM [2], HMP [3] and CNN-RNN [4]) are evaluated and compared with ours under the unified semi-supervised learning framework. Finally, we verify our method on three popular and publicly available RGB-D object databases. The experimental results demonstrate that, with only 20% labeled training set, the proposed method can achieve competitive performance compared with the state of the arts on most of the databases.
HGG-CNN: The generation of the optimal robotic grasp pose based on vision
2020, Intelligent Automation and Soft Computing
The 3D Edge Reconstruction from 2D Image by Using Correlation Based Algorithm
2019, 2019 IEEE 6th International Conference on Industrial Engineering and Applications, ICIEA 2019
Deep single-view 3D object reconstruction with visual hull embedding
2019, 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019
Deep single-view 3d object reconstruction with visual hull embedding
2018, arXiv
Annotation scaffolds for object modeling and manipulation
2018, arXiv

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Carlo Colombo.

View full text

Published by Elsevier Inc.

Object detection, shape recovery, and 3D modelling by depth-encoded hough voting☆

Highlights

Abstract

Introduction

Section snippets

Previous work

Our method

Experiment

Conclusion

Acknowledgments

Groups of adjacent contour segments for object detection

IEEE Trans. PAMI

The visual hull concept for silhouette-based image understanding

IEEE Trans. Pattern Anal. Mach. Intell.

Using multi-view recognition and meta-data annotation to guide a robot’s attention

Int. J. Rob. Res.

Recognizing solid objects by alignment with an image

IJCV

Representations and Techniques for 3D Object Recognition and Scene Interpretation

Visual modeling with a hand-held camera

IJCV

Modelling and interpretation of architecture from several images

IJCV

Calibrated, registered images of an extended urban area

IJCV