Skip to main content
Log in

3DNN: 3D Nearest Neighbor

Data-Driven Geometric Scene Understanding Using 3D Models

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we describe a data-driven approach to leverage repositories of 3D models for scene understanding. Our ability to relate what we see in an image to a large collection of 3D models allows us to transfer information from these models, creating a rich understanding of the scene. We develop a framework for auto-calibrating a camera, rendering 3D models from the viewpoint an image was taken, and computing a similarity measure between each 3D model and an input image. We demonstrate this data-driven approach in the context of geometry estimation and show the ability to find the identities, poses and styles of objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-before-seen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation, as well as two novel applications: affordance estimation and photorealistic object insertion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Training was performed using 10-fold cross validation on a subset of the SUN Database (Xiao et al. 2010), for which there exist LabelMe annotations (Torralba et al. 2010).

  2. GIST: \(4\times 4\) blocks, 8 orientations (code from Oliva and Torralba (2006)). HoG: \(20 \times 20\) blocks, 8 orientations (code from Vondrick et al. (2013)).

  3. Note In Gupta et al. (2011), their approach outperformed the appearance baseline. This is due to the selection bias in creating the dataset for that paper, such that only images with accurate autocalibration estimates from Hedau et al. (2009) were used. However, for these experiments an unbiased sample of images were selected.

  4. Note The object insertion renderings included in this section were created by Kevin Karsch at the University of Illinois at Urbana Champaign during a collaboration with the author.

    Fig. 30
    figure 30

    Examples of synthetically inserted objects. Note the accurate occlusion and depth ordering of the dresser, and the realistic reflections, a input image, b synthetically inserted objects

References

  • Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(5), 898–916.

  • Baatz, G., Saurer, O., Köser, K., & Pollefeys, M. (2012). Large scale visual geo-localization of images in mountainous terrain. Proceedings of the European Conference on Computer Vision (ECCV).

  • Baboud, L., Cadik, M., Eisemann, E., & Seidel, H.-P. (2011). Automatic photo-to-terrain alignment for the annotation of mountain pictures. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Bao, S. Y., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scene layout understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Brooks, R. (1981). Symbolic reasoning among 3D models and 2D images. Artificial Intelligence, Vol. 17.

  • Choi, W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Colburn, A., Agarwala, A., Hertzmann, A., Curless, B., & Cohen, M. F. (2013). Image-based remodeling. IEEE Transactions on Visualization and Computer Graphics, 19(1), 56–66.

    Article  Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  • Debevec, P. (1998). Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. Proceedings of the 25th annual conference on Computer graphics and interactive techniques, SIGGRAPH ’98.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Fisher, M., & Hanrahan, P. (2010). Context-based search for 3D models. ACM Transactions on Graphics, 29, 182:1–182:10.

    Article  Google Scholar 

  • Fisher, M., Savva, M., & Hanrahan, P. (2011). Characterizing structural relationships in scenes using graph Kernels. ACM Transactions on Graphics, 30, 34:1–34:12.

    Article  Google Scholar 

  • Fouhey, D., Gupta, A., & Hebert, M. (2013). Data-driven 3d primatives. Submission to the Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  • Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. Proceedings of the European Conference on Computer Vision (ECCV).

  • Girshick, R.B., Felzenszwalb, P.F., & McAllester, D. (2012). Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/rbg/latent-release5/.

  • Google Inc. (2000). Google SketchUp. http://sketchup.google.com/.

  • Grabner, H., Gall, J., & Gool, L.J.V. (2011). What makes a chair a chair? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1529–1536.

  • Grimson, W., Lozano-Pérez, T., & Huttenlocher, D. (1990). Object recognition by computer. Cambridge: MIT Press.

    Google Scholar 

  • Grosse, R., Johnson, M. K., Adelson, E. H., & Freeman, W. T. (2009). Ground truth dataset and baseline evaluations for intrinsic image algorithms. Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2335–2342.

  • Guo, R., Dai, Q., & Hoiem, D. (2011). Single-image shadow detection and removal using paired regions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Gupta, A., Efros, A. A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. Proceedings of the European Conference on Computer Vision (ECCV).

  • Gupta, A., Satkin, S., Efros, A. A., & Hebert, M. (2011). From 3D scene geometry to human workspace. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Hays, J., & Efros, A. A. (2007). Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH 2007). doi:10.1145/1275808.1276382.

  • Hays, J., & Efros, A. A. (2008). im2gps: estimating geographic information from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  • Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. Proceedings of the European Conference on Computer Vision (ECCV).

  • Hedau, V., Hoiem, D., & Forsyth, D. (2012). Recovering free space of indoor scenes from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Herbrich, R., Graepel, T., & Obermayer, K. (1999). Support vector learning for ordinal regression. Proceedings of the International Conference on Artificial Neural Networks (ANN), 1, 97–102.

    Article  Google Scholar 

  • Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision (IJCV), 75(1), 151–172.

    Article  Google Scholar 

  • Hoiem, D., Efros, A. A., & Hebert, M. (2008). Closing the loop on scene interpretation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • IKEA (2013). IKEA Catalog.

  • Karsch, K., Hedau, V., Forsyth, D., & Hoiem, D. (2011). Rendering synthetic objects into legacy photographs. Proceedings of the 2011 SIGGRAPH Asia Conference. ACM.

  • Karsch, K., Sunkavalli, K., Hadap, S., Carr, N., Jin, H., Fonte, R., et al. (2014). Automatic scene inference for 3D object compositing. ACM Transactions on Graphics (to appear).

  • Lai, K., & Fox, D. (2009). 3D laser scan classification using web data and domain adaptation. Proceedings of Robotics: Science and Systems, Seattle, USA.

  • Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), 3.

    Article  Google Scholar 

  • Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. Proceedings of the Conference on Neural Information Processing Systems (NIPS).

  • Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Levin, A., Rav-Acha, A., & Lischinski, D. (2008). Spectral matting. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(10), 1699–1712.

    Article  Google Scholar 

  • Li, L.-J., Su, H., Xing, E. P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. Neural Information Processing Systems (NIPS), 2(3), 5.

    Google Scholar 

  • Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing ikea objects: Fine pose estimation. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  • Liu, C., Yuen, J., Torralba, A., Sivic, J., & Freeman, W. T. (2008). SIFT flow: Dense correspondence across different scenes. Proceedings of the European Conference on Computer Vision (ECCV).

  • Liu, M.-Y., Tuzel, O., Veeraraghavan, A., & Chellappa, R. (2010). Fast directional chamfer matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135.

  • Lowe, D. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31(3), 355–395.

    Article  Google Scholar 

  • Microsoft Corporation (2010). Kinect for Xbox 360.

  • Munoz, D., Bagnell, J. A., & Hebert, M. (2010). Stacked hierarchical labeling. Proceedings of the European Conference on Computer Vision (ECCV).

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42, 145–175.

    Article  MATH  Google Scholar 

  • Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. Progress in Brain Research, 42, 145–175.

    Google Scholar 

  • Pero, L. D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Pero, L. D., Bowdish, J., Hartley, E., Kermgard, B., & Barnard, K. (2013). Understanding bayesian rooms using composite 3D object models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Pero, L. D., Guan, J., Brau, E., Schlecht, J., & Barnard, K. (2011). Sampling bedrooms. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Polantis (2010). Polantis 3D Catalog. http://www.polantis.com/ikea.

  • Ramalingam, S., Bouaziz, S., Sturm, P., & Brand, M. (2010). SKYLINE2GPS: Localization in Urban Canyons using Omni-Skylines. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

  • Satkin, S. (2013). Data-Driven Geometric Scene Understanding. PhD thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, August.

  • Satkin, S., & Hebert, M. (2013). 3DNN: Viewpoint invariant 3D geometry matching for scene understanding. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  • Satkin, S., Lin, J., & Hebert, M. (2012). CMU 3D-Annotated Scene Database. http://cmu.satkin.com/bmvc2012/.

  • Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. Proceedings of the 23rd British Machine Vision Conference (BMVC).

  • Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 31(5), 824–840.

    Article  Google Scholar 

  • Schwing, A. G., & Urtasun, R. (2012). Efficient Exact Inference for 3D Indoor Scene Understanding. Proceedings of the European Conference on Computer Vision (ECCV).

  • Shao, T., Xu, W., Zhou, K., Wang, J., Li, D., & Guo, B. (2012). An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Transactions on Graphics, 31(6), 136:1–136:11.

    Article  Google Scholar 

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. Proceedings of the European Conference on Computer Vision (ECCV).

  • Sing, G., & Košecká, J. (2013). Nonparametric scene parsing with adaptive feature relevance and semantic context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. Proceedings of the European Conference on Computer Vision (ECCV).

  • Torralba, A., Fergus, R., & Freeman, W. T. (Nov. 2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(11), 1958–1970.

  • Torralba, A., Russell, B. C., & Yuen, J. (2010). Labelme: Online image annotation and applications. Proceedings of the IEEE, 98(8), 1467–1484.

    Article  Google Scholar 

  • Trimble Inc. (2012). Trimble 3D Warehouse. http://sketchup.google.com/3dwarehouse.

  • Vondrick, C., Khosla, A., & Malisiewicz, T., & Torralba, A. (2013). Inverting and visualizing features for object detection. MIT Technical Report.

  • Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. Proceedings of the European Conference on Computer Vision (ECCV).

  • Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Yu, S., Zhang, H., & Malik, J. (2008). Inferring spatial layout from a single image via depth-ordered grouping. Proceedings of the IEEE Workshop on Perceptual Organization in Computer Vision (CVPR-POCV).

  • Zhao, Y., & Zhu, S.-C. (2013). Scene parsing by integrating function, geometry and appearance models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Download references

Acknowledgments

This work was supported in part by ONR MURI N000141010934.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Scott Satkin.

Additional information

Communicated by Cordelia Schmid.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Satkin, S., Rashid, M., Lin, J. et al. 3DNN: 3D Nearest Neighbor. Int J Comput Vis 111, 69–97 (2015). https://doi.org/10.1007/s11263-014-0734-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0734-4

Keywords