Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations

A preprint version of the article is available at arXiv.

Abstract

Object recognition and viewpoint estimation lie at the heart of visual understanding. Recent studies have suggested that convolutional neural networks (CNNs) fail to generalize to out-of-distribution (OOD) category–viewpoint combinations, that is, combinations not seen during training. Here we investigate when and how such OOD generalization may be possible by evaluating CNNs trained to classify both object category and three-dimensional viewpoint on OOD combinations, and identifying the neural mechanisms that facilitate such OOD generalization. We show that increasing the number of in-distribution combinations (data diversity) substantially improves generalization to OOD combinations, even with the same amount of training data. We compare learning category and viewpoint in separate and shared network architectures, and observe starkly different trends on in-distribution and OOD combinations, that is, while shared networks are helpful in distribution, separate networks significantly outperform shared ones at OOD combinations. Finally, we demonstrate that such OOD generalization is facilitated by the neural mechanism of specialization, that is, the emergence of two types of neuron—neurons selective to category and invariant to viewpoint, and vice versa.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Category–viewpoint datasets.
Fig. 2: Architectures for category recognition and viewpoint estimation.
Fig. 3: Generalization performance for Shared and Separate ResNet-18 as in-distribution combinations are increased for all datasets.
Fig. 4: Generalization performance for different architectures and backbones as in-distribution combinations are increased for iLab and Biased-Cars datasets.
Fig. 5: Specialization to category recognition and viewpoint estimation.
Fig. 6: Neuron specialization (selectivity to category and invariance to viewpoint, and vice versa) in the Biased-Cars dataset.

Similar content being viewed by others

Data availability

To access and cite the Biased-Cars dataset, please visit https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/F1NQ3R&faces-redirect=true.

Code availability

Source code and demos are available on GitHub at https://github.com/Spandan-Madan/generalization_to_OOD_category_viewpoint_combinations.

References

  1. He, K., Zhang, X., Ren, S. and Sun, J. Deep residual learning for image recognition. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  2. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2818–2826 (IEEE, 2016).

  3. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).

  4. Su, H., Qi, C. R., Li, Y. & Guibas, L. J. Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In Proc. IEEE International Conference on Computer Vision 2686–2694 (IEEE, 2015).

  5. Massa, F., Marlet, R. & Aubry, M. Crafting a multi-task CNN for viewpoint estimation. In Proc. British Machine Vision Conference 91.1–91.12 (BMVA, 2016).

  6. Elhoseiny, M., El-Gaaly, T., Bakry, A. & Elgammal, A. A comparative analysis and study of multiview CNN models for joint object categorization and pose estimation. In Proc. International Conference on Machine Learning 888–897 (PMLR, 2016).

  7. Mahendran, S., Ali, H. & Vidal, R. Convolutional networks for object category and 3D pose estimation from 2D images. In Proc. European Conference on Computer Vision Workshops 698–715 (Springer, 2018).

  8. Afifi, A. J., Hellwich, O. & Soomro, T. A. Simultaneous object classification and viewpoint estimation using deep multi-task convolutional neural network. In Proc. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 177–184 (2018).

  9. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L. & Madry, A. Exploring the landscape of spatial robustness. In Proc. International Conference on Machine Learning 1802–1811 (PMLR, 2019).

  10. Azulay, A. & Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 20, 1–25 (2019).

    MathSciNet  MATH  Google Scholar 

  11. Srivastava, S., Ben-Yosef, G. & Boix, X. Minimal images in deep neural networks: fragile object recognition in natural images. In Proc. International Conference on Learning Representations (2019).

  12. Alcorn, M. A. et al. Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 4845–4854 (IEEE, 2019).

  13. Barbu, A. et al. ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. Adv. Neural Inf. Process. Syst. 32, 9448–9458 (2019).

    Google Scholar 

  14. Tulsiani, S. & Malik, J. Viewpoints and keypoints. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1510–1519 (IEEE, 2015).

  15. Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In Proc. Robotics: Science and Systems (2018).

  16. Manhardt, F. et al. CPS++: improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. Preprint at https://arxiv.org/abs/2003.05848 (2020).

  17. Caruana, R. Multitask learning. Mach. Learn. 28, 41–75 (1997).

    Article  Google Scholar 

  18. Giles, C. L. & Maxwell, T. Learning, invariance, and generalization in high-order neural networks. Appl. Optics 26, 4972–4978 (1987).

    Article  Google Scholar 

  19. Riesenhuber, M. & Poggio, T. Just one view: Invariances in inferotemporal cell tuning. Adv. Neural Inf. Process. Syst. 10, 215–221 (1998).

    Google Scholar 

  20. Goodfellow, I., Lee, H., Le, Q. V., Saxe, A. & Ng, A. Y. Measuring invariances in deep networks. Adv. Neural Inf. Process. Syst. 22, 646–654 (2009).

    Google Scholar 

  21. Achille, A. & Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 19, 1947–1980 (2018).

    MathSciNet  MATH  Google Scholar 

  22. Borji, A., Izadi, S. & Itti, L. iLab-20M: a large-scale controlled object dataset to investigate deep learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2221–2230 (IEEE, 2016).

  23. Visual variation learning for object recognition. Jatuporn Toy Leksut https://bmobear.github.io/projects/viva/ (2016).

  24. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).

  25. The MNIST Database of Handwritten Digits (accessed 13 January 2022); http://yann.lecun.com/exdb/mnist/

  26. Xiang, Y., Mottaghi, R. & Savarese, S. Beyond pascal: a benchmark for 3D object detection in the wild. In Proc. IEEE Winter Conference on Applications of Computer Vision 75–82 (IEEE, 2014).

  27. Caesar, H. et al. nuScenes: a multimodal dataset for autonomous driving. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11618–11628 (IEEE, 2020).

  28. Min, J., Lee, J., Ponce, J. & Cho, M. Spair-71k: a large-scale benchmark for semantic correspondence. Preprint at https://arxiv.org/abs/1908.10543 (2019).

  29. Larochelle, H., Erhan, D., Courville, A., Bergstra, J. & Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In Proc. 24th International Conference on Machine Learning 473–480 (PMLR, 2007).

  30. Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3D object representations for fine-grained categorization. In Proc. 4th International IEEE Workshop on 3D Representation and Recognition 554–561 (IEEE, 2013).

  31. Ozuysal, M., Lepetit, V. & Fua, P. Pose estimation for category specific multiview object localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 778–785 (IEEE, 2009).

  32. Qiu, W. & Yuille, A. UnrealCV: connecting computer vision to Unreal Engine. In Proc. European Conference on Computer Vision 909–916 (Springer, 2016).

  33. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A. & Koltun, V. CARLA: an open urban driving simulator. In Proc. Annual Conference on Robot Learning 1–16 (2017).

  34. Zhang, Y. et al. Physically-based rendering for indoor scene understanding using convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 5287–5295 (IEEE, 2017).

  35. Halder, S. S., Lalonde, J.-F. & de Charette, R. Physics-based rendering for improving robustness to rain. In Proc. IEEE/CVF International Conference on Computer Vision 10203–10212 (IEEE, 2019).

  36. Divon, G. & Tal, A. Viewpoint estimation—insights & model. In Proc. European Conference on Computer Vision 252–268 (Springer, 2018).

  37. Mueller, P. et al. Esri CityEngine—A 3D City Modeling Software for Urban Design, Visual Effects, and VR/AR (Esri R&D Center Zurich, 2020); http://www.esri.com/cityengine

  38. Blender—A 3D Modelling and Rendering Package (Blender Foundation, Stichting Blender Foundation, 2020); http://www.blender.org

  39. Savarese, S. Fei-Fei, L. 3D generic object categorization, localization and pose estimation. In 2007 IEEE 11th International Conference on Computer Vision 1–8 (IEEE, 2007).

  40. Ghodrati, A., Pedersoli, M. & Tuytelaars, T. Is 2D information enough for viewpoint estimation? In Proc. British Machine Vision Conference (BMVA, 2014).

  41. Tulsiani, S., Carreira, J. & Malik, J. Pose induction for novel object categories. In Proc. IEEE International Conference on Computer Vision 64–72 (IEEE, 2015).

  42. Penedones, H., Collobert, R., Fleuret, F. & Grangier, D. Improving Object Classification Using Pose Information Technical Report Idiap-RR-30-2012 (Idiap Research Institute, 2012).

  43. Zhao, J. & Itti, L. Improved deep learning of object category using pose information. In Proc. IEEE Winter Conference on Applications of Computer Vision 550–559 (IEEE, 2017).

  44. Li, C., Bai, J. & Hager, G. D. A unified framework for multi-view multi-class object pose estimation. In Proc. European Conference on Computer Vision 254–269 (Springer, 2018).

  45. Grabner, A., Roth, P. M. & Lepetit, V. 3D pose estimation and 3D model retrieval for objects in the wild. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3022–3031 (IEEE, 2018).

  46. Bricolo, E., Poggio, T. & Logothetis, N. K. 3D object recognition: a model of view-tuned neurons. Adv. Neural Inf. Process. Syst. 9, 41–47 (1997).

    Google Scholar 

  47. Poggio, T. & Anselmi, F. Visual Cortex and Deep Networks: Learning Invariant Representations (MIT Press, 2016).

  48. Olshausen, B. A., Anderson, C. H. & Van Essen, D. C. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci. 13, 4700–4719 (1993).

    Article  Google Scholar 

  49. Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C. & Fried, I. Invariant visual representation by single neurons in the human brain. Nature 435, 1102–1107 (2005).

    Article  Google Scholar 

  50. Rust, N. C. & DiCarlo, J. J. Selectivity and tolerance (invariance) both increase as visual information propagates from cortical area V4 to IT. J. Neurosci. 30, 12978–12995 (2010).

    Article  Google Scholar 

  51. Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In Proc. European Conference on Computer Vision 818–833 (Springer, 2014).

  52. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. International Conference on Learning Representations Workshop (2014).

  53. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Object detectors emerge in deep scene CNNs. In Proc. International Conference on Learning Representations (2015).

  54. Bau, D., Zhou, B., Khosla, A., Oliva, A. & Torralba, A. Network dissection: quantifying interpretability of deep visual representations. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 6541–6549 (IEEE, 2017).

  55. Oquab, M., Bottou, L., Laptev, I. & Sivic, J. Is object localization for free? Weakly-supervised learning with convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 685–694 (IEEE, 2015).

  56. Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C. & Botvinick, M. On the importance of single directions for generalization. In Proc. International Conference on Learning Representations (2018).

  57. Zhou, B., Sun, Y., Bau, D. & Torralba, A. Revisiting the importance of individual units in CNNs via ablation. Preprint at https://arxiv.org/abs/1806.02891 (2018).

  58. Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T. & Wang, X.-J. Task representations in neural networks trained to perform many cognitive tasks. Nat. Neurosci. 22, 297–306 (2019).

    Article  Google Scholar 

  59. Torralba, A. & Efros, A. A. Unbiased look at dataset bias. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1521–1528 (IEEE, 2011).

  60. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).

  61. Standley, T. et al. Which tasks should be learned together in multi-task learning? In Proc. International Conference on Machine Learning (PMLR, 2020).

  62. Shin, D., Fowlkes, C. C. & Hoiem, D. Pixels, voxels, and views: a study of shape representations for single view 3D object shape prediction. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3061–3069 (IEEE, 2018).

  63. Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1492–1500 (IEEE, 2017).

  64. Zagoruyko, S. & Komodakis, N. Wide residual networks. In Proc. British Machine Vision Conference 87.1–87.12 (BMVA, 2016).

  65. Nakkiran, P. et al. Deep double descent: where bigger models and more data hurt. In Proc. International Conference on Learning Representations (2020).

  66. Casper, S. et al. Frivolous units: wider networks are not really that wide. In Proc. Association for the Advancement of Artificial Intelligence (2021).

  67. Cohen, T. S., Geiger, M., Köhler, J. & Welling, M. Spherical CNNs. In Proc. International Conference on Learning Representations (2018).

  68. Cohen, T. S., Weiler, M., Kicanaoglu, B. & Welling, M. Gauge equivariant convolutional networks and the Icosahedral CNN. In Proc. International Conference on Machine Learning 1321–1330 (PMLR, 2019).

Download references

Acknowledgements

We are grateful to T. Poggio and P. Sinha for their insightful advice and warm encouragement. This work has been partially supported by NSF grant IIS-1901030, a Google Faculty Research Award, the Toyota Research Institute, the Center for Brains, Minds and Machines (funded by NSF STC award CCF-1231216), Fujitsu Laboratories (contract no. 40008819) and the MIT-Sensetime Alliance on Artificial Intelligence. We also thank K. Gupta for help with the figures, and P. Sharma for insightful discussions.

Author information

Authors and Affiliations

Authors

Contributions

S.M., T.H., J.D. and X.B. conceived, designed and implemented the experiments and carried out the analysis, with contributions from T.S., F.D. and H.P.; S.M., H.H., N.B. and F.D. designed and implemented the Biased-Cars dataset; S.M., T.S. and X.B. wrote the manuscript with contributions from F.D. and H.P.; T.S., F.D., H.P. and X.B. supervised the study.

Corresponding authors

Correspondence to Spandan Madan or Xavier Boix.

Ethics declarations

Competing interests

This study received funding from Fujitsu Laboratories. The funder through T.S. was involved in conception of the experiment, writing this article and supervising the study. All other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary results, methods and discussions.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Madan, S., Henry, T., Dozier, J. et al. When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations. Nat Mach Intell 4, 146–153 (2022). https://doi.org/10.1038/s42256-021-00437-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00437-5

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics