When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations

Madan, Spandan; Henry, Timothy; Dozier, Jamell; Ho, Helen; Bhandari, Nishchal; Sasaki, Tomotake; Durand, Frédo; Pfister, Hanspeter; Boix, Xavier

doi:10.1038/s42256-021-00437-5

Article
Published: 21 February 2022

When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations

Nature Machine Intelligence volume 4, pages 146–153 (2022)Cite this article

3093 Accesses
93 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Object recognition and viewpoint estimation lie at the heart of visual understanding. Recent studies have suggested that convolutional neural networks (CNNs) fail to generalize to out-of-distribution (OOD) category–viewpoint combinations, that is, combinations not seen during training. Here we investigate when and how such OOD generalization may be possible by evaluating CNNs trained to classify both object category and three-dimensional viewpoint on OOD combinations, and identifying the neural mechanisms that facilitate such OOD generalization. We show that increasing the number of in-distribution combinations (data diversity) substantially improves generalization to OOD combinations, even with the same amount of training data. We compare learning category and viewpoint in separate and shared network architectures, and observe starkly different trends on in-distribution and OOD combinations, that is, while shared networks are helpful in distribution, separate networks significantly outperform shared ones at OOD combinations. Finally, we demonstrate that such OOD generalization is facilitated by the neural mechanism of specialization, that is, the emergence of two types of neuron—neurons selective to category and invariant to viewpoint, and vice versa.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Category–viewpoint datasets.**

Fig. 2: Architectures for category recognition and viewpoint estimation.

**Fig. 3: Generalization performance for Shared and Separate ResNet-18 as in-distribution combinations are increased for all datasets.**

**Fig. 4: Generalization performance for different architectures and backbones as in-distribution combinations are increased for iLab and Biased-Cars datasets.**

**Fig. 5: Specialization to category recognition and viewpoint estimation.**

**Fig. 6: Neuron specialization (selectivity to category and invariance to viewpoint, and vice versa) in the Biased-Cars dataset.**

Qualitative similarities and differences in visual object representations between brains and deep networks

Article Open access 25 March 2021

Anchor objects drive realism while diagnostic objects drive categorization in GAN generated scenes

Article Open access 26 July 2024

Orthogonal Representations of Object Shape and Category in Deep Convolutional Neural Networks and Human Visual Cortex

Article Open access 12 February 2020

Data availability

To access and cite the Biased-Cars dataset, please visit https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/F1NQ3R&faces-redirect=true.

Code availability

Source code and demos are available on GitHub at https://github.com/Spandan-Madan/generalization_to_OOD_category_viewpoint_combinations.

References

He, K., Zhang, X., Ren, S. and Sun, J. Deep residual learning for image recognition. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2818–2826 (IEEE, 2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).
Su, H., Qi, C. R., Li, Y. & Guibas, L. J. Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In Proc. IEEE International Conference on Computer Vision 2686–2694 (IEEE, 2015).
Massa, F., Marlet, R. & Aubry, M. Crafting a multi-task CNN for viewpoint estimation. In Proc. British Machine Vision Conference 91.1–91.12 (BMVA, 2016).
Elhoseiny, M., El-Gaaly, T., Bakry, A. & Elgammal, A. A comparative analysis and study of multiview CNN models for joint object categorization and pose estimation. In Proc. International Conference on Machine Learning 888–897 (PMLR, 2016).
Mahendran, S., Ali, H. & Vidal, R. Convolutional networks for object category and 3D pose estimation from 2D images. In Proc. European Conference on Computer Vision Workshops 698–715 (Springer, 2018).
Afifi, A. J., Hellwich, O. & Soomro, T. A. Simultaneous object classification and viewpoint estimation using deep multi-task convolutional neural network. In Proc. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 177–184 (2018).
Engstrom, L., Tran, B., Tsipras, D., Schmidt, L. & Madry, A. Exploring the landscape of spatial robustness. In Proc. International Conference on Machine Learning 1802–1811 (PMLR, 2019).
Azulay, A. & Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 20, 1–25 (2019).
MathSciNet MATH Google Scholar
Srivastava, S., Ben-Yosef, G. & Boix, X. Minimal images in deep neural networks: fragile object recognition in natural images. In Proc. International Conference on Learning Representations (2019).
Alcorn, M. A. et al. Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 4845–4854 (IEEE, 2019).
Barbu, A. et al. ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. Adv. Neural Inf. Process. Syst. 32, 9448–9458 (2019).
Google Scholar
Tulsiani, S. & Malik, J. Viewpoints and keypoints. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1510–1519 (IEEE, 2015).
Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In Proc. Robotics: Science and Systems (2018).
Manhardt, F. et al. CPS++: improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. Preprint at https://arxiv.org/abs/2003.05848 (2020).
Caruana, R. Multitask learning. Mach. Learn. 28, 41–75 (1997).
Article Google Scholar
Giles, C. L. & Maxwell, T. Learning, invariance, and generalization in high-order neural networks. Appl. Optics 26, 4972–4978 (1987).
Article Google Scholar
Riesenhuber, M. & Poggio, T. Just one view: Invariances in inferotemporal cell tuning. Adv. Neural Inf. Process. Syst. 10, 215–221 (1998).
Google Scholar
Goodfellow, I., Lee, H., Le, Q. V., Saxe, A. & Ng, A. Y. Measuring invariances in deep networks. Adv. Neural Inf. Process. Syst. 22, 646–654 (2009).
Google Scholar
Achille, A. & Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 19, 1947–1980 (2018).
MathSciNet MATH Google Scholar
Borji, A., Izadi, S. & Itti, L. iLab-20M: a large-scale controlled object dataset to investigate deep learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2221–2230 (IEEE, 2016).
Visual variation learning for object recognition. Jatuporn Toy Leksut https://bmobear.github.io/projects/viva/ (2016).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
The MNIST Database of Handwritten Digits (accessed 13 January 2022); http://yann.lecun.com/exdb/mnist/
Xiang, Y., Mottaghi, R. & Savarese, S. Beyond pascal: a benchmark for 3D object detection in the wild. In Proc. IEEE Winter Conference on Applications of Computer Vision 75–82 (IEEE, 2014).
Caesar, H. et al. nuScenes: a multimodal dataset for autonomous driving. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11618–11628 (IEEE, 2020).
Min, J., Lee, J., Ponce, J. & Cho, M. Spair-71k: a large-scale benchmark for semantic correspondence. Preprint at https://arxiv.org/abs/1908.10543 (2019).
Larochelle, H., Erhan, D., Courville, A., Bergstra, J. & Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In Proc. 24th International Conference on Machine Learning 473–480 (PMLR, 2007).
Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3D object representations for fine-grained categorization. In Proc. 4th International IEEE Workshop on 3D Representation and Recognition 554–561 (IEEE, 2013).
Ozuysal, M., Lepetit, V. & Fua, P. Pose estimation for category specific multiview object localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 778–785 (IEEE, 2009).
Qiu, W. & Yuille, A. UnrealCV: connecting computer vision to Unreal Engine. In Proc. European Conference on Computer Vision 909–916 (Springer, 2016).
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A. & Koltun, V. CARLA: an open urban driving simulator. In Proc. Annual Conference on Robot Learning 1–16 (2017).
Zhang, Y. et al. Physically-based rendering for indoor scene understanding using convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 5287–5295 (IEEE, 2017).
Halder, S. S., Lalonde, J.-F. & de Charette, R. Physics-based rendering for improving robustness to rain. In Proc. IEEE/CVF International Conference on Computer Vision 10203–10212 (IEEE, 2019).
Divon, G. & Tal, A. Viewpoint estimation—insights & model. In Proc. European Conference on Computer Vision 252–268 (Springer, 2018).
Mueller, P. et al. Esri CityEngine—A 3D City Modeling Software for Urban Design, Visual Effects, and VR/AR (Esri R&D Center Zurich, 2020); http://www.esri.com/cityengine
Blender—A 3D Modelling and Rendering Package (Blender Foundation, Stichting Blender Foundation, 2020); http://www.blender.org
Savarese, S. Fei-Fei, L. 3D generic object categorization, localization and pose estimation. In 2007 IEEE 11th International Conference on Computer Vision 1–8 (IEEE, 2007).
Ghodrati, A., Pedersoli, M. & Tuytelaars, T. Is 2D information enough for viewpoint estimation? In Proc. British Machine Vision Conference (BMVA, 2014).
Tulsiani, S., Carreira, J. & Malik, J. Pose induction for novel object categories. In Proc. IEEE International Conference on Computer Vision 64–72 (IEEE, 2015).
Penedones, H., Collobert, R., Fleuret, F. & Grangier, D. Improving Object Classification Using Pose Information Technical Report Idiap-RR-30-2012 (Idiap Research Institute, 2012).
Zhao, J. & Itti, L. Improved deep learning of object category using pose information. In Proc. IEEE Winter Conference on Applications of Computer Vision 550–559 (IEEE, 2017).
Li, C., Bai, J. & Hager, G. D. A unified framework for multi-view multi-class object pose estimation. In Proc. European Conference on Computer Vision 254–269 (Springer, 2018).
Grabner, A., Roth, P. M. & Lepetit, V. 3D pose estimation and 3D model retrieval for objects in the wild. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3022–3031 (IEEE, 2018).
Bricolo, E., Poggio, T. & Logothetis, N. K. 3D object recognition: a model of view-tuned neurons. Adv. Neural Inf. Process. Syst. 9, 41–47 (1997).
Google Scholar
Poggio, T. & Anselmi, F. Visual Cortex and Deep Networks: Learning Invariant Representations (MIT Press, 2016).
Olshausen, B. A., Anderson, C. H. & Van Essen, D. C. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci. 13, 4700–4719 (1993).
Article Google Scholar
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C. & Fried, I. Invariant visual representation by single neurons in the human brain. Nature 435, 1102–1107 (2005).
Article Google Scholar
Rust, N. C. & DiCarlo, J. J. Selectivity and tolerance (invariance) both increase as visual information propagates from cortical area V4 to IT. J. Neurosci. 30, 12978–12995 (2010).
Article Google Scholar
Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In Proc. European Conference on Computer Vision 818–833 (Springer, 2014).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. International Conference on Learning Representations Workshop (2014).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Object detectors emerge in deep scene CNNs. In Proc. International Conference on Learning Representations (2015).
Bau, D., Zhou, B., Khosla, A., Oliva, A. & Torralba, A. Network dissection: quantifying interpretability of deep visual representations. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 6541–6549 (IEEE, 2017).
Oquab, M., Bottou, L., Laptev, I. & Sivic, J. Is object localization for free? Weakly-supervised learning with convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 685–694 (IEEE, 2015).
Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C. & Botvinick, M. On the importance of single directions for generalization. In Proc. International Conference on Learning Representations (2018).
Zhou, B., Sun, Y., Bau, D. & Torralba, A. Revisiting the importance of individual units in CNNs via ablation. Preprint at https://arxiv.org/abs/1806.02891 (2018).
Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T. & Wang, X.-J. Task representations in neural networks trained to perform many cognitive tasks. Nat. Neurosci. 22, 297–306 (2019).
Article Google Scholar
Torralba, A. & Efros, A. A. Unbiased look at dataset bias. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1521–1528 (IEEE, 2011).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Standley, T. et al. Which tasks should be learned together in multi-task learning? In Proc. International Conference on Machine Learning (PMLR, 2020).
Shin, D., Fowlkes, C. C. & Hoiem, D. Pixels, voxels, and views: a study of shape representations for single view 3D object shape prediction. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3061–3069 (IEEE, 2018).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1492–1500 (IEEE, 2017).
Zagoruyko, S. & Komodakis, N. Wide residual networks. In Proc. British Machine Vision Conference 87.1–87.12 (BMVA, 2016).
Nakkiran, P. et al. Deep double descent: where bigger models and more data hurt. In Proc. International Conference on Learning Representations (2020).
Casper, S. et al. Frivolous units: wider networks are not really that wide. In Proc. Association for the Advancement of Artificial Intelligence (2021).
Cohen, T. S., Geiger, M., Köhler, J. & Welling, M. Spherical CNNs. In Proc. International Conference on Learning Representations (2018).
Cohen, T. S., Weiler, M., Kicanaoglu, B. & Welling, M. Gauge equivariant convolutional networks and the Icosahedral CNN. In Proc. International Conference on Machine Learning 1321–1330 (PMLR, 2019).

Download references

Acknowledgements

We are grateful to T. Poggio and P. Sinha for their insightful advice and warm encouragement. This work has been partially supported by NSF grant IIS-1901030, a Google Faculty Research Award, the Toyota Research Institute, the Center for Brains, Minds and Machines (funded by NSF STC award CCF-1231216), Fujitsu Laboratories (contract no. 40008819) and the MIT-Sensetime Alliance on Artificial Intelligence. We also thank K. Gupta for help with the figures, and P. Sharma for insightful discussions.

Author information

Authors and Affiliations

School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
Spandan Madan & Hanspeter Pfister
Center for Brains, Minds and Machines, Cambridge, MA, USA
Spandan Madan, Timothy Henry, Jamell Dozier & Xavier Boix
Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
Timothy Henry, Jamell Dozier & Xavier Boix
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institue of Technology, Cambridge, MA, USA
Helen Ho, Nishchal Bhandari & Frédo Durand
Fujitsu Laboratories, Kawasaki, Japan
Tomotake Sasaki

Authors

Spandan Madan
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Henry
View author publications
You can also search for this author in PubMed Google Scholar
Jamell Dozier
View author publications
You can also search for this author in PubMed Google Scholar
Helen Ho
View author publications
You can also search for this author in PubMed Google Scholar
Nishchal Bhandari
View author publications
You can also search for this author in PubMed Google Scholar
Tomotake Sasaki
View author publications
You can also search for this author in PubMed Google Scholar
Frédo Durand
View author publications
You can also search for this author in PubMed Google Scholar
Hanspeter Pfister
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Boix
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.M., T.H., J.D. and X.B. conceived, designed and implemented the experiments and carried out the analysis, with contributions from T.S., F.D. and H.P.; S.M., H.H., N.B. and F.D. designed and implemented the Biased-Cars dataset; S.M., T.S. and X.B. wrote the manuscript with contributions from F.D. and H.P.; T.S., F.D., H.P. and X.B. supervised the study.

Corresponding authors

Correspondence to Spandan Madan or Xavier Boix.

Ethics declarations

Competing interests

This study received funding from Fujitsu Laboratories. The funder through T.S. was involved in conception of the experiment, writing this article and supervising the study. All other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary results, methods and discussions.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Madan, S., Henry, T., Dozier, J. et al. When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations. Nat Mach Intell 4, 146–153 (2022). https://doi.org/10.1038/s42256-021-00437-5

Download citation

Received: 11 February 2021
Accepted: 10 December 2021
Published: 21 February 2022
Issue Date: February 2022
DOI: https://doi.org/10.1038/s42256-021-00437-5