Boosting Object Representation Learning via Motion and Object Continuity

Delfosse, Quentin; Stammer, Wolfgang; Rothenbächer, Thomas; Vittal, Dwarak; Kersting, Kristian

doi:10.1007/978-3-031-43421-1_36

Quentin Delfosse¹²,
Wolfgang Stammer^12,13,
Thomas Rothenbächer¹²,
Dwarak Vittal¹² &
…
Kristian Kersting^12,13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1468 Accesses
1 Citations

Abstract

Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, despite their good object localization and segmentation capabilities, their object encodings may still be suboptimal for downstream reasoning tasks, such as reinforcement learning. To overcome this, we propose to exploit object motion and continuity (objects do not pop in and out of existence). This is accomplished through two mechanisms: (i) providing temporal loss-based priors on object locations, and (ii) a contrastive object continuity loss across consecutive frames. Rather than developing an explicit deep architecture, the resulting unsupervised Motion and Object Continuity (MOC) training scheme can be instantiated using any object detection model baseline. Our results show large improvements in the performances of variational and slot-based models in terms of object discovery, convergence speed and overall latent object representations, particularly for playing Atari games. Overall, we show clear benefits of integrating motion and object continuity for downstream reasoning tasks, moving beyond object representation learning based only on reconstruction as well as evaluation based only on instance segmentation quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Video Object Detection via Object-Level Temporal Aggregation

Object Discovery and Representation Networks

Lost and Found: Overcoming Detector Failures in Online Multi-object Tracking

Notes

1.
https://github.com/k4ntz/MOC.

References

Anand, A., Racah, E., Ozair, S., Bengio, Y., Côté, M.-A., Hjelm, R.D.: Unsupervised state representation learning in atari. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS) (2019)
Google Scholar
Anthwal, S., Ganotra, D.: An overview of optical flow-based approaches for motion segmentation. Imaging Sci. J. 67(5), 284–294 (2019)
Article Google Scholar
Bai, S., Geng, Z., Savani, Y., Kolter, J.Z.: Deep equilibrium optical flow estimation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Bao, Z., Tokmakov, P., Jabri, A., Wang, Y.-X., Gaidon, A., Hebert, M.: Discovering objects that can move. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Brockman, G., et al.: OpenAI gym. CoRR (2016)
Google Scholar
Burgess, C.P., et al.: MONet: unsupervised scene decomposition and representation. CoRR (2019)
Google Scholar
Cai, Z., Neher, H., Vats, K., Clausi, D.A., Zelek, J.S.: Temporal hockey action recognition via pose and optical flows. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Google Scholar
Crawford, E., Pineau, J.: Exploiting spatial invariance for scalable unsupervised object tracking. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) (2020)
Google Scholar
Delfosse, Q., Blüml, J., Gregori, B., Sztwiertnia, S., Kersting, K.: OCAtari: object-centric atari 2600 reinforcement learning environments. CoRR (2021)
Google Scholar
Delfosse, Q., Shindo, H., Dhami, D., Kersting, K.: Interpretable and explainable logical policies via neurally guided symbolic abstraction, CoRR (2023)
Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Du, Y., Smith, K., Ulman, T., Tenenbaum, J., Wu, J.: Unsupervised discovery of 3D physical objects from video. In: 9th International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Elsayed, G.F., Mahendran, A., van Steenkiste, S., Greff, K., Mozer, M.C., Kipf, T.: SAVi++: towards end-to-end object-centric learning from real-world videos. CoRR (2022)
Google Scholar
Engelcke, M., Kosiorek, A.R., Jones, O.P., Posner, I.: GENESIS: generative scene inference and sampling with object-centric latent representations. In: 8th International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Eslami, S.M., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E.: Attend, infer, repeat: fast scene understanding with generative models. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016 (NeurIPS) (2016)
Google Scholar
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigün, J., Gustavsson, T. (eds.) 13th Scandinavian Conference on Image Analysis 2003 (SCIA) (2003)
Google Scholar
Fleet, D.J., Weiss, Y.: Optical flow estimation. In: Paragios, N., Chen, Y., Faugeras, O.D. (eds.) Handbook of Mathematical Models in Computer Vision (2006)
Google Scholar
Goel, V., Weng, J., Poupart, P.: Unsupervised video object segmentation for deep reinforcement learning. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS) (2018)
Google Scholar
Greff, K., et al.: Multi-object representation learning with iterative variational inference. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning (ICML) (2019)
Google Scholar
Greff, K., van Steenkiste, S., Schmidhuber, J.: On the binding problem in artificial neural networks. CoRR (2020)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Hénaff, O.J.: Data-efficient image recognition with contrastive predictive coding. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Google Scholar
Hoerl, A.E., Kennard, R.W.: Ridge regression: applications to nonorthogonal problems. Technometrics 12(1), 69–82 (1970)
Article MATH Google Scholar
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (2000)
Article MATH Google Scholar
Hur, J., Roth, S.: Optical flow estimation in the deep learning age. CoRR (2020)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Jeong, J., Lin, J.M., Porikli, F., Kwak, N.: Imposing consistency for optical flow estimation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Jiang, J., Ahn, S.: Generative neurosymbolic machines. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., Lin, H.-T. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (NeurIPS) (2020)
Google Scholar
Jiang, J., Janghorbani, S., De Melo, G., Ahn, S.: SCALOR: generative world models with scalable object representations. In: 8th International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Kale, K., Pawar, S., Dhulekar, P.: Moving object tracking using optical flow and motion vector estimation. In: 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions) (2015)
Google Scholar
Kambhampati, S., Sreedharan, S., Verma, M., Zha, Y., Guan, L.: Symbols as a lingua franca for bridging human-AI chasm for explainable and advisable AI systems. In: Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI) (2022)
Google Scholar
Kipf, T., et al.: Conditional object-centric learning from video. In: The Tenth International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Koh, P.W., et al.: Concept bottleneck models. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Google Scholar
Kosiorek, A., Kim, H., Teh, Y.W., Posner, I.: Sequential attend, infer, repeat: generative modelling of moving objects. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS) (2018)
Google Scholar
Laskin, M., Srinivas, A., Abbeel, P.: CURL: contrastive unsupervised representations for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Google Scholar
Lee, M., Lee, S., Son, S., Park, G., Kwak, N.: Motion feature network: fixed motion filter for action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) 15th European Conference on Computer Vision 2018 (ECCV) (2018)
Google Scholar
Li, J., Zhao, Y., He, X., Zhu, X., Liu, J.: Dynamic warping network for semantic video segmentation. Complexity (2021)
Google Scholar
Lin, Z., et al.: SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition. In: 8th International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Liu, Y., Shen, C., Yu, C., Wang, J.: Efficient semantic video segmentation with per-frame inference. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 352–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_21
Chapter Google Scholar
Locatello, F., et al.: Object-centric learning with slot attention. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., Lin, H.-T. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020)
Google Scholar
Luo, K., Wang, C., Liu, S., Fan, H., Wang, J., Sun, J.: Upflow: upsampling pyramid for unsupervised optical flow learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Persson, A., Martires, P.Z.D., De Raedt, L., Loutfi, A.: Semantic relational object tracking. IEEE Trans. Cogn. Develop. Syst. 12(1), 84–97 (2020)
Article Google Scholar
Schulter, S., Leistner, C., Roth, P.M., Bischof, H.: Unsupervised object discovery and segmentation in videos. In: Burghardt, T., Damen, D., Mayol-Cuevas, W.W., Mirmehdi, M. (eds.) British Machine Vision Conference (BMVC) (2013)
Google Scholar
Seitzer, M., et al.: Bridging the gap to real-world object-centric learning. CoRR (2022)
Google Scholar
Singh, G., Deng, F., Ahn, S.: Illiterate DALL-E learns to compose. In: 10th International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Smirnov, D., Gharbi, M., Fisher, M., Guizilini, V., Efros, A., Solomon, J.M.: Marionette: self-supervised sprite learning. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021 (NeurIPS) (2021)
Google Scholar
Stammer, W., Schramowski, P., Kersting, K.: Right for the right concept: revising neuro-symbolic concepts by interacting with their explanations. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Stelzner, K., Peharz, R., Kersting, K.: Faster attend-infer-repeat with tractable probabilistic models. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning (ICML) (2019)
Google Scholar
Stone, A., Maurer, D., Ayvaci, A., Angelova, A., Jonschkowski, R.: Smurf: self-teaching multi-frame unsupervised raft with full-image warping. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Strickland, B., Wertz, A., Labouret, G., Keil, F., Izard, V.: The principles of object continuity and solidity in adult vision: some discrepancies in performance. J. Vis. 15(12), 122 (2015)
Article Google Scholar
Tangemann, M., et al.: Unsupervised object learning via common fate. CoRR (2021)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. (2008)
Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Schuurmans, D., Wellman, M.P. (eds.) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., de Freitas, N.: Dueling network architectures for deep reinforcement learning. In: Balcan, M.-F., Weinberger, K.Q. (eds.) Proceedings of the 33nd International Conference on Machine Learning (ICML) (2016)
Google Scholar
Wertheimer, M.: Untersuchungen zur lehre von der gestalt. ii. Psychologische forschung (1923)
Google Scholar
Yang, C., Lamdouar, H., Lu, E., Zisserman, A., Xie, W.: Self-supervised video object segmentation by motion grouping. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Yuezhang, L., Zhang, R., Ballard, D.H.: An initial attempt of combining visual selective attention with deep reinforcement learning. CoRR (2018)
Google Scholar
Zhang, Y., Hare, J., Prugel-Bennett, A.: Deep set prediction networks. In Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS) (2019)
Google Scholar
Zhao, D., Ding, B., Yulin, W., Chen, L., Zhou, H.: Unsupervised learning from videos for object discovery in single images. Symmetry 13(1), 38 (2021)
Article Google Scholar
Zheng, Z., et al.: DIP: deep inverse patchmatch for high-resolution optical flow. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Zhou, H., Ummenhofer, B., Brox, T.: Deeptam: deep tracking and mapping. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) 15th European Conference on Computer Vision 2018 (ECCV) (2018)
Google Scholar
Zhou, T., Li, J., Li, X., Shao, L.: Target-aware object discovery and association for unsupervised video multi-object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar

Download references

Acknowledgements

The authors thank the anonymous reviewers for their valuable feedback. This research work has been funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts (HMWK) within their joint support of the National Research Center for Applied Cybersecurity ATHENE, via the “SenPai: XReLeaS” project. It also benefited from the HMWK cluster projects “The Third Wave of AI” and “The Adaptive Mind” as well as the Hessian research priority program LOEWE within the project “WhiteBox”.

Author information

Authors and Affiliations

AI & ML Lab, TU Darmstadt, Darmstadt, Germany
Quentin Delfosse, Wolfgang Stammer, Thomas Rothenbächer, Dwarak Vittal & Kristian Kersting
Hessian Center for AI (hessian.AI), Darmstadt, Germany
Wolfgang Stammer & Kristian Kersting
German Research Center for AI (DFKI), Kaiserslautern, Germany
Kristian Kersting
Centre for Cognitive Science, TU Darmstadt, Darmstadt, Germany
Kristian Kersting

Authors

Quentin Delfosse
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Stammer
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Rothenbächer
View author publications
You can also search for this author in PubMed Google Scholar
Dwarak Vittal
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Kersting
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quentin Delfosse .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

Our work aims to improve the object representations of object discovery models, specifically targeting the improvements of their use in additional modules in downstream reasoning tasks. With the improvements of our training scheme, it is feasible to integrate the findings of unsupervised object discovery methods into practical use-cases. A main motivation, as stated in our introduction, is that such an integration of high-quality object-centric representations is beneficial for more human-centric AI. Arguably, it seems beneficial for humans to perceive, communicate and explain the world on the level of objects. Integrating such level of abstraction and representation to AI agents is a necessary step for fruitful and reliable human-AI interactions.

Obviously, our work is not unaffected from the dual-use dilemma of foundational (AI) research. And a watchful eye should be kept, particularly on object detection research which can easily be misused, e.g. for involuntary human surveillance. However, our work or implications thereof do not, to the best of our knowledge, pose an obvious direct threat to any individuals or society in general.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Delfosse, Q., Stammer, W., Rothenbächer, T., Vittal, D., Kersting, K. (2023). Boosting Object Representation Learning via Motion and Object Continuity. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-43421-1_36
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Boosting Object Representation Learning via Motion and Object Continuity