Abstract
Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, despite their good object localization and segmentation capabilities, their object encodings may still be suboptimal for downstream reasoning tasks, such as reinforcement learning. To overcome this, we propose to exploit object motion and continuity (objects do not pop in and out of existence). This is accomplished through two mechanisms: (i) providing temporal loss-based priors on object locations, and (ii) a contrastive object continuity loss across consecutive frames. Rather than developing an explicit deep architecture, the resulting unsupervised Motion and Object Continuity (MOC) training scheme can be instantiated using any object detection model baseline. Our results show large improvements in the performances of variational and slot-based models in terms of object discovery, convergence speed and overall latent object representations, particularly for playing Atari games. Overall, we show clear benefits of integrating motion and object continuity for downstream reasoning tasks, moving beyond object representation learning based only on reconstruction as well as evaluation based only on instance segmentation quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Anand, A., Racah, E., Ozair, S., Bengio, Y., Côté, M.-A., Hjelm, R.D.: Unsupervised state representation learning in atari. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS) (2019)
Anthwal, S., Ganotra, D.: An overview of optical flow-based approaches for motion segmentation. Imaging Sci. J. 67(5), 284–294 (2019)
Bai, S., Geng, Z., Savani, Y., Kolter, J.Z.: Deep equilibrium optical flow estimation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Bao, Z., Tokmakov, P., Jabri, A., Wang, Y.-X., Gaidon, A., Hebert, M.: Discovering objects that can move. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Brockman, G., et al.: OpenAI gym. CoRR (2016)
Burgess, C.P., et al.: MONet: unsupervised scene decomposition and representation. CoRR (2019)
Cai, Z., Neher, H., Vats, K., Clausi, D.A., Zelek, J.S.: Temporal hockey action recognition via pose and optical flows. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2019)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Crawford, E., Pineau, J.: Exploiting spatial invariance for scalable unsupervised object tracking. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) (2020)
Delfosse, Q., Blüml, J., Gregori, B., Sztwiertnia, S., Kersting, K.: OCAtari: object-centric atari 2600 reinforcement learning environments. CoRR (2021)
Delfosse, Q., Shindo, H., Dhami, D., Kersting, K.: Interpretable and explainable logical policies via neurally guided symbolic abstraction, CoRR (2023)
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015)
Du, Y., Smith, K., Ulman, T., Tenenbaum, J., Wu, J.: Unsupervised discovery of 3D physical objects from video. In: 9th International Conference on Learning Representations (ICLR) (2021)
Elsayed, G.F., Mahendran, A., van Steenkiste, S., Greff, K., Mozer, M.C., Kipf, T.: SAVi++: towards end-to-end object-centric learning from real-world videos. CoRR (2022)
Engelcke, M., Kosiorek, A.R., Jones, O.P., Posner, I.: GENESIS: generative scene inference and sampling with object-centric latent representations. In: 8th International Conference on Learning Representations (ICLR) (2020)
Eslami, S.M., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E.: Attend, infer, repeat: fast scene understanding with generative models. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016 (NeurIPS) (2016)
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigün, J., Gustavsson, T. (eds.) 13th Scandinavian Conference on Image Analysis 2003 (SCIA) (2003)
Fleet, D.J., Weiss, Y.: Optical flow estimation. In: Paragios, N., Chen, Y., Faugeras, O.D. (eds.) Handbook of Mathematical Models in Computer Vision (2006)
Goel, V., Weng, J., Poupart, P.: Unsupervised video object segmentation for deep reinforcement learning. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS) (2018)
Greff, K., et al.: Multi-object representation learning with iterative variational inference. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning (ICML) (2019)
Greff, K., van Steenkiste, S., Schmidhuber, J.: On the binding problem in artificial neural networks. CoRR (2020)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Hénaff, O.J.: Data-efficient image recognition with contrastive predictive coding. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Hoerl, A.E., Kennard, R.W.: Ridge regression: applications to nonorthogonal problems. Technometrics 12(1), 69–82 (1970)
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (2000)
Hur, J., Roth, S.: Optical flow estimation in the deep learning age. CoRR (2020)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Jeong, J., Lin, J.M., Porikli, F., Kwak, N.: Imposing consistency for optical flow estimation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Jiang, J., Ahn, S.: Generative neurosymbolic machines. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., Lin, H.-T. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (NeurIPS) (2020)
Jiang, J., Janghorbani, S., De Melo, G., Ahn, S.: SCALOR: generative world models with scalable object representations. In: 8th International Conference on Learning Representations (ICLR) (2020)
Kale, K., Pawar, S., Dhulekar, P.: Moving object tracking using optical flow and motion vector estimation. In: 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions) (2015)
Kambhampati, S., Sreedharan, S., Verma, M., Zha, Y., Guan, L.: Symbols as a lingua franca for bridging human-AI chasm for explainable and advisable AI systems. In: Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI) (2022)
Kipf, T., et al.: Conditional object-centric learning from video. In: The Tenth International Conference on Learning Representations (ICLR) (2022)
Koh, P.W., et al.: Concept bottleneck models. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Kosiorek, A., Kim, H., Teh, Y.W., Posner, I.: Sequential attend, infer, repeat: generative modelling of moving objects. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS) (2018)
Laskin, M., Srinivas, A., Abbeel, P.: CURL: contrastive unsupervised representations for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Lee, M., Lee, S., Son, S., Park, G., Kwak, N.: Motion feature network: fixed motion filter for action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) 15th European Conference on Computer Vision 2018 (ECCV) (2018)
Li, J., Zhao, Y., He, X., Zhu, X., Liu, J.: Dynamic warping network for semantic video segmentation. Complexity (2021)
Lin, Z., et al.: SPACE: unsupervised object-oriented scene representation via spatial attention and decomposition. In: 8th International Conference on Learning Representations (ICLR) (2020)
Liu, Y., Shen, C., Yu, C., Wang, J.: Efficient semantic video segmentation with per-frame inference. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 352–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_21
Locatello, F., et al.: Object-centric learning with slot attention. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., Lin, H.-T. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020)
Luo, K., Wang, C., Liu, S., Fan, H., Wang, J., Sun, J.: Upflow: upsampling pyramid for unsupervised optical flow learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Persson, A., Martires, P.Z.D., De Raedt, L., Loutfi, A.: Semantic relational object tracking. IEEE Trans. Cogn. Develop. Syst. 12(1), 84–97 (2020)
Schulter, S., Leistner, C., Roth, P.M., Bischof, H.: Unsupervised object discovery and segmentation in videos. In: Burghardt, T., Damen, D., Mayol-Cuevas, W.W., Mirmehdi, M. (eds.) British Machine Vision Conference (BMVC) (2013)
Seitzer, M., et al.: Bridging the gap to real-world object-centric learning. CoRR (2022)
Singh, G., Deng, F., Ahn, S.: Illiterate DALL-E learns to compose. In: 10th International Conference on Learning Representations (ICLR) (2022)
Smirnov, D., Gharbi, M., Fisher, M., Guizilini, V., Efros, A., Solomon, J.M.: Marionette: self-supervised sprite learning. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021 (NeurIPS) (2021)
Stammer, W., Schramowski, P., Kersting, K.: Right for the right concept: revising neuro-symbolic concepts by interacting with their explanations. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Stelzner, K., Peharz, R., Kersting, K.: Faster attend-infer-repeat with tractable probabilistic models. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning (ICML) (2019)
Stone, A., Maurer, D., Ayvaci, A., Angelova, A., Jonschkowski, R.: Smurf: self-teaching multi-frame unsupervised raft with full-image warping. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Strickland, B., Wertz, A., Labouret, G., Keil, F., Izard, V.: The principles of object continuity and solidity in adult vision: some discrepancies in performance. J. Vis. 15(12), 122 (2015)
Tangemann, M., et al.: Unsupervised object learning via common fate. CoRR (2021)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. (2008)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Schuurmans, D., Wellman, M.P. (eds.) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (2016)
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., de Freitas, N.: Dueling network architectures for deep reinforcement learning. In: Balcan, M.-F., Weinberger, K.Q. (eds.) Proceedings of the 33nd International Conference on Machine Learning (ICML) (2016)
Wertheimer, M.: Untersuchungen zur lehre von der gestalt. ii. Psychologische forschung (1923)
Yang, C., Lamdouar, H., Lu, E., Zisserman, A., Xie, W.: Self-supervised video object segmentation by motion grouping. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Yuezhang, L., Zhang, R., Ballard, D.H.: An initial attempt of combining visual selective attention with deep reinforcement learning. CoRR (2018)
Zhang, Y., Hare, J., Prugel-Bennett, A.: Deep set prediction networks. In Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS) (2019)
Zhao, D., Ding, B., Yulin, W., Chen, L., Zhou, H.: Unsupervised learning from videos for object discovery in single images. Symmetry 13(1), 38 (2021)
Zheng, Z., et al.: DIP: deep inverse patchmatch for high-resolution optical flow. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Zhou, H., Ummenhofer, B., Brox, T.: Deeptam: deep tracking and mapping. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) 15th European Conference on Computer Vision 2018 (ECCV) (2018)
Zhou, T., Li, J., Li, X., Shao, L.: Target-aware object discovery and association for unsupervised video multi-object segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Acknowledgements
The authors thank the anonymous reviewers for their valuable feedback. This research work has been funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts (HMWK) within their joint support of the National Research Center for Applied Cybersecurity ATHENE, via the “SenPai: XReLeaS” project. It also benefited from the HMWK cluster projects “The Third Wave of AI” and “The Adaptive Mind” as well as the Hessian research priority program LOEWE within the project “WhiteBox”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical Statement
Our work aims to improve the object representations of object discovery models, specifically targeting the improvements of their use in additional modules in downstream reasoning tasks. With the improvements of our training scheme, it is feasible to integrate the findings of unsupervised object discovery methods into practical use-cases. A main motivation, as stated in our introduction, is that such an integration of high-quality object-centric representations is beneficial for more human-centric AI. Arguably, it seems beneficial for humans to perceive, communicate and explain the world on the level of objects. Integrating such level of abstraction and representation to AI agents is a necessary step for fruitful and reliable human-AI interactions.
Obviously, our work is not unaffected from the dual-use dilemma of foundational (AI) research. And a watchful eye should be kept, particularly on object detection research which can easily be misused, e.g. for involuntary human surveillance. However, our work or implications thereof do not, to the best of our knowledge, pose an obvious direct threat to any individuals or society in general.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Delfosse, Q., Stammer, W., Rothenbächer, T., Vittal, D., Kersting, K. (2023). Boosting Object Representation Learning via Motion and Object Continuity. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_36
Download citation
DOI: https://doi.org/10.1007/978-3-031-43421-1_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)