Abstract
Current slot-oriented approaches for compositional scene segmentation from images and videos rely on provided background information or slot assignments. We present a segmented location and identity tracking system, Loci-Segmented (Loci-s), which does not require either of this information. It learns to dynamically segment scenes into interpretable background and slot-based object encodings, separating rgb, mask, location, and depth information for each. The results reveal largely superior video decomposition performance in the MOVi datasets and in another established dataset collection targeting scene segmentation. The system’s well-interpretable, compositional latent encodings may serve as a foundation model for downstream tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: real-time instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9157–9166 (2019)
Butz, M.V., Achimova, A., Bilkey, D., Knott, A.: Event-predictive cognition: a root for conceptual human thought. Top. Cogn. Sci. 13, 10–24 (2021). https://doi.org/10.1111/tops.12522
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Elsayed, G., Mahendran, A., van Steenkiste, S., Greff, K., Mozer, M.C., Kipf, T.: Savi++: towards end-to-end object-centric learning from real-world videos. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 28940–28954. Curran Associates, Inc
Greff, K., et al.: Kubric: A scalable dataset generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3749–3761 (2022)
Greff, K., Van Steenkiste, S., Schmidhuber, J.: On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208 (2020)
Gumbsch, C., Butz, M.V., Martius, G.: Sparsely changing latent states for prediction and planning in partially observable domains. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 17518–17531. Curran Associates, Inc. (2021), https://arxiv.org/abs/2110.15949
Ha, D., Schmidhuber, J.: World Models (2018). https://doi.org/10.5281/zenodo.1207631
Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to Control: Learning Behaviors by Latent Imagination (Mar 2020). https://doi.org/10.48550/arXiv.1912.01603
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: learning latent dynamics for planning from pixels. In: Proceedings of the 36th International Conference on Machine Learning, pp. 2555–2565. PMLR (May 2019), iSSN: 2640-3498
Heald, J.B., Lengyel, M., Wolpert, D.M.: Contextual inference in learning and memory. Trends Cognitive Sci. 27(1), 43–64 (2023). https://doi.org/10.1016/j.tics.2022.10.004, https://www.sciencedirect.com/science/article/pii/S1364661322002650
Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME–J. Basic Eng. 82(Series D), 35–45 (1960)
Kipf, T., et al.: Conditional object-centric learning from video. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=aD7uesX1GF_
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., Kipf, T.: Object-centric learning with slot attention. Adv. Neural. Inf. Process. Syst. 33, 11525–11538 (2020)
Mattar, M.G., Lengyel, M.: Planning in the brain. Neuron 110(6), 914–934 (2022). https://doi.org/10.1016/j.neuron.2021.12.018, https://www.sciencedirect.com/science/article/pii/0896627321010357
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. Adv. Neural Inform. Process. Syst. 30 (2017)
Schrittwieser, J., et al.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020). https://doi.org/10.1038/s41586-020-03051-4
Schwöbel, S., Marković, D., Smolka, M.N., Kiebel, S.J.: Balancing control: A bayesian interpretation of habitual and goal-directed behavior. J. Math. Psychol. 100, 102472 (2021). https://doi.org/10.1016/j.jmp.2020.102472, https://www.sciencedirect.com/science/article/pii/S0022249620301000
Traub, M., Becker, F., Otte, S., Butz, M.V.: Looping loci: Developing object permanence from videos. arXiv preprint arXiv:2310.10372 (2023)
Traub, M., Otte, S., Menge, T., Karlbauer, M., Thuemmel, J., Butz, M.V.: Learning what and where: Disentangling location and identity tracking without supervision. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=NeDc-Ak-H_
Vaswani, A., et al.: Attention is all you need. Adv. Neural inform. Process. syst. 30 (2017)
Wu, Z., Dvornik, N., Greff, K., Kipf, T., Garg, A.: Slotformer: unsupervised visual dynamics simulation with object-centric models. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=TFbwV6I0VLg
Yuan, J., Chen, T., Li, B., Xue, X.: Compositional scene representation learning via reconstruction: A survey (2023)
Acknowledgments
This work received funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC number 2064/1 - Project number 390727645 as well as from the Cyber Valley in Tübingen, CyVy-RF-2020-15. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Manuel Traub and Frederic Becker, and the Alexander von Humboldt Foundation for supporting Martin Butz and Sebastian Otte.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Traub, M., Becker, F., Sauter, A., Otte, S., Butz, M.V. (2024). Loci-Segmented: Improving Scene Segmentation Learning. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15018. Springer, Cham. https://doi.org/10.1007/978-3-031-72338-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-72338-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72337-7
Online ISBN: 978-3-031-72338-4
eBook Packages: Computer ScienceComputer Science (R0)