More Diverse Training, Better Compositionality! Evidence from Multimodal Language Learning

Volquardsen, Caspar; Lee, Jae Hee; Weber, Cornelius; Wermter, Stefan

doi:10.1007/978-3-031-15934-3_35

Caspar Volquardsen¹²,
Jae Hee Lee¹²,
Cornelius Weber¹² &
…
Stefan Wermter¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13531))

Included in the following conference series:

International Conference on Artificial Neural Networks

1784 Accesses

Abstract

Artificial neural networks still fall short of human-level generalization and require a very large number of training examples to succeed. Model architectures that further improve generalization capabilities are therefore still an open research question. We created a multimodal dataset from simulation for measuring the compositional generalization of neural networks in multimodal language learning. The dataset consists of sequences showing a robot arm interacting with objects on a table in a simple 3D environment, with the goal of describing the interaction. Compositional object features, multiple actions, and distracting objects pose challenges to the model. We show that an LSTM-encoder-decoder architecture jointly trained together with a vision-encoder surpasses previous performance and handles multiple visible objects. Visualization of important input dimensions shows that a model that is trained with multiple objects, but not a model trained on just one object, has learnt to ignore irrelevant objects. Furthermore we show that additional modalities in the input improve the overall performance. We conclude that the underlying training data has a significant influence on the model’s capability to generalize compositionally.

The authors acknowledge support from the German Research Foundation DFG under project CML (TRR 169) and from the BMWK under project SiDiMo.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The source code for the model and the data generation can be found at this link: https://github.com/Casparvolquardsen/Compositional-Generalization-in-Multimodal-Language-Learning.

References

Eisermann, A., Lee, J.H., Weber, C., Wermter, S.: Generalization in multimodal language learning from simulation. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2021)(2021)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Adaptive Computation and Machine Learning (2016)
MATH Google Scholar
Greff, K., van Steenkiste, S., Schmidhuber, J.: On the binding problem in artificial neural networks. arXiv:2012.05208 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)
Heinrich, S., Kerzel, M., Strahl, E., Wermter, S.: Embodied multi-modal interaction in language learning: the EMIL data collection. In: Proceedings of the ICDL-EpiRob Workshop on Active Vision, Attention, and Learning (ICDL-Epirob 2018 AVAL). Tokyo, Japan (2018)
Google Scholar
Heinrich, S., et al.: Crossmodal language grounding in an embodied neurocognitive model. Front. Neurorobotics 14 (2020)
Google Scholar
Keysers, D., et al.: Measuring compositional generalization: a comprehensive method on realistic data. arXiv:1912.09713 (2019)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems. vol. 25, Curran Associates, Inc. (2012)
Google Scholar
Lake, B.M., Baroni, M.: Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. arXiv:1711.00350 (2017)
Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. arXiv:1604.00289 (2016)
LeCun, Y.: Generalization and network design strategies. Technical Report CRG-TR-89-4, University of Toronto (1989)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–44 (2015)
Google Scholar
Loula, J., Baroni, M., Lake, B.M.: Rearranging the familiar: testing compositional generalization in recurrent networks. arXiv:1807.07545 (2018)
Montague, R.: Universal Grammar, vol. 36. Blackwell Publishing Ltd. (1970)
Google Scholar
Ruis, L., Andreas, J., Baroni, M., Bouchacourt, D., Lake, B.M.: A benchmark for systematic generalization in grounded language understanding. arXiv:2003.05161 (2020)
Russakovsky, O.: ImageNet large scale visual recognition challenge. arXiv:1409.0575 (2014)
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328, PMLR (2017)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv:1409.3215 (2014)

Download references

Author information

Authors and Affiliations

Knowledge Technology, Department of Informatics, University of Hamburg, Hamburg, Germany
Caspar Volquardsen, Jae Hee Lee, Cornelius Weber & Stefan Wermter

Authors

Caspar Volquardsen
View author publications
You can also search for this author in PubMed Google Scholar
Jae Hee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Cornelius Weber
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Wermter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caspar Volquardsen .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teeside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Volquardsen, C., Lee, J.H., Weber, C., Wermter, S. (2022). More Diverse Training, Better Compositionality! Evidence from Multimodal Language Learning. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13531. Springer, Cham. https://doi.org/10.1007/978-3-031-15934-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-15934-3_35
Published: 15 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15933-6
Online ISBN: 978-3-031-15934-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

More Diverse Training, Better Compositionality! Evidence from Multimodal Language Learning