Latent-Space Data Augmentation for Visually-Grounded Language Understanding

Magassouba, Aly; Sugiura, Komei; Kawai, Hisashi

doi:10.1007/978-3-030-39878-1_17

Aly Magassouba²²,
Komei Sugiura²² &
Hisashi Kawai²²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1128))

Included in the following conference series:

Annual Conference of the Japanese Society for Artificial Intelligence

520 Accesses

Abstract

This is an extension from a selected paper from JSAI2019. In this paper, we study data augmentation for visually-grounded language understanding in the context of picking task. A typical picking task consists of predicting a target object specified by an ambiguous instruction ,e.g., “Pick up the yellow toy near the bottle”. We specifically show that existing methods for understanding such an instruction can be improved by data augmentation. More explicitly, MCTM [1] and MTCM-GAN [2] show better results with data augmentation when specifically considering latent space features instead of raw features. Additionally our results show that latent-space data augmentation can improve better a network accuracy than regularization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Magassouba, A., Sugiura, K., Kawai, H.: A multimodal target-source classifier model for object fetching from natural language instructions. In: Proceedings of the National Congress of the Japanese Society for Artificial Intelligence, pp. 2D3E403–2D3E403 (2019). (in Japanese)
Google Scholar
Magassouba, A., Sugiura, K., Trinh Quoc, A., Kawai, H.: Understanding natural language instructions for fetching daily objects using GAN-based multimodal target-source classification. IEEE RA-L 4(4), 3884–3891 (2019)
Google Scholar
Iocchi, L., Holz, D., Ruiz-del Solar, J., Sugiura, K., Van Der Zant, T.: RoboCup@ Home: analysis and results of evolving competitions for domestic and service robots. Artif. Intell. 229, 258–281 (2015)
Article MathSciNet Google Scholar
Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker listener-reinforcer model for referring expressions. In: CVPR, vol. 2 (2017)
Google Scholar
Magassouba, A., Sugiura, K., Kawai, H.: A multimodal classifier generative adversarial network for carry and place tasks from ambiguous language instructions. IEEE RA-L 3(4), 3113–3120 (2018)
Google Scholar
Magassouba, A., Sugiura, K., Kawai, H.: Multimodal attention branch network for perspective-free sentence generation. In: Conference on Robot Learning (CoRL) (2019)
Google Scholar
Cohen, V., Burchfiel, B., Nguyen, T., Gopalan, N., Tellex, S., Konidaris, G.: Grounding language attributes to objects using bayesian eigenobjects. In: Proceedings IEEE IROS 2019 (2019)
Google Scholar
Nagaraja,V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: ECCV, pp. 792–807 (2016)
Google Scholar
Hatori, J., et al.: Interactively picking real-world objects with unconstrained spoken language instructions. In: IEEE ICRA, pp. 3774–3781 (2018)
Google Scholar
Shridhar, M., Hsu, D.: Interactive visual grounding of referring expressions for human-robot interaction. In: RSS (2018)
Google Scholar
Springenberg, J.T.: Unsupervised and semi-supervised learning with categorical generative adversarial networks. In: Proceedings ICLR 2015 (2015)
Google Scholar
Sugiura, K., Kawai, H.: Grounded language understanding for manipulation instructions using GAN-based classification. In: IEEE ASRU (2017)
Google Scholar
Odena, A., Olah, C., Shlens, J.: Conditional Image Synthesis with Auxiliary Classifier GANs. In: ICML, pp. 2642–2651 (2017)
Google Scholar
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: ECCV, pp. 3674–3683 (2018)
Google Scholar
Bousmalis, K., et al.: Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In: Proceedings of IEEE ICRA, pp. 4243–4250 (2018)
Google Scholar
Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner, D., Bohez, S., Vanhoucke, V.: Sim-to-real: learning agile locomotion for quadruped robots. In: RSS (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings ICLR 2015 (2014)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, C.: Improved training of wasserstein GANs. In: NIPS, pp. 5769–5779 (2017)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR (2018)
Google Scholar

Download references

Acknowledgement

This work was partially supported by JST CREST, SCOPE and NEDO.

Author information

Authors and Affiliations

National Institute of Information and Communications Technology, 3 Chome-5 Hikaridai, Soraku District, Seika, Kyoto Prefecture, 619-0237, Japan
Aly Magassouba, Komei Sugiura & Hisashi Kawai

Authors

Aly Magassouba
View author publications
You can also search for this author in PubMed Google Scholar
Komei Sugiura
View author publications
You can also search for this author in PubMed Google Scholar
Hisashi Kawai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aly Magassouba .

Editor information

Editors and Affiliations

Department of Systems Innovation, University of Tokyo, Tokyo, Japan
Yukio Ohsawa
Faculty of Business and Commerce, Kansai University, Osaka, Japan
Katsutoshi Yada
Nagoya Institute of Technology, Nagoya, Japan
Takayuki Ito
Graduate School of System Design, Tokyo Metropolitan University, Tokyo, Japan
Yasufumi Takama
Department of Information and Communication, Tokyo Metropolitan University, Tokyo, Japan
Eri Sato-Shimokawara
Faculty of Letters, Chiba University, Chiba, Japan
Akinori Abe
School of Engineering, The University of Tokyo, Tokyo, Japan
Junichiro Mori
Graduate School of Economics, Osaka University, Toyonaka, Osaka, Japan
Naohiro Matsumura

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Magassouba, A., Sugiura, K., Kawai, H. (2020). Latent-Space Data Augmentation for Visually-Grounded Language Understanding. In: Ohsawa, Y., et al. Advances in Artificial Intelligence. JSAI 2019. Advances in Intelligent Systems and Computing, vol 1128. Springer, Cham. https://doi.org/10.1007/978-3-030-39878-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-39878-1_17
Published: 04 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39877-4
Online ISBN: 978-3-030-39878-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics