Cross-modal Representation Learning for Understanding Manufacturing Procedure

Hashimoto, Atsushi; Nishimura, Taichi; Ushiku, Yoshitaka; Kameko, Hirotaka; Mori, Shinsuke

doi:10.1007/978-3-031-06047-2_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13312))

Included in the following conference series:

International Conference on Human-Computer Interaction

1749 Accesses

Abstract

Assembling, biochemical experiments, and cooking are representatives that create a new value from multiple materials through multiple processes. If a machine can computationally understand such manufacturing tasks, we will have various options of human-machine collaboration on those tasks, from video scene retrieval to robots that act for on behalf of humans. As one form of such understanding, this paper introduces a series of our studies that aim to associate visual observation of the processes and the procedural texts that instruct such processes. In those studies, captioning is the key task, where input is image sequence or video clips and our methods are still state-of-the-arts. Through the explanation of such techniques, we overview machine learning technologies that deal with the contextual information of manufacturing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings ACL Workshop IEEMMTS, pp. 65–72 (2005)
Google Scholar
Bosselut, A., Levy, O., Holtzman, A., Ennis, C., Fox, D., Choi, Y.: Simulating action dynamics with neural process networks. In: Proceedings ICLR (2018)
Google Scholar
Chandu, K., Nyberg, E., Black, A.W.: Storyboarding of recipes: grounded contextual generation. In: Proceedings ACL, pp. 6040–6046 (2019)
Google Scholar
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings ACL, pp. 2978–2988 (2019)
Google Scholar
Gu, J., Im, D.J., Li., V.O.: Neural machine translation with Gumbel-greedy decoding. In: Proceedings AAAI, pp. 5125–5132 (2018)
Google Scholar
Harashima, J., Someya, Y., Kikuta, Y.: Cookpad image dataset: an image collection as infrastructure for food research. In: SIGIR (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings CVPR, pp. 770–778 (2016)
Google Scholar
Huang, T.K., et al.: Visual storytelling. In: Proceedings NAACL-HLT, pp. 1233–1239 (2016)
Google Scholar
Jermsurawong, J., Habash, N.: Predicting the structure of cooking recipes. In: Proceedings EMNLP (2015)
Google Scholar
Kiddon, C., Ponnuraj, G.T., Zettlemoyer, L., Choi, Y.: Mise EN place: unsupervised interpretation of instructional recipes. In: EMNLP (2015)
Google Scholar
Kim, T., Heo, M., Son, S., Park, K., Zhang, B.: GLAC Net: glocal attention cascading networks for multi-image cued story generation. arXiv (2018)
Google Scholar
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings EMNLP, pp. 388–395 (2004)
Google Scholar
Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T., Bansal, M.: MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In: Proceedings ACL, pp. 2603–2614 (2020)
Google Scholar
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Proceedings NAACL-HLT, pp. 110–119 (2016)
Google Scholar
Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings ACL, pp. 605–612 (2004)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings ICCV, pp. 2630–2640 (2019)
Google Scholar
Mori, S., Maeta, H., Yamakata, Y., Sasada, T.: Flow graph corpus from recipe texts. In: Proceedings LREC (2014)
Google Scholar
Nishimura, T., Hashimoto, A., Mori, S.: Procedural text generation from a photo sequence. In: Proceedings INLG, pp. 409–414 (2019)
Google Scholar
Nishimura, T., Hashimoto, A., Ushiku, Y., Kameko, H., Mori, S.: State-aware video procedural captioning. In: ACMMM, pp. 1766–1774 (2021)
Google Scholar
Nishimura, T., Hashimoto, A., Ushiku, Y., Kameko, H., Yamakata, Y., Mori, S.: Structure-aware procedural text generation from an image sequence. IEEE Access 9, 2125–2141 (2020)
Article Google Scholar
Nishimura, T., et al.: Egocentric biochemical video-and-language dataset. In: Proceedings ICCVW, pp. 3122–3126 (2021)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings ACL, pp. 311–318 (2002)
Google Scholar
Park, J.S., Rohrbach, M., Darrell, T., Rohrbach, A.: Adversarial inference for multi-sentence video description. In: Proceedings CVPR, pp. 6598–6608 (2019)
Google Scholar
Shi, B., et al.: Dense procedure captioning in narrated instructional videos. In: Proceedings ACL, pp. 6382–6391 (2019)
Google Scholar
Shi, B., Ji, L., Niu, Z., Duan, N., Zhou, M., Chen, X.: Learning semantic concepts and temporal alignment for narrated video procedural captioning. In: Proceedings ACMMM, pp. 4355–4363 (2020)
Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings ACL-IJCNLP, pp. 1556–1566 (2015)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings CVPR, pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2015
Google Scholar
Wang, W., Wang, Y., Chen, S., Jin, Q.: YouMakeup: a large-scale domain-specific multimodal dataset for fine-grained semantic comprehension. In: Proceedings EMNLP-IJCNLP, pp. 5133–5143 (2019)
Google Scholar
Xiong, Y., Dai, B., Lin, D.: Move forward and tell: a progressive generator of video descriptions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 489–505. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_29
Chapter Google Scholar
Yagcioglu, S., Erdem, A., Erdem, E., Ikizler-Cinbis, N.: RecipeQA: a challenge dataset for multimodal comprehension of cooking recipes. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1358–1368 (2018)
Google Scholar
Yamakata, Y., Mori, S., Carroll, J.: English recipe flow graph corpus. In: Proceedings LREC, pp. 5187–5194 (2020)
Google Scholar
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: Proceedings AAAI, pp. 7590–7598 (2018)
Google Scholar
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2018
Google Scholar

Download references

Acknowledgement

This work was supported by JSPS KAKENHI Grant Number JP21J20250 and JP20H04210, and partially supported by JP21H04910, JP17H06100, JST-Mirai Program Grant Number JPMJMI21G2.

Author information

Authors and Affiliations

OMRON SINIC X Corp., Tokyo, Japan
Atsushi Hashimoto & Yoshitaka Ushiku
Kyoto University, Kyoto, Japan
Taichi Nishimura, Hirotaka Kameko & Shinsuke Mori

Authors

Atsushi Hashimoto
View author publications
You can also search for this author in PubMed Google Scholar
Taichi Nishimura
View author publications
You can also search for this author in PubMed Google Scholar
Yoshitaka Ushiku
View author publications
You can also search for this author in PubMed Google Scholar
Hirotaka Kameko
View author publications
You can also search for this author in PubMed Google Scholar
Shinsuke Mori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Atsushi Hashimoto .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Pei-Luen Patrick Rau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hashimoto, A., Nishimura, T., Ushiku, Y., Kameko, H., Mori, S. (2022). Cross-modal Representation Learning for Understanding Manufacturing Procedure. In: Rau, PL.P. (eds) Cross-Cultural Design. Applications in Learning, Arts, Cultural Heritage, Creative Industries, and Virtual Reality. HCII 2022. Lecture Notes in Computer Science, vol 13312. Springer, Cham. https://doi.org/10.1007/978-3-031-06047-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-06047-2_4
Published: 16 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06046-5
Online ISBN: 978-3-031-06047-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-modal Representation Learning for Understanding Manufacturing Procedure