Abstract
Assembling, biochemical experiments, and cooking are representatives that create a new value from multiple materials through multiple processes. If a machine can computationally understand such manufacturing tasks, we will have various options of human-machine collaboration on those tasks, from video scene retrieval to robots that act for on behalf of humans. As one form of such understanding, this paper introduces a series of our studies that aim to associate visual observation of the processes and the procedural texts that instruct such processes. In those studies, captioning is the key task, where input is image sequence or video clips and our methods are still state-of-the-arts. Through the explanation of such techniques, we overview machine learning technologies that deal with the contextual information of manufacturing tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings ACL Workshop IEEMMTS, pp. 65–72 (2005)
Bosselut, A., Levy, O., Holtzman, A., Ennis, C., Fox, D., Choi, Y.: Simulating action dynamics with neural process networks. In: Proceedings ICLR (2018)
Chandu, K., Nyberg, E., Black, A.W.: Storyboarding of recipes: grounded contextual generation. In: Proceedings ACL, pp. 6040–6046 (2019)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings ACL, pp. 2978–2988 (2019)
Gu, J., Im, D.J., Li., V.O.: Neural machine translation with Gumbel-greedy decoding. In: Proceedings AAAI, pp. 5125–5132 (2018)
Harashima, J., Someya, Y., Kikuta, Y.: Cookpad image dataset: an image collection as infrastructure for food research. In: SIGIR (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings CVPR, pp. 770–778 (2016)
Huang, T.K., et al.: Visual storytelling. In: Proceedings NAACL-HLT, pp. 1233–1239 (2016)
Jermsurawong, J., Habash, N.: Predicting the structure of cooking recipes. In: Proceedings EMNLP (2015)
Kiddon, C., Ponnuraj, G.T., Zettlemoyer, L., Choi, Y.: Mise EN place: unsupervised interpretation of instructional recipes. In: EMNLP (2015)
Kim, T., Heo, M., Son, S., Park, K., Zhang, B.: GLAC Net: glocal attention cascading networks for multi-image cued story generation. arXiv (2018)
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings EMNLP, pp. 388–395 (2004)
Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T., Bansal, M.: MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In: Proceedings ACL, pp. 2603–2614 (2020)
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Proceedings NAACL-HLT, pp. 110–119 (2016)
Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings ACL, pp. 605–612 (2004)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings ICCV, pp. 2630–2640 (2019)
Mori, S., Maeta, H., Yamakata, Y., Sasada, T.: Flow graph corpus from recipe texts. In: Proceedings LREC (2014)
Nishimura, T., Hashimoto, A., Mori, S.: Procedural text generation from a photo sequence. In: Proceedings INLG, pp. 409–414 (2019)
Nishimura, T., Hashimoto, A., Ushiku, Y., Kameko, H., Mori, S.: State-aware video procedural captioning. In: ACMMM, pp. 1766–1774 (2021)
Nishimura, T., Hashimoto, A., Ushiku, Y., Kameko, H., Yamakata, Y., Mori, S.: Structure-aware procedural text generation from an image sequence. IEEE Access 9, 2125–2141 (2020)
Nishimura, T., et al.: Egocentric biochemical video-and-language dataset. In: Proceedings ICCVW, pp. 3122–3126 (2021)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings ACL, pp. 311–318 (2002)
Park, J.S., Rohrbach, M., Darrell, T., Rohrbach, A.: Adversarial inference for multi-sentence video description. In: Proceedings CVPR, pp. 6598–6608 (2019)
Shi, B., et al.: Dense procedure captioning in narrated instructional videos. In: Proceedings ACL, pp. 6382–6391 (2019)
Shi, B., Ji, L., Niu, Z., Duan, N., Zhou, M., Chen, X.: Learning semantic concepts and temporal alignment for narrated video procedural captioning. In: Proceedings ACMMM, pp. 4355–4363 (2020)
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings ACL-IJCNLP, pp. 1556–1566 (2015)
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings CVPR, pp. 4566–4575 (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2015
Wang, W., Wang, Y., Chen, S., Jin, Q.: YouMakeup: a large-scale domain-specific multimodal dataset for fine-grained semantic comprehension. In: Proceedings EMNLP-IJCNLP, pp. 5133–5143 (2019)
Xiong, Y., Dai, B., Lin, D.: Move forward and tell: a progressive generator of video descriptions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 489–505. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_29
Yagcioglu, S., Erdem, A., Erdem, E., Ikizler-Cinbis, N.: RecipeQA: a challenge dataset for multimodal comprehension of cooking recipes. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1358–1368 (2018)
Yamakata, Y., Mori, S., Carroll, J.: English recipe flow graph corpus. In: Proceedings LREC, pp. 5187–5194 (2020)
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: Proceedings AAAI, pp. 7590–7598 (2018)
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2018
Acknowledgement
This work was supported by JSPS KAKENHI Grant Number JP21J20250 and JP20H04210, and partially supported by JP21H04910, JP17H06100, JST-Mirai Program Grant Number JPMJMI21G2.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hashimoto, A., Nishimura, T., Ushiku, Y., Kameko, H., Mori, S. (2022). Cross-modal Representation Learning for Understanding Manufacturing Procedure. In: Rau, PL.P. (eds) Cross-Cultural Design. Applications in Learning, Arts, Cultural Heritage, Creative Industries, and Virtual Reality. HCII 2022. Lecture Notes in Computer Science, vol 13312. Springer, Cham. https://doi.org/10.1007/978-3-031-06047-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-06047-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06046-5
Online ISBN: 978-3-031-06047-2
eBook Packages: Computer ScienceComputer Science (R0)