Skip to main content

Abstract

Assembling, biochemical experiments, and cooking are representatives that create a new value from multiple materials through multiple processes. If a machine can computationally understand such manufacturing tasks, we will have various options of human-machine collaboration on those tasks, from video scene retrieval to robots that act for on behalf of humans. As one form of such understanding, this paper introduces a series of our studies that aim to associate visual observation of the processes and the procedural texts that instruct such processes. In those studies, captioning is the key task, where input is image sequence or video clips and our methods are still state-of-the-arts. Through the explanation of such techniques, we overview machine learning technologies that deal with the contextual information of manufacturing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings ACL Workshop IEEMMTS, pp. 65–72 (2005)

    Google Scholar 

  2. Bosselut, A., Levy, O., Holtzman, A., Ennis, C., Fox, D., Choi, Y.: Simulating action dynamics with neural process networks. In: Proceedings ICLR (2018)

    Google Scholar 

  3. Chandu, K., Nyberg, E., Black, A.W.: Storyboarding of recipes: grounded contextual generation. In: Proceedings ACL, pp. 6040–6046 (2019)

    Google Scholar 

  4. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings ACL, pp. 2978–2988 (2019)

    Google Scholar 

  5. Gu, J., Im, D.J., Li., V.O.: Neural machine translation with Gumbel-greedy decoding. In: Proceedings AAAI, pp. 5125–5132 (2018)

    Google Scholar 

  6. Harashima, J., Someya, Y., Kikuta, Y.: Cookpad image dataset: an image collection as infrastructure for food research. In: SIGIR (2017)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings CVPR, pp. 770–778 (2016)

    Google Scholar 

  8. Huang, T.K., et al.: Visual storytelling. In: Proceedings NAACL-HLT, pp. 1233–1239 (2016)

    Google Scholar 

  9. Jermsurawong, J., Habash, N.: Predicting the structure of cooking recipes. In: Proceedings EMNLP (2015)

    Google Scholar 

  10. Kiddon, C., Ponnuraj, G.T., Zettlemoyer, L., Choi, Y.: Mise EN place: unsupervised interpretation of instructional recipes. In: EMNLP (2015)

    Google Scholar 

  11. Kim, T., Heo, M., Son, S., Park, K., Zhang, B.: GLAC Net: glocal attention cascading networks for multi-image cued story generation. arXiv (2018)

    Google Scholar 

  12. Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings EMNLP, pp. 388–395 (2004)

    Google Scholar 

  13. Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T., Bansal, M.: MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In: Proceedings ACL, pp. 2603–2614 (2020)

    Google Scholar 

  14. Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Proceedings NAACL-HLT, pp. 110–119 (2016)

    Google Scholar 

  15. Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings ACL, pp. 605–612 (2004)

    Google Scholar 

  16. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  17. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings ICCV, pp. 2630–2640 (2019)

    Google Scholar 

  18. Mori, S., Maeta, H., Yamakata, Y., Sasada, T.: Flow graph corpus from recipe texts. In: Proceedings LREC (2014)

    Google Scholar 

  19. Nishimura, T., Hashimoto, A., Mori, S.: Procedural text generation from a photo sequence. In: Proceedings INLG, pp. 409–414 (2019)

    Google Scholar 

  20. Nishimura, T., Hashimoto, A., Ushiku, Y., Kameko, H., Mori, S.: State-aware video procedural captioning. In: ACMMM, pp. 1766–1774 (2021)

    Google Scholar 

  21. Nishimura, T., Hashimoto, A., Ushiku, Y., Kameko, H., Yamakata, Y., Mori, S.: Structure-aware procedural text generation from an image sequence. IEEE Access 9, 2125–2141 (2020)

    Article  Google Scholar 

  22. Nishimura, T., et al.: Egocentric biochemical video-and-language dataset. In: Proceedings ICCVW, pp. 3122–3126 (2021)

    Google Scholar 

  23. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings ACL, pp. 311–318 (2002)

    Google Scholar 

  24. Park, J.S., Rohrbach, M., Darrell, T., Rohrbach, A.: Adversarial inference for multi-sentence video description. In: Proceedings CVPR, pp. 6598–6608 (2019)

    Google Scholar 

  25. Shi, B., et al.: Dense procedure captioning in narrated instructional videos. In: Proceedings ACL, pp. 6382–6391 (2019)

    Google Scholar 

  26. Shi, B., Ji, L., Niu, Z., Duan, N., Zhou, M., Chen, X.: Learning semantic concepts and temporal alignment for narrated video procedural captioning. In: Proceedings ACMMM, pp. 4355–4363 (2020)

    Google Scholar 

  27. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings ACL-IJCNLP, pp. 1556–1566 (2015)

    Google Scholar 

  28. Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings CVPR, pp. 4566–4575 (2015)

    Google Scholar 

  29. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2015

    Google Scholar 

  30. Wang, W., Wang, Y., Chen, S., Jin, Q.: YouMakeup: a large-scale domain-specific multimodal dataset for fine-grained semantic comprehension. In: Proceedings EMNLP-IJCNLP, pp. 5133–5143 (2019)

    Google Scholar 

  31. Xiong, Y., Dai, B., Lin, D.: Move forward and tell: a progressive generator of video descriptions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 489–505. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_29

    Chapter  Google Scholar 

  32. Yagcioglu, S., Erdem, A., Erdem, E., Ikizler-Cinbis, N.: RecipeQA: a challenge dataset for multimodal comprehension of cooking recipes. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1358–1368 (2018)

    Google Scholar 

  33. Yamakata, Y., Mori, S., Carroll, J.: English recipe flow graph corpus. In: Proceedings LREC, pp. 5187–5194 (2020)

    Google Scholar 

  34. Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: Proceedings AAAI, pp. 7590–7598 (2018)

    Google Scholar 

  35. Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, June 2018

    Google Scholar 

Download references

Acknowledgement

This work was supported by JSPS KAKENHI Grant Number JP21J20250 and JP20H04210, and partially supported by JP21H04910, JP17H06100, JST-Mirai Program Grant Number JPMJMI21G2.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Atsushi Hashimoto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hashimoto, A., Nishimura, T., Ushiku, Y., Kameko, H., Mori, S. (2022). Cross-modal Representation Learning for Understanding Manufacturing Procedure. In: Rau, PL.P. (eds) Cross-Cultural Design. Applications in Learning, Arts, Cultural Heritage, Creative Industries, and Virtual Reality. HCII 2022. Lecture Notes in Computer Science, vol 13312. Springer, Cham. https://doi.org/10.1007/978-3-031-06047-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06047-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06046-5

  • Online ISBN: 978-3-031-06047-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics