skip to main content
10.1145/3474085.3475322acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

State-aware Video Procedural Captioning

Published: 17 October 2021 Publication History

Abstract

Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material list. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model. Our experimental results show the effectiveness of the proposed method, which outperforms state-of-the-art video captioning models. We further analyze the learned embedding of materials to demonstrate that the simulators capture their state transition. The code and dataset are available from https://github.com/misogil0116/svpc

Supplementary Material

ZIP File (mfp0892aux.zip)
This pdf reports supplementary materials, which explain detailed information, such as implementation, annotation process, and generated procedural texts.
M4V File (presentation_video.m4v)
Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material list. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model.

References

[1]
Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In Proc. CVPR. 4575--4583.
[2]
Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2017. Joint discovery of object states and manipulation actions. In Proc. ICCV. 2127-- 2136.
[3]
Mustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, and Erkut Erdem. 2019. Procedural reasoning networks for understanding multimodal procedures. In Proc. CoNLL. 441--451.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proc. ACL Workshop IEEMMTS. 65--72.
[5]
Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. 2018. Simulating action dynamics with neural process networks. In Proc. ICLR.
[6]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proc. ECCV. 213--229.
[7]
Jingjing Chen and Chong wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proc. ACMMM. 32--41.
[8]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: attentive language models beyond a fixedlength context. In Proc. ACL. 2978--2988.
[9]
Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen tau Yih, and Peter Clark. 2018. Tracking state changes in procedural text: a challenge dataset and models for process paragraph comprehension. In Proc. NAACL. 1595--1604.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL. 4171--4186.
[11]
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR. 2625--2634.
[12]
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. DAPs: deep action proposals for action understanding. In Proc. ECCV. 768-- 784.
[13]
Aditya Gupta and Greg Durrett. 2019. Tracking discrete and continuous entity state for process understanding. In Proc. NAACL Workshop SPNLP. 7--12.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. CVPR. 770--778.
[15]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML. 448--456.
[16]
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparametrization with gumble-softmax. In Proc. ICLR.
[17]
Jermsak Jermsurawong and Nizar Habash. 2015. Predicting the structure of cooking recipes. In Proc. EMNLP. 781--786.
[18]
Chloé Kiddon, Ganesa Thandavam Ponnuraj, Luke Zettlemoyer, and Yejin Choi. 2015. Mise en Place: unsupervised interpretation of instructional recipes. In Proc. EMNLP. 982--992.
[19]
Diederik P. Kingma and Jimmy Ba. [n.d.]. Adam: A method for stochastic optimization. In Proc. ICLR. USA.
[20]
Jie Lei, LiweiWang, Yelong Shen, Dong Yu, Tamara Berg, and Mohit Bansal. 2020. MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In Proc. ACL. 2603--2614.
[21]
Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proc. ACL. 605--612.
[22]
Hirokuni Maeta, Tetsuro Sasada, and Shinsuke Mori. 2015. A framework for procedural text understanding. In Proc. IWPT. 50--60.
[23]
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proc. CVPR. 9879--9889.
[24]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In Proc. ICCV. 2630--2640.
[25]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS. 3111--3119.
[26]
Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proc. ACL-IJCNLP. 1003--1011.
[27]
Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Yoko Yamakata, and Shinsuke Mori. 2020. Structure-aware procedural text generation from an image sequence. IEEE Access 9 (2020), 2125--2141.
[28]
Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yu-Gang Jiang, and Tat-Seng Chua. 2020. Multi-modal cooking workflow construction for food recipes. In Proc. ACMMM. 1132--1141.
[29]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proc. ACL. 311--318.
[30]
Jae Sung Park, Marcus Rohrbach, Trevor Darrell, and Anna Rohrbach. 2019. Adversarial inference for multi-sentence video description. In Proc. CVPR. 6598--6608.
[31]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: global vectors for word representation. In Proc. EMNLP. 1532--1543.
[32]
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv.
[33]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. NeurIPS. 91--99.
[34]
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proc. CVPR. 3020--3028.
[35]
Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. 2019. Relational recurrent neural networks. In Proc. NeurIPS. 7299--7310.
[36]
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: summarization with pointer-generator networks. In Proc. ACL. 1073--1083.
[37]
Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, and Ming Zhou. 2019. Dense procedure captioning in narrated instructional videos. In Proc. ACL. 6382--6391.
[38]
Botian Shi, Lei Ji, Zhendong Niu, Nan Duan, Ming Zhou, and Xilin Chen. 2020. Learning semantic concepts and temporal alignment for narrated video procedural captioning. In Proc. ACMMM. 4355--4363.
[39]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: a joint model for video and language representation learning. In Proc. ICCV. 7464--7473.
[40]
Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to discretely compose reasoning module networks for video captioning. In Proc. IJCAI. 745--752.
[41]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579--2605.
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS. 5998--6008.
[43]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: consensus-based image description evaluation. In Proc. CVPR. 4566--4575.
[44]
Yilei Xiong, Bo Dai, and Dahua Lin. 2018. Move forward and tell: a progressive generator of video descriptions. In Proc. ECCV. 489--505.
[45]
Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2020. Asymmetric loss for multi-label classification. arXiv.
[46]
Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, and Marcus Rohrbach. 2019. Grounded video description. In Proc. CVPR. 6578--6587.
[47]
Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proc. AAAI. 7590--7598.
[48]
Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proc. CVPR. 8739--8748.

Cited By

View all
  • (2024)Recipe Generation from Unsegmented Cooking VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3649137Online publication date: 21-Feb-2024
  • (2024)Spatial–Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing10.1109/TIP.2023.334565233(556-568)Online publication date: 1-Jan-2024
  • (2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. instructional video
  2. procedural text
  3. simulator

Qualifiers

  • Research-article

Funding Sources

  • JSPS KAKENHI Grant

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Recipe Generation from Unsegmented Cooking VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3649137Online publication date: 21-Feb-2024
  • (2024)Spatial–Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing10.1109/TIP.2023.334565233(556-568)Online publication date: 1-Jan-2024
  • (2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024
  • (2024)COM Kitchens: An Unedited Overhead-View Video Dataset as a Vision-Language BenchmarkComputer Vision – ECCV 202410.1007/978-3-031-73650-6_8(123-140)Online publication date: 21-Nov-2024
  • (2024)Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking RobotIntelligent Autonomous Systems 1810.1007/978-3-031-44851-5_42(547-560)Online publication date: 25-Apr-2024
  • (2023)Research History on “BioVL2: An Egocentric Biochemical Video-and-Language Dataset”「BioVL2 データセット: 生化学分野における一人称視点の実験映像への言語アノテーション」の研究経緯Journal of Natural Language Processing10.5715/jnlp.30.83330:2(833-838)Online publication date: 2023
  • (2023)Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows調理動作後の物体の視覚的状態予測を目指した Visual Recipe Flow データセットの構築と評価Journal of Natural Language Processing10.5715/jnlp.30.104230:3(1042-1060)Online publication date: 2023
  • (2023)Video Scene Graph Generation with Spatial-Temporal KnowledgeProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613433(9340-9344)Online publication date: 26-Oct-2023
  • (2023)State-aware video procedural captioningMultimedia Tools and Applications10.1007/s11042-023-14774-782:24(37273-37301)Online publication date: 20-Mar-2023
  • (2022)BioVL2: An Egocentric Biochemical Video-and-Language DatasetBioVL2データセット:生化学分野における一人称視点の実験映像への言語アノテーションJournal of Natural Language Processing10.5715/jnlp.29.110629:4(1106-1137)Online publication date: 2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media