research-article

State-aware Video Procedural Captioning

Authors:

Taichi Nishimura,

Atsushi Hashimoto,

Yoshitaka Ushiku,

Hirotaka Kameko,

Shinsuke MoriAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1766 - 1774

https://doi.org/10.1145/3474085.3475322

Published: 17 October 2021 Publication History

Abstract

Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material list. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model. Our experimental results show the effectiveness of the proposed method, which outperforms state-of-the-art video captioning models. We further analyze the learned embedding of materials to demonstrate that the simulators capture their state transition. The code and dataset are available from https://github.com/misogil0116/svpc

Supplementary Material

ZIP File (mfp0892aux.zip)

This pdf reports supplementary materials, which explain detailed information, such as implementation, annotation process, and generated procedural texts.

Download
4.39 MB

M4V File (presentation_video.m4v)

Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material list. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model.

Download
84.59 MB

References

[1]

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In Proc. CVPR. 4575--4583.

[2]

Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2017. Joint discovery of object states and manipulation actions. In Proc. ICCV. 2127-- 2136.

[3]

Mustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, and Erkut Erdem. 2019. Procedural reasoning networks for understanding multimodal procedures. In Proc. CoNLL. 441--451.

[4]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proc. ACL Workshop IEEMMTS. 65--72.

[5]

Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. 2018. Simulating action dynamics with neural process networks. In Proc. ICLR.

[6]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proc. ECCV. 213--229.

Digital Library

[7]

Jingjing Chen and Chong wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proc. ACMMM. 32--41.

Digital Library

[8]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: attentive language models beyond a fixedlength context. In Proc. ACL. 2978--2988.

[9]

Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen tau Yih, and Peter Clark. 2018. Tracking state changes in procedural text: a challenge dataset and models for process paragraph comprehension. In Proc. NAACL. 1595--1604.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL. 4171--4186.

[11]

Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR. 2625--2634.

[12]

Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. DAPs: deep action proposals for action understanding. In Proc. ECCV. 768-- 784.

[13]

Aditya Gupta and Greg Durrett. 2019. Tracking discrete and continuous entity state for process understanding. In Proc. NAACL Workshop SPNLP. 7--12.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. CVPR. 770--778.

[15]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML. 448--456.

Digital Library

[16]

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparametrization with gumble-softmax. In Proc. ICLR.

[17]

Jermsak Jermsurawong and Nizar Habash. 2015. Predicting the structure of cooking recipes. In Proc. EMNLP. 781--786.

[18]

Chloé Kiddon, Ganesa Thandavam Ponnuraj, Luke Zettlemoyer, and Yejin Choi. 2015. Mise en Place: unsupervised interpretation of instructional recipes. In Proc. EMNLP. 982--992.

[19]

Diederik P. Kingma and Jimmy Ba. [n.d.]. Adam: A method for stochastic optimization. In Proc. ICLR. USA.

[20]

Jie Lei, LiweiWang, Yelong Shen, Dong Yu, Tamara Berg, and Mohit Bansal. 2020. MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In Proc. ACL. 2603--2614.

[21]

Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proc. ACL. 605--612.

Digital Library

[22]

Hirokuni Maeta, Tetsuro Sasada, and Shinsuke Mori. 2015. A framework for procedural text understanding. In Proc. IWPT. 50--60.

[23]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proc. CVPR. 9879--9889.

[24]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In Proc. ICCV. 2630--2640.

[25]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS. 3111--3119.

Digital Library

[26]

Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proc. ACL-IJCNLP. 1003--1011.

Digital Library

[27]

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Yoko Yamakata, and Shinsuke Mori. 2020. Structure-aware procedural text generation from an image sequence. IEEE Access 9 (2020), 2125--2141.

[28]

Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yu-Gang Jiang, and Tat-Seng Chua. 2020. Multi-modal cooking workflow construction for food recipes. In Proc. ACMMM. 1132--1141.

Digital Library

[29]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proc. ACL. 311--318.

Digital Library

[30]

Jae Sung Park, Marcus Rohrbach, Trevor Darrell, and Anna Rohrbach. 2019. Adversarial inference for multi-sentence video description. In Proc. CVPR. 6598--6608.

[31]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: global vectors for word representation. In Proc. EMNLP. 1532--1543.

[32]

Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv.

[33]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. NeurIPS. 91--99.

Digital Library

[34]

Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proc. CVPR. 3020--3028.

[35]

Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. 2019. Relational recurrent neural networks. In Proc. NeurIPS. 7299--7310.

Digital Library

[36]

Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: summarization with pointer-generator networks. In Proc. ACL. 1073--1083.

[37]

Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, and Ming Zhou. 2019. Dense procedure captioning in narrated instructional videos. In Proc. ACL. 6382--6391.

[38]

Botian Shi, Lei Ji, Zhendong Niu, Nan Duan, Ming Zhou, and Xilin Chen. 2020. Learning semantic concepts and temporal alignment for narrated video procedural captioning. In Proc. ACMMM. 4355--4363.

Digital Library

[39]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: a joint model for video and language representation learning. In Proc. ICCV. 7464--7473.

[40]

Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to discretely compose reasoning module networks for video captioning. In Proc. IJCAI. 745--752.

[41]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579--2605.

[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS. 5998--6008.

Digital Library

[43]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: consensus-based image description evaluation. In Proc. CVPR. 4566--4575.

[44]

Yilei Xiong, Bo Dai, and Dahua Lin. 2018. Move forward and tell: a progressive generator of video descriptions. In Proc. ECCV. 489--505.

[45]

Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2020. Asymmetric loss for multi-label classification. arXiv.

[46]

Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, and Marcus Rohrbach. 2019. Grounded video description. In Proc. CVPR. 6578--6587.

[47]

Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proc. AAAI. 7590--7598.

[48]

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proc. CVPR. 8739--8748.

Cited By

Nishimura THashimoto AUshiku YKameko HMori S(2024)Recipe Generation from Unsegmented Cooking VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3649137Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3649137
Pu TChen TWu HLu YLin L(2024)Spatial–Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing10.1109/TIP.2023.334565233(556-568)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2023.3345652
Wang ZLi LXie ZLiu C(2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.cviu.2024.103954
Show More Cited By

Index Terms

State-aware Video Procedural Captioning
1. Computing methodologies
  1. Artificial intelligence

Recommendations

State-aware video procedural captioning
Abstract
Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials ...
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image Processing

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Hierarchical & multimodal video captioning

In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task.We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

JSPS KAKENHI Grant

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
268
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nishimura THashimoto AUshiku YKameko HMori S(2024)Recipe Generation from Unsegmented Cooking VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3649137Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3649137
Pu TChen TWu HLu YLin L(2024)Spatial–Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing10.1109/TIP.2023.334565233(556-568)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2023.3345652
Wang ZLi LXie ZLiu C(2024)Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text GenerationComputer Vision and Image Understanding10.1016/j.cviu.2024.103954241:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.cviu.2024.103954
Maeda KHirasawa THashimoto AHarashima JRybicki LFukasawa YUshiku Y(2024)COM Kitchens: An Unedited Overhead-View Video Dataset as a Vision-Language BenchmarkComputer Vision – ECCV 202410.1007/978-3-031-73650-6_8(123-140)Online publication date: 21-Nov-2024
https://doi.org/10.1007/978-3-031-73650-6_8
Kanazawa NKawaharazuka KObinata YOkada KInaba M(2024)Recognition of Heat-Induced Food State Changes by Time-Series Use of Vision-Language Model for Cooking RobotIntelligent Autonomous Systems 1810.1007/978-3-031-44851-5_42(547-560)Online publication date: 25-Apr-2024
https://doi.org/10.1007/978-3-031-44851-5_42
Nishimura T(2023)Research History on “BioVL2: An Egocentric Biochemical Video-and-Language Dataset”「BioVL2 データセット: 生化学分野における一人称視点の実験映像への言語アノテーション」の研究経緯Journal of Natural Language Processing10.5715/jnlp.30.83330:2(833-838)Online publication date: 2023
https://doi.org/10.5715/jnlp.30.833
Shirai KHashimoto ANishimura TKameko HKurita SMori S(2023)Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows調理動作後の物体の視覚的状態予測を目指した Visual Recipe Flow データセットの構築と評価Journal of Natural Language Processing10.5715/jnlp.30.104230:3(1042-1060)Online publication date: 2023
https://doi.org/10.5715/jnlp.30.1042
Pu TEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Video Scene Graph Generation with Spatial-Temporal KnowledgeProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613433(9340-9344)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613433
Nishimura THashimoto AUshiku YKameko HMori S(2023)State-aware video procedural captioningMultimedia Tools and Applications10.1007/s11042-023-14774-782:24(37273-37301)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14774-7
Nishimura TSakoda KUshiku AHashimoto AOkuda NOno FKameko HMori S(2022)BioVL2: An Egocentric Biochemical Video-and-Language DatasetBioVL2データセット：生化学分野における一人称視点の実験映像への言語アノテーションJournal of Natural Language Processing10.5715/jnlp.29.110629:4(1106-1137)Online publication date: 2022
https://doi.org/10.5715/jnlp.29.1106

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten