Flow Graph to Video Grounding for Weakly-Supervised Multi-step Localization

Dvornik, Nikita; Hadji, Isma; Pham, Hai; Bhatt, Dhaivat; Martinez, Brais; Fazly, Afsaneh; Jepson, Allan D.

doi:10.1007/978-3-031-19833-5_19

Nikita Dvornik¹²,
Isma Hadji¹²,
Hai Pham¹²,
Dhaivat Bhatt¹²,
Brais Martinez¹²,
Afsaneh Fazly¹² &
…
Allan D. Jepson¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

1806 Accesses
2 Citations

Abstract

In this work, we consider the problem of weakly-supervised multi-step localization in instructional videos. An established approach to this problem is to rely on a given list of steps. However, in reality, there is often more than one way to execute a procedure successfully, by following the set of steps in slightly varying orders. Thus, for successful localization in a given video, recent works require the actual order of procedure steps in the video, to be provided by human annotators at both training and test times. Instead, here, we only rely on generic procedural text that is not tied to a specific video. We represent the various ways to complete the procedure by transforming the list of instructions into a procedure flow graph which captures the partial order of steps. Using the flow graphs reduces both training and test time annotation requirements. To this end, we introduce the new problem of flow graph to video grounding. In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video. To solve this problem, we propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them. To show the advantage of our proposed formulation, we extend the CrossTask dataset with procedure flow graph information. Our experiments show that Graph2Vid is both more efficient than the baselines and yields strong step localization results, without the need for step order annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The tSort graph is polynomial for an assumed subset of flow graphs with a fixed maximum number of threads.
2.
We find the procedure text of CrossTask in www.wikihow.com.

References

Bi, J., Luo, J., Xu, C.: Procedure planning in instructional videos via contextual modeling and model-based policy learning. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Cai, X., Xu, T., Yi, J., Huang, J., Rajasekaran, S.: DTWNet: A dynamic time warping network. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar
Cao, K., Ji, J., Cao, Z., Chang, C., Niebles, J.C.: Few-shot video classification via temporal alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Chang, C., Huang, D., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Chang, C.Y., Huang, D.A., Xu, D., Adeli, E., Fei-Fei, L., Niebles, J.C.: Procedure planning in instructional videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Chang, X., Tung, F., Mori, G.: Learning discriminative prototypes with dynamic time warping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Cuturi, M., Blondel, M.: Soft-DTW: A differentiable loss function for time-series. In: International Conference on Machine Learning (ICML) (2017)
Google Scholar
Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Donatelli, L., Schmidt, T., Biswas, D., Köhn, A., Zhai, F., Koller, A.: Aligning actions across recipe graphs. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)
Google Scholar
Dvornik, N., Hadji, I., Derpanis, K.G., Garg, A., Jepson, A.: Drop-DTW: Aligning common signal between sequences while dropping outliers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Caba Heilbron, F., Victor Escorcia, B.G., Niebles, J.C.: ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Girdhar, R., Grauman, K.: Anticipative Video Transformer. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Hadji, I., Derpanis, K.G., Jepson, A.D.: Representation learning via global temporal alignment and cycle-consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Huang, D., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016)
Google Scholar
Jain, C., Zhang, H., Gao, Y., Aluru, S.: On the complexity of sequence to graph alignment. J. Comput. Biol. 27(4), 640–654 (2020)
Google Scholar
Kavya, V.N.S., Tayal, K., Srinivasan, R., Sivadasan, N.: Sequence alignment on directed graphs. J. Comput. Biol. : J. Comput. Mol. Cell Biol. 261, 53–67 (2019)
Article Google Scholar
Kiddon, C., Ponnuraj, G.T., Zettlemoyer, L., Choi, Y.: Mise en place: Unsupervised interpretation of instructional recipes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)
Google Scholar
Lee, C., Grasso, C., Sharlow, M.F.: Multiple sequence alignment using partial order graphs. Bioinformatics 18(3), 452–464 (2002)
Google Scholar
Luo, H., et al.: UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Müller, M.: Information Retrieval for Music and Motion. Springer-Verlag, Berlin, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74048-3
Nakatsu, N., Kambayashi, Y., Yajima, S.: A longest common subsequence algorithm suitable for similar text strings. Acta Inf. 18(2), 17–19 (1982)
Google Scholar
Navarro, G.: Improved approximate pattern matching on hypertext. Theoret. Comput. Sci. 237(1), 455–463 (2000)
Article MathSciNet MATH Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Google Scholar
Rautiainen, M., Mäkinen, V., Marschall, T.: Bit-parallel sequence-to-graph alignment. Bioinformatics 35(19), 3599–3607 (2019)
Article Google Scholar
Richard, A., Kuehne, H., Iqbal, A., Gall, J.: NeuralNetwork-Viterbi: A framework for weakly supervised video learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken processing recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol.26, pp. 43–49 (1978)
Google Scholar
Sakurai, Y., Faloutsos, C., Yamamuro, M.: Stream monitoring under the time warping distance. In: International Conference on Data Engineering (ICDE) (2007)
Google Scholar
Schumacher, P., Minor, M., Walter, K., Bergmann, R.: Extraction of procedural knowledge from the web: A comparison of two workflow extraction approaches. In: Proceedings of the 21st International Conference on World Wide Web (2012)
Google Scholar
Senner, F., Yao, A.: Zero-shot anticipation for instructional activities (2019)
Google Scholar
Sermanet, P., et al.: Time-contrastive networks: Self-supervised learning from video. In: IEEE International Conference on Robotics and Automation (ICRA) (2018)
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Google Scholar
Tang, Y., et al.: COIN: A large-scale dataset for comprehensive instructional video analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Wikipedia: Topological sorting – Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Topological%20sorting &oldid=1062117596. Accessed 07 Mar 2022
Yamakata, Y., Mori, S., Carroll, J.: English recipe flow graph corpus. In: Proceedings of the 12th Language Resources and Evaluation Conference (2020)
Google Scholar
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Just Ask: Learning to Answer Questions from Millions of Narrated Videos. In: Proceedings of the International Conference on Computer Vision (ICCV), (2021)
Google Scholar
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Zhukov, D., Alayrac, J.B., Cinbis, R.G., Fouhey, D., Laptev, I., Sivic, J.: Cross-task weakly supervised learning from instructional videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar

Download references

Acknowledgements

We thank Ran Zhang for the help with flow graph creation and processing.

Author information

Authors and Affiliations

Samsung AI Center, New York, USA
Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly & Allan D. Jepson

Authors

Nikita Dvornik
View author publications
You can also search for this author in PubMed Google Scholar
Isma Hadji
View author publications
You can also search for this author in PubMed Google Scholar
Hai Pham
View author publications
You can also search for this author in PubMed Google Scholar
Dhaivat Bhatt
View author publications
You can also search for this author in PubMed Google Scholar
Brais Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Afsaneh Fazly
View author publications
You can also search for this author in PubMed Google Scholar
Allan D. Jepson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikita Dvornik .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1144 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dvornik, N. et al. (2022). Flow Graph to Video Grounding for Weakly-Supervised Multi-step Localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_19
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Flow Graph to Video Grounding for Weakly-Supervised Multi-step Localization