Abstract
Localizing actions in instructional web videos is a complex problem due to background scenes that are unrelated to the task described in the video. Wrong prediction of the action step labels could be reduced by separating backgrounds from actions. Yet, discrimination of actions from backgrounds is challenging due to various styles for the same activity. In this study, we aim to improve the action localization results through learning the actionness of video clips to determine the possibility of a clip having an action. We present a method to learn an actionness score for each video clip to be used for post-processing baseline video clip to step label assignment scores. We propose to use auxiliary representation formed from baseline video to step label assignment scores to reinforce the discrimination of video clips. The experiments on CrossTask and COIN datasets show that our actionness score helps to improve the performance of action step localization and also action segmentation.
Similar content being viewed by others
Data availability
The datasets used during the current study are available in the following repositories: CrossTask: https://github.com/DmZhukov/CrossTask. COIN: https://coin-dataset.github.io/. (Dataset is available upon request to the corresponding organization via tys15@tsinghua.org.cn.).
References
Aakur, S.N., Sarkar, S.: A perceptual prediction framework for self supervised event segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1197–1206 (2019)
Chang, C.Y., Huang, D.A., Sui, Y., et al.: D3tw: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555 (2019)
Chang, S., Wang, P., Wang, F., et al.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv:2103.16024 (2021)
Elfeki, M., Borji, A.: Video summarization via actionness ranking. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 754–763 (2019)
Elhamifar, E., Huynh, D.: Self-supervised multi-task procedure learning from instructional videos. In: European Conference on Computer Vision, pp. 557–573. Springer (2020)
Elhamifar, E., Naing, Z.: Unsupervised procedure learning via joint dynamic summarization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6341–6350 (2019)
Ging, S., Zolfaghari, M., Pirsiavash, H., et al.: Coot: cooperative hierarchical transformer for video-text representation learning. Adv. Neural Inf. Process. Syst. 33, 22605–22618 (2020)
Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 765–779 (2018)
Lee, P., Uh, Y., Byun, H.: Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11320–11327 (2020)
Luo, H., Ji, L., Zhong, M., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval. arXiv:2104.08860 (2021)
Ma, J., Gorti, S.K., Volkovs, M., et al.: Weakly supervised action selection learning in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7587–7596 (2021)
Miech, A., Zhukov, D., Alayrac, J.B., et al.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640 (2019)
Miech, A., Alayrac, J.B., Smaira, L., et al.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)
Naing, Z., Elhamifar, E.: Procedure completion by learning from partial summaries. In: British Machine Vision Conference (2020)
Patrick, M., Huang, P.Y., Asano, Y.M., et al.: Support-set bottlenecks for video-text representation learning. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=EqoXe2zmhrh
Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8368–8376 (2018)
Shen, Y., Wang, L., Elhamifar, E.: Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 (2021)
Sun, C., Myers, A., Vondrick, C., et al.: Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)
Tang, Y., Ding, D., Rao, Y., et al.: Coin: a large-scale dataset for comprehensive instructional video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1207–1216 (2019)
Wang, L., Qiao, Y., Tang, X., et al.: Actionness estimation using hybrid fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2708–2717 (2016)
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8746–8755 (2020)
Zhukov, D., Alayrac, J.B., Cinbis, R.G., et al.: Cross-task weakly supervised learning from instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3537–3545 (2019)
Zhukov, D., Alayrac, J.B., Laptev, I., et al.: Learning actionness via long-range temporal order verification. In: European Conference on Computer Vision, pp. 470–487. Springer (2020)
Acknowledgements
This study is partially supported by The Scientific and Technological Research Institution of Turkey Grant No 118E283 and is part of a doctoral thesis of the first author.
Funding
Partial financial support was received from The Scientific and Technological Research Institution of Turkey Grant No 118E283.
Author information
Authors and Affiliations
Contributions
All authors contributed to writing and reviewing the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yalcinkaya Simsek, O., Russakovsky, O. & Duygulu, P. Learning actionness from action/background discrimination. SIViP 17, 1599–1606 (2023). https://doi.org/10.1007/s11760-022-02369-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-022-02369-y