Learning actionness from action/background discrimination

Yalcinkaya Simsek, Ozge; Russakovsky, Olga; Duygulu, Pinar

doi:10.1007/s11760-022-02369-y

Learning actionness from action/background discrimination

Original Paper
Published: 06 October 2022

Volume 17, pages 1599–1606, (2023)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Ozge Yalcinkaya Simsek^1,2,
Olga Russakovsky³ &
Pinar Duygulu²

387 Accesses
1 Altmetric
Explore all metrics

Abstract

Localizing actions in instructional web videos is a complex problem due to background scenes that are unrelated to the task described in the video. Wrong prediction of the action step labels could be reduced by separating backgrounds from actions. Yet, discrimination of actions from backgrounds is challenging due to various styles for the same activity. In this study, we aim to improve the action localization results through learning the actionness of video clips to determine the possibility of a clip having an action. We present a method to learn an actionness score for each video clip to be used for post-processing baseline video clip to step label assignment scores. We propose to use auxiliary representation formed from baseline video to step label assignment scores to reinforce the discrimination of video clips. The experiments on CrossTask and COIN datasets show that our actionness score helps to improve the performance of action step localization and also action segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Action-Aware Network with Upper and Lower Limit Loss for Weakly-Supervised Temporal Action Localization

Article 24 September 2022

Weakly Supervised Temporal Action Localization Through Segment Contrastive Learning

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Article 18 March 2022

Data availability

The datasets used during the current study are available in the following repositories: CrossTask: https://github.com/DmZhukov/CrossTask. COIN: https://coin-dataset.github.io/. (Dataset is available upon request to the corresponding organization via tys15@tsinghua.org.cn.).

References

Aakur, S.N., Sarkar, S.: A perceptual prediction framework for self supervised event segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1197–1206 (2019)
Chang, C.Y., Huang, D.A., Sui, Y., et al.: D3tw: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555 (2019)
Chang, S., Wang, P., Wang, F., et al.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv:2103.16024 (2021)
Elfeki, M., Borji, A.: Video summarization via actionness ranking. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 754–763 (2019)
Elhamifar, E., Huynh, D.: Self-supervised multi-task procedure learning from instructional videos. In: European Conference on Computer Vision, pp. 557–573. Springer (2020)
Elhamifar, E., Naing, Z.: Unsupervised procedure learning via joint dynamic summarization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6341–6350 (2019)
Ging, S., Zolfaghari, M., Pirsiavash, H., et al.: Coot: cooperative hierarchical transformer for video-text representation learning. Adv. Neural Inf. Process. Syst. 33, 22605–22618 (2020)
Google Scholar
Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 765–779 (2018)
Article Google Scholar
Lee, P., Uh, Y., Byun, H.: Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11320–11327 (2020)
Luo, H., Ji, L., Zhong, M., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval. arXiv:2104.08860 (2021)
Ma, J., Gorti, S.K., Volkovs, M., et al.: Weakly supervised action selection learning in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7587–7596 (2021)
Miech, A., Zhukov, D., Alayrac, J.B., et al.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640 (2019)
Miech, A., Alayrac, J.B., Smaira, L., et al.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)
Naing, Z., Elhamifar, E.: Procedure completion by learning from partial summaries. In: British Machine Vision Conference (2020)
Patrick, M., Huang, P.Y., Asano, Y.M., et al.: Support-set bottlenecks for video-text representation learning. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=EqoXe2zmhrh
Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8368–8376 (2018)
Shen, Y., Wang, L., Elhamifar, E.: Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 (2021)
Sun, C., Myers, A., Vondrick, C., et al.: Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)
Tang, Y., Ding, D., Rao, Y., et al.: Coin: a large-scale dataset for comprehensive instructional video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1207–1216 (2019)
Wang, L., Qiao, Y., Tang, X., et al.: Actionness estimation using hybrid fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2708–2717 (2016)
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8746–8755 (2020)
Zhukov, D., Alayrac, J.B., Cinbis, R.G., et al.: Cross-task weakly supervised learning from instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3537–3545 (2019)
Zhukov, D., Alayrac, J.B., Laptev, I., et al.: Learning actionness via long-range temporal order verification. In: European Conference on Computer Vision, pp. 470–487. Springer (2020)

Download references

Acknowledgements

This study is partially supported by The Scientific and Technological Research Institution of Turkey Grant No 118E283 and is part of a doctoral thesis of the first author.

Funding

Partial financial support was received from The Scientific and Technological Research Institution of Turkey Grant No 118E283.

Author information

Authors and Affiliations

Graduate School of Science and Engineering, Hacettepe University, Beytepe, Ankara, Turkey
Ozge Yalcinkaya Simsek
Computer Engineering, Hacettepe University, Beytepe, Ankara, Turkey
Ozge Yalcinkaya Simsek & Pinar Duygulu
Computer Science, Princeton University, Princeton, NJ, USA
Olga Russakovsky

Authors

Ozge Yalcinkaya Simsek
View author publications
You can also search for this author in PubMed Google Scholar
Olga Russakovsky
View author publications
You can also search for this author in PubMed Google Scholar
Pinar Duygulu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to writing and reviewing the manuscript.

Corresponding author

Correspondence to Ozge Yalcinkaya Simsek.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yalcinkaya Simsek, O., Russakovsky, O. & Duygulu, P. Learning actionness from action/background discrimination. SIViP 17, 1599–1606 (2023). https://doi.org/10.1007/s11760-022-02369-y

Download citation

Received: 24 August 2022
Revised: 31 August 2022
Accepted: 18 September 2022
Published: 06 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11760-022-02369-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning actionness from action/background discrimination

Abstract

Access this article

Similar content being viewed by others

Action-Aware Network with Upper and Lower Limit Loss for Weakly-Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization Through Segment Contrastive Learning

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning actionness from action/background discrimination

Abstract

Access this article

Similar content being viewed by others

Action-Aware Network with Upper and Lower Limit Loss for Weakly-Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization Through Segment Contrastive Learning

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation