Skip to main content
Log in

Learning actionness from action/background discrimination

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Localizing actions in instructional web videos is a complex problem due to background scenes that are unrelated to the task described in the video. Wrong prediction of the action step labels could be reduced by separating backgrounds from actions. Yet, discrimination of actions from backgrounds is challenging due to various styles for the same activity. In this study, we aim to improve the action localization results through learning the actionness of video clips to determine the possibility of a clip having an action. We present a method to learn an actionness score for each video clip to be used for post-processing baseline video clip to step label assignment scores. We propose to use auxiliary representation formed from baseline video to step label assignment scores to reinforce the discrimination of video clips. The experiments on CrossTask and COIN datasets show that our actionness score helps to improve the performance of action step localization and also action segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets used during the current study are available in the following repositories: CrossTask: https://github.com/DmZhukov/CrossTask. COIN: https://coin-dataset.github.io/. (Dataset is available upon request to the corresponding organization via tys15@tsinghua.org.cn.).

References

  1. Aakur, S.N., Sarkar, S.: A perceptual prediction framework for self supervised event segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1197–1206 (2019)

  2. Chang, C.Y., Huang, D.A., Sui, Y., et al.: D3tw: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555 (2019)

  3. Chang, S., Wang, P., Wang, F., et al.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv:2103.16024 (2021)

  4. Elfeki, M., Borji, A.: Video summarization via actionness ranking. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 754–763 (2019)

  5. Elhamifar, E., Huynh, D.: Self-supervised multi-task procedure learning from instructional videos. In: European Conference on Computer Vision, pp. 557–573. Springer (2020)

  6. Elhamifar, E., Naing, Z.: Unsupervised procedure learning via joint dynamic summarization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6341–6350 (2019)

  7. Ging, S., Zolfaghari, M., Pirsiavash, H., et al.: Coot: cooperative hierarchical transformer for video-text representation learning. Adv. Neural Inf. Process. Syst. 33, 22605–22618 (2020)

    Google Scholar 

  8. Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 765–779 (2018)

    Article  Google Scholar 

  9. Lee, P., Uh, Y., Byun, H.: Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11320–11327 (2020)

  10. Luo, H., Ji, L., Zhong, M., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval. arXiv:2104.08860 (2021)

  11. Ma, J., Gorti, S.K., Volkovs, M., et al.: Weakly supervised action selection learning in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7587–7596 (2021)

  12. Miech, A., Zhukov, D., Alayrac, J.B., et al.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640 (2019)

  13. Miech, A., Alayrac, J.B., Smaira, L., et al.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)

  14. Naing, Z., Elhamifar, E.: Procedure completion by learning from partial summaries. In: British Machine Vision Conference (2020)

  15. Patrick, M., Huang, P.Y., Asano, Y.M., et al.: Support-set bottlenecks for video-text representation learning. In: International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=EqoXe2zmhrh

  16. Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8368–8376 (2018)

  17. Shen, Y., Wang, L., Elhamifar, E.: Learning to segment actions from visual and language instructions via differentiable weak sequence alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 (2021)

  18. Sun, C., Myers, A., Vondrick, C., et al.: Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)

  19. Tang, Y., Ding, D., Rao, Y., et al.: Coin: a large-scale dataset for comprehensive instructional video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1207–1216 (2019)

  20. Wang, L., Qiao, Y., Tang, X., et al.: Actionness estimation using hybrid fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2708–2717 (2016)

  21. Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8746–8755 (2020)

  22. Zhukov, D., Alayrac, J.B., Cinbis, R.G., et al.: Cross-task weakly supervised learning from instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3537–3545 (2019)

  23. Zhukov, D., Alayrac, J.B., Laptev, I., et al.: Learning actionness via long-range temporal order verification. In: European Conference on Computer Vision, pp. 470–487. Springer (2020)

Download references

Acknowledgements

This study is partially supported by The Scientific and Technological Research Institution of Turkey Grant No 118E283 and is part of a doctoral thesis of the first author.

Funding

Partial financial support was received from The Scientific and Technological Research Institution of Turkey Grant No 118E283.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to writing and reviewing the manuscript.

Corresponding author

Correspondence to Ozge Yalcinkaya Simsek.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yalcinkaya Simsek, O., Russakovsky, O. & Duygulu, P. Learning actionness from action/background discrimination. SIViP 17, 1599–1606 (2023). https://doi.org/10.1007/s11760-022-02369-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02369-y

Keywords

Navigation