Skip to main content

Prototypical Transformer for Weakly Supervised Action Segmentation

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14430))

Included in the following conference series:

  • 331 Accesses

Abstract

Weakly supervised action segmentation aims to recognize the sequential actions in a video, with only action orderings as supervision for model training. Existing methods either predict the action labels to construct discriminative losses or segment the video based on action prototypes. In this paper, we propose a novel Prototypical Transformer (ProtoTR) to alleviate the defects of existing methods. The motivation behind ProtoTR is to further enhance the prototype-based method with more discriminative power for superior segmentation results. Specifically, the Prediction Decoder of ProtoTR translates the visual input into action ordering while its Video Encoder segments the video with action prototypes. As a unified model, both the encoder and decoder are jointly optimized on the same set of action prototypes. The effectiveness of the proposed method is demonstrated by its state-of-the-art performance on different benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, USA, vol. 10, pp. 359–370. (1994)

    Google Scholar 

  2. Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41

    Chapter  Google Scholar 

  3. Chang, C.Y., Huang, D.A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555. IEEE (2019)

    Google Scholar 

  4. Chang, X., Tung, F., Mori, G.: Learning discriminative prototypes with dynamic time warping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395–8404. IEEE (2021)

    Google Scholar 

  5. Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6516 (2018)

    Google Scholar 

  6. Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)

    Google Scholar 

  7. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  8. Guo, J., Xue, W., Guo, L., Yuan, T., Chen, S.: Multi-level temporal relation graph for continuous sign language recognition. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 408–419. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18913-5_32

    Chapter  Google Scholar 

  9. Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9

    Chapter  Google Scholar 

  10. Ji, P., Yang, B., Zhang, T., Zou, Y.: Consensus-guided keyword targeting for video captioning. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 270–281. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18913-5_21

    Chapter  Google Scholar 

  11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  12. Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4297–4305. IEEE (2017)

    Google Scholar 

  13. Kuehne, H., Arslan, A.B., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 780–787. IEEE (2014)

    Google Scholar 

  14. Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016)

    Google Scholar 

  15. Kuehne, H., Richard, A., Gall, J.: Weakly supervised learning of actions from transcripts. Comput. Vis. Image Underst. 163, 78–89 (2017)

    Article  Google Scholar 

  16. Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 765–779 (2018)

    Article  Google Scholar 

  17. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)

    Google Scholar 

  18. Li, J., Lei, P., Todorovic, S.: Weakly supervised energy-based learning for action segmentation. In: IEEE/CVF International Conference on Computer Vision, pp. 6242–6250. IEEE (2019)

    Google Scholar 

  19. Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 754–763. IEEE (2017)

    Google Scholar 

  20. Richard, A., Kuehne, H., Iqbal, A., Gall, J.: NeuralNetwork-Viterbi: a framework for weakly supervised video learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7386–7395. IEEE (2018)

    Google Scholar 

  21. Ridley, J., Coskun, H., Tan, D.J., Navab, N., Tombari, F.: Transformers in action: weakly supervised action segmentation. arXiv preprint arXiv:2201.05675 (2022)

  22. Souri, Y., Fayyaz, M., Minciullo, L., Francesca, G., Gall, J.: Fast weakly supervised action segmentation using mutual consistency. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 6196–6208 (2021)

    Article  Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  24. Yang, B., Zhang, T., Zou, Y.: Clip meets video captioning: Concept-aware representation learning does matter. In: Yu, S., et al. (eds.) PRCV 2022, Part I. LNCS, vol. 13534, pp. 368–381. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18907-4_29

    Chapter  Google Scholar 

  25. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2914–2923 (2017)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Science Foundation for Young Scientists of China (62106289).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaobin Chang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, T., Chang, X., Sun, W., Zheng, W. (2024). Prototypical Transformer for Weakly Supervised Action Segmentation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14430. Springer, Singapore. https://doi.org/10.1007/978-981-99-8537-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8537-1_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8536-4

  • Online ISBN: 978-981-99-8537-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics