Prototypical Transformer for Weakly Supervised Action Segmentation

Lin, Tao; Chang, Xiaobin; Sun, Wei; Zheng, Weishi

doi:10.1007/978-981-99-8537-1_16

Tao Lin^15,18,
Xiaobin Chang^16,18,19,
Wei Sun¹⁵ &
…
Weishi Zheng^17,18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14430))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

331 Accesses

Abstract

Weakly supervised action segmentation aims to recognize the sequential actions in a video, with only action orderings as supervision for model training. Existing methods either predict the action labels to construct discriminative losses or segment the video based on action prototypes. In this paper, we propose a novel Prototypical Transformer (ProtoTR) to alleviate the defects of existing methods. The motivation behind ProtoTR is to further enhance the prototype-based method with more discriminative power for superior segmentation results. Specifically, the Prediction Decoder of ProtoTR translates the visual input into action ordering while its Video Encoder segments the video with action prototypes. As a unified model, both the encoder and decoder are jointly optimized on the same set of action prototypes. The effectiveness of the proposed method is demonstrated by its state-of-the-art performance on different benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. In: KDD Workshop, Seattle, WA, USA, vol. 10, pp. 359–370. (1994)
Google Scholar
Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41
Chapter Google Scholar
Chang, C.Y., Huang, D.A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555. IEEE (2019)
Google Scholar
Chang, X., Tung, F., Mori, G.: Learning discriminative prototypes with dynamic time warping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395–8404. IEEE (2021)
Google Scholar
Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6516 (2018)
Google Scholar
Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Guo, J., Xue, W., Guo, L., Yuan, T., Chen, S.: Multi-level temporal relation graph for continuous sign language recognition. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 408–419. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18913-5_32
Chapter Google Scholar
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9
Chapter Google Scholar
Ji, P., Yang, B., Zhang, T., Zou, Y.: Consensus-guided keyword targeting for video captioning. In: Yu, S., et al. (eds.) PRCV 2022, Part III. LNCS, vol. 13536, pp. 270–281. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18913-5_21
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4297–4305. IEEE (2017)
Google Scholar
Kuehne, H., Arslan, A.B., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 780–787. IEEE (2014)
Google Scholar
Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016)
Google Scholar
Kuehne, H., Richard, A., Gall, J.: Weakly supervised learning of actions from transcripts. Comput. Vis. Image Underst. 163, 78–89 (2017)
Article Google Scholar
Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 765–779 (2018)
Article Google Scholar
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)
Google Scholar
Li, J., Lei, P., Todorovic, S.: Weakly supervised energy-based learning for action segmentation. In: IEEE/CVF International Conference on Computer Vision, pp. 6242–6250. IEEE (2019)
Google Scholar
Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 754–763. IEEE (2017)
Google Scholar
Richard, A., Kuehne, H., Iqbal, A., Gall, J.: NeuralNetwork-Viterbi: a framework for weakly supervised video learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7386–7395. IEEE (2018)
Google Scholar
Ridley, J., Coskun, H., Tan, D.J., Navab, N., Tombari, F.: Transformers in action: weakly supervised action segmentation. arXiv preprint arXiv:2201.05675 (2022)
Souri, Y., Fayyaz, M., Minciullo, L., Francesca, G., Gall, J.: Fast weakly supervised action segmentation using mutual consistency. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 6196–6208 (2021)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Yang, B., Zhang, T., Zou, Y.: Clip meets video captioning: Concept-aware representation learning does matter. In: Yu, S., et al. (eds.) PRCV 2022, Part I. LNCS, vol. 13534, pp. 368–381. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18907-4_29
Chapter Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2914–2923 (2017)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Science Foundation for Young Scientists of China (62106289).

Author information

Authors and Affiliations

School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China
Tao Lin & Wei Sun
School of Artificial Intelligence, Sun Yat-sen University, Guangzhou, China
Xiaobin Chang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Beijing, China
Tao Lin, Xiaobin Chang & Weishi Zheng
Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, People’s Republic of China
Xiaobin Chang

Authors

Tao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobin Chang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Weishi Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaobin Chang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, T., Chang, X., Sun, W., Zheng, W. (2024). Prototypical Transformer for Weakly Supervised Action Segmentation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14430. Springer, Singapore. https://doi.org/10.1007/978-981-99-8537-1_16

Download citation

DOI: https://doi.org/10.1007/978-981-99-8537-1_16
Published: 26 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8536-4
Online ISBN: 978-981-99-8537-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Prototypical Transformer for Weakly Supervised Action Segmentation