Skip to main content
Log in

Bottom-up improved multistage temporal convolutional network for action segmentation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Action segmentation involves locating and classifying human action segments in an untrimmed video, which is very important for understanding human activities. Segmenting actions in the video is a very challenging task due to the problem of ambiguous frames. Previous studies on this topic usually required additional inputs or constructed highly complicated network structures to achieve good performance. However, these additional inputs are not easy to obtain, and complicated network structures increase the costs of computation and storage. Hence, to mitigate these problems, we propose a bottom-up improved multistage temporal convolutional network (BUIMS-TCN) for action segmentation. Specifically, we first propose a smoothed dilated 1D convolution to learn the inherent local temporal dependencies. Second, we design an adaptive temporal fusion module (ATFM), which is a simple yet effective multiscale temporal-context information fusion module, to obtain better semantic feature representations. Finally, we introduce a new loss function to solve the imbalance between easy and hard samples. To the best of our knowledge, this is the first time that the above improvements have been incorporated into the action segmentation task. Extensive experiments verify that our model significantly outperforms the state-of-the-art baselines on three challenging benchmark datasets: Georgia Tech Egocentric Activities (GTEA), 50Salads, and the Breakfast dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Wang Z, Gao Z, Wang L, Li Z, Wu G (2020) Boundary-Aware Cascade Networks for Temporal Action Segmentation. In: European Conference on Computer Vision (ECCV), pp 34–51

    Google Scholar 

  2. Ahmed M, Mahmood AN, Hu J (2016) A survey of network anomaly detection techniques. J Netw Comput Appl 60:19–31

    Article  Google Scholar 

  3. Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2021) MFDNet: collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia:1–1

  4. Li D, Liu H, Zhang Z, Lin K, Fang S, Li Z, Xiong NN (2021) CARM: confidence-aware recommender model via review representation learning and historical rating behavior in the online platforms. Neurocomputing 455:283–296

    Article  Google Scholar 

  5. Shen X, Yi B, Liu H, Zhang W, Zhang Z, Liu S et al (2021) Deep Variational matrix factorization with knowledge embedding for recommendation system. IEEE Trans Knowl Data Eng 33:1906–1918

    Google Scholar 

  6. Liu T, Liu H, Li Y, Zhang Z, Liu S (2019) Efficient blind signal reconstruction with wavelet transforms regularization for educational robot infrared vision sensing. IEEE/ASME Transactions on Mechatronics 24:384–394

    Article  Google Scholar 

  7. Liu T, Liu H, Li Y, Chen Z, Zhang Z-l, Liu S (2020) Flexible FTIR spectral imaging enhancement for industrial robot infrared vision sensing. IEEE Transactions on Industrial Informatics 16:544–554

    Article  Google Scholar 

  8. Xu B, Ye H, Zheng Y, Wang H, Luwang T, Jiang Y (2019) Dense dilated network for video action recognition. IEEE Trans Image Process 28:4941–4953

    Article  MathSciNet  MATH  Google Scholar 

  9. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems (NIPS), pp 3468–3476

    Google Scholar 

  10. Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4724–4733

    Chapter  Google Scholar 

  11. Feichtenhofer C, Fan H, Malik J, He K (2019) SlowFast Networks for Video Recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 6201–6210

    Chapter  Google Scholar 

  12. Zhang X, Huang Y, Mi Y, Pei Y, Zou Q, Wang S (2021) Video sketch: a middle-level representation for action recognition. Appl Intell 51:2589–2608

    Article  Google Scholar 

  13. Yao G, Lei T, Zhong J, Jiang P (2018) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49:2017–2029

    Article  Google Scholar 

  14. Majd M, Safabakhsh R (2018) A motion-aware ConvLSTM network for action recognition. Appl Intell 49:2515–2521

    Article  Google Scholar 

  15. Ding C, Liu K, Cheng F, Belyaev E (2020) Spatio-temporal attention on manifold space for 3D human action recognition. Appl Intell 51:560–570

    Article  Google Scholar 

  16. Chen M-H, Li B, Bao SY-Z, Al-Regib G, Kira Z (2020) Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9451–9460

    Chapter  Google Scholar 

  17. Li S, Farha YA, Liu Y, Cheng M-M, Gall J (2020) MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans Pattern Anal Mach Intell PP:1

    Google Scholar 

  18. Lea CS, Flynn MD, Vidal R, Reiter A, Hager G (2017) Temporal Convolutional Networks for Action Segmentation and Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1003–1012

    Chapter  Google Scholar 

  19. Gao J, Chen K, Nevatia R (2018) Ctap: Complementary temporal action proposal generation. In: European conference on computer vision (ECCV), pp 68–83

    Google Scholar 

  20. Hendry, Chen RC (2019) Automatic License Plate Recognition via sliding-window darknet-YOLO deep learning. Image Vis. Comput 87:47–56

    Article  Google Scholar 

  21. Lea C, Reiter A, Vidal R, Hager GD (2016) Segmental spatiotemporal CNNs for fine-grained action segmentation. In: European Conference on Computer Vision (ECCV), pp 36–52

    Google Scholar 

  22. Richard A, Kuehne H, Gall J (2018) Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5987–5996

    Chapter  Google Scholar 

  23. Kuehne H, Gall J, Serre T (2016) An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1–8

    Google Scholar 

  24. Farha YA, Gall J (2019) MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3570–3579

    Chapter  Google Scholar 

  25. Ishikawa Y, Kasai S, Aoki Y, Kataoka H (2021) Alleviating Over-segmentation Errors by Detecting Action Boundaries. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 2321–2330

    Chapter  Google Scholar 

  26. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille A (2018) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–848

    Article  Google Scholar 

  27. Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X et al (2018) Understanding Convolution for Semantic Segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1451–1460

    Chapter  Google Scholar 

  28. Hamaguchi R, Fujita A, Nemoto K, Imaizumi T, Hikosaka S (2018) Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1442–1450

    Chapter  Google Scholar 

  29. Wang Z, Ji S (2021) Smoothed dilated convolutions for improved dense prediction. Data Min Knowl Disc 35:1–27

    Article  MathSciNet  MATH  Google Scholar 

  30. Wu T, Tang S, Zhang R, Cao J, Li J (2019) Tree-Structured Kronecker Convolutional Network for Semantic Segmentation. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp 940–945

    Chapter  Google Scholar 

  31. Yu F, Koltun V, Funkhouser T (2017) Dilated Residual Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 636–644

    Chapter  Google Scholar 

  32. L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking Atrous Convolution for Semantic Image Segmentation," ArXiv, vol. abs/1706.05587, 2017

    Google Scholar 

  33. Guo C, Fan B, Zhang Q, Xiang S, Pan C (2020) AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 12592–12601

    Chapter  Google Scholar 

  34. Lin T-Y, Goyal P, Girshick RB, He K, Dollár P (2020) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 42:318–327

    Article  Google Scholar 

  35. Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: 2011 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3281–3288

    Google Scholar 

  36. Stein S, McKenna S (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp 729–738

    Chapter  Google Scholar 

  37. Kuehne H, Arslan AB, Serre T (2014) The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp 780–787

    Chapter  Google Scholar 

  38. Liu H, Nie H, Zhang Z, Li Y (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322

    Article  Google Scholar 

  39. Li Z, Liu H, Zhang Z, Liu T, Xiong NN (2021) Learning knowledge graph embedding with heterogeneous relation attention networks. IEEE Transactions on Neural Networks and Learning Systems

  40. Zhang Z, Li Z, Liu H, Xiong NN (2020) Multi-scale dynamic convolutional network for knowledge graph embedding. IEEE Ann Hist Comput:1

  41. Lei P, Todorovic S (2018) Temporal Deformable Residual Networks for Action Segmentation in Videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6742–6751

    Chapter  Google Scholar 

  42. Ding L, Xu C (2018) Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6508–6516

    Google Scholar 

  43. Singh B, Marks TK, Jones MJ, Tuzel O, Shao M (2016) A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1961–1970

    Chapter  Google Scholar 

  44. Richard A, Kuehne H, Gall J (2017) Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1273–1282

    Chapter  Google Scholar 

  45. Kuehne H, Richard A, Gall J (2020) A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans Pattern Anal Mach Intell 42:765–779

    Article  Google Scholar 

  46. Richard A, Gall J (2016) Temporal Action Detection Using a Statistical Language Model. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3131–3140

    Chapter  Google Scholar 

  47. Mac K-NC, Joshi D, Yeh RA, Xiong J, Feris R, Do M (2019) Learning Motion in Feature Space: Locally-Consistent Deformable Convolution Networks for Fine-Grained Action Detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 6281–6290

    Chapter  Google Scholar 

  48. Kuehne H, Richard A, Gall J (2017) Weakly supervised learning of actions from transcripts. Comput Vis Image Underst 163:78–89

    Article  Google Scholar 

  49. Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp 6054–6063

    Google Scholar 

  50. Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems (NIPS), pp 379–387

    Google Scholar 

  51. Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39:640–651

    Article  Google Scholar 

  52. Vo DM, Lee S-W (2018) Semantic image segmentation using fully convolutional neural networks with multi-scale images and multi-scale dilated convolutions. Multimed Tools Appl 77:18689–18707

    Article  Google Scholar 

  53. Zhang H, Dana K, Shi J, Zhang Z, Wang X, Tyagi A et al (2018) Context Encoding for Semantic Segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7151–7160

    Chapter  Google Scholar 

  54. Islam MA, Rochan M, Bruce NDB, Wang Y (2017) Gated Feedback Refinement Network for Dense Image Labeling. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4877–4885

    Chapter  Google Scholar 

  55. O. Ronneberger, P. Fischer, and T. Brox, "U-net: convolutional networks for biomedical image segmentation," in MICCAI, 2015

  56. Lin G, Milan A, Shen C, Reid I (2017) RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5168–5177

    Chapter  Google Scholar 

  57. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European Conference on Computer Vision (ECCV), pp 801–818

    Google Scholar 

  58. Redmon J, Divvala S, Girshick RB, Farhadi A (2016) You Only Look Once: Unified, Real-Time Object Detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 779–788

    Chapter  Google Scholar 

  59. Redmon J, Farhadi A (2017) YOLO9000: Better, Faster, Stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6517–6525

    Chapter  Google Scholar 

  60. Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C-Y et al (2016) SSD: Single Shot MultiBox Detector. In: European Conference on Computer Vision (ECCV), pp 21–37

    Google Scholar 

  61. Chen Y, Li W, Sakaridis C, Dai D, Van Gool L (2018) Domain adaptive faster r-cnn for object detection in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3339–3348

    Google Scholar 

  62. He K, Gkioxari G, Dollár P, Girshick RB (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42:386–397

    Article  Google Scholar 

  63. Lin T-Y, Dollár P, Girshick RB, He K, Hariharan B, Belongie SJ (2017) Feature Pyramid Networks for Object Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 936–944

    Chapter  Google Scholar 

  64. Shrivastava A, Gupta A, Girshick RB (2016) Training Region-Based Object Detectors with Online Hard Example Mining. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 761–769

    Chapter  Google Scholar 

  65. Bulò SR, Neuhold G, Kontschieder P (2017) Loss Max-Pooling for Semantic Image Segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7082–7091

    Chapter  Google Scholar 

  66. Li B, Liu Y, Wang X (2019) Gradient harmonized single-stage detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 8577–8584

    Google Scholar 

  67. Chollet F (2017) Xception: Deep Learning with Depthwise Separable Convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1800–1807

    Chapter  Google Scholar 

  68. Lea CS, Vidal R, Hager G (2016) Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp 1642–1649

    Chapter  Google Scholar 

  69. He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 61907007 and 62107009, in part by the National Key R&D Program of China under Grant 2020YFA0714102, in part by the Fund of the Jilin Provincial Science and Technology Department under Grants 20210201077GX, 20200201199JC, 20200401086GX and 20200401081GX, and in part by the Fundamental Research Funds for the Central Universities under Grants 2412020FZ029, 2412020FZ031, and 2412019FZ049.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jun Kong or Caixia Zheng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

ESM 1

(DOCX 188 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, W., Chai, Y., Qi, M. et al. Bottom-up improved multistage temporal convolutional network for action segmentation. Appl Intell 52, 14053–14069 (2022). https://doi.org/10.1007/s10489-022-03382-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03382-x

Keywords

Navigation