Skip to main content
Log in

Adaptive receptive field U-shaped temporal convolutional network for vulgar action segmentation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Temporally detecting and classifying action segments in untrimmed videos are significant for many applications, especially for the detection of vulgar actions such as sucking and caressing in video platform supervision and surveillance applications. At present, vulgar action segmentation has problems such as fuzzy spatial features and complex temporal features of the video, which affect the detection accuracy. Therefore, this paper proposed an effective Adaptive receptive field U-shaped Temporal Convolutional Network (AU-TCN) for the automatic and accurate detection of vulgar actions in the video. Firstly, considering that the current temporal convolutional network has a significant effect on temporal feature extraction, AU-TCN uses the adaptive receptive field convolution kernel to solve the problem of large differences in the average duration between different types of actions in the Internet videos and then realize the temporal attention mechanism. Secondly, the U-shaped structure based on the temporal convolutional network is introduced to effectively analyze both high-level and low-level temporal features of the model, to solve the problem that the spatial features of vulgar actions are not obvious. Finally, extensive experiments on multiple data sets, including public datasets and a self-built vulgar dataset, verify the effectiveness of the proposed model. Our method achieves state-of-the-art results on the vulgar action dataset. This is of great significance to the purification of the Internet environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Owens EW, Behun RJ, Manning JC, Reid RC (2012) The Impact of Internet Pornography on Adolescents: A Review of the Research. Sexual Addict Compulsivity 19(1–2):99–122

    Article  Google Scholar 

  2. Moustafa M (2015) Applying deep learning to classify pornographic images and videos. arXiv preprint arXiv:1511.08899

  3. Caetano C, Avila S, Schwartz WR, Guimarães SJF, Araújo AdA (2016) A mid-level video representation based on binary descriptors: a case study for pornography detection. Neurocomputing 213:102–114

    Article  Google Scholar 

  4. Wang Y, Jin X, Tan X (2016) Pornographic image recognition by strongly-supervised deep multiple instance learning. In: 2016 IEEE international conference on Image processing (ICIP). IEEE, pp 4418–4422

  5. Ou X, Ling H, Yu H, Li P, Zou F, Liu S (2017) Adult image and video recognition by a deep multicontext network and fine-to-coarse strategy. ACM Trans Intell Syst Technol TIST 8(5):1–25

    Article  Google Scholar 

  6. Perez M, Avila S, Moreira D, Moraes D, Testoni V, Valle E, Goldenstein S, Rocha A (2017) Video pornography detection through deep learning techniques and motion information. Neurocomputing 230:279–293

    Article  Google Scholar 

  7. Vitorino P, Avila S, Perez M, Rocha A (2018) Leveraging deep neural networks to fight child pornography in the age of social media. J Vis Commun Image Represent 50:303–313

    Article  Google Scholar 

  8. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755

    Article  Google Scholar 

  9. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, pp 1933–1941

  10. Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Article  Google Scholar 

  11. Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, pp 156–165

  12. Kim TS, Reiter A (2017) Interpretable 3D human action analysis with temporal convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1623–1631

  13. Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Computer Society, Salt Lake City, pp 1130–1139

  14. Mavroudi E, Bhaskara D, Sefati S, Ali H, Vidal R (2018) End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1558–1567

  15. Stein S, McKenna SJ (2013) Combining Embedded Accelerometers with Computer Vision for Recognizing Food Preparation Activities. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. Association for Computing Machinery, NY, pp 729–738

  16. Li Y, Ye Z, Rehg JM (2015) Delving into egocentric actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 287–295 10.1109/CVPR.2015.7298625

  17. Gao Z, Guo L, Ren T, Liu AA, Cheng ZY, Chen S (2020) Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Trans Neural Netw Learn Syst 33:1147–1167

    Article  Google Scholar 

  18. Gao Z, Guo L, Guan W, Liu AA, Ren T, Chen S (2020) A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans Image Process 30:767–782

    Article  Google Scholar 

  19. Luo M, Chang X, Nie L, Yang Y, Hauptmann AG, Zheng Q (2017) An adaptive semisupervised feature analysis for video semantic recognition. IEEE Trans Cybern 48(2):648–660

    Article  Google Scholar 

  20. Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758

    Article  MathSciNet  MATH  Google Scholar 

  21. Moreira D, Avila S, Perez M, Moraes D, Testoni V, Valle E, Goldenstein S, Rocha A (2016) Pornography classification: the hidden clues in video space-time. Forensic Sci Int 268:46–61

    Article  Google Scholar 

  22. Caetano C, Avila S, Guimaraes S, Araújo AdA (2014) Pornography detection using BossaNova video descriptor. In: 2014 22nd European signal processing conference (EUSIPCO). IEEE, pp 1681–1685

  23. da Silva MV, Marana AN (2018) Spatiotemporal CNNs for pornography detection in videos. In: Iberoamerican congress on pattern recognition, Springer, pp 547–555

  24. Rea N, Lacey G, Dahyot R, Lambe C (2006) Multimodal periodicity analysis for illicit content detection in videos. IET, Lucknow

    Book  Google Scholar 

  25. Liu Y, Yang Y, Xie H, Tang S (2014) Fusing audio vocabulary with visual features for pornographic video detection. Future Generat Comput Syst 31:69–76

    Article  Google Scholar 

  26. Perez M, Avila S, Moreira D, Moraes D, Testoni V, Valle E, Goldenstein S, Rocha A (2017) Video pornography detection through deep learning techniques and motion information. Neurocomputing 230:279–293

    Article  Google Scholar 

  27. Song KH, Kim YS (2018) Pornographic Video Detection Scheme Using Multimodal Features. J Eng Appl Sci 13(5):1174–1182

    Google Scholar 

  28. Mei M, He F (2021) Multi-label learning based target detecting from multi-frame data. IET Image Process 15:3638–3644

    Article  Google Scholar 

  29. Yu H, He F, Pan Y (2019) A novel segmentation model for medical images with intensity inhomogeneity based on adaptive perturbation. Multimed Tools Appl 78(9):11779–11798

    Article  Google Scholar 

  30. Dai X, Singh B, Zhang G, Davis LS, Qiu Chen Y (2017) Temporal Context Network for Activity Localization in Videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5793–5802

  31. Xu H, Das A, Saenko K (2017) R-C3D: region convolutional 3D network for temporal activity detection. In: Proceedings of the IEEE international conference on computer vision. pp 5783–5792

  32. Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF (2017) CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5734–5743

  33. Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on Multimedia. Association for Computing Machinery, NY, pp 988–996

  34. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision, pp 2914–2923 10.48550/arXiv.1704.06228

  35. Yang K, Qiao P, Li D, Lv S, Dou Y (2017) TPC: temporal preservation convolutional networks for precise temporal action localization. arXiv: 1708.03280

  36. Yan Y, Xu C, Cai D, Corso JJ (2017) Weakly supervised actor-action segmentation via robust multi-task ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1298–1307

  37. Xu C, Xiong C, Corso JJ (2017) Action understanding with multiple classes of actors. arXiv preprint arXiv:1704.08723

  38. Lea C, Reiter A, Vidal R, Hager GD (2016) Efficient segmental inference for spatiotemporal modeling of fine-grained actions. arXiv preprint arXiv:1602.02995

  39. Richard A, Kuehne H, Gall J (2017) Weakly supervised action learning with RNN based Fine-to-coarse modeling. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 754–763

  40. Kuehne H, Richard A, Gall J (2018) A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans Pattern Anal Mach Intell 42(4):765–779

    Article  Google Scholar 

  41. Li C, Sun S, Min X, Lin W, Nie B, Zhang X (2017) End-to-end learning of deep convolutional neural network for 3D human action recognition. In: 2017 IEEE international conference on multimedia & expo workshops (ICMEW). IEEE, pp 609–612

  42. Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE conference on computer vision and pattern recognition (CVPR). pp 1961–1970

  43. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, pp 20–36

  44. Ding L, Xu C (2017) TricorNet: A hybrid temporal convolutional and recurrent network for video action segmentation. arXiv preprint arXiv:1705.07818

  45. Tian Y, Shi J, Li B, Duan Z, Xu C (2018) Audio-visual event localization in unconstrained videos. In: Proceedings of the European conference on computer vision (ECCV), pp 247–263

  46. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241

  47. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  48. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  49. Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  50. Milletari F, Navab N, Ahmadi SA (2016) V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE, pp 565–571

  51. Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O (2016) 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 424–432

  52. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, IEEE, pp 770–778

  53. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Honolulu, pp 4700–4708

  54. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

    Google Scholar 

  55. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multiBox ddetector. In: European conference on computer vision, Springer, pp 21–37

  56. Mei M, He F, Xue S (2020) Attention deep residual networks for MR image analysis. Neural Computing and Applications. Springer, pp 1–10

  57. Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Long Beach, pp 510–519

  58. Feichtenhofer C, Fan H, Malik J, He K (2019) SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. Elsevier Science Inc., Amsterdam, pp 6202–6211

  59. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  60. Ishikawa Y, Kasai S, Aoki Y, Kataoka H (2021) Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp 2322–2331

  61. Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1961–1970

  62. Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Béjar B, Yuh DD, et al. (2014) JHU-ISI gesture and skill assessment working set ( JIGSAWS ) : a surgical activity dataset for human motion modeling. In: MICCAI workshop: M2cai, vol 3. p 3

Download references

Acknowledgements

This work was supported by Zhejiang Provincial Natural Science Foundation of China (Nos. LY21F020015, LY20F020015, LY21F030005), National Natural Science Foundation of China (Nos. 61972121, 61971173), Fundamental Research Funds for the Provincial Universities of Zhejiang (No. GK209907299001-008). The authors would like to thank the reviewers for their comments and suggestions in advance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feiwei Qin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, J., Xu, R., Lin, X. et al. Adaptive receptive field U-shaped temporal convolutional network for vulgar action segmentation. Neural Comput & Applic 35, 9593–9606 (2023). https://doi.org/10.1007/s00521-022-08190-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-08190-5

Keywords

Navigation