Skip to main content
Log in

Weakly supervised temporal action localization with proxy metric modeling

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ronchetti F, Quiroga F, Lanzarini L, Estrebou C. Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9(6): 956–965

    Article  Google Scholar 

  2. Chen K, Ding G, Han J. Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11(2): 219–229

    Article  Google Scholar 

  3. Wang J, Chen D, Yang J. Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4(4): 580–588

    Article  Google Scholar 

  4. Zhu X, Liu Z. Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5(3): 279–289

    Article  MathSciNet  Google Scholar 

  5. Chebieb A, Ameur Y A. A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12(2): 351–375

    Article  Google Scholar 

  6. Chen W, Zhu S, Wan H, Feng J. Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56(3): 1–11

    Article  Google Scholar 

  7. Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049–1058

  8. Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417–1426

  9. Xu H, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794–5803

  10. Chao Y W, Vijayanarasimhan S, Seybold B, Ross D A, Deng J, Sukthankar R. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130–1139

  11. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933–2942

  12. Lin T, Liu X, Li X, Ding E, Wen S. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888–3897

  13. Nguyen P, Han B, Liu T, Prasad G. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752–6761

  14. Islam A, Radke R J. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536–545

  15. Paul S, Roy S, Roy-Chowdhury A K. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588–607

  16. Liu D, Jiang T, Wang Y. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298–1307

  17. Shi B, Dai Q, Mu Y, Wang J. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006–1016

  18. Fernando B, Chet C T Y, Bilen H. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526–535

  19. Huang L, Huang Y, Ouyang W, Wang L. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053–11060

  20. Rashid M, Kjellström H, Lee Y J. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604–613

  21. Wang L, Xiong Y, Lin D, Van Gool L. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402–6411

  22. Narayan S, Cholakkal H, Khan F S, Shao L. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678–8686

  23. Kim S, Kim D, Cho M, Kwak S. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235–3244

  24. Carreira J, Zisserman A. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724–4733

  25. Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933–1941

  26. Bendale A, Boult T E. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563–1572

  27. Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402–6413

  28. Lee P, Wang J, Lu Y, Byun H. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006

  29. Movshovitz-Attias Y, Toshev A, Leung T K, Ioffe S, Singh S. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360–368

  30. Idrees H, Zamir A R, Jiang Y G, Gorban A, Laptev I, Sukthankar R, Shah M. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1–23

    Article  Google Scholar 

  31. Heilbron F C, Escorcia V, Ghanem B, Niebles J C. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961–970

  32. Shou Z, Gao H, Zhang L, Miyazawa K, Chang S F. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162–179

  33. Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320–11327

  34. McInnes L, Healy J, Melville J. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2018AAA0100104 and 2018AAA0100100), the National Natural Science Foundation of China (Grant No. 61702095), Natural Science Foundation of Jiangsu Province (BK20211164, BK20190341, and BK20210002), and the Big Data Computing Center of Southeast University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siya Mi.

Additional information

Hongsheng Xu received the BSc degree from the Southeast University, Nanjing, China in 2009, and the PhD degree in electrical engineering research from the Iowa State University, USA in 2015. He is currently a Machine Learning Scientist with NARI Research Institute, NARI Group Corporation, China. His current research interests include development and application of deep reinforcement learning in smart grids and energy markets as well as deep learning approaches for the application of operation and maintenance in power systems.

Zihan Chen received the BS degree in computer science and technology from University of Electronic Science and Technology of China, China. Now he is a master student at School of Computer Science and Engineering, Southeast University, China. His research interests include machine learning and computer vision.

Yu Zhang received the BS and MS degrees in telecommunications engineering from Xidian University, China, and his PhD degree in computer engineering from Nanyang Technological University, Singapore. He has been a postdoctoral fellow in the Bioinformatics Institute, A*STAR, Singapore. He is now an Associate Professor in Southeast University, China. His research interest is computer vision.

Xin Geng is currently a professor and the dean of School of Computer Science and Engineering at Southeast University, China. He received the BSc (2001) and MSc (2004) degrees in computer science from Nanjing University, China, and the PhD (2008) degree in computer science from Deakin University, Australia. His research machine learning, pattern recognition, and computer

Siya Mi received the double BS degree from the Beijing University of Posts and Telecoms, China, and the University of London, UK in 2010, and the MS and PhD degrees from Nanyang Technological University, Singapore in 2011 and 2018, respectively. She is currently a lecturer in the Southeast University, China. Her research interests include the data processing and computer vision for cyber security.

Zhihong Yang received the BSc degree from the Nanjing University, China in 1990, and the MSc degree from the Southeast University, China in 1998, all in Computer Science. He was with the NARI Group Corporation, China, for 22 years. He has been the vice president of NARI Research Institute, NARI Group Corporation, China since 2018. He led the development of novel automation technologies that have been developed as series products extensively used in grid dispatching industry. His research interests include power system automation, integrated energy system, big data analysis and AI application in power system. He is also a member of National Power System Management and Information Exchange Standardization Technical Committee.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, H., Chen, Z., Zhang, Y. et al. Weakly supervised temporal action localization with proxy metric modeling. Front. Comput. Sci. 17, 172309 (2023). https://doi.org/10.1007/s11704-022-1154-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-1154-1

Keywords

Navigation