Weakly supervised temporal action localization with proxy metric modeling

Xu, Hongsheng; Chen, Zihan; Zhang, Yu; Geng, Xin; Mi, Siya; Yang, Zhihong

doi:10.1007/s11704-022-1154-1

Weakly supervised temporal action localization with proxy metric modeling

Research Article
Published: 08 August 2022

Volume 17, article number 172309, (2023)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Hongsheng Xu¹^na1,
Zihan Chen²^na1,
Yu Zhang²,
Xin Geng²,
Siya Mi^3,4 &
…
Zhihong Yang¹

52 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly-Supervised Action Recognition and Localization via Knowledge Transfer

Learning Actionness via Long-Range Temporal Order Verification

Weakly supervised temporal action localization: a survey

Article 22 February 2024

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Ronchetti F, Quiroga F, Lanzarini L, Estrebou C. Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9(6): 956–965
Article Google Scholar
Chen K, Ding G, Han J. Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11(2): 219–229
Article Google Scholar
Wang J, Chen D, Yang J. Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4(4): 580–588
Article Google Scholar
Zhu X, Liu Z. Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5(3): 279–289
Article MathSciNet Google Scholar
Chebieb A, Ameur Y A. A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12(2): 351–375
Article Google Scholar
Chen W, Zhu S, Wan H, Feng J. Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56(3): 1–11
Article Google Scholar
Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049–1058
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417–1426
Xu H, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794–5803
Chao Y W, Vijayanarasimhan S, Seybold B, Ross D A, Deng J, Sukthankar R. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130–1139
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933–2942
Lin T, Liu X, Li X, Ding E, Wen S. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888–3897
Nguyen P, Han B, Liu T, Prasad G. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752–6761
Islam A, Radke R J. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536–545
Paul S, Roy S, Roy-Chowdhury A K. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588–607
Liu D, Jiang T, Wang Y. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298–1307
Shi B, Dai Q, Mu Y, Wang J. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006–1016
Fernando B, Chet C T Y, Bilen H. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526–535
Huang L, Huang Y, Ouyang W, Wang L. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053–11060
Rashid M, Kjellström H, Lee Y J. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604–613
Wang L, Xiong Y, Lin D, Van Gool L. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402–6411
Narayan S, Cholakkal H, Khan F S, Shao L. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678–8686
Kim S, Kim D, Cho M, Kwak S. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235–3244
Carreira J, Zisserman A. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724–4733
Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933–1941
Bendale A, Boult T E. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563–1572
Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402–6413
Lee P, Wang J, Lu Y, Byun H. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006
Movshovitz-Attias Y, Toshev A, Leung T K, Ioffe S, Singh S. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360–368
Idrees H, Zamir A R, Jiang Y G, Gorban A, Laptev I, Sukthankar R, Shah M. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1–23
Article Google Scholar
Heilbron F C, Escorcia V, Ghanem B, Niebles J C. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961–970
Shou Z, Gao H, Zhang L, Miyazawa K, Chang S F. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162–179
Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320–11327
McInnes L, Healy J, Melville J. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2018AAA0100104 and 2018AAA0100100), the National Natural Science Foundation of China (Grant No. 61702095), Natural Science Foundation of Jiangsu Province (BK20211164, BK20190341, and BK20210002), and the Big Data Computing Center of Southeast University.

Author information

These authors contributed equally to this work.

Authors and Affiliations

NARI Group Corporation (State Grid Electric Power Research Institute), Nanjing, 211106, China
Hongsheng Xu & Zhihong Yang
School of Computer Science and Engineering, and the Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing, 211189, China
Zihan Chen, Yu Zhang & Xin Geng
School of Cyber Science and Engineering, Southeast University, Nanjing, 211189, China
Siya Mi
Purple Mountain Laboratories, Nanjing, 211111, China
Siya Mi

Authors

Hongsheng Xu
View author publications
Search author on:PubMed Google Scholar
Zihan Chen
View author publications
Search author on:PubMed Google Scholar
Yu Zhang
View author publications
Search author on:PubMed Google Scholar
Xin Geng
View author publications
Search author on:PubMed Google Scholar
Siya Mi
View author publications
Search author on:PubMed Google Scholar
Zhihong Yang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Siya Mi.

Additional information

Hongsheng Xu received the BSc degree from the Southeast University, Nanjing, China in 2009, and the PhD degree in electrical engineering research from the Iowa State University, USA in 2015. He is currently a Machine Learning Scientist with NARI Research Institute, NARI Group Corporation, China. His current research interests include development and application of deep reinforcement learning in smart grids and energy markets as well as deep learning approaches for the application of operation and maintenance in power systems.

Zihan Chen received the BS degree in computer science and technology from University of Electronic Science and Technology of China, China. Now he is a master student at School of Computer Science and Engineering, Southeast University, China. His research interests include machine learning and computer vision.

Yu Zhang received the BS and MS degrees in telecommunications engineering from Xidian University, China, and his PhD degree in computer engineering from Nanyang Technological University, Singapore. He has been a postdoctoral fellow in the Bioinformatics Institute, A*STAR, Singapore. He is now an Associate Professor in Southeast University, China. His research interest is computer vision.

Xin Geng is currently a professor and the dean of School of Computer Science and Engineering at Southeast University, China. He received the BSc (2001) and MSc (2004) degrees in computer science from Nanjing University, China, and the PhD (2008) degree in computer science from Deakin University, Australia. His research machine learning, pattern recognition, and computer

Siya Mi received the double BS degree from the Beijing University of Posts and Telecoms, China, and the University of London, UK in 2010, and the MS and PhD degrees from Nanyang Technological University, Singapore in 2011 and 2018, respectively. She is currently a lecturer in the Southeast University, China. Her research interests include the data processing and computer vision for cyber security.

Zhihong Yang received the BSc degree from the Nanjing University, China in 1990, and the MSc degree from the Southeast University, China in 1998, all in Computer Science. He was with the NARI Group Corporation, China, for 22 years. He has been the vice president of NARI Research Institute, NARI Group Corporation, China since 2018. He led the development of novel automation technologies that have been developed as series products extensively used in grid dispatching industry. His research interests include power system automation, integrated energy system, big data analysis and AI application in power system. He is also a member of National Power System Management and Information Exchange Standardization Technical Committee.

Electronic supplementary material