Abstract
Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.
Similar content being viewed by others
References
Ronchetti F, Quiroga F, Lanzarini L, Estrebou C. Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9(6): 956–965
Chen K, Ding G, Han J. Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11(2): 219–229
Wang J, Chen D, Yang J. Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4(4): 580–588
Zhu X, Liu Z. Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5(3): 279–289
Chebieb A, Ameur Y A. A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12(2): 351–375
Chen W, Zhu S, Wan H, Feng J. Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56(3): 1–11
Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049–1058
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417–1426
Xu H, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794–5803
Chao Y W, Vijayanarasimhan S, Seybold B, Ross D A, Deng J, Sukthankar R. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130–1139
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933–2942
Lin T, Liu X, Li X, Ding E, Wen S. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888–3897
Nguyen P, Han B, Liu T, Prasad G. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752–6761
Islam A, Radke R J. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536–545
Paul S, Roy S, Roy-Chowdhury A K. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588–607
Liu D, Jiang T, Wang Y. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298–1307
Shi B, Dai Q, Mu Y, Wang J. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006–1016
Fernando B, Chet C T Y, Bilen H. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526–535
Huang L, Huang Y, Ouyang W, Wang L. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053–11060
Rashid M, Kjellström H, Lee Y J. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604–613
Wang L, Xiong Y, Lin D, Van Gool L. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402–6411
Narayan S, Cholakkal H, Khan F S, Shao L. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678–8686
Kim S, Kim D, Cho M, Kwak S. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235–3244
Carreira J, Zisserman A. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724–4733
Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933–1941
Bendale A, Boult T E. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563–1572
Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402–6413
Lee P, Wang J, Lu Y, Byun H. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006
Movshovitz-Attias Y, Toshev A, Leung T K, Ioffe S, Singh S. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360–368
Idrees H, Zamir A R, Jiang Y G, Gorban A, Laptev I, Sukthankar R, Shah M. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1–23
Heilbron F C, Escorcia V, Ghanem B, Niebles J C. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961–970
Shou Z, Gao H, Zhang L, Miyazawa K, Chang S F. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162–179
Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320–11327
McInnes L, Healy J, Melville J. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2
Acknowledgements
This work was supported by the National Key Research and Development Program of China (2018AAA0100104 and 2018AAA0100100), the National Natural Science Foundation of China (Grant No. 61702095), Natural Science Foundation of Jiangsu Province (BK20211164, BK20190341, and BK20210002), and the Big Data Computing Center of Southeast University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Hongsheng Xu received the BSc degree from the Southeast University, Nanjing, China in 2009, and the PhD degree in electrical engineering research from the Iowa State University, USA in 2015. He is currently a Machine Learning Scientist with NARI Research Institute, NARI Group Corporation, China. His current research interests include development and application of deep reinforcement learning in smart grids and energy markets as well as deep learning approaches for the application of operation and maintenance in power systems.
Zihan Chen received the BS degree in computer science and technology from University of Electronic Science and Technology of China, China. Now he is a master student at School of Computer Science and Engineering, Southeast University, China. His research interests include machine learning and computer vision.
Yu Zhang received the BS and MS degrees in telecommunications engineering from Xidian University, China, and his PhD degree in computer engineering from Nanyang Technological University, Singapore. He has been a postdoctoral fellow in the Bioinformatics Institute, A*STAR, Singapore. He is now an Associate Professor in Southeast University, China. His research interest is computer vision.
Xin Geng is currently a professor and the dean of School of Computer Science and Engineering at Southeast University, China. He received the BSc (2001) and MSc (2004) degrees in computer science from Nanjing University, China, and the PhD (2008) degree in computer science from Deakin University, Australia. His research machine learning, pattern recognition, and computer
Siya Mi received the double BS degree from the Beijing University of Posts and Telecoms, China, and the University of London, UK in 2010, and the MS and PhD degrees from Nanyang Technological University, Singapore in 2011 and 2018, respectively. She is currently a lecturer in the Southeast University, China. Her research interests include the data processing and computer vision for cyber security.
Zhihong Yang received the BSc degree from the Nanjing University, China in 1990, and the MSc degree from the Southeast University, China in 1998, all in Computer Science. He was with the NARI Group Corporation, China, for 22 years. He has been the vice president of NARI Research Institute, NARI Group Corporation, China since 2018. He led the development of novel automation technologies that have been developed as series products extensively used in grid dispatching industry. His research interests include power system automation, integrated energy system, big data analysis and AI application in power system. He is also a member of National Power System Management and Information Exchange Standardization Technical Committee.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Xu, H., Chen, Z., Zhang, Y. et al. Weakly supervised temporal action localization with proxy metric modeling. Front. Comput. Sci. 17, 172309 (2023). https://doi.org/10.1007/s11704-022-1154-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-022-1154-1