ABSTRACT
Temporal action detection aims to locate specific segments of action instances in an untrimmed video. Most existing approaches commonly extract the features of all candidate video segments and then classify them separately. However, they may neglect the underlying relationship among candidates unconsciously. In this paper, we propose a novel model termed Candidate-Aware Aggregation (CAA) to tackle this problem. In CAA, we design the Global Awareness (GA) module to exploit long-range relations among all candidates from a global perspective, which enhances the features of action instances. The GA module is then embedded into a multi-level hierarchical network named FENet, to aggregate local features in adjacent candidates to suppress background noise. As a result, the relationship among candidates is explicitly captured from both local and global perspectives, which ensures more accurate prediction results for the candidates. Extensive experiments conducted on two popular benchmarks ActivityNet-1.3 and THUMOS-14 demonstrate the superiority of CAA comparing to the state-of-the-art methods.
- Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, and Junhui Liu. 2020. Boundary Content Graph Neural Network for Temporal Action Proposal Generation. In European Conference on Computer Vision, Vol. 12373. 121--137.Google Scholar
- Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Soft-NMS - Improving Object Detection with One Line of Code. In IEEE International Conference on Computer Vision. 5562--5570.Google Scholar
- Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-Stream Temporal Action Proposals. In IEEE Conference on Computer Vision and Pattern Recognition. 6373--6382.Google Scholar
- João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In IEEE Conference on Computer Vision and Pattern Recognition. 4724--4733.Google ScholarCross Ref
- Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In IEEE Conference on Computer Vision and Pattern Recognition. 1130--1139.Google Scholar
- Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Conference on Neural Information Processing Systems. 379--387. Google ScholarDigital Library
- Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, and Yan Qiu Chen. 2017. Temporal Context Network for Activity Localization in Videos. In IEEE International Conference on Computer Vision. 5727--5736.Google Scholar
- Navneet Dalal and Bill Triggs. 2005. Histograms of Oriented Gradients for Human Detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 886--893. Google ScholarDigital Library
- Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human Detection Using Oriented Histograms of Flow and Appearance. In European Conference on Computer Vision, Vol. 3952. 428--441. Google ScholarDigital Library
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In IEEE International Conference on Computer Vision. 6201--6210.Google Scholar
- Jialin Gao, Zhixiang Shi, Guanshuo Wang, Jiani Li, Yufeng Yuan, Shiming Ge, and Xi Zhou. 2020. Accurate Temporal Action Proposal Generation with Relation-Aware Pyramid Network. In AAAI Conference on Artificial Intelligence. 10810--10817.Google Scholar
- Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In IEEE International Conference on Computer Vision. 3648--3656.Google Scholar
- Ross B. Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448. Google ScholarDigital Library
- Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 580--587. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. 770--778.Google Scholar
- Fabian Caba Heilbron, Wayner Barrios, Victor Escorcia, and Bernard Ghanem. 2017. SCC: Semantic Context Cascade for Efficient Action Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 3175--3184.Google Scholar
- Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 961--970.Google ScholarCross Ref
- Qibin Hou, Li Zhang, Ming-Ming Cheng, and Jiashi Feng. 2020. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 4002--4011.Google Scholar
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. 7132--7141.Google Scholar
- Yanli Ji, Yue Zhan, Yang Yang, Xing Xu, Fumin Shen, and Heng Tao Shen. 2019. A Knowledge Map Guided Coarse-to-fine Action Recognition. IEEE Trans. Image Processing, Vol. 29 (2019), 2742--2752.Google ScholarCross Ref
- Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. STM: Spatio Temporal and Motion Encoding for Action Recognition. In IEEE International Conference on Computer Vision. 2000--2009.Google Scholar
- Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.Google Scholar
- Will Kay, Jo a o Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. (2017). http://arxiv.org/abs/1705.06950Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. 1106--1114. Google ScholarDigital Library
- Jin Li, Xianglong Liu, Zhuofan Zong, Wanru Zhao, Mingyuan Zhang, and Jingkuan Song. 2020 b. Graph Attention Based Proposal 3D ConvNets for Action Detection. In AAAI Conference on Artificial Intelligence. 4626--4633.Google ScholarCross Ref
- Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020 a. TEA: Temporal Excitation and Aggregation for Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 906--915.Google Scholar
- Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. 2020. Fast Learning of Temporal Action Proposal via Dense Boundary Generator. In AAAI Conference on Artificial Intelligence. 11499--11506.Google ScholarCross Ref
- Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In IEEE International Conference on Computer Vision. 3888--3897.Google Scholar
- Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In Proceedings of the 2017 ACM on Multimedia Conference. 988--996. Google ScholarDigital Library
- Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In European Conference on Computer Vision, Vol. 11208. 3--21.Google Scholar
- Shuming Liu, Xu Zhao, Haisheng Su, and Zhilan Hu. 2020. TSI: Temporal Scale Invariant Network for Action Proposal Generation. In Asian Conference on Computer Vision, Vol. 12626. 530--546.Google Scholar
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision, Vol. 9905. 21--37.Google Scholar
- Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, and Shih-Fu Chang. 2019. Multi-Granularity Generator for Temporal Action Proposal. In IEEE Conference on Computer Vision and Pattern Recognition. 3604--3613.Google Scholar
- Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google Scholar
- Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In IEEE Conference on Computer Vision and Pattern Recognition. 6517--6525.Google Scholar
- Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement., Vol. abs/1804.02767 (2018). http://arxiv.org/abs/1804.02767Google Scholar
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Conference on Neural Information Processing Systems. 91--99. Google ScholarDigital Library
- Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. In IEEE Conference on Computer Vision and Pattern Recognition. 1417--1426.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.Google Scholar
- Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. BSN: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation. In AAAI Conference on Artificial Intelligence.Google Scholar
- Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
- Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. UntrimmedNets for Weakly Supervised Action Recognition and Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 6402--6411.Google Scholar
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In European Conference on Computer Vision, Vol. 9912. 20--36.Google Scholar
- Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph., Vol. 38, 5 (2019), 146:1--146:12. Google ScholarDigital Library
- Huijuan Xu, Abir Das, and Kate Saenko. 2017a. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In IEEE International Conference on Computer Vision. 5794--5803.Google ScholarCross Ref
- Mengmeng Xu, Chen Zhao, David S. Rojas, Ali K. Thabet, and Bernard Ghanem. 2020 b. G-TAD: Sub-Graph Localization for Temporal Action Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 10153--10162.Google Scholar
- Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017b. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Trans. Image Processing, Vol. 26, 5 (2017), 2494--2507. Google ScholarDigital Library
- Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020 a. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Trans. Neural Networks Learn. Syst., Vol. 31, 12 (2020), 5412--5425.Google ScholarCross Ref
- Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal Action Detection with Structured Segment Networks. In IEEE International Conference on Computer Vision. 2933--2942.Google Scholar
Index Terms
CAA: Candidate-Aware Aggregation for Temporal Action Detection
Recommendations
Dual-Memory Feature Aggregation for Video Object Detection
Pattern Recognition and Computer VisionAbstractRecent studies on video object detection have shown the advantages of aggregating features across frames to capture temporal information, which can mitigate appearance degradation, such as occlusion, motion blur, and defocus. However, these ...
Learning frame-level affinity with video-level labels for weakly supervised temporal action detection
AbstractWeakly supervised temporal action detection aims at localizing actions with only video-level labels rather than lots of frame-level labels. To this end, previous methods train a classification network for mining discernible action ...
A coarse-to-fine temporal action detection method combining light and heavy networks
AbstractTemporal action detection aims to judge whether there existing a certain number of action instances in a long untrimmed videos and to locate the start and end time of each action. Even though the existing action detection methods have shown ...
Comments