skip to main content
10.1145/3474085.3475616acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CAA: Candidate-Aware Aggregation for Temporal Action Detection

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

Temporal action detection aims to locate specific segments of action instances in an untrimmed video. Most existing approaches commonly extract the features of all candidate video segments and then classify them separately. However, they may neglect the underlying relationship among candidates unconsciously. In this paper, we propose a novel model termed Candidate-Aware Aggregation (CAA) to tackle this problem. In CAA, we design the Global Awareness (GA) module to exploit long-range relations among all candidates from a global perspective, which enhances the features of action instances. The GA module is then embedded into a multi-level hierarchical network named FENet, to aggregate local features in adjacent candidates to suppress background noise. As a result, the relationship among candidates is explicitly captured from both local and global perspectives, which ensures more accurate prediction results for the candidates. Extensive experiments conducted on two popular benchmarks ActivityNet-1.3 and THUMOS-14 demonstrate the superiority of CAA comparing to the state-of-the-art methods.

References

  1. Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, and Junhui Liu. 2020. Boundary Content Graph Neural Network for Temporal Action Proposal Generation. In European Conference on Computer Vision, Vol. 12373. 121--137.Google ScholarGoogle Scholar
  2. Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Soft-NMS - Improving Object Detection with One Line of Code. In IEEE International Conference on Computer Vision. 5562--5570.Google ScholarGoogle Scholar
  3. Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-Stream Temporal Action Proposals. In IEEE Conference on Computer Vision and Pattern Recognition. 6373--6382.Google ScholarGoogle Scholar
  4. João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In IEEE Conference on Computer Vision and Pattern Recognition. 4724--4733.Google ScholarGoogle ScholarCross RefCross Ref
  5. Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In IEEE Conference on Computer Vision and Pattern Recognition. 1130--1139.Google ScholarGoogle Scholar
  6. Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Conference on Neural Information Processing Systems. 379--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, and Yan Qiu Chen. 2017. Temporal Context Network for Activity Localization in Videos. In IEEE International Conference on Computer Vision. 5727--5736.Google ScholarGoogle Scholar
  8. Navneet Dalal and Bill Triggs. 2005. Histograms of Oriented Gradients for Human Detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 886--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human Detection Using Oriented Histograms of Flow and Appearance. In European Conference on Computer Vision, Vol. 3952. 428--441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In IEEE International Conference on Computer Vision. 6201--6210.Google ScholarGoogle Scholar
  11. Jialin Gao, Zhixiang Shi, Guanshuo Wang, Jiani Li, Yufeng Yuan, Shiming Ge, and Xi Zhou. 2020. Accurate Temporal Action Proposal Generation with Relation-Aware Pyramid Network. In AAAI Conference on Artificial Intelligence. 10810--10817.Google ScholarGoogle Scholar
  12. Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In IEEE International Conference on Computer Vision. 3648--3656.Google ScholarGoogle Scholar
  13. Ross B. Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. 770--778.Google ScholarGoogle Scholar
  16. Fabian Caba Heilbron, Wayner Barrios, Victor Escorcia, and Bernard Ghanem. 2017. SCC: Semantic Context Cascade for Efficient Action Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 3175--3184.Google ScholarGoogle Scholar
  17. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 961--970.Google ScholarGoogle ScholarCross RefCross Ref
  18. Qibin Hou, Li Zhang, Ming-Ming Cheng, and Jiashi Feng. 2020. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 4002--4011.Google ScholarGoogle Scholar
  19. Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. 7132--7141.Google ScholarGoogle Scholar
  20. Yanli Ji, Yue Zhan, Yang Yang, Xing Xu, Fumin Shen, and Heng Tao Shen. 2019. A Knowledge Map Guided Coarse-to-fine Action Recognition. IEEE Trans. Image Processing, Vol. 29 (2019), 2742--2752.Google ScholarGoogle ScholarCross RefCross Ref
  21. Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. STM: Spatio Temporal and Motion Encoding for Action Recognition. In IEEE International Conference on Computer Vision. 2000--2009.Google ScholarGoogle Scholar
  22. Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.Google ScholarGoogle Scholar
  23. Will Kay, Jo a o Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. (2017). http://arxiv.org/abs/1705.06950Google ScholarGoogle Scholar
  24. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. 1106--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jin Li, Xianglong Liu, Zhuofan Zong, Wanru Zhao, Mingyuan Zhang, and Jingkuan Song. 2020 b. Graph Attention Based Proposal 3D ConvNets for Action Detection. In AAAI Conference on Artificial Intelligence. 4626--4633.Google ScholarGoogle ScholarCross RefCross Ref
  26. Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020 a. TEA: Temporal Excitation and Aggregation for Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 906--915.Google ScholarGoogle Scholar
  27. Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. 2020. Fast Learning of Temporal Action Proposal via Dense Boundary Generator. In AAAI Conference on Artificial Intelligence. 11499--11506.Google ScholarGoogle ScholarCross RefCross Ref
  28. Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In IEEE International Conference on Computer Vision. 3888--3897.Google ScholarGoogle Scholar
  29. Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In Proceedings of the 2017 ACM on Multimedia Conference. 988--996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In European Conference on Computer Vision, Vol. 11208. 3--21.Google ScholarGoogle Scholar
  31. Shuming Liu, Xu Zhao, Haisheng Su, and Zhilan Hu. 2020. TSI: Temporal Scale Invariant Network for Action Proposal Generation. In Asian Conference on Computer Vision, Vol. 12626. 530--546.Google ScholarGoogle Scholar
  32. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision, Vol. 9905. 21--37.Google ScholarGoogle Scholar
  33. Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, and Shih-Fu Chang. 2019. Multi-Granularity Generator for Temporal Action Proposal. In IEEE Conference on Computer Vision and Pattern Recognition. 3604--3613.Google ScholarGoogle Scholar
  34. Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google ScholarGoogle Scholar
  35. Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In IEEE Conference on Computer Vision and Pattern Recognition. 6517--6525.Google ScholarGoogle Scholar
  36. Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement., Vol. abs/1804.02767 (2018). http://arxiv.org/abs/1804.02767Google ScholarGoogle Scholar
  37. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Conference on Neural Information Processing Systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. In IEEE Conference on Computer Vision and Pattern Recognition. 1417--1426.Google ScholarGoogle Scholar
  39. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.Google ScholarGoogle Scholar
  40. Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. BSN: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation. In AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  41. Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In IEEE International Conference on Computer Vision. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. UntrimmedNets for Weakly Supervised Action Recognition and Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 6402--6411.Google ScholarGoogle Scholar
  43. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In European Conference on Computer Vision, Vol. 9912. 20--36.Google ScholarGoogle Scholar
  44. Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph., Vol. 38, 5 (2019), 146:1--146:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Huijuan Xu, Abir Das, and Kate Saenko. 2017a. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In IEEE International Conference on Computer Vision. 5794--5803.Google ScholarGoogle ScholarCross RefCross Ref
  46. Mengmeng Xu, Chen Zhao, David S. Rojas, Ali K. Thabet, and Bernard Ghanem. 2020 b. G-TAD: Sub-Graph Localization for Temporal Action Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 10153--10162.Google ScholarGoogle Scholar
  47. Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017b. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Trans. Image Processing, Vol. 26, 5 (2017), 2494--2507. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020 a. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Trans. Neural Networks Learn. Syst., Vol. 31, 12 (2020), 5412--5425.Google ScholarGoogle ScholarCross RefCross Ref
  49. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal Action Detection with Structured Segment Networks. In IEEE International Conference on Computer Vision. 2933--2942.Google ScholarGoogle Scholar

Index Terms

  1. CAA: Candidate-Aware Aggregation for Temporal Action Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader