skip to main content
10.1145/3394171.3413646acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

Published: 12 October 2020 Publication History

Abstract

Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. The contribution of MGMA is three-fold: First, by devising a new spatiotemporal separable attention mechanism, it can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. Second, through designing a novel multi-group structure, it can capture multi-attention rendered spatiotemporal features better. Finally, our MGMA module is lightweight and flexible yet effective, so that can be easily embedded into any 3D Convolutional Neural Network (3D-CNN) architecture. We embed multiple MGMA modules into 3D-CNN to train an end-to-end, RGB-only model and evaluate on four popular benchmarks: UCF101 and HMDB51, Something-Something V1 and V2. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts. Our code is available at https://github.com/zhenglab/mgma.

Supplementary Material

MP4 File (3394171.3413646.mp4)
Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. By devising a new spatiotemporal separable attention mechanism, MGMA can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. At the meantime, through designing a novel multigroup structure, MGMA can capture multi-attention rendered spatiotemporal features better. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts.

References

[1]
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. 2020. SpeedNet: Learning the Speediness in Videos. In CVPR. 9922--9931.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the Kinetics dataset. In CVPR. IEEE, 6299--6308.
[3]
Xiongtao Chen, Wenmin Wang, Jinzhuo Wang, and Weimian Li. 2017. Learning object-centric transformation for video prediction. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1503--1512.
[4]
Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. 2019. MARS: Motion-Augmented RGB Stream for Action Recognition. In CVPR. IEEE, 7882--7891.
[5]
Ali Diba, Mohsen Fayyaz, Vivek Sharma, M Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool. 2018. Spatio-temporal channel correlation networks for action classification. In ECCV. Springer, 284--299.
[6]
Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal residual networks for video action recognition. In NIPS. Curran Associates,Inc., 3468--3476.
[7]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition. In CVPR. IEEE, 1933--1941.
[8]
Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas. 2018. Geometry guided convolutional neural networks for self-supervised video representation learning. In CVPR. 5589--5597.
[9]
Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. 2016a. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In ECCV. Springer, 849--866.
[10]
Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. DevNet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting. In CVPR. IEEE, 2568--2577.
[11]
Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016b. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In CVPR. IEEE, 923--932.
[12]
Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. 2019. Large-scale weakly-supervised pre-training for video action recognition. In CVPR. IEEE, 12046--12055.
[13]
Rohit Girdhar and Deva Ramanan. 2017. Attentional pooling for action recognition. In NIPS. Curran Associates,Inc., 34--45.
[14]
Melvyn A Goodale and A David Milner. 1992. Separate visual pathways for perception and action. Trends in Neurosciences, Vol. 15, 1 (1992), 20--25.
[15]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017a. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
[16]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et almbox. 2017b. The "something something" video database for learning and evaluating visual common sense. In ICCV. IEEE, 5842--5850.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770--778.
[18]
Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image and Vision Computing, Vol. 60 (2017), 4--21.
[19]
Alejandro Hernandez Ruiz, Lorenzo Porzi, Samuel Rota Bulò, and Francesc Moreno-Noguer. 2017. 3D CNNs on distance matrices for human action recognition. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1087--1095.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[21]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
[22]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE TPAMI, Vol. 35, 1 (2013), 221--231.
[23]
Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. S™: SpatioTemporal and Motion Encoding for Action Recognition. In ICCV. IEEE, 2000--2008.
[24]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR. IEEE, 1725--1732.
[25]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, and Paul Natsev. 2017. The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1409.1556 (2017).
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1097--1105.
[27]
Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In ICCV. IEEE, 2556--2563.
[28]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE, Vol. 86, 11 (1998), 2278--2324.
[29]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. 2017. Attention transfer from web images for video recognition. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1--9.
[30]
Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. 2018. Tell me where to look: Guided attention inference network. In CVPR. IEEE, 9215--9223.
[31]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In ICCV. IEEE, 7083--7093.
[32]
Xingyu Liu, Joon-Young Lee, and Hailin Jin. 2019. Learning Video Representations from Correspondence Proposals. In CVPR. IEEE, 4273--4281.
[33]
Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In CVPR. IEEE, 7834--7843.
[34]
Chenxu Luo and Alan L Yuille. 2019. Grouped Spatial-Temporal Aggregation for Efficient Action Recognition. In ICCV. IEEE, 5512--5521.
[35]
Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. 2018a. Attend and Interact: Higher-Order Object Interactions for Video Understanding. In CVPR. IEEE, 6790--6800.
[36]
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018b. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV. Springer, 116--131.
[37]
Brais Martinez, Davide Modolo, Yuanjun Xiong, and Joseph Tighe. 2019. Action Recognition With Spatial-Temporal Discriminative Filter Banks. In ICCV. IEEE, 5482--5491.
[38]
Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Frederick Tung, and Leonid Sigal. 2018. Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos. arXiv preprint arXiv:1810.04511 (2018).
[39]
Volodymyr Mnih, Nicolas Heess, Alex Graves, et almbox. 2014. Recurrent models of visual attention. In NIPS. Curran Associates,Inc., 2204--2212.
[40]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with Pseudo-3D residual networks. In ICCV. IEEE, 5533--5541.
[41]
Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and Tao Mei. 2019. Learning spatio-temporal representation with local and global diffusion. In CVPR. IEEE, 12056--12065.
[42]
Ronald A Rensink. 2000. The dynamic representation of scenes. Visual Cognition, Vol. 7, 1--3 (2000), 17--42.
[43]
John T Serences and Steven Yantis. 2006. Selective visual attention and perceptual coherence. Trends in Cognitive Sciences, Vol. 10, 1 (2006), 38--45.
[44]
Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).
[45]
Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In NIPS. Curran Associates,Inc., 568--576.
[46]
Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[47]
Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, and Bappaditya Mandal. 2018. Deep adaptive temporal pooling for activity recognition. In Proceedings of the 26th ACM international conference on Multimedia. ACM, 1829--1837.
[48]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[49]
Lazar Stankov. 1983. Attention and intelligence. Journal of Educational Psychology, Vol. 75, 4 (1983), 471.
[50]
Ming Sun, Yuchen Yuan, Feng Zhou, and Errui Ding. 2018. Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition. In ECCV. Springer, 805--821.
[51]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI. AAAIPress, 4278--4284.
[52]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR. IEEE, 1--9.
[53]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. IEEE, 2818--2826.
[54]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV. IEEE, 4489--4497.
[55]
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In ICCV. IEEE, 5552--5561.
[56]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR. IEEE, 6450--6459.
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. Curran Associates,Inc., 5998--6008.
[58]
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image classification. In CVPR. IEEE, 3156--3164.
[59]
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In CVPR. IEEE, 3169--3676.
[60]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV. IEEE, 3551--3558.
[61]
Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018b. Appearance-and-relation networks for video classification. In CVPR. IEEE, 1430--1439.
[62]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016b. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Springer, 20--36.
[63]
Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. 2016a. Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks. In Proceedings of the ACM Conference on Multimedia (ACM'MM). ACM, 102--106.
[64]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local neural networks. In CVPR. IEEE, 7794--7803.
[65]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In ECCV. Springer, 399--417.
[66]
Di Wu, Nabin Sharma, and Michael Blumenstein. 2017. Recent advances in video-based human action recognition using deep learning: a review. In IJCNN. IEEE, 2865--2872.
[67]
Tete Xiao, Quanfu Fan, Dan Gutfreund, Mathew Monfort, Aude Oliva, and Bolei Zhou. 2019. Reasoning About Human-Object Interactions Through Dual Attention Networks. In ICCV. IEEE, 3919--3928.
[68]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR. IEEE, 1492--1500.
[69]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV. Springer, 305--321.
[70]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR. IEEE, 4694--4702.
[71]
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR. IEEE, 6848--6856.
[72]
Yue Zhao, Yuanjun Xiong, and Dahua Lin. 2018a. Recognize Actions by Disentangling Components of Dynamics. In CVPR. IEEE, 6566--6575.
[73]
Yue Zhao, Yuanjun Xiong, and Dahua Lin. 2018b. Trajectory convolution for action recognition. In NeurIPS. 2208--2219.
[74]
Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In ICCV. IEEE, 5209--5217.
[75]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In ECCV. Springer, 803--818.
[76]
Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient convolutional network for online video understanding. In ECCV. Springer, 695--712.

Cited By

View all
  • (2024)Spatiotemporal self-supervised predictive learning for atmospheric variable prediction via multi-group multi-attentionKnowledge-Based Systems10.1016/j.knosys.2024.112090300(112090)Online publication date: Sep-2024

Index Terms

  1. Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action recognition
    2. multi-attention
    3. spatiotemporal representation

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Spatiotemporal self-supervised predictive learning for atmospheric variable prediction via multi-group multi-attentionKnowledge-Based Systems10.1016/j.knosys.2024.112090300(112090)Online publication date: Sep-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media