research-article

Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

Authors:

Bing ZhengAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 2057 - 2066

https://doi.org/10.1145/3394171.3413646

Published: 12 October 2020 Publication History

Abstract

Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. The contribution of MGMA is three-fold: First, by devising a new spatiotemporal separable attention mechanism, it can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. Second, through designing a novel multi-group structure, it can capture multi-attention rendered spatiotemporal features better. Finally, our MGMA module is lightweight and flexible yet effective, so that can be easily embedded into any 3D Convolutional Neural Network (3D-CNN) architecture. We embed multiple MGMA modules into 3D-CNN to train an end-to-end, RGB-only model and evaluate on four popular benchmarks: UCF101 and HMDB51, Something-Something V1 and V2. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts. Our code is available at https://github.com/zhenglab/mgma.

Supplementary Material

MP4 File (3394171.3413646.mp4)

Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. By devising a new spatiotemporal separable attention mechanism, MGMA can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. At the meantime, through designing a novel multigroup structure, MGMA can capture multi-attention rendered spatiotemporal features better. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts.

Download
14.20 MB

References

[1]

Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. 2020. SpeedNet: Learning the Speediness in Videos. In CVPR. 9922--9931.

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the Kinetics dataset. In CVPR. IEEE, 6299--6308.

[3]

Xiongtao Chen, Wenmin Wang, Jinzhuo Wang, and Weimian Li. 2017. Learning object-centric transformation for video prediction. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1503--1512.

Digital Library

[4]

Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. 2019. MARS: Motion-Augmented RGB Stream for Action Recognition. In CVPR. IEEE, 7882--7891.

[5]

Ali Diba, Mohsen Fayyaz, Vivek Sharma, M Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, and Luc Van Gool. 2018. Spatio-temporal channel correlation networks for action classification. In ECCV. Springer, 284--299.

[6]

Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016a. Spatiotemporal residual networks for video action recognition. In NIPS. Curran Associates,Inc., 3468--3476.

[7]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016b. Convolutional two-stream network fusion for video action recognition. In CVPR. IEEE, 1933--1941.

[8]

Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas. 2018. Geometry guided convolutional neural networks for self-supervised video representation learning. In CVPR. 5589--5597.

[9]

Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. 2016a. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In ECCV. Springer, 849--866.

[10]

Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. DevNet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting. In CVPR. IEEE, 2568--2577.

[11]

Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016b. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In CVPR. IEEE, 923--932.

[12]

Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. 2019. Large-scale weakly-supervised pre-training for video action recognition. In CVPR. IEEE, 12046--12055.

[13]

Rohit Girdhar and Deva Ramanan. 2017. Attentional pooling for action recognition. In NIPS. Curran Associates,Inc., 34--45.

[14]

Melvyn A Goodale and A David Milner. 1992. Separate visual pathways for perception and action. Trends in Neurosciences, Vol. 15, 1 (1992), 20--25.

[15]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017a. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[16]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et almbox. 2017b. The "something something" video database for learning and evaluating visual common sense. In ICCV. IEEE, 5842--5850.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770--778.

[18]

Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image and Vision Computing, Vol. 60 (2017), 4--21.

Digital Library

[19]

Alejandro Hernandez Ruiz, Lorenzo Porzi, Samuel Rota Bulò, and Francesc Moreno-Noguer. 2017. 3D CNNs on distance matrices for human action recognition. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1087--1095.

[20]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[21]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

[22]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE TPAMI, Vol. 35, 1 (2013), 221--231.

Digital Library

[23]

Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. S™: SpatioTemporal and Motion Encoding for Action Recognition. In ICCV. IEEE, 2000--2008.

[24]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR. IEEE, 1725--1732.

[25]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, and Paul Natsev. 2017. The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1409.1556 (2017).

[26]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1097--1105.

[27]

Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In ICCV. IEEE, 2556--2563.

[28]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE, Vol. 86, 11 (1998), 2278--2324.

[29]

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. 2017. Attention transfer from web images for video recognition. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1--9.

Digital Library

[30]

Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. 2018. Tell me where to look: Guided attention inference network. In CVPR. IEEE, 9215--9223.

[31]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In ICCV. IEEE, 7083--7093.

[32]

Xingyu Liu, Joon-Young Lee, and Hailin Jin. 2019. Learning Video Representations from Correspondence Proposals. In CVPR. IEEE, 4273--4281.

[33]

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In CVPR. IEEE, 7834--7843.

[34]

Chenxu Luo and Alan L Yuille. 2019. Grouped Spatial-Temporal Aggregation for Efficient Action Recognition. In ICCV. IEEE, 5512--5521.

[35]

Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. 2018a. Attend and Interact: Higher-Order Object Interactions for Video Understanding. In CVPR. IEEE, 6790--6800.

[36]

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018b. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV. Springer, 116--131.

[37]

Brais Martinez, Davide Modolo, Yuanjun Xiong, and Joseph Tighe. 2019. Action Recognition With Spatial-Temporal Discriminative Filter Banks. In ICCV. IEEE, 5482--5491.

[38]

Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Frederick Tung, and Leonid Sigal. 2018. Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos. arXiv preprint arXiv:1810.04511 (2018).

[39]

Volodymyr Mnih, Nicolas Heess, Alex Graves, et almbox. 2014. Recurrent models of visual attention. In NIPS. Curran Associates,Inc., 2204--2212.

[40]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with Pseudo-3D residual networks. In ICCV. IEEE, 5533--5541.

[41]

Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and Tao Mei. 2019. Learning spatio-temporal representation with local and global diffusion. In CVPR. IEEE, 12056--12065.

[42]

Ronald A Rensink. 2000. The dynamic representation of scenes. Visual Cognition, Vol. 7, 1--3 (2000), 17--42.

[43]

John T Serences and Steven Yantis. 2006. Selective visual attention and perceptual coherence. Trends in Cognitive Sciences, Vol. 10, 1 (2006), 38--45.

[44]

Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).

[45]

Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In NIPS. Curran Associates,Inc., 568--576.

[46]

Karen Simonyan and Andrew Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[47]

Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, and Bappaditya Mandal. 2018. Deep adaptive temporal pooling for activity recognition. In Proceedings of the 26th ACM international conference on Multimedia. ACM, 1829--1837.

Digital Library

[48]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[49]

Lazar Stankov. 1983. Attention and intelligence. Journal of Educational Psychology, Vol. 75, 4 (1983), 471.

[50]

Ming Sun, Yuchen Yuan, Feng Zhou, and Errui Ding. 2018. Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition. In ECCV. Springer, 805--821.

[51]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI. AAAIPress, 4278--4284.

[52]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR. IEEE, 1--9.

[53]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. IEEE, 2818--2826.

[54]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV. IEEE, 4489--4497.

[55]

Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In ICCV. IEEE, 5552--5561.

[56]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR. IEEE, 6450--6459.

[57]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. Curran Associates,Inc., 5998--6008.

[58]

Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image classification. In CVPR. IEEE, 3156--3164.

[59]

Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In CVPR. IEEE, 3169--3676.

[60]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV. IEEE, 3551--3558.

[61]

Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018b. Appearance-and-relation networks for video classification. In CVPR. IEEE, 1430--1439.

[62]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016b. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Springer, 20--36.

[63]

Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li. 2016a. Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks. In Proceedings of the ACM Conference on Multimedia (ACM'MM). ACM, 102--106.

Digital Library

[64]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local neural networks. In CVPR. IEEE, 7794--7803.

[65]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In ECCV. Springer, 399--417.

[66]

Di Wu, Nabin Sharma, and Michael Blumenstein. 2017. Recent advances in video-based human action recognition using deep learning: a review. In IJCNN. IEEE, 2865--2872.

[67]

Tete Xiao, Quanfu Fan, Dan Gutfreund, Mathew Monfort, Aude Oliva, and Bolei Zhou. 2019. Reasoning About Human-Object Interactions Through Dual Attention Networks. In ICCV. IEEE, 3919--3928.

[68]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR. IEEE, 1492--1500.

[69]

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV. Springer, 305--321.

[70]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR. IEEE, 4694--4702.

[71]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR. IEEE, 6848--6856.

[72]

Yue Zhao, Yuanjun Xiong, and Dahua Lin. 2018a. Recognize Actions by Disentangling Components of Dynamics. In CVPR. IEEE, 6566--6575.

[73]

Yue Zhao, Yuanjun Xiong, and Dahua Lin. 2018b. Trajectory convolution for action recognition. In NeurIPS. 2208--2219.

[74]

Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In ICCV. IEEE, 5209--5217.

[75]

Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In ECCV. Springer, 803--818.

[76]

Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient convolutional network for online video understanding. In ECCV. Springer, 695--712.

Cited By

Shi ZZheng HDong J(2024)Spatiotemporal self-supervised predictive learning for atmospheric variable prediction via multi-group multi-attentionKnowledge-Based Systems10.1016/j.knosys.2024.112090300(112090)Online publication date: Sep-2024
https://doi.org/10.1016/j.knosys.2024.112090

Index Terms

Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Fast multi-class action recognition by querying inverted index tables

A fast inverted index based algorithm is proposed for multi-class action recognition. This approach represents an action as a sequence of action states. Here, the action states are cluster centers of the extracted shape-motion features. At first, we ...
Multi-view representation learning for multi-view action recognition

This approach directly exploits the relationships among different action categories from different views.We bridge the gap of the sparsity representation of different actions from the different views.This approach explores the task of cross-view ...
Human action recognition via multi-task learning base on spatial-temporal feature

This study proposes a novel human action recognition method using regularized multi-task learning. First, we propose the part Bag-of-Words (PBoW) representation that completely represents the local visual characteristics of the human body structure. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
248
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shi ZZheng HDong J(2024)Spatiotemporal self-supervised predictive learning for atmospheric variable prediction via multi-group multi-attentionKnowledge-Based Systems10.1016/j.knosys.2024.112090300(112090)Online publication date: Sep-2024
https://doi.org/10.1016/j.knosys.2024.112090

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten