skip to main content
10.1145/3551626.3564960acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Multimodal Fusion with Cross-Modal Attention for Action Recognition in Still Images

Published: 13 December 2022 Publication History

Abstract

We propose a cross-modal attention module to combine information from different cues and different modalities, to achieve action recognition in still images. Feature maps are extracted from the entire image, the detected human bounding box, and the detected human skeleton, respectively. Inspired by the transformer structure, we design the processing between the query vector from one cue/modality, and the key vector from another cue/modality. Feature maps from different cues/modalities are cross-referred so that better representations can be obtained to yield better performance. We show that the proposed framework outperforms the state-of-the-art systems without the requirement of an extra training dataset. We also conduct ablation studies to investigate how different settings impact the final results.

References

[1]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-scale Hierarchical Image Database. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 248--255.
[2]
Chaitanya Desai and Deva Ramanan. 2012. Detecting Actions, Poses, and Objects with Relational Phraselets. In Proceedings of European Conference on Computer Vision.
[3]
Mark Everingham, Luc Van Gool, Chris Williams, John Winn, and Andrew Zisserman. 2012. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[4]
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multi-Person Pose Estimation. In Proceedings of IEEE International Conference on Computer Vision.
[5]
Georgia Gkioxari, Ross Girshick, and Jitendra Malik. 2015. Actions and Attributes from Wholes and Parts. In Proceedings of International Conference on Computer Vision.
[6]
Georgia Gkioxari, Ross Girshick, and Jitendra Malik. 2015. Contextual Action Recognition with R*CNN. In Proceedings of International Conference on Computer Vision.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
[8]
Vladimir Iashin and Esa Rahtu. 2020. A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer. In Proceedings of British Machine Vision Conference.
[9]
Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal Transfer Module for CNN Fusion. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition.
[10]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of Neural Information Processing Systems.
[11]
Bangpeng Yao and Li Fei-Fei. 2012. Action Recognition with Exemplar Based 2.5D Graph Matching. In Proceedings of European Conference on Computer Vision.
[12]
Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. 2011. Human Action Recognition by Learning Bases of Action Attributes and Parts. In Proceedings of International Conference on Computer Vision.
[13]
Yu Zhang, Li Cheng, Jianxin Wu, Jianfei Cai, Minh N. Do, and Jiangbo Lu. 2016. Action Recognition in Still Images With Minimum Annotation Efforts. IEEE Transactions on Image Processing 25, 11 (2016), 5479--5490.
[14]
Zhichen Zhao, Huimin Ma, and Xiaozhi Chen. 2016. Semantic Parts based Top-down Pyramid for Action Recognition. Pattern Recognition Letters 84 (2016), 134--141.
[15]
Zhichen Zhao, Huimin Ma, and Shaodi You. 2017. Single Image Action Recognition using Semantic Body Part Actions. In Proceedings of International Conference on Computer Vision.

Cited By

View all
  • (2023)Integrating Gaze and Mouse Via Joint Cross-Attention Fusion Net for Students' Activity Recognition in E-learningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36108767:3(1-35)Online publication date: 27-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia
December 2022
296 pages
ISBN:9781450394789
DOI:10.1145/3551626
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 December 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. cross-modal attention
  3. feature fusion

Qualifiers

  • Short-paper

Funding Sources

Conference

MMAsia '22
Sponsor:
MMAsia '22: ACM Multimedia Asia
December 13 - 16, 2022
Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)5
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Integrating Gaze and Mouse Via Joint Cross-Attention Fusion Net for Students' Activity Recognition in E-learningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36108767:3(1-35)Online publication date: 27-Sep-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media