Abstract
Human-object interaction (HOI) detection is an important computer vision task for recognizing the interaction between humans and surrounding objects in an image or video. The HOI datasets have a serious long-tailed data distribution problem because it is challenging to have a dataset that contains all potential interactions. Many HOI detectors have addressed this issue by utilizing visual-language models. However, due to the calculation mechanism of the Transformer, the visual-language model is not good at extracting the local features of input samples. Therefore, we propose a novel local feature enhanced Transformer to motivate encoders to extract multi-modal features that contain more information. Moreover, it is worth noting that the application of prompt learning in HOI detection is still in preliminary stages. Consequently, we propose a multi-modal adaptive prompt module, which uses an adaptive learning strategy to facilitate the interaction of language and visual prompts. In the HICO-DET and SWIG-HOI datasets, the proposed model achieves full interaction with 24.21% mAP and 14.29% mAP, respectively. Our code is available at https://github.com/small-code-cat/AMP-HOI.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Fouhey DF, Kuo WC, Efros AA, Malik J (2018) From lifestyle vlogs to everyday interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4991–5000
Li S, Du Y, Torralba A, Sivic J, Russell B (2021) Weakly supervised human-object interaction detection in video via contrastive spatiotemporal regions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1845–1855
Morais R, Le V, Venkatesh S, Tran T (2021) Learning asynchronous and sparse human-object interaction in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16041–16050
Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatiotemporal vlad for video action recognition. IEEE Trans Image Process 28(6):2799–2812
Tu Z, Xie W, Dauwels J, Li B, Yuan J (2018) Semantic cues enhanced multimodality multistream cnn for action recognition. IEEE Transactions on Circuits and Systems for Video Technology 29(5):1423–1437
Wang S, Duan Y, Ding H, Tan YP, Yap KH, Yuan J (2022) Learning transferable human-object interaction detector with natural language supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 939–948
Yuan H, Jiang J, Albanie S, Feng T, Huang Z, Ni D, Tang M (2022) Rlip: relational language-image pre-training for human-object interaction detection. Adv Neural Inf Process Syst 35:37416–37431
Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20123–20132
Pan M, Shen H (2024) Multimodal variational contrastive learning for few-shot classification. Applied Intelligence, 1–14
Liao H, Wang Q, Zhao S, Xing T, Hu R (2023) Domain consensual contrastive learning for few-shot universal domain adaptation. Appl Intell 53(22):27191–27206
Kan H, Yu J, Huang J, Liu Z, Wang H, Zhou H (2023) Self-supervised group meiosis contrastive learning for eeg-based emotion recognition. Appl Intell 53(22):27207–27225
Ben-David E, Oved N, Reichart R (2022) Pada: example-based prompt learning for on-the-fly adaptation to unseen domains. Trans Assoc Comput Linguist 10:414–433
Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, Tang J (2021) Gpt understands, too. arXiv preprint arXiv:2103.10385
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B, Lim SN (2022) Visual prompt tuning. In: European conference on computer vision, pp 709–727. Springer
Wang Z, Zhang Z, Lee CY, Zhang H, Sun R, Ren X, Su G, Perot V, Dy J, Pfister T (2022) Learning to prompt for continual learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 139–149
Zang Y, Li W, Zhou K, Huang C, Loy CC (2022) Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225
Yang F, Zhang QX, Ding XJ, Ma FM, Cao J, Tong DY (2023) Semantic preserving asymmetric discrete hashing for cross-modal retrieval. Appl Intell 53(12):15352–15371
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Chao YW, Wang Z, He Y, Wang J, Deng J (2015) Hico: A benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE international conference on computer vision, pp 1017–1025
Wang S, Yap KH, Yuan J, Tan YP (2020) Discovering human interactions with novel objects via zero-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11652–11661
Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L (2021) Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res 304:114135
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24142–24156
Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vision 130(9):2337–2348
Zhou K, Yang J, Loy CC, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825
Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091
Khattak MU, Rasheed H, Maaz M, Khan S, Khan FS (2023) Maple: multi-modal prompt learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19113–19122
Gao C, Xu J, Zou Y, Huang JB (2020) Drg: dual relation graph for human-object interaction detection. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp. 696–712. Springer
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367
Li YL, Liu X, Wu X, Li Y, Lu C (2020) Hoi analysis: integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022
Qi S, Wang W, Jia B, Shen J, Zhu SC (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 401–417
Wang T, Anwer RM, Khan MH, Khan FS, Pang Y, Shao L, Laaksonen J (2019) Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5694–5702
Xiao T, Fan Q, Gutfreund D, Monfort M, Oliva A, Zhou B (2019) Reasoning about human-object interactions through dual attention networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3919–3928
Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Affordance transfer learning for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 495–504
Huynh D, Elhamifar E (2021) Interaction compass: multi-label zero-shot learning of human-object interactions via spatial relations. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8472–8483
Bansal A, Rambhatla SS, Shrivastava A, Chellappa R (2020) Detecting human-object interactions via functional generalization. Proceedings of the AAAI Conference on Artificial Intelligence 34:10460–10469
Gupta T, Schwing A, Hoiem D (2019) No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9677–9685
Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1981–1990
Yuan H, Zhang S, Wang X, Albanie S, Pan Y, Feng T, Jiang J, Ni D, Zhang Y, Zhao D (2023) Rlipv2: fast scaling of relational language-image pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 21649–21661
Ning S, Qiu L, Liu Y, He X (2023) Hoiclip: efficient knowledge transfer for hoi detection with vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23507–23517
Hou Z, Peng X, Qiao Y, Tao D (2020) Visual compositional learning for human-object interaction detection. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XV 16, pp 584–600 Springer
Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Detecting human-object interaction via fabricated compositional learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14646–14655
Liu Y, Yuan J, Chen CW (2020) Consnet: learning consistency graph for zero-shot human-object interaction detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 4235–4243
Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, Le Q, Sung YH, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, pp 4904–4916. PMLR
Du Y, Wei F, Zhang Z, Shi M, Gao Y, Li G (2022) Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14084–14093
Feng C, Zhong Y, Jie Z, Chu X, Ren H, Wei X, Xie W, Ma L (2022) Promptdet: towards open-vocabulary detection using uncurated images. In: European conference on computer vision, pp 701–717. Springer
Cao Y, Tang Q, Yang F, Su X, You S, Lu X, Xu C (2023) Re-mine, learn and reason: exploring the cross-modal semantic correlations for language-guided hoi detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 23492–23503
Yuan H, Wang M, Ni D, Xu L (2022) Detecting human-object interactions with object-guided cross-modal calibrated semantics. Proceedings of the AAAI Conference on Artificial Intelligence 36:3206–3214
Zhao L, Yuan L, Gong B, Cui Y, Schroff F, Yang MH, Adam H, Liu T (2023) Unified visual relationship detection with vision and language models. arXiv preprint arXiv:2303.08998
Li L, Xiao J, Chen G, Shao J, Zhuang Y, Chen L (2023) Zero-shot visual relation detection via composite visual cues from large language models. arXiv preprint arXiv:2305.12476
Wu M, Gu J, Shen Y, Lin M, Chen C, Sun X (2023) End-to-end zero-shot hoi detection via vision and language knowledge distillation. Proceedings of the AAAI Conference on Artificial Intelligence 37:2839–2846
Zong D, Sun S (2023) Zero-shot human–object interaction detection via similarity propagation. IEEE Transactions on Neural Networks and Learning Systems
Li Z, An G (2022) Human-object interaction prediction with natural language supervision. In: 2022 16th IEEE International Conference on Signal Processing (ICSP), vol 1, pp 124–128. IEEE
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Gao J, Yap KH, Wu K, Phan DT, Garg K, Han BS (2024) Contextual human object interaction understanding from pre-trained large language model. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 13436–13440. IEEE
Pratt S, Yatskar M, Weihs L, Farhadi A, Kembhavi A (2020) Grounded situation recognition. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part IV 16, pp 314–332. Springer
Wang S, Yap KH, Ding H, Wu J, Yuan J, Tan YP (2021) Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13475–13484
Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10410–10419
Acknowledgements
This work was supported in part by Shanghai Local Capacity Enhancement project (No. 21010501500).
Author information
Authors and Affiliations
Contributions
Kejun Xue performed the methodology and conceptualization; Yongbin Gao and Zhijun Fang performed the review and supervision; Xiaoyan Jiang performed the data curation; Wenjun Yu performed the validation; Mingxuan Chen performed the formalanalysis; Chenmou Wu performed the investigation.
Corresponding authors
Ethics declarations
Ethical and Informed Consent for Data Used
Informed consent has been obtained from Shanghai University of Engineering Science for the publication of this article, as well as from all authors.
Competing Interests
The corresponding author of this paper holds the role of associate editor at Applied Intelligence.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xue, K., Gao, Y., Fang, Z. et al. Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer. Appl Intell 54, 12492–12504 (2024). https://doi.org/10.1007/s10489-024-05774-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05774-7