Skip to main content

Advertisement

Log in

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human-object interaction (HOI) detection is an important computer vision task for recognizing the interaction between humans and surrounding objects in an image or video. The HOI datasets have a serious long-tailed data distribution problem because it is challenging to have a dataset that contains all potential interactions. Many HOI detectors have addressed this issue by utilizing visual-language models. However, due to the calculation mechanism of the Transformer, the visual-language model is not good at extracting the local features of input samples. Therefore, we propose a novel local feature enhanced Transformer to motivate encoders to extract multi-modal features that contain more information. Moreover, it is worth noting that the application of prompt learning in HOI detection is still in preliminary stages. Consequently, we propose a multi-modal adaptive prompt module, which uses an adaptive learning strategy to facilitate the interaction of language and visual prompts. In the HICO-DET and SWIG-HOI datasets, the proposed model achieves full interaction with 24.21% mAP and 14.29% mAP, respectively. Our code is available at https://github.com/small-code-cat/AMP-HOI.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability and Access

The data that support the findings of this study are openly available in HICO-DET and SWIG-HOI.

References

  1. Fouhey DF, Kuo WC, Efros AA, Malik J (2018) From lifestyle vlogs to everyday interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4991–5000

  2. Li S, Du Y, Torralba A, Sivic J, Russell B (2021) Weakly supervised human-object interaction detection in video via contrastive spatiotemporal regions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1845–1855

  3. Morais R, Le V, Venkatesh S, Tran T (2021) Learning asynchronous and sparse human-object interaction in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16041–16050

  4. Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatiotemporal vlad for video action recognition. IEEE Trans Image Process 28(6):2799–2812

    Article  MathSciNet  Google Scholar 

  5. Tu Z, Xie W, Dauwels J, Li B, Yuan J (2018) Semantic cues enhanced multimodality multistream cnn for action recognition. IEEE Transactions on Circuits and Systems for Video Technology 29(5):1423–1437

    Article  Google Scholar 

  6. Wang S, Duan Y, Ding H, Tan YP, Yap KH, Yuan J (2022) Learning transferable human-object interaction detector with natural language supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 939–948

  7. Yuan H, Jiang J, Albanie S, Feng T, Huang Z, Ni D, Tang M (2022) Rlip: relational language-image pre-training for human-object interaction detection. Adv Neural Inf Process Syst 35:37416–37431

    Google Scholar 

  8. Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20123–20132

  9. Pan M, Shen H (2024) Multimodal variational contrastive learning for few-shot classification. Applied Intelligence, 1–14

  10. Liao H, Wang Q, Zhao S, Xing T, Hu R (2023) Domain consensual contrastive learning for few-shot universal domain adaptation. Appl Intell 53(22):27191–27206

    Article  Google Scholar 

  11. Kan H, Yu J, Huang J, Liu Z, Wang H, Zhou H (2023) Self-supervised group meiosis contrastive learning for eeg-based emotion recognition. Appl Intell 53(22):27207–27225

    Article  Google Scholar 

  12. Ben-David E, Oved N, Reichart R (2022) Pada: example-based prompt learning for on-the-fly adaptation to unseen domains. Trans Assoc Comput Linguist 10:414–433

    Article  Google Scholar 

  13. Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, Tang J (2021) Gpt understands, too. arXiv preprint arXiv:2103.10385

  14. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR

  15. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  16. Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B, Lim SN (2022) Visual prompt tuning. In: European conference on computer vision, pp 709–727. Springer

  17. Wang Z, Zhang Z, Lee CY, Zhang H, Sun R, Ren X, Su G, Perot V, Dy J, Pfister T (2022) Learning to prompt for continual learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 139–149

  18. Zang Y, Li W, Zhou K, Huang C, Loy CC (2022) Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225

  19. Yang F, Zhang QX, Ding XJ, Ma FM, Cao J, Tong DY (2023) Semantic preserving asymmetric discrete hashing for cross-modal retrieval. Appl Intell 53(12):15352–15371

    Article  Google Scholar 

  20. Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259

  21. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  22. Chao YW, Wang Z, He Y, Wang J, Deng J (2015) Hico: A benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE international conference on computer vision, pp 1017–1025

  23. Wang S, Yap KH, Yuan J, Tan YP (2020) Discovering human interactions with novel objects via zero-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11652–11661

  24. Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L (2021) Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res 304:114135

    Article  Google Scholar 

  25. Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24142–24156

    Article  Google Scholar 

  26. Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vision 130(9):2337–2348

    Article  Google Scholar 

  27. Zhou K, Yang J, Loy CC, Liu Z (2022) Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825

  28. Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091

  29. Khattak MU, Rasheed H, Maaz M, Khan S, Khan FS (2023) Maple: multi-modal prompt learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19113–19122

  30. Gao C, Xu J, Zou Y, Huang JB (2020) Drg: dual relation graph for human-object interaction detection. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp. 696–712. Springer

  31. Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367

  32. Li YL, Liu X, Wu X, Li Y, Lu C (2020) Hoi analysis: integrating and decomposing human-object interaction. Adv Neural Inf Process Syst 33:5011–5022

    Google Scholar 

  33. Qi S, Wang W, Jia B, Shen J, Zhu SC (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 401–417

  34. Wang T, Anwer RM, Khan MH, Khan FS, Pang Y, Shao L, Laaksonen J (2019) Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5694–5702

  35. Xiao T, Fan Q, Gutfreund D, Monfort M, Oliva A, Zhou B (2019) Reasoning about human-object interactions through dual attention networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3919–3928

  36. Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Affordance transfer learning for human-object interaction detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 495–504

  37. Huynh D, Elhamifar E (2021) Interaction compass: multi-label zero-shot learning of human-object interactions via spatial relations. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8472–8483

  38. Bansal A, Rambhatla SS, Shrivastava A, Chellappa R (2020) Detecting human-object interactions via functional generalization. Proceedings of the AAAI Conference on Artificial Intelligence 34:10460–10469

    Article  Google Scholar 

  39. Gupta T, Schwing A, Hoiem D (2019) No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9677–9685

  40. Peyre J, Laptev I, Schmid C, Sivic J (2019) Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1981–1990

  41. Yuan H, Zhang S, Wang X, Albanie S, Pan Y, Feng T, Jiang J, Ni D, Zhang Y, Zhao D (2023) Rlipv2: fast scaling of relational language-image pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 21649–21661

  42. Ning S, Qiu L, Liu Y, He X (2023) Hoiclip: efficient knowledge transfer for hoi detection with vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23507–23517

  43. Hou Z, Peng X, Qiao Y, Tao D (2020) Visual compositional learning for human-object interaction detection. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XV 16, pp 584–600 Springer

  44. Hou Z, Yu B, Qiao Y, Peng X, Tao D (2021) Detecting human-object interaction via fabricated compositional learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14646–14655

  45. Liu Y, Yuan J, Chen CW (2020) Consnet: learning consistency graph for zero-shot human-object interaction detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 4235–4243

  46. Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, Le Q, Sung YH, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, pp 4904–4916. PMLR

  47. Du Y, Wei F, Zhang Z, Shi M, Gao Y, Li G (2022) Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14084–14093

  48. Feng C, Zhong Y, Jie Z, Chu X, Ren H, Wei X, Xie W, Ma L (2022) Promptdet: towards open-vocabulary detection using uncurated images. In: European conference on computer vision, pp 701–717. Springer

  49. Cao Y, Tang Q, Yang F, Su X, You S, Lu X, Xu C (2023) Re-mine, learn and reason: exploring the cross-modal semantic correlations for language-guided hoi detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 23492–23503

  50. Yuan H, Wang M, Ni D, Xu L (2022) Detecting human-object interactions with object-guided cross-modal calibrated semantics. Proceedings of the AAAI Conference on Artificial Intelligence 36:3206–3214

    Article  Google Scholar 

  51. Zhao L, Yuan L, Gong B, Cui Y, Schroff F, Yang MH, Adam H, Liu T (2023) Unified visual relationship detection with vision and language models. arXiv preprint arXiv:2303.08998

  52. Li L, Xiao J, Chen G, Shao J, Zhuang Y, Chen L (2023) Zero-shot visual relation detection via composite visual cues from large language models. arXiv preprint arXiv:2305.12476

  53. Wu M, Gu J, Shen Y, Lin M, Chen C, Sun X (2023) End-to-end zero-shot hoi detection via vision and language knowledge distillation. Proceedings of the AAAI Conference on Artificial Intelligence 37:2839–2846

  54. Zong D, Sun S (2023) Zero-shot human–object interaction detection via similarity propagation. IEEE Transactions on Neural Networks and Learning Systems

  55. Li Z, An G (2022) Human-object interaction prediction with natural language supervision. In: 2022 16th IEEE International Conference on Signal Processing (ICSP), vol 1, pp 124–128. IEEE

  56. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  57. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  58. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  59. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  60. Gao J, Yap KH, Wu K, Phan DT, Garg K, Han BS (2024) Contextual human object interaction understanding from pre-trained large language model. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 13436–13440. IEEE

  61. Pratt S, Yatskar M, Weihs L, Farhadi A, Kembhavi A (2020) Grounded situation recognition. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part IV 16, pp 314–332. Springer

  62. Wang S, Yap KH, Ding H, Wu J, Yuan J, Tan YP (2021) Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13475–13484

  63. Tamura M, Ohashi H, Yoshinaga T (2021) Qpic: query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10410–10419

Download references

Acknowledgements

This work was supported in part by Shanghai Local Capacity Enhancement project (No. 21010501500).

Author information

Authors and Affiliations

Authors

Contributions

Kejun Xue performed the methodology and conceptualization; Yongbin Gao and Zhijun Fang performed the review and supervision; Xiaoyan Jiang performed the data curation; Wenjun Yu performed the validation; Mingxuan Chen performed the formalanalysis; Chenmou Wu performed the investigation.

Corresponding authors

Correspondence to Yongbin Gao or Zhijun Fang.

Ethics declarations

Ethical and Informed Consent for Data Used

Informed consent has been obtained from Shanghai University of Engineering Science for the publication of this article, as well as from all authors.

Competing Interests

The corresponding author of this paper holds the role of associate editor at Applied Intelligence.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, K., Gao, Y., Fang, Z. et al. Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer. Appl Intell 54, 12492–12504 (2024). https://doi.org/10.1007/s10489-024-05774-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05774-7

Keywords

Navigation