Abstract
Human-Object Interaction (HOI) detection is important in human-centric scene understanding. Existing works typically assume that the same verb in different HOI categories has similar visual characteristics, while ignoring the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net), which decodes the visual polysemy of verbs for HOI detection in three ways. First, PD-Net augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce the intra-class variation of the same verb. Second, we introduce a novel Polysemy Attention Module (PAM) that guides PD-Net to make decisions based on more important feature types according to the language priors. Finally, the above two strategies are applied to two types of classifiers for verb recognition, i.e., object-shared and object-specific verb classifiers, whose combination further relieves the verb polysemy problem. By deciphering the visual polysemy of verbs, we achieve the best performance on both HICO-DET and V-COCO datasets. In particular, PD-Net outperforms state-of-the-art approaches by 3.81% mAP in the Known-Object evaluation mode of HICO-DET. Code of PD-Net is available at https://github.com/MuchHair/PD-Net.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ashual, O., Wolf, L.: Specifying object attributes and relations in interactive scene generation. In: ICCV (2019)
Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. arXiv preprint arXiv:1904.03181 (2019)
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: CVPR (2019)
Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: WACV (2018)
Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: CVPR (2019)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR (2018)
Fang, H.-S., Cao, J., Tai, Y.-W., Lu, C.: Pairwise body-part attention for recognizing human-object interactions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 52–68. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_4
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)
Gao, C., Zou, Y., Huang, J.B.: iCAN: instance-centric attention network for human-object interaction detection. In: BMVC (2018)
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: CVPR (2019)
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: CVPR (2018)
Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: CVPR (2019)
Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: ICCV (2019)
He, S., Tavakoli, H.R., Borji, A., Pugeault, N.: Human attention in image captioning: dataset and analysis. In: ICCV (2019)
Li, Y.L., et al.: Transferable interactiveness knowledge for human-object interaction detection. In: CVPR (2019)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, X., Ding, C., Zeng, J., Tao, D.: GPS-Net: graph property sensing network for scene graph generation. In: CVPR (2020)
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. In: ICCV (2019)
Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.-C.: Learning human-object interactions by graph parsing neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 407–423. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_25
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Shen, L., Yeung, S., Hoffman, J., Mori, G., Li, F.F.: Scaling human-object interaction recognition through zero-shot learning. In: WACV (2018)
Shrestha, R., Kafle, K., Kanan, C.: Answer them all! toward universal visual question answering models. In: CVPR (2019)
Wan, B., Zhou, D., Liu, Y., Li, R., He, X.: Pose-aware multi-level feature network for human object interaction detection. In: ICCV (2019)
Wan, H., Luo, Y., Peng, B., Zheng, W.S.: Representation learning for scene graph completion via jointly structural and visual embedding. In: IJCAI (2018)
Wang, T., et al.: Deep contextual attention for human-object interaction detection. In: ICCV (2019)
Wang, W., Wang, R., Shan, S., Chen, X.: Exploring context and visual pattern of relationship for scene graph generation. In: CVPR (2019)
Xu, B., Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.S.: Interact as you intend: intention-driven human-object interaction detection. IEEE Trans. Multimed., 1 (2019). https://doi.org/10.1109/TMM.2019.2943753
Xu, B., Wong, Y., Li, J., Zhao, Q., Kankanhalli, M.S.: Learning to detect human-object interactions with knowledge. In: CVPR (2019)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR (2019)
Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. arXiv preprint arXiv:1909.03918 (2019)
Zhou, P., Chi, M.: Relation parsing neural network for human-object interaction detection. In: ICCV (2019)
Acknowledgement
Changxing Ding is the corresponding author. This work was supported by the NSF of China under Grant 61702193, the Science and Technology Program of Guangzhou under Grant 201804010272, the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant 2017ZT07X183, the Fundamental Research Funds for the Central Universities of China under Grant 2019JQ01, and ARC FL-170100117.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhong, X., Ding, C., Qu, X., Tao, D. (2020). Polysemy Deciphering Network for Human-Object Interaction Detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12365. Springer, Cham. https://doi.org/10.1007/978-3-030-58565-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-58565-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58564-8
Online ISBN: 978-3-030-58565-5
eBook Packages: Computer ScienceComputer Science (R0)