Abstract
In this work, we study the task of detecting human-object interactions (HOI) from images, which is defined as detecting triplets of (human, predicate, object). A common practice in the literatures is firstly localizing human and object instances and then inferring the triplets or predicates only as a classification task from the detected human-object pairs. A data sparsity issue arises when inferring the triplets because of the serious data imbalance among HOI classes, while a data variance issue arises when inferring predicates only since a predicate can carry different semantic meanings when being applied to different objects. To resolve the problem, we propose to decompose HOI classes with a same predicate into several semantic groups based on the appearance, semantic information and function of the objects. By doing this, semantic-related HOI classes are grouped together to compensate the data sparsity issue, while visually and functionally less related HOI classes are separated to relieve the data variance issue. We reveal multiple levels of decomposition in different granularities can provide richer auxiliary information to boost the performance. We implement this idea with a multi-branch graph network, while the multiple branches make classifications based on different levels of decompositions. We evaluate our method on popular HICO-Det dataset. Experimental results show that our method achieves state-of-the art performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. arXiv preprint arXiv:1904.03181 (2019)
Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389. IEEE (2018)
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1017–1025 (2015)
Fang, H.S., Cao, J., Tai, Y.W., Lu, C.: Pairwise body-part attention for recognizing human-object interactions. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–67 (2018)
Gao, C., Zou, Y., Huang, J.B.: iCAN: instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437 (2018)
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8359–8367 (2018)
Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9677–9685 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Li, Y.L., et al.: Transferable interactiveness knowledge for human-object interaction detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3585–3594 (2019)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. arXiv preprint arXiv:1812.05736 (2018)
Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.C.: Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 401–417 (2018)
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, May 2010. http://is.muni.cz/publication/884893/en
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1568–1576. IEEE (2018)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Wang, S., Cheng, Z., Deng, X., Chang, L., Duan, F., Lu, K.: Leveraging 3D blendshape for facial expression recognition using CNN. Sci. China Inf. Sci 63 (2020). Article number: 120114. https://doi.org/10.1007/s11432-019-2747-y
Xu, B., Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.S.: Interact as you intend: Intention-driven human-object interaction detection. IEEE Trans. Multimed. 22(6), 1423–1432 (2019)
Xu, B., Wong, Y., Li, J., Zhao, Q., Kankanhalli, M.S.: Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–685 (2018)
Yang, Z., Qin, Z., Yu, J., Hu, Y.: Scene graph reasoning with prior visual relationship for visual question answering. arXiv preprint arXiv:1812.09681 (2018)
Zhuang, B., Wu, Q., Shen, C., Reid, I., van den Hengel, A.: HCVRD: a benchmark for large-scale human-centered visual relationship detection. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Acknowledgement
This work is supported by National Key Research and Development Project Grant, Grant/Award Number: 2018AAA0100802.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, T., Zhang, X., Duan, F., Chang, L. (2021). Multi-branch Graph Network for Learning Human-Object Interaction. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13022. Springer, Cham. https://doi.org/10.1007/978-3-030-88013-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-88013-2_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88012-5
Online ISBN: 978-3-030-88013-2
eBook Packages: Computer ScienceComputer Science (R0)