Full length article
Language-guided graph parsing attention network for human-object interaction recognition

https://doi.org/10.1016/j.jvcir.2022.103640Get rights and content

Highlights

  • This paper proposes a language-guided graph parsing attention network (LG-GPAN).

  • We leverage the relationship of the visual and language to the HOI recognition.

  • LG-GPAN outperforms the previous human-object interaction recognition methods both on CAD-120 and V-COCO dataset.

Abstract

This paper focuses on the task of human-object interaction (HOI) recognition, which aims to classify the interaction between human and objects. It is a challenging task partially due to the extremely imbalanced data among classes. To solve this problem, we propose a language-guided graph parsing attention network (LG-GPAN) that makes use of the word distribution in language to guide the classification in vision. We first associate each HOI class name with a word embedding vector in language and then all the vectors can construct a language space specified for HOI recognition. Simultaneously, the visual feature is extracted from the inputs via the proposed graph parsing attention network (GPAN) for better visual representation. The visual feature is then transformed into the linguistic one in language space. Finally, the output score is obtained via measuring the distance between the linguistic feature and the word embedding of classes in language space. Experimental results on the popular CAD-120 and V-COCO datasets validate our design choice and demonstrate its superior performance in comparison to the state-of-the-art.

Introduction

As an important part of human-centric scene understanding, HOI understanding can reflect how human interacts with the objects in a scene (e.g., opening microwave). The goal of HOI recognition is to identify objects interacting with human and predict the classes of sub-activities (the sequence of activities being performed by a human, e.g., drinking) and object affordances (A situation where an object’s sensory characteristics intuitively imply its functionality and use [1], e.g., drinkable). It is required to detect the human and objects but also their relationships. In recent years, the research on HOI recognition has become an emerging and attractive task in the computer vision community considering the wide potential applications, e.g., video indexing and surveillance [2], video understanding [3], [4], [5], and even brain science [6] .

There have been many studies of HOI detection in still images [7], [8], [9], [10], [11], [12], [13]. Limited by the static modality, the image cannot provide temporal information for HOI learning immediately. The temporal information brings both temporal cues and challenges in videos [14], [15]. First, imbalanced data distribution is more serious in terms of sample collecting and annotation. Second, learning HOI in videos needs more high-dimensional data due to the diverse motion of human and objects.

Thanks to deep learning, researchers have made great progress in HOI video research in recent years [16], [17], [18], [19]. For example, Jain et al. [16] proposed a structural-RNN which constructs a graph as rich recurrent neural networks (RNN) mixture to represent videos. Qi et al. [17] analyzed the HOI graph and proposes a graph parsing neural network (GPNN) which worked with long-short term memory (LSTM) and gated recurrent unit (GRU) [18]. The hierarchical approach named LIGHTEN [19] was a hierarchical approach using deep visual features of human and objects to capture spatial–temporal cues at different granularity in videos. However, little has addressed the data imbalance, which refers that the HOI data naturally has a long-tailed distribution both in sub-activity classes and affordance classes. Some HOI classes (head classes) occupy most of the data, while some HOI classes (tail classes) have a few samples. Fig. 1 shows the long-tail distribution among classes in the CAD-120 dataset [15]. The head classes (e.g., moving, stationary) are in red and tail classes (e.g., cleaning, cleaner) are in green. Besides, some different sub-activities or affordance classes have similar visual features in HOI videos. For example, the location of a similar object (e.g., microwave) is not easily changed by interaction, which causes some affordance classes to look like each other (e.g., openable, stationary).

To analyze data distribution on the visual feature clearly, we visualize the distribution of sub-activity features which are extracted by GPNN [17], as shown in Fig. 2(a). For convenience, we map such features into two-dimensional space via t-SNE visualization [21]. In Fig. 2(a), some samples of tail classes are close to features of head classes (e.g., placing and cleaning). And tail classes often have some samples that fall into the distribution of head classes (e.g., closing in the moving sample distribution). Such feature distribution easily makes the tail classes misrecognized as head classes in HOI classification.

In our opinion, understanding HOI in videos requires visual data as well as human knowledge. Linguistic knowledge is simple to grasp and plays a significant role in the process of understanding the environment. We are wondering whether such language knowledge can be used to guide HOI recognition and alleviate the burden of the long-tail problem. There have already been some works that use language in visual tasks in recent years (e.g., visual question answering [24], relationship detection [25], image retrieval [26], referring expression comprehension [27], and video summarization [28]). Such methods leverage descriptions or queries as additional modalities to improve the results. Though they use language for distinct tasks while using word embeddings to gain language features. Different from that, we think that the distribution of vision and language features has an inherent correlation, which can improve the performance of HOI.

Interestingly, we notice that the word embedding of HOI classes has a related pattern with the long-tail in the CAD-120 dataset. Fig. 2(b) shows that the distance among the intra-cluster is significantly smaller than the distance among inter-cluster in word embedding space (from here on let us assume head classes and tail classes as different clusters). It means that we could work by language representation to increase the distance of such two kinds of classes, i.e., head classes and tail classes. Then the problem caused by long-tail in data may be alleviated in HOI recognition.

With the insights above, we propose a language-guided graph attention network that leverages linguistic knowledge to guide the HOI recognition in videos. The proposed language-guided module aims to distill information from language data to guide the visual module to learn a better HOI embedding. In this paper, we attempt to teach the network linguistic knowledge about HOI in addition to learning classification. Unlike the previously mentioned language-guided approaches, we are the first to analyze the relationship between the visual feature and language feature distribution, and leverage it to HOI recognition.

More specifically, we first associate each HOI class name with a word embedding vector in language and then all the vectors can construct a language space specified for HOI recognition. Simultaneously, the visual feature is extracted from the inputs via the proposed graph parsing attention network (GPAN) for better visual representation. The visual feature is then transformed into the linguistic one in language space. Finally, the output score is obtained via measuring the distance between the linguistic feature and the word embedding of classes in language space. With the guidance of linguistic knowledge, the margins between the features of head classes and tail classes are increased. The network also can recognize some visual-similar features by making use of the margin in language space.

The rest of this paper is organized as follows. We start with the related work to our network in Section 2. The details of the proposed network are introduced in Section 3. Then extensive experiments are presented in Section 4 to evaluate the proposed approach. Finally, Section 5 concludes the paper.

Section snippets

Related work

Human-Object Interaction. The research of human-object interaction can be divided into two parts: image-based HOI detection and video-based HOI recognition. Image-based HOI detection attempts to predict the interaction of each human and object pair from images. Gkioxari et al. [8] proposed to predict the approximate position of the target object of interaction by studying the target human appearance. Gao et al. [29] proposed an instance-centric attention module that learns to dynamically

Proposed approach: LG-GPAN

The problem we considered is to recognize the sequence of sub-activities being performed by a human, more relevantly, the associated affordances of the objects with interaction [20]. An overview of the proposed LG-GPAN is shown in Fig. 3. It mainly includes two parts: the language-guided module (LGM) and the visual-based graph parsing attention module (VGPAM). We adopt GPNN as our baseline. For clear notation, we first revisit GPNN. Then the proposed language-guided module and visual-based

Experiments

In the following sections, we evaluate our approach both on the CAD-120 [15] dataset for HOI from videos and V-COCO [47] dataset for HOI from images. To verify the effectiveness of our approach, a comprehensive comparison is established for quantitative assessment. Then we take HOI from videos as an example to give a qualitative analysis. In the end, an ablation study is followed to discuss the contribution of each component of the proposed approach to the final performance.

Conclusion

In this paper, we propose a language-guided graph attention parsing neural network that makes use of the word distribution in language to guide the HOI recognition in vision. We first analyze the relationship between the distribution of vision and language features in HOI. Then the language-guided module (LGM) is proposed to measure the distance between the visual feature and word embedding of classes to alleviate the burden from the long-tail problem. Moreover, the visual feature is extracted

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by National Key R & D Program of China (No. 2020AAA0109301), National Natural Science Foundation of China (No. 61836008), Guangdong Provincial Key Field Research and Development Plan Project (No. 2021B0 101410002) and Guangdong Provincial Key Field Research and Development Plan Project (No. 2021B0101400002).

References (49)

  • NgS.

    Principal component analysis to reduce dimension on digital image

    Procedia Comput. Sci.

    (2017)
  • ChenY. et al.

    Learning joint visual semantic matching embeddings for language-guided retrieval

  • GrabnerH. et al.

    What makes a chair a chair?

  • PrestA. et al.

    Explicit modeling of human-object interactions in realistic videos

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • C. Liu, Y. Jin, K. Xu, G. Gong, Y. Mu, Beyond short-term snippet: Video relation detection with spatio-temporal global...
  • X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, J. Xiao, Video relation detection with spatio-temporal graph, in:...
  • ShangX. et al.

    Annotating objects and relations in user-generated videos

  • BaldassanoC. et al.

    Human–object interactions are more than the sum of their parts

    Cerebral Cortex

    (2017)
  • ChaoY.-W. et al.

    Learning to detect human-object interactions

  • G. Gkioxari, R. Girshick, P. Dollár, K. He, Detecting and recognizing human-object interactions, in: Proceedings of the...
  • B. Xu, Y. Wong, J. Li, Q. Zhao, M.S. Kankanhalli, Learning to detect human-object interactions with knowledge, in:...
  • Y.-L. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H.-S. Fang, Y. Wang, C. Lu, Transferable interactiveness knowledge for...
  • XuB. et al.

    Interact as you intend: Intention-driven human-object interaction detection

    IEEE Trans. Multimed.

    (2019)
  • B. Wan, D. Zhou, Y. Liu, R. Li, X. He, Pose-aware multi-level feature network for human object interaction detection,...
  • KimD.-J. et al.

    Detecting human-object interactions with action co-occurrence priors

  • GuptaA. et al.

    Observing human-object interactions: Using spatial and functional compatibility for recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • KoppulaH.S. et al.

    Anticipating human activities using object affordances for reactive robotic response

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • A. Jain, A.R. Zamir, S. Savarese, A. Saxena, Structural-rnn: Deep learning on spatio-temporal graphs, in: Proceedings...
  • S. Qi, W. Wang, B. Jia, J. Shen, S.-C. Zhu, Learning human-object interactions by graph parsing neural networks, in:...
  • ChungJ. et al.

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    (2014)
  • S.P.R. Sunkesula, R. Dabral, G. Ramakrishnan, Lighten: Learning interactions with graph and hierarchical temporal...
  • KoppulaH.S. et al.

    Learning human activities and object affordances from rgb-d videos

    Int. J. Robot. Res.

    (2013)
  • Van der MaatenL. et al.

    Visualizing data using t-SNE

    J. Mach. Learn. Res.

    (2008)
  • MikolovT. et al.

    Efficient estimation of word representations in vector space

    (2013)
  • Cited by (4)

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text