Diagnosing Human-Object Interaction Detectors

Zhu, Fangrui; Xie, Yiming; Xie, Weidi; Jiang, Huaizu

doi:10.1007/s11263-025-02369-8

Diagnosing Human-Object Interaction Detectors

Open access
Published: 16 February 2025

Volume 133, pages 2227–2244, (2025)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Diagnosing Human-Object Interaction Detectors

Download PDF

Fangrui Zhu ORCID: orcid.org/0009-0009-8610-8031¹,
Yiming Xie¹,
Weidi Xie² &
…
Huaizu Jiang¹

1080 Accesses
Explore all metrics

Abstract

We have witnessed significant progress in human-object interaction (HOI) detection. However, relying solely on mAP (mean Average Precision) scores as a summary metric does not provide sufficient insight into the nuances of model performance (e.g., why one model outperforms another), which can hinder further innovation in this field. To address this issue, we introduce a diagnosis toolbox in this paper to offer a detailed quantitative breakdown of HOI detection models, inspired by the success of object detection diagnosis tools. We first conduct a holistic investigation into the HOI detection pipeline. By defining a set of errors and using oracles to fix each one, we quantitatively analyze the significance of different errors based on the mAP improvement gained from fixing them. Next, we explore the two key sub-tasks of HOI detection: human-object pair localization and interaction classification. For the pair localization task, we compute the coverage of ground-truth human-object pairs and assess the noisiness of the localization results. For the classification task, we measure a model’s ability to distinguish between positive and negative detection results and to classify actual interactions when human-object pairs are correctly localized. We analyze eight state-of-the-art HOI detection models, providing valuable diagnostic insights to guide future research. For instance, our diagnosis reveals that the state-of-the-art model RLIPv2 outperforms others primarily due to its significant improvement in multi-label interaction classification accuracy. Our toolbox is applicable across various methods and datasets and is available at https://neu-vi.github.io/Diag-HOI/.

DRG: Dual Relation Graph for Human-Object Interaction Detection

Pairwise Negative Sample Mining for Human-Object Interaction Detection

UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Human-object interaction (HOI) detection aims to jointly localize humans and objects that have interactions in static images. For example, the person and snowboard in Fig. 3. It provides structured interpretations of the semantics of visual scenes going beyond mere object recognition or detection. A successful HOI detection system is an essential building block for many downstream applications, such as visual question answering (Antol et al., 2015; Anderson et al., 2018; Shih et al., 2016; Lu et al., 2016; Wang et al., 2017; Lu et al., 2016), image captioning (Vinyals et al., 2016; Aneja et al., 2018; Feng et al., 2019; Li et al., 2019) and retrieval (Chao et al., 2015; Brown et al., 2020; Ng et al., 2022; Teichmann et al., 2019; Radenović et al., 2018), etc.

Recent advances in HOI detection have been marked by increasing mean Average Precision (mAP) scores on standard benchmarks (Gupta and Malik, 2015; Chao et al., 2018; Gao et al., 2020; Ulutan et al., 2020; Gupta et al., 2019; Zhou and Chi, 2019; Zhang et al., 2021, 2022; Liao et al., 2022; Yuan et al., 2022; Yu et al., 2023; Ma et al., 2023; Li et al., 2022; Wu et al., 2022; Zhong et al., 2022; Jiang et al., 2022; Liu et al., 2022; Kim et al., 2023; Yuan et al., 2023; Zhu et al., 2024), denoting remarkable progress. Nonetheless, relying on mAP scores as a summary metric does not provide sufficient insight into the nuances of model performance, such as the factors that make one method perform better or reveal any bottlenecks for further improvement. This lack of detailed understanding may impede future advancements in the field. The same issue also exists in object detection, a sub-task of HOI detection, where mAP is also the dominant evaluation metric. To address this, diagnostic toolboxes have been designed to provide more insightful quantitative breakdown analysis (Hoiem et al., 2012; Bolya et al., 2020), which has significantly boosted the development of object detection.

In this paper, we aim to investigate the success of these works by introducing a toolbox designed for HOI detection, fostering future research. Generally speaking, the HOI detection problem consists of two sub-tasks: 1) localizing pairs of interacting human and object (human-object pair localization) and 2) classification of their interactions, as illustrated in Fig. 1. These two tasks are not independent but in a cascaded relationship, as shown in Fig. 3. Specifically, in our toolbox, we first perform a holistic analysis of the overall HOI detection accuracy. Inspired by the object detection diagnosis toolbox (Bolya et al., 2020), we define a set of error types as well as oracles to fix them in the HOI detection pipeline across the human-object pair localization and interaction classification tasks. The mAP improvement, achieved by applying each oracle, is used to measure the significance of different errors. The larger mAP improvement can be obtained for a particular type of error, the more it contributes to the failure of an HOI detector.

We then examine the tasks of human-object pair localization and interaction classification in detail. For human-object pair localization, we focus primarily on Recall to assess whether the model can capture all ground-truth pairs, which is crucial for the subsequent interaction classification stage. For interaction classification, the model must determine whether a detected human-object pair involves an actual interaction. To evaluate this binary classification task, we report the Average Precision (AP) score instead of accuracy, as this eliminates the need to select a specific threshold. For human-object pairs with actual interactions, we calculate mAP scores to address the multi-label classification aspect of the task. This approach allows us to analyze the two sub-tasks independently.

Our diagnosis toolbox is applicable to various methods across different datasets. Through both holistic and detailed investigations of human-object pair localization and interaction classification, the toolbox provides a comprehensive diagnosis report for eight state-of-the-art HOI detection models. With the detailed quantitative breakdown results, conducted with Fig. 2, we can now address key questions such as: “Are one-stage HOI detection models superior to two-stage models, or vice versa?” (there is no clear accuracy advantage between the two paradigms), “What is the main bottleneck in HOI detection?” (incorrect object localization in human-object pairs and misclassification of interactions), and “Why does the state-of-the-art method RLIPv2 (Yuan et al., 2023) perform better?” (it significantly improves interaction classification accuracy). For more detailed discussions of existing HOI detection models, please refer to Sect. 6.

To the best of our knowledge, this is the first toolbox specifically dedicated to diagnosing HOI detection in static images. By releasing our toolbox, we believe it will promote the future development of HOI detection models.

1.1 Related Work

There are several analysis tools for object detection (Lin et al., 2014; Hoiem et al., 2012; Bolya et al., 2020). The seminal work (Hoiem et al., 2012) shows how to analyze the influences of object characteristics on detection performance and the impact of different types of false positives. However, it requires extra annotations to help analyze the impact of object characteristics, which is unlikely to be scalable in large-scale benchmark datasets. TIDE (Bolya et al., 2020) improves the default evaluation tool provided by the COCO dataset (Lin et al., 2014). It provides a more general framework for quantifying performance improvements for different false positive and false negative errors in object detection and instance segmentation algorithms. Our quantitative analysis of different errors and tasks in HOI detection is motivated by TIDE (Bolya et al., 2020). Extending existing toolboxes like TIDE (Bolya et al., 2020) to HOI detection is not trivial due to the intertwined nature of the human-object pair localization and interaction classification tasks. TIDE (Bolya et al., 2020) focuses on single-box detection, whereas HOI detection involves both box pair localization and the subsequent cascaded interaction classification. In TIDE, the various error types are easily distinguished and mutually exclusive. However, in our case, the errors are naturally entangled, requiring carefully designed criteria to categorize them in alignment with the model structure. The definitions and calculations of error significance in our work differ significantly from TIDE. Additionally, we provide an in-depth analysis of each error type to better understand the model’s performance and identify bottlenecks.

A similar error diagnosis work (Chen et al., 2021) is proposed for the video relation detection task, adopting a holistic approach inspired by TIDE (Bolya et al., 2020). In our diagnosis toolbox, we go beyond holistic error analysis and also conduct detailed investigations into the two distinct sub-tasks of HOI detection, considering the cascaded nature of the HOI detection pipeline. In Gupta and Malik (2015), the authors define several types of false positive errors. However, the definition is specifically tailored to the annotation format of the V-COCO dataset, making it less generalizable to others. In contrast, our analysis is applicable to various benchmark datasets (Chao et al., 2018; Gupta and Malik, 2015). In Kilickaya and Smeulders (2020), the authors analyze a specific issue in HOI detection, namely the long-tail problem of HOI categories, and highlight limiting factors. Liu et al. (2022) proposes a new metric to improve HOI generalization by preventing the model from learning spurious object-verb correlations. Both Kilickaya and Smeulders (2020) and Liu et al. (2022) are complementary to our diagnosis tool and analysis results.

2 Preliminaries

2.1 Definition of HOI Detection

Given an input image I, the output of a human-object interaction (HOI) detector is a set of triplets $\mathcal {S} = \left\{ \left( \textbf{b}_i^h, \textbf{b}_i^o, a_i \right) \right\} _{i=1}^K$, where $\textbf{b}_i^h$, $\textbf{b}_i^o$, and $a_i$ represent the bounding box of the i-th human, the bounding box of the i-th object, and their interaction class, respectively. Each bounding box $\textbf{b}_i^h$ and $\textbf{b}_i^o$ contains both the spatial coordinates of the bounding box and the associated category label. Specifically, for the i-th human: $\textbf{b}_i^h = \left( x_i^h, y_i^h, w_i^h, h_i^h, c_i^h \right) $, where: $(x_i^h, y_i^h)$ are the coordinates of the top-left corner of the bounding box, $w_i^h$ is the width of the bounding box, $h_i^h$ is the height of the bounding box, and $c_i^h$ is the category label associated with the human (usually fixed as ‘person’). To be more specific, $\textbf{b}_i^h$ can generally be any objects, e.g. $\texttt {<}$chair on floor$\texttt {>}$, $\texttt {<}$car near fire_hydrant$\texttt {>}$, but in HOI detection, the tasks only involve human-centric relationships, where the subject is always human. While $\textbf{b}_i^h$ can represent a variety of objects or entities, such as chair in $\texttt {<}$chair on floor$\texttt {>}$ or car in $\texttt {<}$car near fire_hydrant$\texttt {>}$, in the context of HOI detection tasks, the focus is on human-centric relationships. Therefore, the subject category is always restricted to ‘person’ to align with the scope of HOI detection, which specifically studies interactions involving humans.

Similarly, for the i-th object: $ \textbf{b}_i^o = \left( x_i^o, y_i^o, w_i^o, h_i^o, c_i^o \right) $, where: $(x_i^o, y_i^o)$ are the coordinates of the top-left corner of the object bounding box, $ w_i^o$ is the width of the object bounding box, $h_i^o$ is the height of the object bounding box, and $c_i^o$ is the category label of the object, which can be any object category, including ‘person’ for human-human interactions. The interaction class $a_i$ represents the action or relationship between the human and the object (e.g., ‘holding’, ‘riding’, ‘looking at’). The number of such triplets is denoted by K, indicating the total number of detected human-object (or human-human) interactions in the image. In this formalism, it is important to note that the object category $c_i^o$ can include the label ‘person’, making human-human interactions valid instances in the HOI detection framework, as seen in datasets like HICO-DET. These human-human pairs are treated similarly to human-object pairs. In essence, the HOI detection problem consists of two sub-tasks, as shown in Fig. 3. First, it is required to correctly localize every human-object pair that has any actual interaction. Unlike object detection, where individual object bounding boxes are predicted, the localization task here involves associating a pair of human and object boxes. This introduces additional complexity because the model needs to both detect the objects and establish the correct pairings between them. Once the human-object pairs are localized, the second sub-task is to recognize their interaction labels. Multiple interactions can occur for the same pair, making this a multi-label classification problem. For instance, the interaction between the person and the skateboard in Fig. 3 could be classified as hold, ride, or other relevant actions. Identifying all applicable interactions for each human-object pair is crucial for successful HOI detection.

2.2 Benchmark Datasets

HICO-DET (Chao et al., 2018) and V-COCO (Gupta and Malik, 2015) are two widely used benchmark datasets, both of which share the same 80 object categories as in the COCO dataset (Lin et al., 2014). In HICO-DET, there are 117 interaction categories, resulting in 600 HOI categories, as certain combinations of objects and interactions are not applicable. To clarify, each HOI category represents a specific pairing of an interaction category (e.g., “hold”, “sit on”) with an object category (e.g., “cup”, “chair”). Thus, an HOI category is defined by the unique combination of an interaction and an object. HICO-Det comprises 47,776 Creative Commons images sourced from Flickr, including 38,118 for training and 9,658 for testing, featuring over 150,000 human-object pairs. We omit no_interaction annotations from HICO-DET due to incomplete annotation, leaving 44,329 images (35,801 training, 8,528 testing) with 520 HOI categories from 80 objects and 116 interactions.

V-COCO is based on the MS-COCO (Lin et al., 2014) dataset, which contains 5,400 images in the trainval subset and 4,946 images in the test subset. In V-COCO, there are 26 interaction categories. For each interaction, objects are annotated in three different roles: the agent, the instrument, or the object. The task is to detect the agent (human) and the objects in various roles for the interaction (e.g., $\texttt {<}$person cut_instrument knife$\texttt {>}$, $\texttt {<}$person read_object book$\texttt {>}$).

2.3 Computing mAP

For an output triplet $\left( b_{i}^{h}, b_{i}^{o}, a_{i}\right) $ from a model, it is compared with the ground-truth annotations, and considered to be a true positive (TP) for an HOI class if all the following conditions are satisfied:

The category labels of the human and object bounding boxes are both correct.^{Footnote 1}
The intersection-over-union (IoU) w.r.t. the ground-truth annotations for the human $\textrm{IoU}^{h}$ and object $\textrm{IoU}^{o}$ both exceed 0.5, i.e., $\min \bigl ( \textrm{IoU}^{h}, \textrm{IoU}^{o} \bigr ) > 0.5$.
The output interaction category $a_i$ is correct.

If any of these conditions is not satisfied, it is considered as a false positive (FP).

If multiple HOI predictions are matched to the same ground-truth HOI triplet, the one with the highest confidence score, defined as the product of the confidence scores for $\textbf{b}_i^h$, $ \textbf{b}_i^o$, and $a_i$, is chosen as a true positive (TP), while all others are considered false positives (FPs). Let the confidence scores for $\textbf{b}_i^h$, $\textbf{b}_i^o$, and $a_i$ be denoted by $s_i^h$, $s_i^o$, and $s_i^a $, respectively. The overall confidence score for the triplet $( \textbf{b}_i^h, \textbf{b}_i^o, a_i)$ is computed as: $S_i = s_i^h \cdot s_i^o \cdot s_i^a$. Among all predictions matched with the same ground-truth HOI triplet, the prediction with the highest $S_i$ is selected as the true positive (TP), and all other predictions are classified as false positives (FP).

All output triplets are collected from all images in a benchmark for each HOI category. The detected triplets are then sorted in descending order based on their confidence scores. For each HOI category, given a triplet confidence threshold $\tau _t$, the cumulative precision and recall are defined as:

$$\begin{aligned} P=\frac{N_{TP}}{N_{TP}+N_{FP}}, \quad R=\frac{N_{TP}}{{N_{GT}}}, \end{aligned}$$

(1)

for those triplets with confidence scores greater than $\tau _t$. Here, P denotes precision and R represents recall. $N_{TP}$, $N_{FP}$, and $N_{GT}$ are the number of true positives (TPs), false positives (FPs), and ground-truth triplets, respectively, for a particular HOI category. By varying the confidence threshold $\tau _t$, P is interpolated to decrease monotonically, and the AP (Average Precision) is computed as the integral under the precision-recall curve. Finally, mAP is defined as the average AP across all HOI categories.

2.4 Removing no_interaction Category

In our diagnosis, we exclude the 80 no_interaction HOI categories on HICO-DET due to their incomplete annotations. The no_interaction label indicates that the localized human-object pair does not have any actual interactions. Each of the 80 object categories in HICO-DET is paired with a no_interaction action label, creating 80 distinct no_interaction HOI categories. In the detection setting, which includes both localization and classification, HICO-DET incorporates these 80 no_interaction HOI categories. However, this introduces inconsistencies between the ground-truth annotations and the evaluation protocol, particularly in the localization sub-task. To accurately compute the mAP for the no_interaction categories, every human-object pair without an interaction would need to be exhaustively annotated. In practice, this is not the case. As shown in Fig. 5, the HICO-DET dataset has incomplete annotations for the no_interaction HOI categories, where both bounding box annotations and interaction labels are often missing. This incomplete annotation poses a significant issue: even if a model correctly identifies human-object pairs and assigns the

no_interaction labels, those predictions may still be marked as false positives due to the lack of corresponding ground-truth annotations. Consequently, the model’s mAP score is unfairly penalized for detecting interactions that are not fully accounted for in the dataset. The inconsistency arises because the current evaluation protocol assumes fully annotated datasets for both interaction and non-interaction cases, which is not feasible in this setting, particularly for the no_interaction categories. This misalignment between the model’s performance and the reported metrics leads to an inaccurate assessment of its true ability to handle no_interaction cases. Note that this is not an issue for HOI classification in HICO (Chao et al., 2015), as no localization is required.

How can we solve this issue? Clearly, exhaustively annotating no_interaction human-object pairs exhaustively is not feasible nor scalable. In fact, the no_interaction HOI category is not needed. If there are no annotations stating that two objects have any actual interactions (e.g., catch or ride), it means they have no interactions, as what the current annotations indicate in Fig. 5. This setting is adopted in the V-COCO (Gupta and Malik, 2015) benchmark. Therefore, in our diagnosis, we remove all 80 no_interaction HOI categories and only consider the remaining 520 ones for the HICO-DET benchmark (Chao et al., 2018).

We would like to emphasize that excluding the mAP calculation for the no_interaction HOI category does not cause an HOI detector to ignore human-object pairs with no interactions, nor does it underestimate the detector’s accuracy in our diagnosis. First, we do not remove the no_interaction label from the model’s output, so there is no need to retrain the model. An HOI model can still identify when a human-object pair has no interactions, which we will discuss in detail in Sec. 4.2. Second, if the model incorrectly classifies a human-object pair with no interaction as having an actual interaction (e.g., ride bicycle), this incorrect output will be considered a false positive. Similarly, misclassifying the pair of ride bicycle as no_interaction will reduce the number of true positives. Such errors will result in a lower mAP, which accurately reflects the model’s performance.

2.5 Two-Stage Versus One-Stage HOI Detectors

Existing HOI detectors can be broadly classified into two categories: two-stage and one-stage, as illustrated in Fig. 4. Two-stage HOI detectors first detect individual object instances, resulting in a set of human and object bounding boxes, where the confidence scores must exceed a fixed threshold $\tau _d$. Every possible pair of human and object bounding boxes is then exhaustively paired for action classification in the second stage, using the object detector’s feature representations for interaction classification. Depending on the object detector used, NMS (non-maximum suppression) may be applied to eliminate duplicate object detections, ensuring no duplicates in the human-object pairs and final triplet outputs.

In contrast, one-stage methods perform human-object pair localization and interaction classification together without a clear separation. A one-stage detector directly localizes human-object pairs that may have interactions and classifies the interactions between them, with feature representations shared between both tasks. NMS is typically used to remove duplicates in the final output of detected triplets.

Both paradigms are actively researched. One-stage detectors generally run faster than two-stage counterparts as they bypass individual object detection by coupling human-object pair localization and interaction classification. However, in terms of accuracy (i.e., mAP), neither paradigm has a clear advantage over the other.

3 Holistic Error Analysis

One way to diagnose a method is by analyzing the error patterns in its output. As discussed in Sect. 2.3, an HOI detection model can make errors at various stages of human-object pair localization and interaction classification, leading to more false positives (FPs), false negatives (FNs), fewer true positives (TPs), and ultimately, a lower mAP. Inspired by the diagnosis approach used in object detection (Bolya et al., 2020), we conduct a holistic analysis by first defining a set of error types specific to HOI detection, as illustrated in Fig. 1. We then quantify the significance of each error by determining the mAP improvement that could be achieved if the error were perfectly addressed using predefined oracles.

3.1 Error Categories

3.1.1 Human-Object Pair Localization Errors

We define the following set of errors in the human-object pair localization task.

Human box error: The detected object bounding box is correct, but the human bounding box is incorrect (either incorrect localization where $\mathrm {IoU^h}<0.5$, or incorrect classification of the human category, or both).
Object box error. The detected human bounding box is correct, but the object bounding box is incorrect (either incorrect localization where $\mathrm {IoU^o}<0.5$, or incorrect classification of the object category, or both).
Both boxes error: Neither the detected human nor object bounding box is correct.
Association error: Both the human and object bounding boxes are correct, but they have no actual interaction.

3.1.2 Interaction Classification Errors

Here, we focus only on human-object pairs with actual interactions that have already been correctly localized (otherwise, such pairs would fall under errors in the human-object pair localization task). We define the following interaction classification errors.

Duplicate error: The output action category is correct, but there is another detected triplet with a higher confidence score that has already matched the ground truth.
Interaction error: The output interaction is different from the ground-truth label.

3.1.3 Missed GT Error

After all predicted triplets have been matched with ground truth triplets and classified into one of the above error categories, any remaining unmatched ground truth triplets are considered missed GT. These unmatched triplets may result from either missed pair localizations or missed interaction classifications.

We demonstrate the process of categorizing a given triplet prediction into either a true positive or a specific error category in Fig. 2. The flow chart illustrates that each prediction is assigned to only one error category based on a priority system. The decision process begins by determining whether the predicted human-object pair can match with a ground truth triplet. If a match is found, we next check whether the predicted action is correct. If the action is correctly classified, we further assess whether the prediction is a duplicate. If not, the prediction is considered a true positive. Otherwise, it is categorized as a duplicate error. If the interaction is incorrect, the prediction is labeled as an action error. In cases where the human-object pair does not match any ground truth, the process examines whether the object and human bounding boxes are correct. If the object box is correct (i.e., it matches one of the ground truth boxes) but the human box is not, it is classified as a human box error. Conversely, if the human box is correct but the object box is incorrect, it is classified as an object box error. When both the human and object boxes are incorrect, the prediction is classified as a both boxes error. If only the association between the human and object is incorrect, it is categorized as an association error. After all predictions are processed, any unmatched ground truth triplets are classified as missed GT errors. This approach ensures that each prediction is assigned to one and only one error category, making the error categorization mutually exclusive and avoiding overlap. The flow chart provides a visual representation of this process, highlighting how different types of errors are prioritized and resolved based on the prediction’s characteristics.

3.2 Error Significance

Among all such errors, one may wonder which one is more critical toward improving the mAP of HOI detection. To this end, we compute the improvement of mAP by fixing each error type with an “oracle”

$$\begin{aligned} \varDelta mAP_o = mAP_o - mAP, \end{aligned}$$

(2)

where $mAP_o$ denotes the mAP score after applying the oracle o to fix an error type (we call it oracle as it is assumed to solve each error perfectly). Due to the space limit, Here we assume the error is perfectly solved, measuring the most mAP improvement we can get. The larger the $\varDelta mAP_o$ we can see for an error, the more significant it is as a bottleneck for an HOI detection model.

Then we provide details of how we fix errors according to different oracles. We show visualization examples of how we fix different types of errors in Fig. 6.

3.2.1 Oracles for Fixing the Human-Object Pair Localization Errors

It involves four oracles.

Human box oracle: Fix the human box detection and action category, making it a true positive. If duplicates are made, suppress the lower-scoring prediction. Specifically, we first fix the corresponding box. Then, we search the ground truth triplets to find a matching human-object pair and assign the prediction the correct action category to make it a true positive. If multiple action categories are possible, we randomly select one.
Object box oracle: Fix such false positives in a similar way as introduced in the previous step.
Both boxes oracle: Since both boxes are incorrect, we cannot decide which ground truth triplet the detection attempts to match. We just remove this kind of false positive prediction.
Association oracle: Correct the pair association and action label, making it a true positive. If duplicates are made in this way, suppress the lower-scoring prediction. Specifically, we attempt to fix either the human or object box to match the prediction with a ground truth human-object pair, followed by assigning the correct action category. There may be multiple ways to resolve this error, so our rule is to make the minimal changes necessary. For instance, if the object box and action category are already correct, we only adjust the human box. Additionally, we ensure each ground truth pair is fixed only once to maintain consistency when addressing different error types.

3.2.2 Oracles for Fixing Interaction Classification Errors

In this case, we fix false positives caused by duplicate error or interaction error.

Duplicate oracle: We directly remove all duplicate predictions.
Interaction classification oracle: Fix the action label to make it a true positive. If duplicates are made in this way, suppress the lower-scoring prediction.

3.2.3 Oracles for Fixing Missed GT Errors

Note that, after fixing all the above errors, there are still ground truth triplets that do not match any predictions, the number of which is the number of missed ground truth triplets. We reduce the number of ground-truth HOI triplets in the mAP calculation by the number of missed ground-truth triplets.

3.3 Grouping Errors into Two Categories

In some cases, we may need a more concise summary of the error patterns. To this end, we group the errors introduced earlier into FPs and FNs, regardless where they stem from in the HOI detection pipeline, and measure the mAP improvement for each of them separately.

Oracles for grouped errors.

False positive oracle: Remove all false positive predictions.
False negative oracle: Set the number of ground truth triplets to the number of true positive predictions.

Table 1 Details of HOI detection models used in our analysis, including four two-stage and four one-stage detectors. They cover a wide range of design choices (e.g., backbone, object/pair detector, and interaction classifier)

Full size table

3.4 Oddities of mAP Improvement

Similar to Bolya et al. (2020), the mAP improvement in our case has the same issue that the summation of $\varDelta mAP$ of different error types does not lead to $100-mAP$. For example, in Fig. 5 of the main paper, adding mAP of CDN (Zhang et al., 2021) with $\varDelta mAP_{FP}$ and $\varDelta mAP_{FN}$ (34.4+33.1+ 12.94) yields 80.44, not 100. As pointed out in Bolya et al. (2020), the reason is that fixing different errors at once gives a larger mAP improvement than fixing each error on its own. We discuss more details below. For more detailed analysis, we direct readers to Bolya et al. (2020).

Summing the $\varDelta mAP_a$ for each error type does not result in 100-mAP. Specifically, for $\mathcal {O}=\left\{ o_1, o_2,..., o_n \right\} $, we generally have:

$$\begin{aligned} mAP + \varDelta mAP_{o_1} + ... + \varDelta mAP_{o_n} \ne 100. \end{aligned}$$

(3)

This occurs because we don’t compute errors progressively. In contrast, progressively fixed errors would give:

$$\begin{aligned} mAP + \varDelta mAP_{o_1, o_2, ..., o_n} = 100. \end{aligned}$$

(4)

The progressive error $\varDelta mAP_{a|b}$ represents the change in mAP after applying oracle ‘a’ with oracle ‘b’ already applied:

$$\begin{aligned} \varDelta mAP_{a|b} = mAP_{a,b} - mAP_b. \end{aligned}$$

(5)

While progressive error computation assigns importance to error i as $\varDelta mAP_{o_i|o_1,...,o_{i-1}}$, it inflates precision after reducing false positives and lacks intuitive appeal. Errors are rarely addressed in isolation; instead, multiple error types remain during improvement. Thus, observing $\varDelta mAP_{a|b}$ is not practical since there’s no state where only $mAP_b$ is corrected. Relating $\varDelta mAP_a + \varDelta mAP_b$ to $\varDelta mAP_{a,b}$ shows they differ by $\varDelta mAP_a + \varDelta mAP_{a|b}$. Expanding the terms:

$$\begin{aligned} \varDelta mAP_{a,b} = mAP_{a,b} - mAP, \end{aligned}$$

(6)

$$\begin{aligned} \varDelta mAP_a + \varDelta mAP_b = mAP_a + mAP_b - 2mAP. \end{aligned}$$

(7)

Rearranging the terms:

$$\begin{aligned} mAP = mAP_{a,b} - \varDelta mAP_{a,b}. \end{aligned}$$

(8)

Substituting into Eqs. 6 and 7:

$$\begin{aligned} \varDelta mAP_a + \varDelta mAP_b&= mAP_a + mAP_b - mAP \nonumber \\&\quad - mAP_{a,b} + \varDelta mAP_{a,b}. \end{aligned}$$

(9)

Grouping the terms gives:

$$\begin{aligned} \varDelta mAP_a + \varDelta mAP_b = \varDelta mAP_{a,b} + (\varDelta mAP_a - \varDelta mAP_{a|b}). \end{aligned}$$

(10)

Since $\varDelta mAP_{a|b}$ is typically larger than $\varDelta mAP_a$, $\varDelta mAP_{a,b}$ is generally greater than $\varDelta mAP_a + \varDelta mAP_b$, confirming Eq. 3. These nuances highlight the need for careful handling of mAP calculations due to its non-intuitive properties.

4 Diagnosis of Two Sub-Tasks

4.1 Diagnosis of Human-Object Pair Localization

For an HOI detector, whether one- or two-stage, interaction classification relies on the accuracy of the human-object pair localization results. It is therefore crucial to evaluate whether the pair localization results are sufficient, without considering the interaction labels, to disentangle the two sub-tasks.

Two factors impact the quality of pair localization: the coverage of ground-truth pairs and the noisiness level of the detection results. For coverage, if a ground-truth human-object pair is missing from the detection results, the interaction classification module cannot recognize the interaction labels, leading to a false negative (FN). In terms of noisiness, if the pair localization results include too many human-object pairs without actual interactions, it creates a significant burden for the interaction module, leading to many false positives (FPs) when classifying their interaction labels.

Specifically, we calculate Pair Recall as the percentage of ground-truth human-object pairs that are present in the detection results. Due to the multi-label nature, multiple ground-truth pairs can match the same detected pair. In such cases, only one of them is counted towards recall computation, and the other duplicates are suppressed. We then report the average recall across the entire dataset.

To assess the noisiness of the detection results, we compute Pair Precision as the percentage of detected human-object pairs considered to be true positives (TPs). Similarly, we report the precision score at the dataset level.

4.2 Diagnosis of Interaction Classification

Based on the human-object pair localization results, the interaction classification module needs to handle two cases.

4.2.1 Recognizing Incorrect Human-Object Pair Localizations

For incorrectly localized human-object pairs, which have no actual interactions, they should not appear in the final output. Unlike multi-label interaction classification, recognizing an incorrect human-object pair localization is a binary classification problem. In V-COCO, if the classification scores for the actual interaction categories are very low, it suggests that the human and object do not have any real interactions. Thus, we compute the classification score for the negative class as $1 - \max _i (p_i)$, where $p_i$ represents the classification score for the i-th actual interaction category. To avoid selecting a specific threshold for this binary classification, we report the AP (average precision) score.

4.2.2 Correct Human-Object Pair Localizations

For correctly localized pairs, multiple interaction labels may be associated with them, as shown in Fig. 3. Therefore, we compute the mAP score for classification across all possible interaction categories. Similar to the analysis in the previous section, we disregard the detection scores for correctly detected human-object pairs, disentangling the human-object pair localization and interaction classification tasks.

5 Diagnosis Results

5.1 Setup

In our analysis, we diagnose eight popular HOI detection models, including four two-stage and four one-stage, covering a wide range of design choices (e.g., backbone, object/pair detector, and interaction classifier). We use the code and model weights provided by the authors,^{Footnote 2}. A summary of these models can be found in Table 1. QPIC (Tamura et al., 2021), HOITR (Zou et al., 2021), and QAHOI (Chen and Yanai, 2021) share similar model designs based on the DETR object detection model (Carion et al., 2020). We therefore only investigate QAHOI here because it reports higher mAP.

Models used in our diagnosis. We give a brief summary of each model we used.

SCG (Zhang et al., 2021) solves the HOI detection problem by graphical neural networks with a two-stage design. It designs condition messages between pairs of nodes on their spatial relationships.

UPT (Zhang et al., 2022) proposes a Unary-Pairwise Transformer architecture but makes it a two-stage model that exploits unary and pairwise representations for HOIs.

STIP (Zhang et al., 2022) is a two-stage method that uses a Transformer-based detector to generate interaction proposals first and then transforms the nonparametric interaction proposals into HOI predictions via a structure-aware Transformer.

RLIPv2 (Yuan et al., 2023) is a two-stage, fast converging model, enabling large-scaled relational pre-training with pseudo-labeled data. After pre-training, it performs well on both HOI detection and scene graph generation.

CDN (Zhang et al., 2021) proposes a one-stage method with disentangling human-object detection and interaction classification in a cascade manner. It first uses a human-object pair generator and then designs an isolated interaction classifier to classify each human-object pair.

QAHOI (Chen and Yanai, 2021) proposes a transformer-based one-stage method leverages a multi-scale architecture to extract features from different scales and uses query-based anchors to predict human-object pairs and their interactions as triplets.

GEN-VLKT (Liao et al., 2022) follows the one-stage cascaded manner of CDN and designs guided embeddings and instance guided embeddings to generate HOI instances. Besides, it proposes a Visual-Linguistic Knowledge Transfer training strategy for better interaction understanding by transferring knowledge from a pre-trained model CLIP (Radford et al., 2021).

MUREN (Kim et al., 2023) follows the one-stage paradigm and designs three decoder branches using unary, pairwise, and ternary relations of human, object, and interaction tokens for discovering HOI instances.

5.2 Holistic Error Analysis

The mAP improvement for the seven types of errors as well as FPs and FNs on both HICO-DET and V-COCO are shown in Fig. 7.

Table 2 Diagnosis results of human-object pair localization and interaction classification

Full size table

On HICO-DET, among the seven error types across human-object pair localization and interaction classification, two stand out as significant across all HOI detectors, regardless of whether they use a one-stage or two-stage approach: object box errors and incorrect interaction classification errors. These errors stem from two main factors. First, HICO-DET has no overlap with datasets like COCO, which are typically used to pre-train object detectors or backbones, leading models to struggle with correctly localizing objects in human-object pairs. Second, HICO-DET features a large number of interactions, many of which are multi-labeled, making it difficult for models to distinguish them accurately.

On V-COCO, errors are primarily concentrated in object box errors and association errors between human and object bounding boxes. Notably, two-stage HOI detectors such as SCG and UPT show a higher frequency of association errors compared to others. Overall, the mAP improvement on V-COCO is less pronounced than on HICO-DET. This is partly because detectors and backbones are often pre-trained on the COCO dataset, from which V-COCO is derived, making correct association of detections more critical than detecting them. Additionally, V-COCO has a smaller and simpler set of interactions compared to the more abstract ones in HICO-DET (e.g., inspect, wield). It is worth noting that SCG (Zhang et al., 2021) and UPT (Zhang et al., 2022) do not exhibit interaction errors on V-COCO due to their pre-processing and suppression techniques.

In terms of FPs and FNs, from the last two figures of Fig. 7, we can see that for most HOI detectors on both HICO-DET and V-COCO, suppressing FPs brings significantly higher mAP improvement than FNs, except for SCG (Zhang et al., 2021) and UPT (Zhang et al., 2022) on V-COCO. It suggests that the existence of incorrect triplets in HOI detection results is holding the existing models back more than missing the ground-truth triplets.

5.3 Human-Object Pair Localization

We report the average Pair Recall and Pair Precision for the human-object pair localization task on both HICO-DET and V-COCO in Table 2.

Perhaps a little surprisingly, two-stage models produce fewer human-object pairs than the one-stage counterparts, even though they exhaustively pair all detected human and object bounding boxes. On both HICO-DET and V-COCO, two-stage models tend to have higher Pair Precision scores, indicating fewer noises (incorrect human-object pairs) in the detection results. This is partial because, in two-stage models, the NMS is usually applied before the pairing of human and object bounding boxes, as introduced in Sec. 2.5, which removes duplicates in the human-object pair localization results.

To examine the impact of this factor, we apply NMS to remove duplicate human-object pair localizations for one-stage methods. After doing so, we indeed see fewer human-object pairs detected, decreased Pair Recall, and increased Pair Precision from Tables 3 and 4. We also examine increasing the number of human-object pairs for two-stage models by lowering the object detection threshold $\tau _d$. Note that the two-stage RLIPv2 (Yuan et al., 2023) does not apply NMS in their original model, so we use NMS to reduce its number of pairs for comparison. As expected, the effect is reversed compared to one-stage models, where we see more human-object pairs, increased Pair Recall, and decreased Pair Precision, as shown in Tables 3 and 4. However, these changes in human-object pair localization do not result in significant changes in the final HOI detection mAP, indicating that the bottleneck lies in the subsequent interaction classification stage, which we will analyze in the following section.

It is worth noting that the Pair Recall values for both one-stage and two-stage methods are significantly lower than 100, suggesting that many human-object pairs are discarded at this stage, with no chance to appear in the final output.

Table 3 Average recall and precision of one-stage and two-stage models on HICO-DET dataset

Full size table

Table 4 Average recall and precision of one-stage and two-stage models on V-COCO dataset

Full size table

5.4 Interaction Classification

In Table 2, we report the AP for classifying negative human-object pairs (Neg. AP) to evaluate whether the model can effectively suppress incorrectly detected human-object pairs by assigning them low confidence scores. As we can see, two-stage methods perform better on this task across both HICO-DET and V-COCO.

From the results of interaction mAP (Inter. mAP) in Table 2, we observe that although two-stage models achieve relatively higher Pair Precision, their interaction classification heads struggle to correctly classify all interactions (except for RLIPv2). In contrast, one-stage models provide more confident scores for correct interaction predictions, leading to higher Inter. mAP.

The advantages of two-stage versus one-stage models in human-object pair localization and interaction classification tend to cancel each other out. As a result, the overall HOI detection mAP for both two-stage and one-stage models is roughly the same (except for RLIPv2).

RLIPv2 achieves significantly higher HOI detection mAP than other methods, including both two-stage and one-stage ones. From Table 2, we can see that its main advantage is about the substantially higher Inter. mAP, which is mainly because its usage of large-scale relational data (VG (Krishna et al., 2017), COCO (Lin et al., 2014), and Object365 (Shao et al., 2019)) for pre-training.

6 Discussions

6.1 Two-Stage Versus One-Stage HOI Detection Models

In terms of overall HOI detection mAP, there is no clear advantage of one paradigm over the other. However, with our diagnosis toolbox, we can gain more insights into the strengths and weaknesses of these two types of HOI detectors. On the one hand, according to our holistic error analysis, the mAP improvement shown in Fig. 7 demonstrates that both two-stage and one-stage detectors share the same bottleneck in the pipeline. They both struggle with significant errors in detecting the object in a human-object pair and in achieving accurate interaction classification when the pair localization is correct. Additionally, false positives (FPs) are more prevalent than false negatives (FNs) among the errors. On the other hand, after closely examining the human-object detection and interaction classification tasks, we observe that the two paradigms each have their own advantages. As shown in Table 2, two-stage models generally detect human-object pairs with similar Pair Recall but higher Pair Precision than their one-stage counterparts, indicating that the detection results are less noisy. Moreover, while two-stage models excel at recognizing negative human-object pairs without actual interactions (Neg. AP), one-stage methods are better at identifying actual interactions for correctly localized human-object pairs (Inter. mAP). As a result, the final HOI detection mAP scores are generally comparable between the two types of models (with the exception of RLIPv2, which will be discussed later).

Table 5 Diagnosis results for rare and non-rare HOI categories on HICO-DET

Full size table

6.2 Different Backbones

A stronger backbone can help improve the mAP score of HOI detection. But where does the improvement come from? To answer this question, we study three methods: UPT (Zhang et al., 2022) (ResNet50 vs. ResNet101), RLIPv2 (Yuan et al., 2023) (SwinT vs. SwinL), and GEN-VLKT (Liao et al., 2022) (ResNet50 vs. ResNet101). According to Table 2, a better backbone for UPT mainly leads to slightly improved Pair Recall in terms of human-object pair localization, which in turn improves the final HOI mAP on both HICO-DET and V-COCO. For RLIPv2 and GEN-VLKT, however, the improvements span the entire HOI detection pipeline, enhancing both Pair Recall and interaction classification (Inter. mAP). Additionally, the stronger backbone enhances GEN-VLKT’s ability to discard incorrectly detected human-object pairs (better Neg. AP).

Moreover, as shown in Fig. 11, our holistic error analysis reveals that better backbones result in reduced error significance (due to less mAP improvement) for incorrect object localization in human-object pairs for UPT, RLIPv2, and GEN-VLKT on both HICO-DET and V-COCO. Specifically, for the state-of-the-art

RLIPv2 model, switching to a stronger backbone also reduces error significance for incorrect interaction classification on both HICO-DET and V-COCO, as well as association errors on V-COCO. However, stronger backbones may not always reduce all errors. For example, slightly increased error significance is observed for incorrect human detection in UPT and interaction classification in GEN-VLKT on HICO-DET. This explains why the overall HOI detection mAP improvement from better backbones is not as significant for these two methods (0.9 and 0.7, respectively) compared to the 7.5 improvement for RLIPv2 on HICO-DET.

6.3 Rare Versus Non-Rare HOI Categories

The HOI categories follow a long-tail distribution, where some interaction and object categories (e.g., ride horse) are more frequent than others (e.g., chase cat). On HICO-DET, the HOI categories are divided into rare and non-rare ones, where a category is considered rare if it has fewer than ten training instances. How does the abundance of training instances affect model performance? According to our holistic error analysis in Fig. 8, the overall distribution of error significance is similar for both rare and non-rare HOI categories. For instance, incorrect object detection in human-object pairs and incorrect interaction classification remain the main bottlenecks, with false positives (FPs) being more prominent than false negatives (FNs). However, models tend to fail more often in rare HOI categories due to fewer training instances, and fixing these errors results in a larger mAP improvement.

Furthermore, as shown in Table5, the limited training data available for rare HOI categories leads to consistently lower performance in human-object pair localization (Pair Recall and Pair Precision) and interaction classification (Inter. mAP). Even for the state-of-the-art model RLIPv2 (Yuan et al., 2023) with a strong SwinL backbone, interaction classification accuracy (Inter. mAP) significantly drops from 54.4 to 21.7 for rare categories.

Both object categories and interaction categories follow long-tail distributions, and their combination makes the long-tail issue even more pronounced. Improving accuracy in rare HOI categories (i.e., tail classes) remains an open problem (Kilickaya and Smeulders, 2020; Liao et al., 2022; Hou et al., 2021).

6.4 Performance on HICO-DET Versus V-COCO

Our holistic error analysis in Fig. 7 shows that the major error significance on both of them are similar. Our diagnosis in Table 2 further reveals that existing methods tend to generate more human-object pairs to cover ground truths (higher Pair Recall) on V-COCO compared to HICO-DET, although the noisiness level is roughly the same (Pair Precision).

At the same time, existing methods perform better on the interaction classification task on V-COCO than on HICO-DET, in terms of both discarding incorrectly detected human-object pairs (Neg. AP) and recognizing the actual interactions of correctly detected pairs (Inter. mAP). As discussed earlier, part of this improvement can be attributed to the overlap between V-COCO and COCO, which is commonly used for detector/backbone pre-training.

Additionally, we randomly selected test images from the V-COCO dataset to examine the predictions of SCG (Zhang et al., 2021) and UPT (Zhang et al., 2022), as shown in Fig. 9. Since both methods use post-processing techniques, no action errors are present in the predictions. We also provide more visualization results for different categories of errors in Fig. 10.

6.5 Human-Object Pair Localization Versus Interaction Classification

Both our holistic error analysis via mAP improvement and the detailed breakdown of human-object pair localization and interaction classification reveal significant bottlenecks in both sub-tasks. For instance, in Table 2, we observe that the Pair Recall on HICO-DET is still much lower than 100, indicating that many ground-truth pairs are not being detected. Similarly, the low Inter. mAP suggests that accurately recognizing actual interactions remains a challenging task.

In practice, human-object pair localization largely depends on advances in generic single-object detection, which has been the focus of extensive research. Performance on standard benchmarks (e.g., COCO) has almost reached saturation. In contrast, interaction classification, a multi-label classification problem, has not yet been as thoroughly studied. The success of the state-of-the-art RLIPv2 model (Yuan et al., 2023) demonstrates the potential of addressing this challenge by leveraging large-scale relational data for pre-training.

6.6 Impact of Removing no_interaction on Association Errors

The presence of the no_interaction category affects the association errors of both one-stage and two-stage HOI detectors. To analyze its impact, we compute the mAP improvement after fixing association errors, both with and without considering the no_interaction category, on HICO-DET and V-COCO datasets. As shown in Fig. 12, incorporating the no_interaction category can lead to higher association errors in both one-stage and two-stage models. This is because the increasing number of no_interaction human-object pairs in the ground truth makes it more challenging for the model to localize all pairs correctly. Interestingly, for some two-stage models (e.g., UPT, STIP, RLIPv2 on V-COCO), including more human-object pairs in the ground truth results in fewer association errors. This occurs because two-stage models rely on pairing all detected objects, and the additional pairs in the ground truth may help reduce ambiguity in the association process.

6.7 Human Detection

Previous analysis of mAP improvement shows that the human error is not as significant as the generic objects. Here, we examine the recall of human detection. On the HICO-DET dataset, the average recall of human detection is 91.5, while the average recall of all object categories is 88.0.^{Footnote 3} We can see that human detection is easier, but the performance is still far from satisfactory.

7 Implications on Model Development

Our diagnosis toolbox makes a valuable contribution to both the current use of HOI detection models and future model development. By offering a comprehensive analysis across various datasets, it allows researchers and developers to pinpoint the strengths and weaknesses of their models from multiple perspectives. The detailed breakdown enables users to make informed choices between one-stage and two-stage models, showing that while there is no clear accuracy advantage, one-stage models typically require fewer training steps to achieve comparable performance to two-stage models. Additionally, the toolbox identifies key bottlenecks in HOI detection, such as object localization errors and misclassification of interactions, which highlight specific areas for model improvement. The insights also clarify why state-of-the-art models like RLIPv2 excel, particularly due to enhanced interaction classification accuracy. Overall, these findings support both the refinement of existing models and the development of future ones by underscoring the need for improvements in object localization and interaction classification.

8 Conclusion

In this paper, we introduced the first diagnosis toolbox for HOI detectors. We first conduct a holistic error analysis by defining a set of errors across the pipeline of HOI detection and report the mAP improvement by fixing each of them using an oracle. We then delve into the human-object pair localization and interaction classification tasks separately and provide a detailed breakdown inspection for each.

Detailed analyses are reported on both HICO-DET and V-COCO over eight state-of-the-art HOI detectors. We believe our diagnosis toolbox and analysis results will be helpful for fostering future research in this direction.

Data Availability

The source data used for obtaining analysis results are publicly available in the following repositories: https://umich-ywchao-hico.github.io/ and https://github.com/s-gupta/v-coco.

Notes

By ‘correct’, we mean that the human-object pair matches one of the ground truth pairs.
Since the model weights on V-COCO are not available for QAHOI (Chen and Yanai, 2021) we only report its diagnosis results on HICO-DET.
Since the annotations of objects are not complete as we pointed out in Section 2.2, the precision is not reliable.

References

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition
Aneja, J., Deshpande, A., & Schwing, A.G. (2018). Convolutional image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). Vqa: Visual question answering. In: International Conference on Computer Vision
Bolya, D., Foley, S., Hays, J., & Hoffman, J. (2020). TIDE: A general toolbox for identifying object detection errors. In: European Conference on Computer Vision
Brown, A., Xie, W., Kalogeiton, V., & Zisserman, A. (2020). Smooth-ap: Smoothing the path towards large-scale image retrieval. In: European Conference on Computer Vision
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: European Conference on Computer Vision
Chao, Y.W., Liu, Y., Liu, X., Zeng, H., & Deng, J. (2018). Learning to detect human-object interactions. In: Winter Conference on Applications of Computer Vision
Chao, Y.W., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A benchmark for recognizing human-object interactions in images. In: International Conference on Computer Vision
Chen, J., & Yanai, K. (2021). QAHOI: Query-based anchors for human-object interaction detection. arXiv preprint arXiv:2112.08647
Chen, S., Mettes, P., & Snoek, C.G. (2021). Diagnosing errors in video relation detectors. In: British Machine Vision Conference.
Feng, Y., Ma, L., Liu, W., & Luo, J. (2019). Unsupervised image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition.
Gao, C., Xu, J., Zou, Y., & Huang, J.B. (2020). DRG: Dual relation graph for human-object interaction detection. In: European Conference on Computer Vision
Gupta, S., & Malik, J. (2015). Visual semantic role labeling. arXiv preprint arXiv:1505.04474
Gupta, T., Schwing, A., & Hoiem, D. (2019). No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: International Conference on Computer Vision
Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In: European Conference on Computer Vision
Hou, Z., Yu, B., Qiao, Y., Peng, X., & Tao, D. (2021). Detecting human-object interaction via fabricated compositional learning. In: IEEE Conference on Computer Vision and Pattern Recognition
Jiang, H., Ma, X., Nie, W., Yu, Z., Zhu, Y., & Anandkumar, A. (2022). Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition
Kilickaya, M., & Smeulders, A. (2020). Diagnosing rarity in human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition
Kim, S., Jung, D., & Cho, M. (2023). Relational context learning for human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision
Li, G., Zhu, L., Liu, P., & Yang, Y. (2019). Entangled transformer for image captioning. In: International Conference on Computer Vision
Li, Y.L., Fan, H., Qiu, Z., Dou, Y., Xu, L., Fang, H.S., Guo, P., Su, H., Wang, D., Wu, W., et al. (2022). Discovering a variety of objects in spatio-temporal human-object interactions. arXiv preprint arXiv:2211.07501
Liao, Y., Zhang, A., Lu, M., Wang, Y., Li, X., & Liu, S. (2022). Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: IEEE Conference on Computer Vision and Pattern Recognition
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: European Conference on Computer Vision
Liu, X., Li, Y.L., & Lu, C. (2022). Highlighting object category immunity for the generalization of human-object interaction detection. In: Association for the Advancement of Artificial Intelligence
Liu, X., Li, Y.L., Wu, X., Tai, Y.W., Lu, C., & Tang, C.K. (2022). Interactiveness field in human-object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition
Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. Conference on Neural Information Processing Systems
Ma, S., Wang, Y., Wang, S., & Wei, Y. (2023). Fgahoi: Fine-grained anchors for human-object interaction detection. arXiv preprint arXiv:2301.04019
Ng, T., Balntas, V., Tian, Y., & Mikolajczyk, K. (2020). Solar: second-order loss and attention for image retrieval. In: European Conference on Computer Vision
Radenović, F., Tolias, G., & Chum, O. (2018). Fine-tuning cnn image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In: International Conference on Computer Vision
Shih, K.J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition
Tamura, M., Ohashi, H., & Yoshinaga, T. (2021). Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: IEEE Conference on Computer Vision and Pattern Recognition
Teichmann, M., Araujo, A., Zhu, M., & Sim, J. (2019). Detect-to-retrieve: Efficient regional aggregation for image search. In: IEEE Conference on Computer Vision and Pattern Recognition
Ulutan, O., Iftekhar, A., & Manjunath, B.S. (2020). VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wang, P., Wu, Q., Shen, C., Dick, A., & Van Den Hengel, A. (2017). Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wu, X., Li, Y.L., Liu, X., Zhang, J., Wu, Y., & Lu, C. (2022). Mining cross-person cues for body-part interactiveness learning in hoi detection. In: European Conference on Computer Vision
Yu, Z., Huang, Y., Furuta, R., Yagi, T., Goutsu, Y., & Sato, Y. (2023). Fine-grained affordance annotation for egocentric hand-object interaction videos. In: Winter Conference on Applications of Computer Vision
Yuan, H., Jiang, J., Albanie, S., Feng, T., Huang, Z., Ni, D., & Tang, M. (2022). Rlip: Relational language-image pre-training for human-object interaction detection. In: Conference on Neural Information Processing Systems
Yuan, H., Zhang, S., Wang, X., Albanie, S., Pan, Y., Feng, T., Jiang, J., Ni, D., Zhang, Y., & Zhao, D. (2023). Rlipv2: Fast scaling of relational language-image pre-training. In: International Conference on Computer Vision
Zhang, A., Liao, Y., Liu, S., Lu, M., Wang, Y., Gao, C., & Li, X. (2021). Mining the benefits of two-stage and one-stage hoi detection. In: Advances in Neural Information Processing Systems
Zhang, F.Z., Campbell, D., Gould, S. (2021). Spatially conditioned graphs for detecting human-object interactions. In: International Conference on Computer Vision
Zhang, F.Z., Campbell, D., & Gould, S. (2022). Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In: IEEE Conference on Computer Vision and Pattern Recognition
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., & Chen, C.W. (2022). Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In: IEEE Conference on Computer Vision and Pattern Recognition
Zhong, X., Ding, C., Li, Z., & Huang, S. (2022). Towards hard-positive query mining for detr-based human-object interaction detection. In: European Conference on Computer Vision
Zhou, P., & Chi, M. (2019). Relation parsing neural network for human-object interaction detection. In: International Conference on Computer Vision
Zhu, F., Yang, J., & Jiang, H. (2024). Towards flexible visual relationship segmentation. In: Conference on Neural Information Processing Systems
Zou, C., Wang, B., Hu, Y., Liu, J., Wu, Q., Zhao, Y., Li, B., Zhang, C., Zhang, C., Wei, Y., et al. (2021). End-to-end human object interaction detection with hoi transformer. In: IEEE Conference on Computer Vision and Pattern Recognition

Download references

Funding

Open access funding provided by Northeastern University Library.

Author information

Authors and Affiliations

Northeastern University, Boston, 02115, USA
Fangrui Zhu, Yiming Xie & Huaizu Jiang
Shanghai Jiao Tong University, Shanghai, 200240, China
Weidi Xie

Authors

Fangrui Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Yiming Xie
View author publications
You can also search for this author inPubMed Google Scholar
Weidi Xie
View author publications
You can also search for this author inPubMed Google Scholar
Huaizu Jiang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Fangrui Zhu.

Additional information

Communicated by Bryan Allen Plummer.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhu, F., Xie, Y., Xie, W. et al. Diagnosing Human-Object Interaction Detectors. Int J Comput Vis 133, 2227–2244 (2025). https://doi.org/10.1007/s11263-025-02369-8

Download citation

Received: 19 March 2024
Accepted: 25 January 2025
Published: 16 February 2025
Issue Date: April 2025
DOI: https://doi.org/10.1007/s11263-025-02369-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Diagnosing Human-Object Interaction Detectors

Abstract

Similar content being viewed by others

DRG: Dual Relation Graph for Human-Object Interaction Detection

Pairwise Negative Sample Mining for Human-Object Interaction Detection

UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection

Explore related subjects

1 Introduction

1.1 Related Work

2 Preliminaries

2.1 Definition of HOI Detection

2.2 Benchmark Datasets

2.3 Computing mAP

2.4 Removing no_interaction Category

2.5 Two-Stage Versus One-Stage HOI Detectors

3 Holistic Error Analysis

3.1 Error Categories

3.1.1 Human-Object Pair Localization Errors

3.1.2 Interaction Classification Errors

3.1.3 Missed GT Error

3.2 Error Significance

3.2.1 Oracles for Fixing the Human-Object Pair Localization Errors

3.2.2 Oracles for Fixing Interaction Classification Errors

3.2.3 Oracles for Fixing Missed GT Errors

3.3 Grouping Errors into Two Categories

3.4 Oddities of mAP Improvement

4 Diagnosis of Two Sub-Tasks

4.1 Diagnosis of Human-Object Pair Localization

4.2 Diagnosis of Interaction Classification

4.2.1 Recognizing Incorrect Human-Object Pair Localizations

4.2.2 Correct Human-Object Pair Localizations

5 Diagnosis Results

5.1 Setup

5.2 Holistic Error Analysis

5.3 Human-Object Pair Localization

5.4 Interaction Classification

6 Discussions

6.1 Two-Stage Versus One-Stage HOI Detection Models

6.2 Different Backbones

6.3 Rare Versus Non-Rare HOI Categories

6.4 Performance on HICO-DET Versus V-COCO

6.5 Human-Object Pair Localization Versus Interaction Classification

6.6 Impact of Removing no_interaction on Association Errors

6.7 Human Detection

7 Implications on Model Development

8 Conclusion

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords