1 Introduction

Fig. 1
figure 1

Different types of errors in Human-object Interaction Detectors. In the image, a human-object pair exhibits two interactions. We present seven sample HOI triplet detections, each corresponding to a different type of error. For clarity, we display only one triplet detection per image (solid line), while the dashed line indicates that a detection with a higher confidence score already exists

Fig. 2
figure 2

Error categorization flow. Given a triplet prediction from the model output, it will be passed into this logical flow. It will either be a true positive or be classified as one type of the errors. Note that, different types of errors are mutually exclusive. And one prediction will only be treated as one type of errors

Human-object interaction (HOI) detection aims to jointly localize humans and objects that have interactions in static images. For example, the person and snowboard in Fig. 3. It provides structured interpretations of the semantics of visual scenes going beyond mere object recognition or detection. A successful HOI detection system is an essential building block for many downstream applications, such as visual question answering (Antol et al., 2015; Anderson et al., 2018; Shih et al., 2016; Lu et al., 2016; Wang et al., 2017; Lu et al., 2016), image captioning (Vinyals et al., 2016; Aneja et al., 2018; Feng et al., 2019; Li et al., 2019) and retrieval (Chao et al., 2015; Brown et al., 2020; Ng et al., 2022; Teichmann et al., 2019; Radenović et al., 2018), etc.

Recent advances in HOI detection have been marked by increasing mean Average Precision (mAP) scores on standard benchmarks (Gupta and Malik, 2015; Chao et al., 2018; Gao et al., 2020; Ulutan et al., 2020; Gupta et al., 2019; Zhou and Chi, 2019; Zhang et al., 2021, 2022; Liao et al., 2022; Yuan et al., 2022; Yu et al., 2023; Ma et al., 2023; Li et al., 2022; Wu et al., 2022; Zhong et al., 2022; Jiang et al., 2022; Liu et al., 2022; Kim et al., 2023; Yuan et al., 2023; Zhu et al., 2024), denoting remarkable progress. Nonetheless, relying on mAP scores as a summary metric does not provide sufficient insight into the nuances of model performance, such as the factors that make one method perform better or reveal any bottlenecks for further improvement. This lack of detailed understanding may impede future advancements in the field. The same issue also exists in object detection, a sub-task of HOI detection, where mAP is also the dominant evaluation metric. To address this, diagnostic toolboxes have been designed to provide more insightful quantitative breakdown analysis (Hoiem et al., 2012; Bolya et al., 2020), which has significantly boosted the development of object detection.

In this paper, we aim to investigate the success of these works by introducing a toolbox designed for HOI detection, fostering future research. Generally speaking, the HOI detection problem consists of two sub-tasks: 1) localizing pairs of interacting human and object (human-object pair localization) and 2) classification of their interactions, as illustrated in Fig. 1. These two tasks are not independent but in a cascaded relationship, as shown in Fig. 3. Specifically, in our toolbox, we first perform a holistic analysis of the overall HOI detection accuracy. Inspired by the object detection diagnosis toolbox (Bolya et al., 2020), we define a set of error types as well as oracles to fix them in the HOI detection pipeline across the human-object pair localization and interaction classification tasks. The mAP improvement, achieved by applying each oracle, is used to measure the significance of different errors. The larger mAP improvement can be obtained for a particular type of error, the more it contributes to the failure of an HOI detector.

We then examine the tasks of human-object pair localization and interaction classification in detail. For human-object pair localization, we focus primarily on Recall to assess whether the model can capture all ground-truth pairs, which is crucial for the subsequent interaction classification stage. For interaction classification, the model must determine whether a detected human-object pair involves an actual interaction. To evaluate this binary classification task, we report the Average Precision (AP) score instead of accuracy, as this eliminates the need to select a specific threshold. For human-object pairs with actual interactions, we calculate mAP scores to address the multi-label classification aspect of the task. This approach allows us to analyze the two sub-tasks independently.

Our diagnosis toolbox is applicable to various methods across different datasets. Through both holistic and detailed investigations of human-object pair localization and interaction classification, the toolbox provides a comprehensive diagnosis report for eight state-of-the-art HOI detection models. With the detailed quantitative breakdown results, conducted with Fig. 2, we can now address key questions such as: “Are one-stage HOI detection models superior to two-stage models, or vice versa?” (there is no clear accuracy advantage between the two paradigms), “What is the main bottleneck in HOI detection?” (incorrect object localization in human-object pairs and misclassification of interactions), and “Why does the state-of-the-art method RLIPv2 (Yuan et al., 2023) perform better?” (it significantly improves interaction classification accuracy). For more detailed discussions of existing HOI detection models, please refer to Sect. 6.

To the best of our knowledge, this is the first toolbox specifically dedicated to diagnosing HOI detection in static images. By releasing our toolbox, we believe it will promote the future development of HOI detection models.

1.1 Related Work

There are several analysis tools for object detection (Lin et al., 2014; Hoiem et al., 2012; Bolya et al., 2020). The seminal work (Hoiem et al., 2012) shows how to analyze the influences of object characteristics on detection performance and the impact of different types of false positives. However, it requires extra annotations to help analyze the impact of object characteristics, which is unlikely to be scalable in large-scale benchmark datasets. TIDE (Bolya et al., 2020) improves the default evaluation tool provided by the COCO dataset (Lin et al., 2014). It provides a more general framework for quantifying performance improvements for different false positive and false negative errors in object detection and instance segmentation algorithms. Our quantitative analysis of different errors and tasks in HOI detection is motivated by TIDE (Bolya et al., 2020). Extending existing toolboxes like TIDE (Bolya et al., 2020) to HOI detection is not trivial due to the intertwined nature of the human-object pair localization and interaction classification tasks. TIDE (Bolya et al., 2020) focuses on single-box detection, whereas HOI detection involves both box pair localization and the subsequent cascaded interaction classification. In TIDE, the various error types are easily distinguished and mutually exclusive. However, in our case, the errors are naturally entangled, requiring carefully designed criteria to categorize them in alignment with the model structure. The definitions and calculations of error significance in our work differ significantly from TIDE. Additionally, we provide an in-depth analysis of each error type to better understand the model’s performance and identify bottlenecks.

A similar error diagnosis work (Chen et al., 2021) is proposed for the video relation detection task, adopting a holistic approach inspired by TIDE (Bolya et al., 2020). In our diagnosis toolbox, we go beyond holistic error analysis and also conduct detailed investigations into the two distinct sub-tasks of HOI detection, considering the cascaded nature of the HOI detection pipeline. In Gupta and Malik (2015), the authors define several types of false positive errors. However, the definition is specifically tailored to the annotation format of the V-COCO dataset, making it less generalizable to others. In contrast, our analysis is applicable to various benchmark datasets (Chao et al., 2018; Gupta and Malik, 2015). In Kilickaya and Smeulders (2020), the authors analyze a specific issue in HOI detection, namely the long-tail problem of HOI categories, and highlight limiting factors. Liu et al. (2022) proposes a new metric to improve HOI generalization by preventing the model from learning spurious object-verb correlations. Both Kilickaya and Smeulders (2020) and Liu et al. (2022) are complementary to our diagnosis tool and analysis results.

Fig. 3
figure 3

Illustration of the two sub-tasks in HOI detection. (a) Localize all human-object pairs that have actual interactions (person and snowboard). (b) Classify the interactions between them (hold, jump, ride, stand on, and wear)

2 Preliminaries

2.1 Definition of HOI Detection

Given an input image I, the output of a human-object interaction (HOI) detector is a set of triplets \(\mathcal {S} = \left\{ \left( \textbf{b}_i^h, \textbf{b}_i^o, a_i \right) \right\} _{i=1}^K\), where \(\textbf{b}_i^h\), \(\textbf{b}_i^o\), and \(a_i\) represent the bounding box of the i-th human, the bounding box of the i-th object, and their interaction class, respectively. Each bounding box \(\textbf{b}_i^h\) and \(\textbf{b}_i^o\) contains both the spatial coordinates of the bounding box and the associated category label. Specifically, for the i-th human: \(\textbf{b}_i^h = \left( x_i^h, y_i^h, w_i^h, h_i^h, c_i^h \right) \), where: \((x_i^h, y_i^h)\) are the coordinates of the top-left corner of the bounding box, \(w_i^h\) is the width of the bounding box, \(h_i^h\) is the height of the bounding box, and \(c_i^h\) is the category label associated with the human (usually fixed as ‘person’). To be more specific, \(\textbf{b}_i^h\) can generally be any objects, e.g. \(\texttt {<}\)chair on floor\(\texttt {>}\), \(\texttt {<}\)car near fire_hydrant\(\texttt {>}\), but in HOI detection, the tasks only involve human-centric relationships, where the subject is always human. While \(\textbf{b}_i^h\) can represent a variety of objects or entities, such as chair in \(\texttt {<}\)chair on floor\(\texttt {>}\) or car in \(\texttt {<}\)car near fire_hydrant\(\texttt {>}\), in the context of HOI detection tasks, the focus is on human-centric relationships. Therefore, the subject category is always restricted to ‘person’ to align with the scope of HOI detection, which specifically studies interactions involving humans.

Similarly, for the i-th object: \( \textbf{b}_i^o = \left( x_i^o, y_i^o, w_i^o, h_i^o, c_i^o \right) \), where: \((x_i^o, y_i^o)\) are the coordinates of the top-left corner of the object bounding box, \( w_i^o\) is the width of the object bounding box, \(h_i^o\) is the height of the object bounding box, and \(c_i^o\) is the category label of the object, which can be any object category, including ‘person’ for human-human interactions. The interaction class \(a_i\) represents the action or relationship between the human and the object (e.g., ‘holding’, ‘riding’, ‘looking at’). The number of such triplets is denoted by K, indicating the total number of detected human-object (or human-human) interactions in the image. In this formalism, it is important to note that the object category \(c_i^o\) can include the label ‘person’, making human-human interactions valid instances in the HOI detection framework, as seen in datasets like HICO-DET. These human-human pairs are treated similarly to human-object pairs. In essence, the HOI detection problem consists of two sub-tasks, as shown in Fig. 3. First, it is required to correctly localize every human-object pair that has any actual interaction. Unlike object detection, where individual object bounding boxes are predicted, the localization task here involves associating a pair of human and object boxes. This introduces additional complexity because the model needs to both detect the objects and establish the correct pairings between them. Once the human-object pairs are localized, the second sub-task is to recognize their interaction labels. Multiple interactions can occur for the same pair, making this a multi-label classification problem. For instance, the interaction between the person and the skateboard in Fig. 3 could be classified as hold, ride, or other relevant actions. Identifying all applicable interactions for each human-object pair is crucial for successful HOI detection.

2.2 Benchmark Datasets

HICO-DET (Chao et al., 2018) and V-COCO (Gupta and Malik, 2015) are two widely used benchmark datasets, both of which share the same 80 object categories as in the COCO dataset (Lin et al., 2014). In HICO-DET, there are 117 interaction categories, resulting in 600 HOI categories, as certain combinations of objects and interactions are not applicable. To clarify, each HOI category represents a specific pairing of an interaction category (e.g., “hold”, “sit on”) with an object category (e.g., “cup”, “chair”). Thus, an HOI category is defined by the unique combination of an interaction and an object. HICO-Det comprises 47,776 Creative Commons images sourced from Flickr, including 38,118 for training and 9,658 for testing, featuring over 150,000 human-object pairs. We omit no_interaction annotations from HICO-DET due to incomplete annotation, leaving 44,329 images (35,801 training, 8,528 testing) with 520 HOI categories from 80 objects and 116 interactions.

V-COCO is based on the MS-COCO (Lin et al., 2014) dataset, which contains 5,400 images in the trainval subset and 4,946 images in the test subset. In V-COCO, there are 26 interaction categories. For each interaction, objects are annotated in three different roles: the agent, the instrument, or the object. The task is to detect the agent (human) and the objects in various roles for the interaction (e.g., \(\texttt {<}\)person cut_instrument knife\(\texttt {>}\), \(\texttt {<}\)person read_object book\(\texttt {>}\)).

2.3 Computing mAP

For an output triplet \(\left( b_{i}^{h}, b_{i}^{o}, a_{i}\right) \) from a model, it is compared with the ground-truth annotations, and considered to be a true positive (TP) for an HOI class if all the following conditions are satisfied:

  • The category labels of the human and object bounding boxes are both correct.Footnote 1

  • The intersection-over-union (IoU) w.r.t. the ground-truth annotations for the human \(\textrm{IoU}^{h}\) and object \(\textrm{IoU}^{o}\) both exceed 0.5, i.e., \(\min \bigl ( \textrm{IoU}^{h}, \textrm{IoU}^{o} \bigr ) > 0.5\).

  • The output interaction category \(a_i\) is correct.

If any of these conditions is not satisfied, it is considered as a false positive (FP).

If multiple HOI predictions are matched to the same ground-truth HOI triplet, the one with the highest confidence score, defined as the product of the confidence scores for \(\textbf{b}_i^h\), \( \textbf{b}_i^o\), and \(a_i\), is chosen as a true positive (TP), while all others are considered false positives (FPs). Let the confidence scores for \(\textbf{b}_i^h\), \(\textbf{b}_i^o\), and \(a_i\) be denoted by \(s_i^h\), \(s_i^o\), and \(s_i^a \), respectively. The overall confidence score for the triplet \(( \textbf{b}_i^h, \textbf{b}_i^o, a_i)\) is computed as: \(S_i = s_i^h \cdot s_i^o \cdot s_i^a\). Among all predictions matched with the same ground-truth HOI triplet, the prediction with the highest \(S_i\) is selected as the true positive (TP), and all other predictions are classified as false positives (FP).

Fig. 4
figure 4

Illustration of two- and one-stage HOI detectors. A two-stage HOI detector separates human-object pair localization from interaction classification, whereas a one-stage model does not have this clear separation and directly outputs detected triplets

All output triplets are collected from all images in a benchmark for each HOI category. The detected triplets are then sorted in descending order based on their confidence scores. For each HOI category, given a triplet confidence threshold \(\tau _t\), the cumulative precision and recall are defined as:

$$\begin{aligned} P=\frac{N_{TP}}{N_{TP}+N_{FP}}, \quad R=\frac{N_{TP}}{{N_{GT}}}, \end{aligned}$$
(1)

for those triplets with confidence scores greater than \(\tau _t\). Here, P denotes precision and R represents recall. \(N_{TP}\), \(N_{FP}\), and \(N_{GT}\) are the number of true positives (TPs), false positives (FPs), and ground-truth triplets, respectively, for a particular HOI category. By varying the confidence threshold \(\tau _t\), P is interpolated to decrease monotonically, and the AP (Average Precision) is computed as the integral under the precision-recall curve. Finally, mAP is defined as the average AP across all HOI categories.

2.4 Removing no_interaction Category

In our diagnosis, we exclude the 80 no_interaction HOI categories on HICO-DET due to their incomplete annotations. The no_interaction label indicates that the localized human-object pair does not have any actual interactions. Each of the 80 object categories in HICO-DET is paired with a no_interaction action label, creating 80 distinct no_interaction HOI categories. In the detection setting, which includes both localization and classification, HICO-DET incorporates these 80 no_interaction HOI categories. However, this introduces inconsistencies between the ground-truth annotations and the evaluation protocol, particularly in the localization sub-task. To accurately compute the mAP for the no_interaction categories, every human-object pair without an interaction would need to be exhaustively annotated. In practice, this is not the case. As shown in Fig. 5, the HICO-DET dataset has incomplete annotations for the no_interaction HOI categories, where both bounding box annotations and interaction labels are often missing. This incomplete annotation poses a significant issue: even if a model correctly identifies human-object pairs and assigns the

no_interaction labels, those predictions may still be marked as false positives due to the lack of corresponding ground-truth annotations. Consequently, the model’s mAP score is unfairly penalized for detecting interactions that are not fully accounted for in the dataset. The inconsistency arises because the current evaluation protocol assumes fully annotated datasets for both interaction and non-interaction cases, which is not feasible in this setting, particularly for the no_interaction categories. This misalignment between the model’s performance and the reported metrics leads to an inaccurate assessment of its true ability to handle no_interaction cases. Note that this is not an issue for HOI classification in HICO (Chao et al., 2015), as no localization is required.

Fig. 5
figure 5

Examples of the missing annotations of the no_interactionHOI categories. On the right, we show missing no_interaction labels and missing bounding boxes using dashed lines and bounding boxes, respectively

How can we solve this issue? Clearly, exhaustively annotating no_interaction human-object pairs exhaustively is not feasible nor scalable. In fact, the no_interaction HOI category is not needed. If there are no annotations stating that two objects have any actual interactions (e.g., catch or ride), it means they have no interactions, as what the current annotations indicate in Fig. 5. This setting is adopted in the V-COCO (Gupta and Malik, 2015) benchmark. Therefore, in our diagnosis, we remove all 80 no_interaction HOI categories and only consider the remaining 520 ones for the HICO-DET benchmark (Chao et al., 2018).

We would like to emphasize that excluding the mAP calculation for the no_interaction HOI category does not cause an HOI detector to ignore human-object pairs with no interactions, nor does it underestimate the detector’s accuracy in our diagnosis. First, we do not remove the no_interaction label from the model’s output, so there is no need to retrain the model. An HOI model can still identify when a human-object pair has no interactions, which we will discuss in detail in Sec. 4.2. Second, if the model incorrectly classifies a human-object pair with no interaction as having an actual interaction (e.g., ride bicycle), this incorrect output will be considered a false positive. Similarly, misclassifying the pair of ride bicycle as no_interaction will reduce the number of true positives. Such errors will result in a lower mAP, which accurately reflects the model’s performance.

2.5 Two-Stage Versus One-Stage HOI Detectors

Existing HOI detectors can be broadly classified into two categories: two-stage and one-stage, as illustrated in Fig. 4. Two-stage HOI detectors first detect individual object instances, resulting in a set of human and object bounding boxes, where the confidence scores must exceed a fixed threshold \(\tau _d\). Every possible pair of human and object bounding boxes is then exhaustively paired for action classification in the second stage, using the object detector’s feature representations for interaction classification. Depending on the object detector used, NMS (non-maximum suppression) may be applied to eliminate duplicate object detections, ensuring no duplicates in the human-object pairs and final triplet outputs.

In contrast, one-stage methods perform human-object pair localization and interaction classification together without a clear separation. A one-stage detector directly localizes human-object pairs that may have interactions and classifies the interactions between them, with feature representations shared between both tasks. NMS is typically used to remove duplicates in the final output of detected triplets.

Both paradigms are actively researched. One-stage detectors generally run faster than two-stage counterparts as they bypass individual object detection by coupling human-object pair localization and interaction classification. However, in terms of accuracy (i.e., mAP), neither paradigm has a clear advantage over the other.

3 Holistic Error Analysis

One way to diagnose a method is by analyzing the error patterns in its output. As discussed in Sect. 2.3, an HOI detection model can make errors at various stages of human-object pair localization and interaction classification, leading to more false positives (FPs), false negatives (FNs), fewer true positives (TPs), and ultimately, a lower mAP. Inspired by the diagnosis approach used in object detection (Bolya et al., 2020), we conduct a holistic analysis by first defining a set of error types specific to HOI detection, as illustrated in Fig. 1. We then quantify the significance of each error by determining the mAP improvement that could be achieved if the error were perfectly addressed using predefined oracles.

3.1 Error Categories

3.1.1 Human-Object Pair Localization Errors

We define the following set of errors in the human-object pair localization task.

  • Human box error: The detected object bounding box is correct, but the human bounding box is incorrect (either incorrect localization where \(\mathrm {IoU^h}<0.5\), or incorrect classification of the human category, or both).

  • Object box error. The detected human bounding box is correct, but the object bounding box is incorrect (either incorrect localization where \(\mathrm {IoU^o}<0.5\), or incorrect classification of the object category, or both).

  • Both boxes error: Neither the detected human nor object bounding box is correct.

  • Association error: Both the human and object bounding boxes are correct, but they have no actual interaction.

3.1.2 Interaction Classification Errors

Here, we focus only on human-object pairs with actual interactions that have already been correctly localized (otherwise, such pairs would fall under errors in the human-object pair localization task). We define the following interaction classification errors.

  • Duplicate error: The output action category is correct, but there is another detected triplet with a higher confidence score that has already matched the ground truth.

  • Interaction error: The output interaction is different from the ground-truth label.

3.1.3 Missed GT Error

After all predicted triplets have been matched with ground truth triplets and classified into one of the above error categories, any remaining unmatched ground truth triplets are considered missed GT. These unmatched triplets may result from either missed pair localizations or missed interaction classifications.

We demonstrate the process of categorizing a given triplet prediction into either a true positive or a specific error category in Fig. 2. The flow chart illustrates that each prediction is assigned to only one error category based on a priority system. The decision process begins by determining whether the predicted human-object pair can match with a ground truth triplet. If a match is found, we next check whether the predicted action is correct. If the action is correctly classified, we further assess whether the prediction is a duplicate. If not, the prediction is considered a true positive. Otherwise, it is categorized as a duplicate error. If the interaction is incorrect, the prediction is labeled as an action error. In cases where the human-object pair does not match any ground truth, the process examines whether the object and human bounding boxes are correct. If the object box is correct (i.e., it matches one of the ground truth boxes) but the human box is not, it is classified as a human box error. Conversely, if the human box is correct but the object box is incorrect, it is classified as an object box error. When both the human and object boxes are incorrect, the prediction is classified as a both boxes error. If only the association between the human and object is incorrect, it is categorized as an association error. After all predictions are processed, any unmatched ground truth triplets are classified as missed GT errors. This approach ensures that each prediction is assigned to one and only one error category, making the error categorization mutually exclusive and avoiding overlap. The flow chart provides a visual representation of this process, highlighting how different types of errors are prioritized and resolved based on the prediction’s characteristics.

3.2 Error Significance

Among all such errors, one may wonder which one is more critical toward improving the mAP of HOI detection. To this end, we compute the improvement of mAP by fixing each error type with an “oracle”

$$\begin{aligned} \varDelta mAP_o = mAP_o - mAP, \end{aligned}$$
(2)

where \(mAP_o\) denotes the mAP score after applying the oracle o to fix an error type (we call it oracle as it is assumed to solve each error perfectly). Due to the space limit, Here we assume the error is perfectly solved, measuring the most mAP improvement we can get. The larger the \(\varDelta mAP_o\) we can see for an error, the more significant it is as a bottleneck for an HOI detection model.

Then we provide details of how we fix errors according to different oracles. We show visualization examples of how we fix different types of errors in Fig. 6.

Fig. 6
figure 6

Examples of different error types in real images and how we fix them using oracles. There are three human-object pairs in the ground truth, with three interactions between each pair. There are six predicted triplets (with confidence scores) in the second row, corresponding to six different error types. We fix each of them into true positives and remove duplicates afterward

3.2.1 Oracles for Fixing the Human-Object Pair Localization Errors

It involves four oracles.

  • Human box oracle: Fix the human box detection and action category, making it a true positive. If duplicates are made, suppress the lower-scoring prediction. Specifically, we first fix the corresponding box. Then, we search the ground truth triplets to find a matching human-object pair and assign the prediction the correct action category to make it a true positive. If multiple action categories are possible, we randomly select one.

  • Object box oracle: Fix such false positives in a similar way as introduced in the previous step.

  • Both boxes oracle: Since both boxes are incorrect, we cannot decide which ground truth triplet the detection attempts to match. We just remove this kind of false positive prediction.

  • Association oracle: Correct the pair association and action label, making it a true positive. If duplicates are made in this way, suppress the lower-scoring prediction. Specifically, we attempt to fix either the human or object box to match the prediction with a ground truth human-object pair, followed by assigning the correct action category. There may be multiple ways to resolve this error, so our rule is to make the minimal changes necessary. For instance, if the object box and action category are already correct, we only adjust the human box. Additionally, we ensure each ground truth pair is fixed only once to maintain consistency when addressing different error types.

3.2.2 Oracles for Fixing Interaction Classification Errors

In this case, we fix false positives caused by duplicate error or interaction error.

  • Duplicate oracle: We directly remove all duplicate predictions.

  • Interaction classification oracle: Fix the action label to make it a true positive. If duplicates are made in this way, suppress the lower-scoring prediction.

3.2.3 Oracles for Fixing Missed GT Errors

Note that, after fixing all the above errors, there are still ground truth triplets that do not match any predictions, the number of which is the number of missed ground truth triplets. We reduce the number of ground-truth HOI triplets in the mAP calculation by the number of missed ground-truth triplets.

3.3 Grouping Errors into Two Categories

In some cases, we may need a more concise summary of the error patterns. To this end, we group the errors introduced earlier into FPs and FNs, regardless where they stem from in the HOI detection pipeline, and measure the mAP improvement for each of them separately.

Oracles for grouped errors.

  • False positive oracle: Remove all false positive predictions.

  • False negative oracle: Set the number of ground truth triplets to the number of true positive predictions.

Table 1 Details of HOI detection models used in our analysis, including four two-stage and four one-stage detectors. They cover a wide range of design choices (e.g., backbone, object/pair detector, and interaction classifier)

3.4 Oddities of mAP Improvement

Similar to Bolya et al. (2020), the mAP improvement in our case has the same issue that the summation of \(\varDelta mAP\) of different error types does not lead to \(100-mAP\). For example, in Fig. 5 of the main paper, adding mAP of CDN (Zhang et al., 2021) with \(\varDelta mAP_{FP}\) and \(\varDelta mAP_{FN}\) (34.4+33.1+ 12.94) yields 80.44, not 100. As pointed out in Bolya et al. (2020), the reason is that fixing different errors at once gives a larger mAP improvement than fixing each error on its own. We discuss more details below. For more detailed analysis, we direct readers to Bolya et al. (2020).

Summing the \(\varDelta mAP_a\) for each error type does not result in 100-mAP. Specifically, for \(\mathcal {O}=\left\{ o_1, o_2,..., o_n \right\} \), we generally have:

$$\begin{aligned} mAP + \varDelta mAP_{o_1} + ... + \varDelta mAP_{o_n} \ne 100. \end{aligned}$$
(3)

This occurs because we don’t compute errors progressively. In contrast, progressively fixed errors would give:

$$\begin{aligned} mAP + \varDelta mAP_{o_1, o_2, ..., o_n} = 100. \end{aligned}$$
(4)

The progressive error \(\varDelta mAP_{a|b}\) represents the change in mAP after applying oracle ‘a’ with oracle ‘b’ already applied:

$$\begin{aligned} \varDelta mAP_{a|b} = mAP_{a,b} - mAP_b. \end{aligned}$$
(5)

While progressive error computation assigns importance to error i as \(\varDelta mAP_{o_i|o_1,...,o_{i-1}}\), it inflates precision after reducing false positives and lacks intuitive appeal. Errors are rarely addressed in isolation; instead, multiple error types remain during improvement. Thus, observing \(\varDelta mAP_{a|b}\) is not practical since there’s no state where only \(mAP_b\) is corrected. Relating \(\varDelta mAP_a + \varDelta mAP_b\) to \(\varDelta mAP_{a,b}\) shows they differ by \(\varDelta mAP_a + \varDelta mAP_{a|b}\). Expanding the terms:

$$\begin{aligned} \varDelta mAP_{a,b} = mAP_{a,b} - mAP, \end{aligned}$$
(6)
$$\begin{aligned} \varDelta mAP_a + \varDelta mAP_b = mAP_a + mAP_b - 2mAP. \end{aligned}$$
(7)

Rearranging the terms:

$$\begin{aligned} mAP = mAP_{a,b} - \varDelta mAP_{a,b}. \end{aligned}$$
(8)

Substituting into Eqs. 6 and 7:

$$\begin{aligned} \varDelta mAP_a + \varDelta mAP_b&= mAP_a + mAP_b - mAP \nonumber \\&\quad - mAP_{a,b} + \varDelta mAP_{a,b}. \end{aligned}$$
(9)

Grouping the terms gives:

$$\begin{aligned} \varDelta mAP_a + \varDelta mAP_b = \varDelta mAP_{a,b} + (\varDelta mAP_a - \varDelta mAP_{a|b}). \end{aligned}$$
(10)

Since \(\varDelta mAP_{a|b}\) is typically larger than \(\varDelta mAP_a\), \(\varDelta mAP_{a,b}\) is generally greater than \(\varDelta mAP_a + \varDelta mAP_b\), confirming Eq. 3. These nuances highlight the need for careful handling of mAP calculations due to its non-intuitive properties.

4 Diagnosis of Two Sub-Tasks

4.1 Diagnosis of Human-Object Pair Localization

For an HOI detector, whether one- or two-stage, interaction classification relies on the accuracy of the human-object pair localization results. It is therefore crucial to evaluate whether the pair localization results are sufficient, without considering the interaction labels, to disentangle the two sub-tasks.

Two factors impact the quality of pair localization: the coverage of ground-truth pairs and the noisiness level of the detection results. For coverage, if a ground-truth human-object pair is missing from the detection results, the interaction classification module cannot recognize the interaction labels, leading to a false negative (FN). In terms of noisiness, if the pair localization results include too many human-object pairs without actual interactions, it creates a significant burden for the interaction module, leading to many false positives (FPs) when classifying their interaction labels.

Specifically, we calculate Pair Recall as the percentage of ground-truth human-object pairs that are present in the detection results. Due to the multi-label nature, multiple ground-truth pairs can match the same detected pair. In such cases, only one of them is counted towards recall computation, and the other duplicates are suppressed. We then report the average recall across the entire dataset.

To assess the noisiness of the detection results, we compute Pair Precision as the percentage of detected human-object pairs considered to be true positives (TPs). Similarly, we report the precision score at the dataset level.

Fig. 7
figure 7

mAP improvement by fixing different types of errors on HICO-DET and VCOCO

Fig. 8
figure 8

mAP improvement by fixing different types of errors for the rare and non-rare HOI categories on HICO-DET

4.2 Diagnosis of Interaction Classification

Based on the human-object pair localization results, the interaction classification module needs to handle two cases.

4.2.1 Recognizing Incorrect Human-Object Pair Localizations

For incorrectly localized human-object pairs, which have no actual interactions, they should not appear in the final output. Unlike multi-label interaction classification, recognizing an incorrect human-object pair localization is a binary classification problem. In V-COCO, if the classification scores for the actual interaction categories are very low, it suggests that the human and object do not have any real interactions. Thus, we compute the classification score for the negative class as \(1 - \max _i (p_i)\), where \(p_i\) represents the classification score for the i-th actual interaction category. To avoid selecting a specific threshold for this binary classification, we report the AP (average precision) score.

4.2.2 Correct Human-Object Pair Localizations

For correctly localized pairs, multiple interaction labels may be associated with them, as shown in Fig. 3. Therefore, we compute the mAP score for classification across all possible interaction categories. Similar to the analysis in the previous section, we disregard the detection scores for correctly detected human-object pairs, disentangling the human-object pair localization and interaction classification tasks.

5 Diagnosis Results

5.1 Setup

In our analysis, we diagnose eight popular HOI detection models, including four two-stage and four one-stage, covering a wide range of design choices (e.g., backbone, object/pair detector, and interaction classifier). We use the code and model weights provided by the authors,Footnote 2. A summary of these models can be found in Table 1. QPIC (Tamura et al., 2021), HOITR (Zou et al., 2021), and QAHOI (Chen and Yanai, 2021) share similar model designs based on the DETR object detection model (Carion et al., 2020). We therefore only investigate QAHOI here because it reports higher mAP.

Models used in our diagnosis. We give a brief summary of each model we used.

SCG (Zhang et al., 2021) solves the HOI detection problem by graphical neural networks with a two-stage design. It designs condition messages between pairs of nodes on their spatial relationships.

UPT (Zhang et al., 2022) proposes a Unary-Pairwise Transformer architecture but makes it a two-stage model that exploits unary and pairwise representations for HOIs.

STIP (Zhang et al., 2022) is a two-stage method that uses a Transformer-based detector to generate interaction proposals first and then transforms the nonparametric interaction proposals into HOI predictions via a structure-aware Transformer.

RLIPv2 (Yuan et al., 2023) is a two-stage, fast converging model, enabling large-scaled relational pre-training with pseudo-labeled data. After pre-training, it performs well on both HOI detection and scene graph generation.

CDN (Zhang et al., 2021) proposes a one-stage method with disentangling human-object detection and interaction classification in a cascade manner. It first uses a human-object pair generator and then designs an isolated interaction classifier to classify each human-object pair.

QAHOI (Chen and Yanai, 2021) proposes a transformer-based one-stage method leverages a multi-scale architecture to extract features from different scales and uses query-based anchors to predict human-object pairs and their interactions as triplets.

GEN-VLKT (Liao et al., 2022) follows the one-stage cascaded manner of CDN and designs guided embeddings and instance guided embeddings to generate HOI instances. Besides, it proposes a Visual-Linguistic Knowledge Transfer training strategy for better interaction understanding by transferring knowledge from a pre-trained model CLIP (Radford et al., 2021).

MUREN (Kim et al., 2023) follows the one-stage paradigm and designs three decoder branches using unary, pairwise, and ternary relations of human, object, and interaction tokens for discovering HOI instances.

5.2 Holistic Error Analysis

The mAP improvement for the seven types of errors as well as FPs and FNs on both HICO-DET and V-COCO are shown in Fig. 7.

Table 2 Diagnosis results of human-object pair localization and interaction classification

On HICO-DET, among the seven error types across human-object pair localization and interaction classification, two stand out as significant across all HOI detectors, regardless of whether they use a one-stage or two-stage approach: object box errors and incorrect interaction classification errors. These errors stem from two main factors. First, HICO-DET has no overlap with datasets like COCO, which are typically used to pre-train object detectors or backbones, leading models to struggle with correctly localizing objects in human-object pairs. Second, HICO-DET features a large number of interactions, many of which are multi-labeled, making it difficult for models to distinguish them accurately.

On V-COCO, errors are primarily concentrated in object box errors and association errors between human and object bounding boxes. Notably, two-stage HOI detectors such as SCG and UPT show a higher frequency of association errors compared to others. Overall, the mAP improvement on V-COCO is less pronounced than on HICO-DET. This is partly because detectors and backbones are often pre-trained on the COCO dataset, from which V-COCO is derived, making correct association of detections more critical than detecting them. Additionally, V-COCO has a smaller and simpler set of interactions compared to the more abstract ones in HICO-DET (e.g., inspect, wield). It is worth noting that SCG (Zhang et al., 2021) and UPT (Zhang et al., 2022) do not exhibit interaction errors on V-COCO due to their pre-processing and suppression techniques.

In terms of FPs and FNs, from the last two figures of Fig. 7, we can see that for most HOI detectors on both HICO-DET and V-COCO, suppressing FPs brings significantly higher mAP improvement than FNs, except for SCG (Zhang et al., 2021) and UPT (Zhang et al., 2022) on V-COCO. It suggests that the existence of incorrect triplets in HOI detection results is holding the existing models back more than missing the ground-truth triplets.

5.3 Human-Object Pair Localization

We report the average Pair Recall and Pair Precision for the human-object pair localization task on both HICO-DET and V-COCO in Table 2.

Perhaps a little surprisingly, two-stage models produce fewer human-object pairs than the one-stage counterparts, even though they exhaustively pair all detected human and object bounding boxes. On both HICO-DET and V-COCO, two-stage models tend to have higher Pair Precision scores, indicating fewer noises (incorrect human-object pairs) in the detection results. This is partial because, in two-stage models, the NMS is usually applied before the pairing of human and object bounding boxes, as introduced in Sec. 2.5, which removes duplicates in the human-object pair localization results.

To examine the impact of this factor, we apply NMS to remove duplicate human-object pair localizations for one-stage methods. After doing so, we indeed see fewer human-object pairs detected, decreased Pair Recall, and increased Pair Precision from Tables 3 and 4. We also examine increasing the number of human-object pairs for two-stage models by lowering the object detection threshold \(\tau _d\). Note that the two-stage RLIPv2 (Yuan et al., 2023) does not apply NMS in their original model, so we use NMS to reduce its number of pairs for comparison. As expected, the effect is reversed compared to one-stage models, where we see more human-object pairs, increased Pair Recall, and decreased Pair Precision, as shown in Tables 3 and 4. However, these changes in human-object pair localization do not result in significant changes in the final HOI detection mAP, indicating that the bottleneck lies in the subsequent interaction classification stage, which we will analyze in the following section.

It is worth noting that the Pair Recall values for both one-stage and two-stage methods are significantly lower than 100, suggesting that many human-object pairs are discarded at this stage, with no chance to appear in the final output.

Table 3 Average recall and precision of one-stage and two-stage models on HICO-DET dataset
Table 4 Average recall and precision of one-stage and two-stage models on V-COCO dataset

5.4 Interaction Classification

In Table 2, we report the AP for classifying negative human-object pairs (Neg. AP) to evaluate whether the model can effectively suppress incorrectly detected human-object pairs by assigning them low confidence scores. As we can see, two-stage methods perform better on this task across both HICO-DET and V-COCO.

From the results of interaction mAP (Inter. mAP) in Table 2, we observe that although two-stage models achieve relatively higher Pair Precision, their interaction classification heads struggle to correctly classify all interactions (except for RLIPv2). In contrast, one-stage models provide more confident scores for correct interaction predictions, leading to higher Inter. mAP.

The advantages of two-stage versus one-stage models in human-object pair localization and interaction classification tend to cancel each other out. As a result, the overall HOI detection mAP for both two-stage and one-stage models is roughly the same (except for RLIPv2).

RLIPv2 achieves significantly higher HOI detection mAP than other methods, including both two-stage and one-stage ones. From Table 2, we can see that its main advantage is about the substantially higher Inter. mAP, which is mainly because its usage of large-scale relational data (VG (Krishna et al., 2017), COCO (Lin et al., 2014), and Object365 (Shao et al., 2019)) for pre-training.

Fig. 9
figure 9

Visualization of errors of SCG (Zhang et al., 2021) and UPT (Zhang et al., 2022) on the V-COCO dataset. We show the ground truth triplet (in cyan color) and the error type of the prediction (red characters) for each image. The red boxes represent predictions

Fig. 10
figure 10

Visualization of errors of CDN (Zhang et al., 2021) on the HICO-DET dataset. We show the ground truth triplet (in cyan color) and the error type of the prediction (red characters) for each image. The red boxes represent predictions

6 Discussions

6.1 Two-Stage Versus One-Stage HOI Detection Models

In terms of overall HOI detection mAP, there is no clear advantage of one paradigm over the other. However, with our diagnosis toolbox, we can gain more insights into the strengths and weaknesses of these two types of HOI detectors. On the one hand, according to our holistic error analysis, the mAP improvement shown in Fig. 7 demonstrates that both two-stage and one-stage detectors share the same bottleneck in the pipeline. They both struggle with significant errors in detecting the object in a human-object pair and in achieving accurate interaction classification when the pair localization is correct. Additionally, false positives (FPs) are more prevalent than false negatives (FNs) among the errors. On the other hand, after closely examining the human-object detection and interaction classification tasks, we observe that the two paradigms each have their own advantages. As shown in Table 2, two-stage models generally detect human-object pairs with similar Pair Recall but higher Pair Precision than their one-stage counterparts, indicating that the detection results are less noisy. Moreover, while two-stage models excel at recognizing negative human-object pairs without actual interactions (Neg. AP), one-stage methods are better at identifying actual interactions for correctly localized human-object pairs (Inter. mAP). As a result, the final HOI detection mAP scores are generally comparable between the two types of models (with the exception of RLIPv2, which will be discussed later).

Table 5 Diagnosis results for rare and non-rare HOI categories on HICO-DET

6.2 Different Backbones

A stronger backbone can help improve the mAP score of HOI detection. But where does the improvement come from? To answer this question, we study three methods: UPT (Zhang et al., 2022) (ResNet50 vs. ResNet101), RLIPv2 (Yuan et al., 2023) (SwinT vs. SwinL), and GEN-VLKT (Liao et al., 2022) (ResNet50 vs. ResNet101). According to Table 2, a better backbone for UPT mainly leads to slightly improved Pair Recall in terms of human-object pair localization, which in turn improves the final HOI mAP on both HICO-DET and V-COCO. For RLIPv2 and GEN-VLKT, however, the improvements span the entire HOI detection pipeline, enhancing both Pair Recall and interaction classification (Inter. mAP). Additionally, the stronger backbone enhances GEN-VLKT’s ability to discard incorrectly detected human-object pairs (better Neg. AP).

Moreover, as shown in Fig. 11, our holistic error analysis reveals that better backbones result in reduced error significance (due to less mAP improvement) for incorrect object localization in human-object pairs for UPT, RLIPv2, and GEN-VLKT on both HICO-DET and V-COCO. Specifically, for the state-of-the-art

RLIPv2 model, switching to a stronger backbone also reduces error significance for incorrect interaction classification on both HICO-DET and V-COCO, as well as association errors on V-COCO. However, stronger backbones may not always reduce all errors. For example, slightly increased error significance is observed for incorrect human detection in UPT and interaction classification in GEN-VLKT on HICO-DET. This explains why the overall HOI detection mAP improvement from better backbones is not as significant for these two methods (0.9 and 0.7, respectively) compared to the 7.5 improvement for RLIPv2 on HICO-DET.

6.3 Rare Versus Non-Rare HOI Categories

The HOI categories follow a long-tail distribution, where some interaction and object categories (e.g., ride horse) are more frequent than others (e.g., chase cat). On HICO-DET, the HOI categories are divided into rare and non-rare ones, where a category is considered rare if it has fewer than ten training instances. How does the abundance of training instances affect model performance? According to our holistic error analysis in Fig. 8, the overall distribution of error significance is similar for both rare and non-rare HOI categories. For instance, incorrect object detection in human-object pairs and incorrect interaction classification remain the main bottlenecks, with false positives (FPs) being more prominent than false negatives (FNs). However, models tend to fail more often in rare HOI categories due to fewer training instances, and fixing these errors results in a larger mAP improvement.

Furthermore, as shown in Table5, the limited training data available for rare HOI categories leads to consistently lower performance in human-object pair localization (Pair Recall and Pair Precision) and interaction classification (Inter. mAP). Even for the state-of-the-art model RLIPv2 (Yuan et al., 2023) with a strong SwinL backbone, interaction classification accuracy (Inter. mAP) significantly drops from 54.4 to 21.7 for rare categories.

Both object categories and interaction categories follow long-tail distributions, and their combination makes the long-tail issue even more pronounced. Improving accuracy in rare HOI categories (i.e., tail classes) remains an open problem (Kilickaya and Smeulders, 2020; Liao et al., 2022; Hou et al., 2021).

6.4 Performance on HICO-DET Versus V-COCO

Our holistic error analysis in Fig. 7 shows that the major error significance on both of them are similar. Our diagnosis in Table 2 further reveals that existing methods tend to generate more human-object pairs to cover ground truths (higher Pair Recall) on V-COCO compared to HICO-DET, although the noisiness level is roughly the same (Pair Precision).

At the same time, existing methods perform better on the interaction classification task on V-COCO than on HICO-DET, in terms of both discarding incorrectly detected human-object pairs (Neg. AP) and recognizing the actual interactions of correctly detected pairs (Inter. mAP). As discussed earlier, part of this improvement can be attributed to the overlap between V-COCO and COCO, which is commonly used for detector/backbone pre-training.

Additionally, we randomly selected test images from the V-COCO dataset to examine the predictions of SCG (Zhang et al., 2021) and UPT (Zhang et al., 2022), as shown in Fig. 9. Since both methods use post-processing techniques, no action errors are present in the predictions. We also provide more visualization results for different categories of errors in Fig. 10.

Fig. 11
figure 11

mAP improvement of different backbones on HICO-DET and V-COCO

6.5 Human-Object Pair Localization Versus Interaction Classification

Both our holistic error analysis via mAP improvement and the detailed breakdown of human-object pair localization and interaction classification reveal significant bottlenecks in both sub-tasks. For instance, in Table 2, we observe that the Pair Recall on HICO-DET is still much lower than 100, indicating that many ground-truth pairs are not being detected. Similarly, the low Inter. mAP suggests that accurately recognizing actual interactions remains a challenging task.

In practice, human-object pair localization largely depends on advances in generic single-object detection, which has been the focus of extensive research. Performance on standard benchmarks (e.g., COCO) has almost reached saturation. In contrast, interaction classification, a multi-label classification problem, has not yet been as thoroughly studied. The success of the state-of-the-art RLIPv2 model (Yuan et al., 2023) demonstrates the potential of addressing this challenge by leveraging large-scale relational data for pre-training.

6.6 Impact of Removing no_interaction on Association Errors

The presence of the no_interaction category affects the association errors of both one-stage and two-stage HOI detectors. To analyze its impact, we compute the mAP improvement after fixing association errors, both with and without considering the no_interaction category, on HICO-DET and V-COCO datasets. As shown in Fig. 12, incorporating the no_interaction category can lead to higher association errors in both one-stage and two-stage models. This is because the increasing number of no_interaction human-object pairs in the ground truth makes it more challenging for the model to localize all pairs correctly. Interestingly, for some two-stage models (e.g., UPT, STIP, RLIPv2 on V-COCO), including more human-object pairs in the ground truth results in fewer association errors. This occurs because two-stage models rely on pairing all detected objects, and the additional pairs in the ground truth may help reduce ambiguity in the association process.

6.7 Human Detection

Previous analysis of mAP improvement shows that the human error is not as significant as the generic objects. Here, we examine the recall of human detection. On the HICO-DET dataset, the average recall of human detection is 91.5, while the average recall of all object categories is 88.0.Footnote 3 We can see that human detection is easier, but the performance is still far from satisfactory.

Fig. 12
figure 12

mAP improvement of fixing association errors on HICO-DET and V-COCO, w/wo no_interaction category

7 Implications on Model Development

Our diagnosis toolbox makes a valuable contribution to both the current use of HOI detection models and future model development. By offering a comprehensive analysis across various datasets, it allows researchers and developers to pinpoint the strengths and weaknesses of their models from multiple perspectives. The detailed breakdown enables users to make informed choices between one-stage and two-stage models, showing that while there is no clear accuracy advantage, one-stage models typically require fewer training steps to achieve comparable performance to two-stage models. Additionally, the toolbox identifies key bottlenecks in HOI detection, such as object localization errors and misclassification of interactions, which highlight specific areas for model improvement. The insights also clarify why state-of-the-art models like RLIPv2 excel, particularly due to enhanced interaction classification accuracy. Overall, these findings support both the refinement of existing models and the development of future ones by underscoring the need for improvements in object localization and interaction classification.

8 Conclusion

In this paper, we introduced the first diagnosis toolbox for HOI detectors. We first conduct a holistic error analysis by defining a set of errors across the pipeline of HOI detection and report the mAP improvement by fixing each of them using an oracle. We then delve into the human-object pair localization and interaction classification tasks separately and provide a detailed breakdown inspection for each.

Detailed analyses are reported on both HICO-DET and V-COCO over eight state-of-the-art HOI detectors. We believe our diagnosis toolbox and analysis results will be helpful for fostering future research in this direction.