Abstract:
Human-Object Interaction (HOI) detection is a fertile research ground that merits further investigation in computer vision, and plays an important role in image high-leve...Show MoreMetadata
Abstract:
Human-Object Interaction (HOI) detection is a fertile research ground that merits further investigation in computer vision, and plays an important role in image high-level semantic information understanding. To achieve superior object detection performance, existing HOI models predominantly concentrate on the corresponding bounding box information of humans and objects, respectively, and ignore their surrounding information, thus it results in imprecise inference of instance interaction, which is severe for indirectly-contact interaction images (Intersection-over-Union = 0). To address that, a novel Triple stream Enhanced encoder-decoder Dispersal Network (TED-Net), equipped with human, object, and instance interaction decoding streams, is proposed to decouple instances’ relationships. Meanwhile, we design a dispersal attention mechanism to capture indirectly-contact interaction information and an auxiliary discrimination mechanism to improve the ability of instance interaction decoding stream for action category recognition. Experimental results show that the proposed TED-Net achieves the best performance among HOI models using the ResNet-50 backbone on the (big) HICO-Det dataset and comes third on the (small) V-COCO dataset in leaderboard. Additionally, two indirectly-contact interaction datasets, namely, HICO-Det-IC and V-COCO-IC, are constructed to demonstrate the usefulness and effectiveness of our TED-Net in interacting between indirectly-contact instances, with an average of +3.80 mAP on HICO-Det-IC and +5.46 mAP on V-COCO-IC. Code is available at https://drliuqi.github.io/.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 34, Issue: 7, July 2024)