Journals & Magazines >IEEE Transactions on Image Pr... >Volume: 30

Action Anticipation Using Pairwise Human-Object Interactions and Transformers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applicat...Show More

Metadata

Abstract:

The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.

Published in: IEEE Transactions on Image Processing ( Volume: 30)

Page(s): 8116 - 8129

Date of Publication: 22 September 2021

ISSN Information:

PubMed ID: 34550884

DOI: 10.1109/TIP.2021.3113114

Funding Agency:

Contents

References is not available for this document.

Action Anticipation Using Pairwise Human-Object Interactions and Transformers

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Action Anticipation Using Pairwise Human-Object Interactions and Transformers

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?