Skip to main content

Action Prediction During Human-Object Interaction Based on DTW and Early Fusion of Human and Object Representations

  • Conference paper
  • First Online:
Computer Vision Systems (ICVS 2021)

Abstract

Action prediction is defined as the inference of an action label while the action is still ongoing. Such a capability is extremely useful for early response and further action planning. In this paper, we consider the problem of action prediction in scenarios involving humans interacting with objects. We formulate an approach that builds time series representations of the performance of the humans and the objects. Such a representation of an ongoing action is then compared to prototype actions. This is achieved by a Dynamic Time Warping (DTW)-based time series alignment framework which identifies the best match between the ongoing action and the prototype ones. Our approach is evaluated quantitatively on three standard benchmark datasets. Our experimental results reveal the importance of the fusion of human- and object-centered action representations in the accuracy of action prediction. Moreover, we demonstrate that the proposed approach achieves significantly higher action prediction accuracy compared to competitive methods.

The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number: 1592).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In our work the terms “video recordings” and “skeletal data” are used interchangeably.

  2. 2.

    MHAD is not included in this investigation as the vast majority of its actions do not involve human-object interactions.

References

  1. https://github.com/statefb/dtwalign

  2. Afrasiabi, M., Mansoorizadeh, M., et al.: DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis. Comput. 36, 1127–1139 (2019)

    Article  Google Scholar 

  3. Alfaifi, R., Artoli, A.: Human action prediction with 3D-CNN. SN Comput. Sci. 1, 1–15 (2020)

    Article  Google Scholar 

  4. Arzani, M.M., Fathy, M., Azirani, A.A., Adeli, E.: Skeleton-based structured early activity prediction. Multimedia Tools Appl. 80(15), 23023–23049 (2020). https://doi.org/10.1007/s11042-020-08875-w

    Article  Google Scholar 

  5. Bao, W., Yu, Q., Kong, Y.: Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In: ACM International Conference on Multimedia (2020)

    Google Scholar 

  6. Bochkovskiy, A., Wang, C., Liao, H.: YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)

  7. Cuturi, M.: Fast global alignment kernels. In: ICML 2011 (2011)

    Google Scholar 

  8. Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series. arXiv:1703.01541 (2017)

  9. Dutta, V., Zielinska, T.: Predicting human actions taking into account object affordances. J. Intell. Robot. Syst. 93, 745–761 (2019)

    Article  Google Scholar 

  10. Dutta, V., Zielińska, T.: An adversarial explainable artificial intelligence (XAI) based approach for action forecasting. J. Autom. Mob. Robot. Intell. Syst. (2021)

    Google Scholar 

  11. Farha, A., Richard, A., Gall, J.: When will you do what?-anticipating temporal occurrences of activities. In: IEEE CVPR (2018)

    Google Scholar 

  12. Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Predicting the future: a jointly learnt model for action anticipation. In: IEEE ICCV (2019)

    Google Scholar 

  13. Ghoddoosian, R., Sayed, S., Athitsos, V.: Action duration prediction for segment-level alignment of weakly-labeled videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2053–2062 (2021)

    Google Scholar 

  14. Hadji, I., Derpanis, K.G., Jepson, A.D.: Representation learning via global temporal alignment and cycle-consistency. arXiv preprint arXiv:2105.05217 (2021)

  15. Haresh, S., et al.: Learning by aligning videos in time. arXiv preprint arXiv:2103.17260 (2021)

  16. Ke, Q., Bennamoun, M., Rahmani, H., An, S., Sohel, F., Boussaid, F.: Learning latent global network for skeleton-based action prediction. IEEE Trans. Image Process. (2019)

    Google Scholar 

  17. Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: IEEE CVPR (2019)

    Google Scholar 

  18. Koppula, H., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32, 951–970 (2013)

    Article  Google Scholar 

  19. Li, T., Liu, J., Zhang, W., Duan, L.: HARD-Net: hardness-AwaRe discrimination network for 3D early activity prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 420–436. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_25

    Chapter  Google Scholar 

  20. Liu, J., Shahroudy, A., Wang, G., Duan, L., Kot, A.: Skeleton-based online action prediction using scale selection network. IEEE PAMI 42, 1453–1467 (2019)

    Article  Google Scholar 

  21. Manousaki, V., Papoutsakis, K., Argyros, A.: Evaluating method design options for action classification based on bags of visual words. In: VISAPP (2018)

    Google Scholar 

  22. Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28

    Chapter  Google Scholar 

  23. Mavrogiannis, A., Chandra, R., Manocha, D.: B-GAP: behavior-guided action prediction for autonomous navigation. arXiv:2011.03748 (2020)

  24. Mavroudi, E., Haro, B.B., Vidal, R.: Representation learning on visual-symbolic graphs for video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 71–90. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_5

    Chapter  Google Scholar 

  25. Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: IEEE CVPR Workshops (2019)

    Google Scholar 

  26. Ng, Y., Basura, F.: Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. IEEE Trans. Image Process. 29, 8880–8891 (2020)

    Article  Google Scholar 

  27. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: IEEE Workshop on Applications of Computer Vision (WACV) (2013)

    Google Scholar 

  28. Oprea, S., et al.: A review on deep learning techniques for video prediction. IEEE PAMI (2020)

    Google Scholar 

  29. Papoutsakis, K., Panagiotakis, C., Argyros, A.: Temporal action co-segmentation in 3D motion capture data and videos. In: CVPR (2017)

    Google Scholar 

  30. Qammaz, A., Argyros, A.: Occlusion-tolerant and personalized 3D human pose estimation in RGB images. In: 2020 ICPR. IEEE (2021)

    Google Scholar 

  31. Qin, Y., Mo, L., Li, C., Luo, J.: Skeleton-based action recognition by part-aware graph convolutional networks. Vis. Comput. 36, 621–631 (2020)

    Article  Google Scholar 

  32. Rasouli, A.: Deep learning for vision-based prediction: a survey. arXiv:2007.00095 (2020)

  33. Rasouli, A., Yau, T., Rohani, M., Luo, J.: Multi-modal hybrid architecture for pedestrian action prediction. arXiv:2012.00514 (2020)

  34. Reily, B., Han, F., Parker, L., Zhang, H.: Skeleton-based bio-inspired human activity prediction for real-time human-robot interaction. Auton. Robots 42, 1281–1298 (2018)

    Article  Google Scholar 

  35. Rius, I., Gonzàlez, J., Varona, J., Roca, F.: Action-specific motion prior for efficient Bayesian 3D human body tracking. Pattern Recogn. 42, 2907–2921 (2009)

    Article  Google Scholar 

  36. Tavenard, R., et al.: Tslearn, a machine learning toolkit for time series data. J. Mach. Learn. Res. 21, 1–6 (2020)

    MATH  Google Scholar 

  37. Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE ICCV (2011)

    Google Scholar 

  38. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26, 43–49 (1978)

    Article  Google Scholar 

  39. Tormene, P., Giorgino, T., Quaglini, S., Stefanelli, M.: Matching incomplete time series with dynamic time warping: an algorithm and an application to post-stroke rehabilitation. Artif. Intell. Med. 45, 11–34 (2009)

    Article  Google Scholar 

  40. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE CVPR (2012)

    Google Scholar 

  41. Wang, X., Hu, J., Lai, J., Zhang, J., Zheng, W.: Progressive teacher-student learning for early action prediction. In: IEEE CVPR (2019)

    Google Scholar 

  42. Wu, M., et al.: Gaze-based intention anticipation over driving manoeuvres in semi-autonomous vehicles (2020)

    Google Scholar 

  43. Xia, L., Aggarwal, J.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: IEEE CVPR (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Victoria Manousaki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Manousaki, V., Papoutsakis, K., Argyros, A. (2021). Action Prediction During Human-Object Interaction Based on DTW and Early Fusion of Human and Object Representations. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds) Computer Vision Systems. ICVS 2021. Lecture Notes in Computer Science(), vol 12899. Springer, Cham. https://doi.org/10.1007/978-3-030-87156-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87156-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87155-0

  • Online ISBN: 978-3-030-87156-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics