Skip to main content
Log in

Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow

  • Published:
Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

Abstract

A key technical issue for human intention prediction from observed human actions is how to discover and utilize the spatio-temporal patterns behind those actions. Inspired by the well-known two-stream architecture for action recognition, this paper proposes an approach for human intention prediction based on a two-stream architecture using RGB images and optical flow. Firstly, the action-start frame of each trial of a human action is determined by calculating the L2 distance of the positions of the human joints between frames of the skeleton data. Secondly, a spatial network and a temporal network are trained separately to predict human intentions using RGB images and optical flow, respectively. Both early concatenation fusion methods in the spatial network and sampling methods in the temporal network are optimized based on experiments. Finally, average fusion is used to fuse the prediction results from the spatial network and the temporal network. To verify the effectiveness of the proposed approach, a new dataset of human intentions behind human actions is introduced. This dataset contains RGB images, RGB-D images, and skeleton data of human actions of pitching a ball. Experiments show that the proposed approach can predict human intentions behind human actions with a prediction accuracy of 74% on the proposed dataset. The proposed approach is further evaluated on the Intention from Motion (IfM) dataset, a dataset of human intentions behind human actions of grasping a bottle. The proposed approach achieves a prediction accuracy of 77% on the IfM dataset. The proposed approach is effective in predicting human intentions behind human actions in different applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Wang, Z., Boularias, A., Mülling, K., Schölkopf, B., Peters, J.: Anticipatory action selection for human–robot table tennis. Artif. Intell. 247, 399–414 (2017)

    Article  MathSciNet  Google Scholar 

  2. Koppula, H.S., Jain, A., Saxena, A.: Anticipatory planning for human-robot teams. Experimental Robotics. 453–470 (2016)

  3. Townsend, E.C., Mielke, E.A., Wingate, D., and Killpack, M.D.: “Estimating Human Intent for Physical Human-Robot co-Manipulation,” arXiv Prepr. arXiv1705.10851, (2017)

  4. Kim, I.-H., Bong, J.-H., Park, J., Park, S.: Prediction of driver’s intention of lane change by augmenting sensor information using machine learning techniques. Sensors. 17(6), 1350 (2017)

    Article  Google Scholar 

  5. Kwak, J.-Y., Ko, B.C., Nam, J.-Y.: Pedestrian intention prediction based on dynamic fuzzy automata for vehicle driving at nighttime. Infrared Phys. Technol. 81, 41–51 (2017)

    Article  Google Scholar 

  6. Kirchner, E.A., Tabie, M., Seeland, A.: Multimodal movement prediction-towards an individual assistance of patients. PLoS One. 9(1), e85060 (2014)

    Article  Google Scholar 

  7. Phule, S.S., Sawant, S.D.: “Abnormal activities detection for security purpose unattainded bag and crowding detection by using image processing,” in Intelligent Computing and Control Systems (ICICCS), 2017 International Conference on, pp. 1069–1073, (2017)

  8. Feichtenhofer, C., Pinz, A., Zisserman, A.: “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941, (2016)

  9. Ma, S., Sigal, L., Sclaroff, S.: “Learning activity progression in lstms for activity detection and early detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1942–1950, (2016)

  10. Ryoo, M.S.: “Human activity prediction: early recognition of ongoing activities from streaming videos,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1036–1043, (2011)

  11. Xu, Z., Qing, L., Miao, J.: “Activity auto-completion: predicting human activities from partial videos,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3191–3199, (2015)

  12. Li, S., Zhang, L., Diao, X., “Improving Human Intention Prediction Using Data Augmentation,” in 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 559–564, (2018)

  13. Sharma, G, Jurie, F., Schmid, C.: “Expanded parts model for human attribute and action recognition in still images,” in computer vision and pattern recognition, pp. 652–659, (2013)

  14. Zheng, Y., Zhang, Y.J., Li, X., Liu, B.D.: “Action recognition in still images using a combination of human pose and context information,” in 2012 19th IEEE International Conference on Image Processing, pp. 785–788, (2012)

  15. Delaitre, V., Sivic, J., Laptev, I.: “Learning person-object interactions for action recognition in still images,” in Advances in Neural Information Processing Systems, pp. 1503–1511, (2011)

  16. Zunino, A., Cavazza, J., Koul, A., Cavallo, A., Becchio, C., Murino, V.: “intention from motion,” arXiv Prepr. arXiv1605.09526, (2016)

  17. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521(7553), 436–444 (2015)

    Article  Google Scholar 

  18. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection, in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. 1, 886–893 (2005)

    Google Scholar 

  19. Klaser, A., Marszałek, M., Schmid, C.: “A spatio-temporal descriptor based on 3d-gradients,” in BMVC 2008-19th British Machine Vision Conference, pp. 271–275, (2008)

  20. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: “Learning realistic human actions from movies,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, (2008)

  21. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)

    Article  Google Scholar 

  22. Scovanner, P., Ali, S., Shah, M.: “A 3-dimensional sift descriptor and its application to action recognition,” in Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360, (2007)

  23. Wang, H., Schmid, C.: “Action recognition with improved trajectories,” in Proceedings of the IEEE International Conference on Computer Vision, (2013)

  24. Bilen, H., Fernando, B., Gavves, E., Vedaldi A., Gould, S.: “Dynamic image networks for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3034–3042, 2016

  25. Ryoo, M.S., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: “Robot-centric activity prediction from first-person videos: what will they do to me?,” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp. 295–302, (2015)

  26. Soran, B., Farhadi, A., Shapiro, L.: “Generating notifications for missing actions: Don’t forget to turn the lights off!,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4669–4677, (2015)

  27. Yu, G., Yuan, J., Liu, Z.: “Predicting human activities using spatio-temporal structure of interest points,” in Proceedings of the 20th ACM International Conference on Multimedia, pp. 1049–1052, (2012)

  28. Soomro, K., Idrees, H., Shah, M.: “Predicting the where and what of actors and actions through online action localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2648–2657, (2016)

  29. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: “Learning spatiotemporal features with 3d convolutional networks,” in Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 4489–4497, (2015)

  30. Donahue, J. et al.: “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2625–2634, (2015)

  31. Simonyan, K., Zisserman, A.: “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS), pp. 568–576, (2014)

  32. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  33. Varol, G., Laptev, I., Schmid, C.: “Long-term temporal convolutions for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., (2017)

  34. Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. Proceedings of the IEEE International Conference on Computer Vision. 4597–4605 (2015)

  35. Qiu, Z., Yao, T., Mei, T.: “Learning spatio-temporal representation with pseudo-3d residual networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542, (2017)

  36. Carreira, J., Zisserman, A.: “Quo vadis, action recognition? A new model and the kinetics dataset,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 4724–4733, (2017)

  37. Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: “Beyond short snippets: deep networks for video classification,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 4694–4702, (2015)

  38. Fermüller, C., Wang, F., Yang, Y., Zampogiannis, K., Zhang, Y., Barranco, F., Pfeiffer, M.: Prediction of manipulation actions. Int. J. Comput. Vis. 126(2–4), 358–374 (2018)

    Article  MathSciNet  Google Scholar 

  39. Srivastava, N., Mansimov, E., Salakhudinov, R.: “Unsupervised learning of video representations using lstms,” in International conference on machine learning, pp. 843–852, (2015)

  40. L. Wang et al.: “Temporal segment networks: Towards good practices for deep action recognition,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9912 LNCS, pp. 20–36, (2016)

    Chapter  Google Scholar 

  41. Wang, L., Qiao, Y., Tang, X.: “Action recognition with trajectory-pooled deep-convolutional descriptors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4305–4314, (2015)

  42. Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: “A key volume mining deep framework for action recognition,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 1991–1999, (2016)

  43. Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: “Action recognition by learning deep multi-granular spatio-temporal video representation,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 159–166, (2016)

  44. Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Learning hierarchical video representation for action recognition. Int. J. Multimed. Inf. Retr. 6(1), 85–98 (2017)

    Article  Google Scholar 

  45. Qiu, Z., Li, Q., Yao, T., Mei, T., Rui, Y: “Msr asia msm at thumos challenge 2015,” in CVPR workshop, vol. 8, (2015)

  46. Chéron, G., Laptev, I., Schmid, C.: “P-CNN: pose-based CNN features for action recognition,” in Proceedings of the IEEE international conference on computer vision, pp. 3218–3226, (2015)

  47. Gkioxari, G., Malik, J.: “Finding action tubes,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 759–768, (2015)

  48. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: “Learning to track for spatio-temporal action localization,” in Proceedings of the IEEE international conference on computer vision, pp. 3164–3172, (2015)

  49. Daoudi, M., Coello, Y., Desrosiers, P., Ott, L.: “A new computational approach to identify human social intention in action,” in IEEE International Conference on Automatic Face & Gesture Recognition, (2018)

  50. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732, (2014)

  51. Dosovitskiy, A. et al.: “Flownet: learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766, (2015)

  52. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionision (CVPR), vol. 2, p. 6, (2017)

  53. He, K., Zhang, X., Ren, S., Sun, J.: “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionision (CVPR), pp. 770–778, (2016)

  54. Berg, A., Deng, J., Fei-Fei, L.: “Large scale visual recognition challenge 2010.” 2010

  55. Soomro, K., Zamir, A.R., Shah, M.: “UCF101: a dataset of 101 human actions classes from videos in the wild,” arXiv Prepr. arXiv1212.0402, (2012)

  56. Ryoo, M.S., Aggarwal, J.K., Dataset, U.-I.: “ICPR Contest on Semantic Description of Human Activities (SDHA), (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiumin Diao.

Additional information

Publisher’s Note 

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, S., Zhang, L. & Diao, X. Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow. J Intell Robot Syst 97, 95–107 (2020). https://doi.org/10.1007/s10846-019-01049-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10846-019-01049-3

Keywords

Navigation