Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow

Li, Shengchao; Zhang, Lin; Diao, Xiumin

doi:10.1007/s10846-019-01049-3

Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow

Published: 12 July 2019

Volume 97, pages 95–107, (2020)
Cite this article

Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

1309 Accesses
29 Citations
Explore all metrics

Abstract

A key technical issue for human intention prediction from observed human actions is how to discover and utilize the spatio-temporal patterns behind those actions. Inspired by the well-known two-stream architecture for action recognition, this paper proposes an approach for human intention prediction based on a two-stream architecture using RGB images and optical flow. Firstly, the action-start frame of each trial of a human action is determined by calculating the L2 distance of the positions of the human joints between frames of the skeleton data. Secondly, a spatial network and a temporal network are trained separately to predict human intentions using RGB images and optical flow, respectively. Both early concatenation fusion methods in the spatial network and sampling methods in the temporal network are optimized based on experiments. Finally, average fusion is used to fuse the prediction results from the spatial network and the temporal network. To verify the effectiveness of the proposed approach, a new dataset of human intentions behind human actions is introduced. This dataset contains RGB images, RGB-D images, and skeleton data of human actions of pitching a ball. Experiments show that the proposed approach can predict human intentions behind human actions with a prediction accuracy of 74% on the proposed dataset. The proposed approach is further evaluated on the Intention from Motion (IfM) dataset, a dataset of human intentions behind human actions of grasping a bottle. The proposed approach achieves a prediction accuracy of 77% on the IfM dataset. The proposed approach is effective in predicting human intentions behind human actions in different applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multilayer human motion prediction perceptron by aggregating repetitive motion

Article 13 September 2023

A Review of Dynamic Maps for 3D Human Motion Recognition Using ConvNets and Its Improvement

Article 30 July 2020

Development of human motion prediction strategy using inception residual block

Article 23 February 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Wang, Z., Boularias, A., Mülling, K., Schölkopf, B., Peters, J.: Anticipatory action selection for human–robot table tennis. Artif. Intell. 247, 399–414 (2017)
Article MathSciNet Google Scholar
Koppula, H.S., Jain, A., Saxena, A.: Anticipatory planning for human-robot teams. Experimental Robotics. 453–470 (2016)
Townsend, E.C., Mielke, E.A., Wingate, D., and Killpack, M.D.: “Estimating Human Intent for Physical Human-Robot co-Manipulation,” arXiv Prepr. arXiv1705.10851, (2017)
Kim, I.-H., Bong, J.-H., Park, J., Park, S.: Prediction of driver’s intention of lane change by augmenting sensor information using machine learning techniques. Sensors. 17(6), 1350 (2017)
Article Google Scholar
Kwak, J.-Y., Ko, B.C., Nam, J.-Y.: Pedestrian intention prediction based on dynamic fuzzy automata for vehicle driving at nighttime. Infrared Phys. Technol. 81, 41–51 (2017)
Article Google Scholar
Kirchner, E.A., Tabie, M., Seeland, A.: Multimodal movement prediction-towards an individual assistance of patients. PLoS One. 9(1), e85060 (2014)
Article Google Scholar
Phule, S.S., Sawant, S.D.: “Abnormal activities detection for security purpose unattainded bag and crowding detection by using image processing,” in Intelligent Computing and Control Systems (ICICCS), 2017 International Conference on, pp. 1069–1073, (2017)
Feichtenhofer, C., Pinz, A., Zisserman, A.: “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941, (2016)
Ma, S., Sigal, L., Sclaroff, S.: “Learning activity progression in lstms for activity detection and early detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1942–1950, (2016)
Ryoo, M.S.: “Human activity prediction: early recognition of ongoing activities from streaming videos,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1036–1043, (2011)
Xu, Z., Qing, L., Miao, J.: “Activity auto-completion: predicting human activities from partial videos,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3191–3199, (2015)
Li, S., Zhang, L., Diao, X., “Improving Human Intention Prediction Using Data Augmentation,” in 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 559–564, (2018)
Sharma, G, Jurie, F., Schmid, C.: “Expanded parts model for human attribute and action recognition in still images,” in computer vision and pattern recognition, pp. 652–659, (2013)
Zheng, Y., Zhang, Y.J., Li, X., Liu, B.D.: “Action recognition in still images using a combination of human pose and context information,” in 2012 19th IEEE International Conference on Image Processing, pp. 785–788, (2012)
Delaitre, V., Sivic, J., Laptev, I.: “Learning person-object interactions for action recognition in still images,” in Advances in Neural Information Processing Systems, pp. 1503–1511, (2011)
Zunino, A., Cavazza, J., Koul, A., Cavallo, A., Becchio, C., Murino, V.: “intention from motion,” arXiv Prepr. arXiv1605.09526, (2016)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521(7553), 436–444 (2015)
Article Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection, in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. 1, 886–893 (2005)
Google Scholar
Klaser, A., Marszałek, M., Schmid, C.: “A spatio-temporal descriptor based on 3d-gradients,” in BMVC 2008-19th British Machine Vision Conference, pp. 271–275, (2008)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: “Learning realistic human actions from movies,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, (2008)
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Article Google Scholar
Scovanner, P., Ali, S., Shah, M.: “A 3-dimensional sift descriptor and its application to action recognition,” in Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360, (2007)
Wang, H., Schmid, C.: “Action recognition with improved trajectories,” in Proceedings of the IEEE International Conference on Computer Vision, (2013)
Bilen, H., Fernando, B., Gavves, E., Vedaldi A., Gould, S.: “Dynamic image networks for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3034–3042, 2016
Ryoo, M.S., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: “Robot-centric activity prediction from first-person videos: what will they do to me?,” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp. 295–302, (2015)
Soran, B., Farhadi, A., Shapiro, L.: “Generating notifications for missing actions: Don’t forget to turn the lights off!,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4669–4677, (2015)
Yu, G., Yuan, J., Liu, Z.: “Predicting human activities using spatio-temporal structure of interest points,” in Proceedings of the 20th ACM International Conference on Multimedia, pp. 1049–1052, (2012)
Soomro, K., Idrees, H., Shah, M.: “Predicting the where and what of actors and actions through online action localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2648–2657, (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: “Learning spatiotemporal features with 3d convolutional networks,” in Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 4489–4497, (2015)
Donahue, J. et al.: “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2625–2634, (2015)
Simonyan, K., Zisserman, A.: “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS), pp. 568–576, (2014)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Varol, G., Laptev, I., Schmid, C.: “Long-term temporal convolutions for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., (2017)
Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. Proceedings of the IEEE International Conference on Computer Vision. 4597–4605 (2015)
Qiu, Z., Yao, T., Mei, T.: “Learning spatio-temporal representation with pseudo-3d residual networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542, (2017)
Carreira, J., Zisserman, A.: “Quo vadis, action recognition? A new model and the kinetics dataset,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 4724–4733, (2017)
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: “Beyond short snippets: deep networks for video classification,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 4694–4702, (2015)
Fermüller, C., Wang, F., Yang, Y., Zampogiannis, K., Zhang, Y., Barranco, F., Pfeiffer, M.: Prediction of manipulation actions. Int. J. Comput. Vis. 126(2–4), 358–374 (2018)
Article MathSciNet Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: “Unsupervised learning of video representations using lstms,” in International conference on machine learning, pp. 843–852, (2015)
L. Wang et al.: “Temporal segment networks: Towards good practices for deep action recognition,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9912 LNCS, pp. 20–36, (2016)
Chapter Google Scholar
Wang, L., Qiao, Y., Tang, X.: “Action recognition with trajectory-pooled deep-convolutional descriptors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4305–4314, (2015)
Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: “A key volume mining deep framework for action recognition,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 1991–1999, (2016)
Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: “Action recognition by learning deep multi-granular spatio-temporal video representation,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 159–166, (2016)
Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Learning hierarchical video representation for action recognition. Int. J. Multimed. Inf. Retr. 6(1), 85–98 (2017)
Article Google Scholar
Qiu, Z., Li, Q., Yao, T., Mei, T., Rui, Y: “Msr asia msm at thumos challenge 2015,” in CVPR workshop, vol. 8, (2015)
Chéron, G., Laptev, I., Schmid, C.: “P-CNN: pose-based CNN features for action recognition,” in Proceedings of the IEEE international conference on computer vision, pp. 3218–3226, (2015)
Gkioxari, G., Malik, J.: “Finding action tubes,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 759–768, (2015)
Weinzaepfel, P., Harchaoui, Z., Schmid, C.: “Learning to track for spatio-temporal action localization,” in Proceedings of the IEEE international conference on computer vision, pp. 3164–3172, (2015)
Daoudi, M., Coello, Y., Desrosiers, P., Ott, L.: “A new computational approach to identify human social intention in action,” in IEEE International Conference on Automatic Face & Gesture Recognition, (2018)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732, (2014)
Dosovitskiy, A. et al.: “Flownet: learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766, (2015)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionision (CVPR), vol. 2, p. 6, (2017)
He, K., Zhang, X., Ren, S., Sun, J.: “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionision (CVPR), pp. 770–778, (2016)
Berg, A., Deng, J., Fei-Fei, L.: “Large scale visual recognition challenge 2010.” 2010
Soomro, K., Zamir, A.R., Shah, M.: “UCF101: a dataset of 101 human actions classes from videos in the wild,” arXiv Prepr. arXiv1212.0402, (2012)
Ryoo, M.S., Aggarwal, J.K., Dataset, U.-I.: “ICPR Contest on Semantic Description of Human Activities (SDHA), (2010)

Download references

Author information

Authors and Affiliations

School of Engineering Technology, Purdue University, West Lafayette, IN, 47907, USA
Shengchao Li, Lin Zhang & Xiumin Diao

Authors

Shengchao Li
View author publications
You can also search for this author inPubMed Google Scholar
Lin Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Xiumin Diao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xiumin Diao.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, S., Zhang, L. & Diao, X. Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow. J Intell Robot Syst 97, 95–107 (2020). https://doi.org/10.1007/s10846-019-01049-3

Download citation

Received: 22 January 2019
Accepted: 07 June 2019
Published: 12 July 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10846-019-01049-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A multilayer human motion prediction perceptron by aggregating repetitive motion

A Review of Dynamic Maps for 3D Human Motion Recognition Using ConvNets and Its Improvement

Development of human motion prediction strategy using inception residual block

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now