DTR-HAR: deep temporal residual representation for human activity recognition

Basly, Hend; Ouarda, Wael; Sayadi, Fatma Ezahra; Ouni, Bouraoui; Alimi, Adel M.

doi:10.1007/s00371-021-02064-y

DTR-HAR: deep temporal residual representation for human activity recognition

Original article
Published: 15 February 2021

Volume 38, pages 993–1013, (2022)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Hend Basly ORCID: orcid.org/0000-0002-2632-8544¹,
Wael Ouarda²,
Fatma Ezahra Sayadi³,
Bouraoui Ouni¹ &
…
Adel M. Alimi²

1239 Accesses
1 Altmetric
Explore all metrics

Abstract

Human activity recognition (HAR) is a highly prized application in the pattern recognition and the computer vision fields. Up till now, deep neural networks have acquired big attention in computer studies and image processing fields, and have generated significant results. In this paper, we propose a deep temporal residual system for daily living activity recognition that aims to enhance spatiotemporal feature representation in order to improve the HAR system performance. To this end, we adopt a deep residual convolutional neural network (RCN) to retain discriminative visual features relayed to appearance and long short-term memory neural network to capture the long-term temporal evolution of actions. The latter was considered to implement time dependencies occurring when carrying out the activity to enhance features extracted from the RCN network by adding time information to address the dynamic activity recognition problem as a sequence labeling job. The deep temporal residual model for human activity recognition system is performed on two benchmark publicly available datasets: MSRDailyActivity3D and CAD-60. the proposed system achieves very competitive results when compared to others from the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Activity Recognition with a Time Distributed Deep Neural Network

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

Article 24 July 2018

A Comprehensive Survey and Analysis of CNN-LSTM-Based Approaches for Human Activity Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Zhou, T., Wang, W., Qi, S., Ling, H., Shen, J.: Cascaded human–object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4263–4272. IEEE (2020)
Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., Shao, L.: Hierarchical human parsing with typed part-relation reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8929–8939. IEEE (2020)
Li, T., Liang, Z., Zhao, S., Gong, J., Shen, J.: Self-learning with rectification strategy for human parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9263–9272. IEEE (2020)
Qi, S., Wang, W., Jia, B., Shen, J.,Zhu, S.C.: Learning human–object interactions by graph parsing neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 401–417. Springer (2018)
Yilmaz, A., Shah, M.: Actions sketch: a novel action representation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), pp. 984–989. IEEE (2005)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision (ICCV’05), pp. 1395–1402. IEEE (2005)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE, Las Vegas, NV (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., $\ldots $ , Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE, Columbus, OH (2014)
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: 15th ACM International Conference on Multimedia, pp. 357–360. ACM, Augsburg, Germany (2007)
Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: 19th British Machine Vision Conference (BMVC), pp. 1–10. Leeds, United Kingdom (2008)
Oreifej, O., Liu, Z.: Hon4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723. IEEE, Portland, OR (2013)
Asadi-Aghbolaghi, M., Kasaei, S.: Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos. Multimed. Tools Appl. 77(11), 14115–14135 (2018). https://doi.org/10.1007/s11042-017-5017-y
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008). https://doi.org/10.1016/j.cviu.2007.09.014
Article Google Scholar
Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: European Conference on Computer Vision, pp. 650–663. Springer, Berlin, Heidelberg (2008)
Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., Huang, T.S.: Action detection in complex scenes with spatial and temporal ambiguities. In: 12th International Conference on Computer Vision, pp. 128–135. IEEE, Kyoto (2009)
Zhang, M., Sawchuk, A.A.: Human daily activity recognition with sparse representation using wearable sensors. IEEE J. Biomed. Health. Inf. 17(3), 553–560 (2013). https://doi.org/10.1109/JBHI.2013.2253613
Article Google Scholar
Liu, C., Ying, J., Yang, H., Hu, X., Liu, J.: Improved human action recognition approach based on two-stream convolutional neural network model. Vis. Comput. (2020). https://doi.org/10.1007/s00371-020-01868-8
Article Google Scholar
Li, X., Shen, H., Zhang, L., Zhang, H., Yuan, Q., Yang, G.: Recovering quantitative remote sensing products contaminated by thick clouds and shadows using multitemporal dictionary learning. IEEE Trans. Geosci. Remote Sens. 52(11), 7086–7098 (2014). https://doi.org/10.1109/TGRS.2014.2307354
Article Google Scholar
Dong, X., Shen, J., Wu, D., Guo, K., Jin, X., Porikli, F.: Quadruplet network with one-shot learning for fast visual object tracking. IEEE Trans. Image Process. 28(7), 3516–3527 (2019). https://doi.org/10.1109/TIP.2019.2898567
Article MathSciNet MATH Google Scholar
Liang, Z., Shen, J.: Local semantic Siamese networks for fast tracking. IEEE Trans. Image Process. 29, 3351–3364 (2019). https://doi.org/10.1109/TIP.2019.2959256
Article Google Scholar
Wang, W., Shen, J., Shao, L.: Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. 27(1), 38–49 (2017). https://doi.org/10.1109/TIP.2017.2754941
Article MathSciNet MATH Google Scholar
Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans. Image Process. 29, 1113–1126 (2019). https://doi.org/10.1109/TIP.2019.2936112
Article MathSciNet Google Scholar
Wang, W., Shen, J., Ling, H.: A deep network solution for attention and aesthetics aware photo cropping. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1531–1544 (2018). https://doi.org/10.1109/TPAMI.2018.2840724
Article Google Scholar
Kuanar, S., Athitsos, V., Pradhan, N., Mishra, A., Rao, K.R.: Cognitive analysis of working memory load from EEG, by a deep recurrent neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2576–2580. IEEE SigPort (2018)
Kuanar, S., Athitsos, V., Mahapatra, D., Rao, K.R., Akhtar, Z., Dasgupta, D.: Low dose abdominal CT image reconstruction: an unsupervised learning based approach. In: IEEE International Conference on Image Processing (ICIP), pp. 1351–1355. IEEE (2019)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576. MIT Press, Cambridge, MA (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE Computer Society, USA (2015)
Bhattacharya, S., Nurmi, P., Hammerla, N., Plötz, T.: Using unlabeled data in a sparse-coding framework for human activity recognition. Pervasive Mob. Comput. 15, 242–262 (2014). https://doi.org/10.1016/j.pmcj.2014.05.006
Article Google Scholar
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization (2014). arXiv preprint arXiv:1409.2329
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
. Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4041–4049. IEEE Computer Society, USA (2015)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702. Boston, MA (2015)
Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 461–470. ACM, New York, NY (2015)
Kim, J.H., Hong, G.S., Kim, B.G., Dogra, D.P.: deepGesture: deep learning-based gesture recognition scheme using motion sensors. Displays 55, 38–45 (2018). https://doi.org/10.1016/j.displa.2018.08.001
Article Google Scholar
Madhuranga, D., Madushan, R., Siriwardane, C., Gunasekera, K.: Real-time multimodal ADL recognition using convolution neural networks. Vis. Comput. (2020). https://doi.org/10.1007/s00371-020-01864-y
Article Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852. JMLR.org (2015)
Ercolano, G., Riccio, D., Rossi, S.: Two deep approaches for ADL recognition: a multi-scale LSTM and a CNN-LSTM with a 3D matrix skeleton representation. In: 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 877-882. IEEE, Lisbon (2017)
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017). https://doi.org/10.1109/ACCESS.2017.2778011
Article Google Scholar
Ma, C.Y., Chen, M.H., Kira, Z., AlRegib, G.: TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Signal Process. Image Commun. 71, 76–87 (2019). https://doi.org/10.1016/j.image.2018.09.003
Article Google Scholar
Zhao, R., Ali, H., Van der Smagt, P.: Two-stream RNN/CNN for action recognition in 3D videos. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4260–4267. IEEE, Vancouver, BC (2017)
Ding, L., Fang, W., Luo, H., Love, P.E., Zhong, B., Ouyang, X.: A deep hybrid learning model to detect unsafe behavior: integrating convolution neural networks and long short-term memory. Autom. Constr. 86, 118–124 (2018). https://doi.org/10.1016/j.autcon.2017.11.002
Article Google Scholar
Baradel, F., Wolf, C., Mille, J.: Human activity recognition with pose-driven attention to RGB. In: 29th British Machine Vision Conference, pp. 1–14. Newcastle, United Kingdom (2018)
Arif, S., Wang, J., Ul Hassan, T., Fei, Z.: 3D-CNN-based fused feature maps with LSTM applied to action recognition. Future Internet 11(2), 42 (2019). https://doi.org/10.3390/fi11020042
Article Google Scholar
Das, S., Koperski, M., Bremond, F., Francesca, G.: Deep-temporal LSTM for daily living action recognition. In: 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE, Auckland, New Zealand (2018)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Berg, A.C.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318. JMLR.org, Atlanta, GA (2013)
Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, Saint Paul, MN (2012)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297. IEEE, USA (2012)
Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018). https://doi.org/10.1016/j.patrec.2018.04.035
Article Google Scholar
Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 32(8), 453–464 (2014). https://doi.org/10.1016/j.imavis.2014.04.005
Article Google Scholar
Ni, B., Pei, Y., Moulin, P., Yan, S.: Multilevel depth and image fusion for human activity detection. IEEE Trans. Cybern. 43(5), 1383–1394 (2013). https://doi.org/10.1109/TCYB.2013.2276433
Article Google Scholar
Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32(8), 951–970 (2013). https://doi.org/10.1177/0278364913478446
Article Google Scholar
Koperski, M., Bremond, F.: Modeling spatial layout of features for real world scenario RGB-D action recognition. In: 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 44–50. IEEE (2016)
Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5344–5352. IEEE (2015)
Srihari, D., Kishore, P.V.V., Kumar, E.K., Kumar, D.A., Kumar, M.T.K., Prasad, M.V.D., Prasad, C.R.: A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data. Multimed. Tools Appl. (2020). https://doi.org/10.1007/s11042-019-08588-9
Article Google Scholar
Nunez, J.C., Cabido, R., Pantrigo, J.J., Montemayor, A.S., Velez, J.F.: Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognit. 76, 80–94 (2018). https://doi.org/10.1016/J.PATCOG.2017.10.033
Article Google Scholar
Luo, J., Wang, W., Qi, H.: Spatio-temporal feature extraction and representation for RGB-D human action recognition. Pattern Recognit. Lett. 50, 139–148 (2014). https://doi.org/10.1016/j.patrec.2014.03.024
Article Google Scholar
Baradel, F., Wolf, C., Mille, J.: Human activity recognition with pose-driven attention to RGB. In: 29th British Machine Vision Conference, pp. 1–14. Newcastle, United Kingdom (2018)
Zhou, Y., Ni, B., Hong, R., Wang, M., Tian, Q.: Interaction part mining: a mid-level approach for fine-grained action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3323–3331. IEEE (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer, Cham (2016)
Das, S., Koperski, M., Bremond, F., Francesca, G.: Action recognition based on a mixture of RGB and depth based skeleton. In: 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE, Lecce (2017)
Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: International Joint Conference on Artificial Intelligence (2013)
Kong, Y., Fu, Y.: Bilinear heterogeneous information machine for RGB-D action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1054–1062 (2015)

Download references

Author information

Authors and Affiliations

NOCCS–Lab.: Networked Objects Control and Communication Systems Laboratory, National Engineering School of Sousse (ENISO), University of Sousse, BP 264, 4023, Erriadh, Sousse, Tunisia
Hend Basly & Bouraoui Ouni
REGIM–Lab.: REsearch Groups in Intelligent Machines, National Engineering School of Sfax (ENIS), University of Sfax, BP 1173, 3038, Sfax, Tunisia
Wael Ouarda & Adel M. Alimi
EµE-Lab.: Electronics and Microelectronics Laboratory, Faculty of Sciences of Monastir (FSM), University of Monastir, Environment Avenue, 5019, Monastir, Tunisia
Fatma Ezahra Sayadi

Authors

Hend Basly
View author publications
You can also search for this author in PubMed Google Scholar
Wael Ouarda
View author publications
You can also search for this author in PubMed Google Scholar
Fatma Ezahra Sayadi
View author publications
You can also search for this author in PubMed Google Scholar
Bouraoui Ouni
View author publications
You can also search for this author in PubMed Google Scholar
Adel M. Alimi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hend Basly.

Ethics declarations

Conflict of interest

Hend Basly, Wael Ouarda, Fatma Ezahra Sayadi, Bouraoui Ouni and Adel M. Alimi declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Basly, H., Ouarda, W., Sayadi, F.E. et al. DTR-HAR: deep temporal residual representation for human activity recognition. Vis Comput 38, 993–1013 (2022). https://doi.org/10.1007/s00371-021-02064-y

Download citation

Accepted: 05 January 2021
Published: 15 February 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00371-021-02064-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DTR-HAR: deep temporal residual representation for human activity recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Activity Recognition with a Time Distributed Deep Neural Network

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

A Comprehensive Survey and Analysis of CNN-LSTM-Based Approaches for Human Activity Recognition

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DTR-HAR: deep temporal residual representation for human activity recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Activity Recognition with a Time Distributed Deep Neural Network

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

A Comprehensive Survey and Analysis of CNN-LSTM-Based Approaches for Human Activity Recognition

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation