Skip to main content
Log in

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Multimodal action recognition techniques combine several image modalities (RGB, Depth, Skeleton, and InfraRed) for a more robust recognition. According to the fusion level in the action recognition pipeline, we can distinguish three families of approaches: early fusion, where the raw modalities are combined ahead of feature extraction; intermediate fusion, the features, respective to each modality, are concatenated before classification; and late fusion, where the modality-wise classification results are combined. After reviewing the literature, we identified the principal defects of each category, which we try to address by first investigating more deeply the early-stage fusion that has been poorly explored in the literature. Second, intermediate fusion protocols operate on the feature map, irrespective of the particularity of human action, we propose a new scheme where we optimally combine modality-wise features. Third, as most of the late fusion solutions use handcrafted rules, prone to human bias, and far from real-world peculiarities, we adopt a neural learning strategy to extract significant features from data rather than assuming that artificial rules are correct. We validated our findings on two challenging datasets. Our obtained results were as good or better than their literature counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Bouderbal, I., Amamra, A., Benatia, M.A.: How would image down-sampling and compression impact object detection in the context of self-driving vehicles? In: CSA, pp. 25–37 (2020)

  2. Boulahia, S.Y., Anquetil, E., Multon, F., Kulpa, R.: Dynamic hand gesture recognition based on 3d pattern assembled trajectories. In: 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6. IEEE (2017)

  3. Boulahia, S.Y., Anquetil, E., Multon, F., Kulpa, R.: Cudi3d: curvilinear displacement based approach for online 3d action detection. Comput. Vis. Image Understanding 174, 57–69 (2018)

    Article  Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  5. Chen, C., Jafari, R., Kehtarnavaz, N.: Fusion of depth, skeleton, and inertial data for human action recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2712–2716. IEEE (2016)

  6. Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)

  7. Das, S., Sharma, S., Dai, R., Bremond, F., Thonnat, M.: Vpn: Learning video-pose embedding for activities of daily living. In: European Conference on Computer Vision, pp. 72–90. Springer (2020)

  8. Davoodikakhki, M., Yin, K.: Hierarchical action classification with network pruning. In: International Symposium on Visual Computing, pp. 291–305. Springer (2020)

  9. De Boissiere, A.M., Noumeir, R.: Infrared and 3d skeleton feature fusion for rgb-d action recognition. IEEE Access 8, 168297–168308 (2020)

    Article  Google Scholar 

  10. Elharrouss, O., Almaadeed, N., Al-Maadeed, S., Bouridane, A., Beghdadi, A.: A combined multiple action recognition and summarization for surveillance video sequences. Appl. Intell. 51(2), 690–712 (2021)

    Article  Google Scholar 

  11. Fan, Y., Weng, S., Zhang, Y., Shi, B., Zhang, Y.: Context-aware cross-attention for skeleton-based human action recognition. IEEE Access 8, 15280–15290 (2020)

    Article  Google Scholar 

  12. Franco, A., Magnani, A., Maio, D.: A multimodal approach for human activity recognition based on skeleton and rgb data. Pattern Recogn. Lett. 131, 293–299 (2020)

    Article  Google Scholar 

  13. Gravina, R., Alinia, P., Ghasemzadeh, H., Fortino, G.: Multi-sensor fusion in body sensor networks: state-of-the-art and research challenges. Inf. Fusion 35, 68–80 (2017)

    Article  Google Scholar 

  14. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  17. Hu, J.F., Zheng, W.S., Pan, J., Lai, J., Zhang, J.: Deep bilinear learning for rgb-d action recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 335–351 (2018)

  18. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  19. Ijjina, E.P., Chalavadi, K.M.: Human action recognition in rgb-d videos using motion sequence information and deep learning. Pattern Recogn. 72, 504–516 (2017)

    Article  Google Scholar 

  20. Imran, J., Raman, B.: Evaluating fusion of rgb-d and inertial sensors for multimodal human action recognition. J. Ambient Intell. Hum. Comput. 11, 1–20 (2019)

    Google Scholar 

  21. Islam, M.M., Iqbal, T.: Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. Preprint arXiv:2008.01148 (2020)

  22. Islam, M.M., Iqbal, T.: Multi-gat: a graphical attention-based hierarchical multimodal representation learning approach for human activity recognition. IEEE Robot. Autom. Lett. 6(2), 1729–1736 (2021)

    Article  Google Scholar 

  23. Jegham, I., Khalifa, A.B., Alouani, I., Mahjoub, M.A.: Vision-based human action recognition: an overview and real world challenges. For. Sci. Int. Digital Investig. 32, 200901 (2020)

    Article  Google Scholar 

  24. Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: Multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)

  25. Khaire, P., Kumar, P., Imran, J.: Combining cnn streams of rgb-d and skeletal data for human activity recognition. Pattern Recogn. Lett. 115, 107–116 (2018)

    Article  Google Scholar 

  26. Lin, W., Sun, M.T., Poovandran, R., Zhang, Z.: Human activity recognition for video surveillance. In: IEEE International Symposium on Circuits and Systems, pp. 2737–2740 (2008)

  27. Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)

    Article  Google Scholar 

  28. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)

  29. Lockhart, J.W., Pulickal, T., Weiss, G.M.: Applications of mobile activity recognition. In: Proceedings of the ACM Conference on Ubiquitous Computing, pp. 1054–1058 (2012)

  30. Memmesheimer, R., Theisen, N., Paulus, D.: Gimme signals: discriminative signal encoding for multimodal activity recognition. Preprint arXiv:2003.06156 (2020)

  31. Pérez-Rúa, J.M., Vielzeuf, V., Pateux, S., Baccouche, M., Jurie, F.: Mfas: Multimodal fusion architecture search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6966–6975 (2019)

  32. Pham, C., Nguyen, L., Nguyen, A., Nguyen, N., Nguyen, V.T.: Combining skeleton and accelerometer data for human fine-grained activity recognition and abnormal behaviour detection with deep temporal convolutional networks. Multimedia Tools and Applications pp. 1–22 (2021)

  33. Rodríguez-Moreno, I., Martínez-Otzeta, J.M., Goienetxea, I., Rodriguez-Rodriguez, I., Sierra, B.: Shedding light on people action recognition in social robotics by means of common spatial patterns. Sensors 20(8), 2436 (2020)

    Article  Google Scholar 

  34. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Conference on Computer Vision and Pattern Recognition, pp. 1010–1019. IEEE (2016)

  35. Shahroudy, A., Ng, T.T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1045–1058 (2017)

    Article  Google Scholar 

  36. Shahroudy, A., Wang, G., Ng, T.T.: Multi-modal feature fusion for action recognition in rgb-d sequences. In: 6th International Symposium on Communications, Control and Signal Processing (ISCCSP), pp. 1–4. IEEE (2014)

  37. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR, pp. 1297–1304. IEEE (2011)

  38. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Preprint arXiv:1409.1556 (2014)

  39. Su, L., Hu, C., Li, G., Cao, D.: Msaf: Multimodal split attention fusion. Preprint arXiv:2012.07175 (2020)

  40. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)

  41. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

  42. Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE (2012)

  43. Zhao, R., Ali, H., Van der Smagt, P.: Two-stream rnn/cnn for action recognition in 3d videos. In: RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4260–4267. IEEE (2017)

  44. Zhao, Y., Liu, Z., Yang, L., Cheng, H.: Combing rgb and depth map features for human activity recognition. In: Proceedings of The Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–4. IEEE (2012)

  45. Zhu, Y., Chen, W., Guo, G.: Fusing multiple features for depth-based action recognition. ACM Trans. Intell. Syst. Technol. (TIST) 6(2), 1–20 (2015)

    Article  Google Scholar 

  46. Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2904–2913 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Said Yacine Boulahia.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boulahia, S.Y., Amamra, A., Madi, M.R. et al. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32, 121 (2021). https://doi.org/10.1007/s00138-021-01249-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-021-01249-8

Keywords

Navigation