Skip to main content

Advertisement

Log in

Multi modal human action recognition for video content matching

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human action recognition (HAR)in videos is a challenging task in computer vision. Conventional methods are prone to explore the spatiotemporal or optical representations for video actions. However, optical representation might be inefficient in some real-life situations, such as object occlusion and dim light. To address this issue, this paper presents a novel approach for human action recognition by jointly exploiting video and Wi-Fi clues. We leverage the fact that Wi-Fi signals carry discriminative information of human actions, which is robust to optical limitations. To validate this innovative thought, we conceive a practical framework for HAR and setup a dataset containing both video clips and Wi-Fi Channel State Information of human actions. The 3D convolutional neural network was used to extract the video features and the statistical algorithms were used to extract radio features. A classical linear support vector machine is employed as the classifier after the video and radio feature fusion. Comprehensive experiments on this dataset achieved desirable results with the maximum improvement in accuracy by 10%. This demonstrates our promising findings: with the aid of Wi-Fi Channel State Information, the performance of the video action recognition methods can be improved significantly, even under the optical limitation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Adams W, Iyengar G, Lin C, Naphade M, Neti C, Nock H, Smith J (2003) Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP J Appl Signal Process 2003(2):170–185

    Google Scholar 

  2. Adib F, Katabi D (2013) See through walls with wifi! In Proc.ACM SIGCOMM Conf. SIGCOMM, pp. 75–86

  3. Adib F, Kabelac Z, Katabi D, Miller RC (2014) 3d tracking via body radio reflections. In Proc. 11th USENIX Conf. Netw. Syst. Des. Implementation, vol. 14, pp. 317–329.

  4. Adib F, Kabelac Z, Katabi D (2015) Multi-person localization via RF body reflections. In Proc. 12th USENIX Conf. Netw. Syst. Des. Implementation, pp. 279–292.

  5. Duan S, Yu T, He J (2018) WiDriver: driver activity recognition system based on WiFi CSI. Int J Wireless Inf Networks 25:146–156

    Article  Google Scholar 

  6. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In Proceedings of Advances in Neural Information Processing Systems, pages 3468–3476, Barcelona, SPAIN

  7. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of Conference on Computer Vision and Pattern Recognition, pages 1933–1941, Las Vegas, NV, United States

  8. Gu Y, Quan L, Ren F (2014) Wifi-assisted human activity recognition. In Wireless and Mobile, 2014 IEEE Asia Pacific Conference on. IEEE, pp. 60–65

  9. Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494

    Article  MathSciNet  Google Scholar 

  10. Hara K , Kataoka H , Satoh Y (2017) Learning Spatio-temporal features with 3D Residual Networks for Action Recognition[J]. arXiv:1708.07632v1

  11. Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, and T. Brox (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 1647-1655, Honolulu, HI, USA, IEEE Computer Society.

  12. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  13. Joshi KR, Bharadia D, Kotaru M, Sachin Katti (2015) WiDeo: Fine-grained device-free motion tracing using RF back scatter. In Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI). Pages 189–204

  14. Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Conference on Computer Vision and Pattern Recognition, pages 5699-5678, Honolulu, Hawaii, USA, IEEE

  15. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732,Columbus, OH, USA

  16. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, Anchorage, AK, USA, IEEE

  17. Li Y, Kuai Y (2012) Action recognition based on spatio-temporal interest points. In Biomedical Engineering and Informatics (BMEI), 2012 5thInternational Conference on. IEEE, pp. 181–185.

  18. Liu J, Shah M (2008) Learning human actions via information maximization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, Anchorage, AK, USA, IEEE

  19. Liu H, Tu J , Liu M (2017) Two-stream 3D convolutional neural network for skeleton-based action recognition. arXiv:1705.08106

  20. Melgarejo P, Zhang X, Ramanathan P, Chu D (2014) Leveraging directional antenna capabilities for fine grained gesture recognition. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 541–551

  21. Peng X, Schmid C (2016) Multi-region two-stream r-cnn for action detection. In European Conference on Computer Vision, pages 744–759. Springer

  22. Pu Q, Gupta S, Gollakota S, Patel S (2013) Whole-home gesture recognition using wireless signals. In Proceedings of the 19th Annual International Conference on Mobile Computing &Networking ,pages 27–38, (MobiCom2013). ACM,

  23. Ranjan A, Black MJ (2016) Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850

  24. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Proceedings of Advances in Neural Information Processing Systems, pages 568–576

  25. Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection. Proceedings of Conference on Computer Vision and Pattern Recognition, pages 1961–1970,Las Vegas, NV, United States, IEEE

  26. Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In ACM International Conference on Multimedia, pp. 399–402. Singapore

  27. Tadas B, Chaitanya A, Louis-Philippe M (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Ana Match Intell 41(2):423–443

  28. Tran A, Cheong LF (2017) Two-stream flow-guided convolutional attention networks for action recognition. IEEE International Conference on Computer Vision Workshop. Pages 3110-3119, Venice, Italy, IEEE Computer Society

  29. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159

  30. Wang W, Liu AX, Shahzad M (2016) Gait recognition using WiFi signals. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 363–373, Heidelberg, Germany, ACM

  31. Wang Y, Wu K, Ni LM (2017) Wifall: device-free fall detection by wireless networks. IEEE Trans Mob Comput 16(2):581–594

    Article  Google Scholar 

  32. Zhang D, Wang H, Wu D (2017) Toward centimeter-scale human activity sensing with Wi-Fi signals. Computer 50(1):48–57

    Article  Google Scholar 

  33. Zhao M, Yue S, Katabi D, Jaakkola TS, Bianchi MT (2017) Learning sleep stages from radio signals: A conditional adversarial architecture. In International Conference on Machine Learning (ICML), Sydney, Australia

  34. Zhao M,Li T, Alsheikh MA, Tian Y, Zhao H, Torralba A, Katabi D (2018) Through wall human pose estimation using radio signals. pages 7356–7365,Proceedings of Conference on Computer Vision and Pattern Recognition, Salt Lake City, UTAL, United States, IEEE.

  35. Zhao M, Tian Y, Zhao H, Alsheikh MA, Li T, Hristov R, Kabelac Z, Katabi D, Torralba A (2018) RF-based 3D skeletons. SIGCOMM’18, pages 267-282, Budapest, Hungary. ACM

  36. Zhu Y, Lan Z, Newsam S, Hauptmann AG (2017) Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389

Download references

Acknowledgements

This work is financially supported in part by the National Key Research and Development Program of China under Grant No.2017YFB1400301 and National Science Foundation of China under Grant No. 61973250, 61702415, 61902318, 61973249.We are also grateful to all volunteers, the Shaanxi Science and Technology Innovation Team Support Project under grant agreement 2018TD-026, China University of Labor Relations (20XYJ007) and Wanfang Data Co. for their contribution to our dataset.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhanyong Tang or Pengfei Xu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, J., Bai, H., Tang, Z. et al. Multi modal human action recognition for video content matching. Multimed Tools Appl 79, 34665–34683 (2020). https://doi.org/10.1007/s11042-020-08998-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08998-0

Keywords

Navigation