Multi modal human action recognition for video content matching

Guo, Jun; Bai, Hao; Tang, Zhanyong; Xu, Pengfei; Gan, Daguang; Liu, Baoying

doi:10.1007/s11042-020-08998-0

Multi modal human action recognition for video content matching

Published: 29 May 2020

Volume 79, pages 34665–34683, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jun Guo^1,2,
Hao Bai¹,
Zhanyong Tang¹,
Pengfei Xu¹,
Daguang Gan³ &
…
Baoying Liu¹

376 Accesses
4 Citations
Explore all metrics

Abstract

Human action recognition (HAR)in videos is a challenging task in computer vision. Conventional methods are prone to explore the spatiotemporal or optical representations for video actions. However, optical representation might be inefficient in some real-life situations, such as object occlusion and dim light. To address this issue, this paper presents a novel approach for human action recognition by jointly exploiting video and Wi-Fi clues. We leverage the fact that Wi-Fi signals carry discriminative information of human actions, which is robust to optical limitations. To validate this innovative thought, we conceive a practical framework for HAR and setup a dataset containing both video clips and Wi-Fi Channel State Information of human actions. The 3D convolutional neural network was used to extract the video features and the statistical algorithms were used to extract radio features. A classical linear support vector machine is employed as the classifier after the video and radio feature fusion. Comprehensive experiments on this dataset achieved desirable results with the maximum improvement in accuracy by 10%. This demonstrates our promising findings: with the aid of Wi-Fi Channel State Information, the performance of the video action recognition methods can be improved significantly, even under the optical limitation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

References

Adams W, Iyengar G, Lin C, Naphade M, Neti C, Nock H, Smith J (2003) Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP J Appl Signal Process 2003(2):170–185
Google Scholar
Adib F, Katabi D (2013) See through walls with wifi! In Proc.ACM SIGCOMM Conf. SIGCOMM, pp. 75–86
Adib F, Kabelac Z, Katabi D, Miller RC (2014) 3d tracking via body radio reflections. In Proc. 11th USENIX Conf. Netw. Syst. Des. Implementation, vol. 14, pp. 317–329.
Adib F, Kabelac Z, Katabi D (2015) Multi-person localization via RF body reflections. In Proc. 12th USENIX Conf. Netw. Syst. Des. Implementation, pp. 279–292.
Duan S, Yu T, He J (2018) WiDriver: driver activity recognition system based on WiFi CSI. Int J Wireless Inf Networks 25:146–156
Article Google Scholar
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In Proceedings of Advances in Neural Information Processing Systems, pages 3468–3476, Barcelona, SPAIN
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of Conference on Computer Vision and Pattern Recognition, pages 1933–1941, Las Vegas, NV, United States
Gu Y, Quan L, Ren F (2014) Wifi-assisted human activity recognition. In Wireless and Mobile, 2014 IEEE Asia Pacific Conference on. IEEE, pp. 60–65
Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494
Article MathSciNet Google Scholar
Hara K , Kataoka H , Satoh Y (2017) Learning Spatio-temporal features with 3D Residual Networks for Action Recognition[J]. arXiv:1708.07632v1
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, and T. Brox (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 1647-1655, Honolulu, HI, USA, IEEE Computer Society.
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Joshi KR, Bharadia D, Kotaru M, Sachin Katti (2015) WiDeo: Fine-grained device-free motion tracing using RF back scatter. In Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI). Pages 189–204
Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Conference on Computer Vision and Pattern Recognition, pages 5699-5678, Honolulu, Hawaii, USA, IEEE
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732,Columbus, OH, USA
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, Anchorage, AK, USA, IEEE
Li Y, Kuai Y (2012) Action recognition based on spatio-temporal interest points. In Biomedical Engineering and Informatics (BMEI), 2012 5thInternational Conference on. IEEE, pp. 181–185.
Liu J, Shah M (2008) Learning human actions via information maximization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, Anchorage, AK, USA, IEEE
Liu H, Tu J , Liu M (2017) Two-stream 3D convolutional neural network for skeleton-based action recognition. arXiv:1705.08106
Melgarejo P, Zhang X, Ramanathan P, Chu D (2014) Leveraging directional antenna capabilities for fine grained gesture recognition. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 541–551
Peng X, Schmid C (2016) Multi-region two-stream r-cnn for action detection. In European Conference on Computer Vision, pages 744–759. Springer
Pu Q, Gupta S, Gollakota S, Patel S (2013) Whole-home gesture recognition using wireless signals. In Proceedings of the 19th Annual International Conference on Mobile Computing &Networking ,pages 27–38, (MobiCom2013). ACM,
Ranjan A, Black MJ (2016) Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Proceedings of Advances in Neural Information Processing Systems, pages 568–576
Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection. Proceedings of Conference on Computer Vision and Pattern Recognition, pages 1961–1970,Las Vegas, NV, United States, IEEE
Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In ACM International Conference on Multimedia, pp. 399–402. Singapore
Tadas B, Chaitanya A, Louis-Philippe M (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Ana Match Intell 41(2):423–443
Tran A, Cheong LF (2017) Two-stream flow-guided convolutional attention networks for action recognition. IEEE International Conference on Computer Vision Workshop. Pages 3110-3119, Venice, Italy, IEEE Computer Society
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159
Wang W, Liu AX, Shahzad M (2016) Gait recognition using WiFi signals. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 363–373, Heidelberg, Germany, ACM
Wang Y, Wu K, Ni LM (2017) Wifall: device-free fall detection by wireless networks. IEEE Trans Mob Comput 16(2):581–594
Article Google Scholar
Zhang D, Wang H, Wu D (2017) Toward centimeter-scale human activity sensing with Wi-Fi signals. Computer 50(1):48–57
Article Google Scholar
Zhao M, Yue S, Katabi D, Jaakkola TS, Bianchi MT (2017) Learning sleep stages from radio signals: A conditional adversarial architecture. In International Conference on Machine Learning (ICML), Sydney, Australia
Zhao M,Li T, Alsheikh MA, Tian Y, Zhao H, Torralba A, Katabi D (2018) Through wall human pose estimation using radio signals. pages 7356–7365,Proceedings of Conference on Computer Vision and Pattern Recognition, Salt Lake City, UTAL, United States, IEEE.
Zhao M, Tian Y, Zhao H, Alsheikh MA, Li T, Hristov R, Kabelac Z, Katabi D, Torralba A (2018) RF-based 3D skeletons. SIGCOMM’18, pages 267-282, Budapest, Hungary. ACM
Zhu Y, Lan Z, Newsam S, Hauptmann AG (2017) Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389

Download references

Acknowledgements

This work is financially supported in part by the National Key Research and Development Program of China under Grant No.2017YFB1400301 and National Science Foundation of China under Grant No. 61973250, 61702415, 61902318, 61973249.We are also grateful to all volunteers, the Shaanxi Science and Technology Innovation Team Support Project under grant agreement 2018TD-026, China University of Labor Relations (20XYJ007) and Wanfang Data Co. for their contribution to our dataset.

Author information

Authors and Affiliations

School of Information Science and Technology, Northwest University, Xuefudajie street No.1, Xi’an, 710127, Shaanxi, China
Jun Guo, Hao Bai, Zhanyong Tang, Pengfei Xu & Baoying Liu
Shaanxi International Joint Research Centre for the Battery-free Internet of things, Taibaibeilu 229, Xi’an, 710069, Shaanxi, China
Jun Guo
Wanfang Data Co. Limited, Haidianqu Fuxinglu 15, Beijing, 100038, China
Daguang Gan

Authors

Jun Guo
View author publications
You can also search for this author in PubMed Google Scholar
Hao Bai
View author publications
You can also search for this author in PubMed Google Scholar
Zhanyong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Daguang Gan
View author publications
You can also search for this author in PubMed Google Scholar
Baoying Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhanyong Tang or Pengfei Xu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, J., Bai, H., Tang, Z. et al. Multi modal human action recognition for video content matching. Multimed Tools Appl 79, 34665–34683 (2020). https://doi.org/10.1007/s11042-020-08998-0

Download citation

Received: 15 July 2019
Revised: 23 February 2020
Accepted: 23 April 2020
Published: 29 May 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11042-020-08998-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi modal human action recognition for video content matching

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Image Matching from Handcrafted to Deep Features: A Survey

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi modal human action recognition for video content matching

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Image Matching from Handcrafted to Deep Features: A Survey

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation