Abstract
Human action recognition (HAR)in videos is a challenging task in computer vision. Conventional methods are prone to explore the spatiotemporal or optical representations for video actions. However, optical representation might be inefficient in some real-life situations, such as object occlusion and dim light. To address this issue, this paper presents a novel approach for human action recognition by jointly exploiting video and Wi-Fi clues. We leverage the fact that Wi-Fi signals carry discriminative information of human actions, which is robust to optical limitations. To validate this innovative thought, we conceive a practical framework for HAR and setup a dataset containing both video clips and Wi-Fi Channel State Information of human actions. The 3D convolutional neural network was used to extract the video features and the statistical algorithms were used to extract radio features. A classical linear support vector machine is employed as the classifier after the video and radio feature fusion. Comprehensive experiments on this dataset achieved desirable results with the maximum improvement in accuracy by 10%. This demonstrates our promising findings: with the aid of Wi-Fi Channel State Information, the performance of the video action recognition methods can be improved significantly, even under the optical limitation.
Similar content being viewed by others
References
Adams W, Iyengar G, Lin C, Naphade M, Neti C, Nock H, Smith J (2003) Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP J Appl Signal Process 2003(2):170–185
Adib F, Katabi D (2013) See through walls with wifi! In Proc.ACM SIGCOMM Conf. SIGCOMM, pp. 75–86
Adib F, Kabelac Z, Katabi D, Miller RC (2014) 3d tracking via body radio reflections. In Proc. 11th USENIX Conf. Netw. Syst. Des. Implementation, vol. 14, pp. 317–329.
Adib F, Kabelac Z, Katabi D (2015) Multi-person localization via RF body reflections. In Proc. 12th USENIX Conf. Netw. Syst. Des. Implementation, pp. 279–292.
Duan S, Yu T, He J (2018) WiDriver: driver activity recognition system based on WiFi CSI. Int J Wireless Inf Networks 25:146–156
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In Proceedings of Advances in Neural Information Processing Systems, pages 3468–3476, Barcelona, SPAIN
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of Conference on Computer Vision and Pattern Recognition, pages 1933–1941, Las Vegas, NV, United States
Gu Y, Quan L, Ren F (2014) Wifi-assisted human activity recognition. In Wireless and Mobile, 2014 IEEE Asia Pacific Conference on. IEEE, pp. 60–65
Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494
Hara K , Kataoka H , Satoh Y (2017) Learning Spatio-temporal features with 3D Residual Networks for Action Recognition[J]. arXiv:1708.07632v1
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, and T. Brox (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 1647-1655, Honolulu, HI, USA, IEEE Computer Society.
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Joshi KR, Bharadia D, Kotaru M, Sachin Katti (2015) WiDeo: Fine-grained device-free motion tracing using RF back scatter. In Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI). Pages 189–204
Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Conference on Computer Vision and Pattern Recognition, pages 5699-5678, Honolulu, Hawaii, USA, IEEE
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732,Columbus, OH, USA
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, Anchorage, AK, USA, IEEE
Li Y, Kuai Y (2012) Action recognition based on spatio-temporal interest points. In Biomedical Engineering and Informatics (BMEI), 2012 5thInternational Conference on. IEEE, pp. 181–185.
Liu J, Shah M (2008) Learning human actions via information maximization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, Anchorage, AK, USA, IEEE
Liu H, Tu J , Liu M (2017) Two-stream 3D convolutional neural network for skeleton-based action recognition. arXiv:1705.08106
Melgarejo P, Zhang X, Ramanathan P, Chu D (2014) Leveraging directional antenna capabilities for fine grained gesture recognition. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 541–551
Peng X, Schmid C (2016) Multi-region two-stream r-cnn for action detection. In European Conference on Computer Vision, pages 744–759. Springer
Pu Q, Gupta S, Gollakota S, Patel S (2013) Whole-home gesture recognition using wireless signals. In Proceedings of the 19th Annual International Conference on Mobile Computing &Networking ,pages 27–38, (MobiCom2013). ACM,
Ranjan A, Black MJ (2016) Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Proceedings of Advances in Neural Information Processing Systems, pages 568–576
Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection. Proceedings of Conference on Computer Vision and Pattern Recognition, pages 1961–1970,Las Vegas, NV, United States, IEEE
Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In ACM International Conference on Multimedia, pp. 399–402. Singapore
Tadas B, Chaitanya A, Louis-Philippe M (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Ana Match Intell 41(2):423–443
Tran A, Cheong LF (2017) Two-stream flow-guided convolutional attention networks for action recognition. IEEE International Conference on Computer Vision Workshop. Pages 3110-3119, Venice, Italy, IEEE Computer Society
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159
Wang W, Liu AX, Shahzad M (2016) Gait recognition using WiFi signals. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 363–373, Heidelberg, Germany, ACM
Wang Y, Wu K, Ni LM (2017) Wifall: device-free fall detection by wireless networks. IEEE Trans Mob Comput 16(2):581–594
Zhang D, Wang H, Wu D (2017) Toward centimeter-scale human activity sensing with Wi-Fi signals. Computer 50(1):48–57
Zhao M, Yue S, Katabi D, Jaakkola TS, Bianchi MT (2017) Learning sleep stages from radio signals: A conditional adversarial architecture. In International Conference on Machine Learning (ICML), Sydney, Australia
Zhao M,Li T, Alsheikh MA, Tian Y, Zhao H, Torralba A, Katabi D (2018) Through wall human pose estimation using radio signals. pages 7356–7365,Proceedings of Conference on Computer Vision and Pattern Recognition, Salt Lake City, UTAL, United States, IEEE.
Zhao M, Tian Y, Zhao H, Alsheikh MA, Li T, Hristov R, Kabelac Z, Katabi D, Torralba A (2018) RF-based 3D skeletons. SIGCOMM’18, pages 267-282, Budapest, Hungary. ACM
Zhu Y, Lan Z, Newsam S, Hauptmann AG (2017) Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389
Acknowledgements
This work is financially supported in part by the National Key Research and Development Program of China under Grant No.2017YFB1400301 and National Science Foundation of China under Grant No. 61973250, 61702415, 61902318, 61973249.We are also grateful to all volunteers, the Shaanxi Science and Technology Innovation Team Support Project under grant agreement 2018TD-026, China University of Labor Relations (20XYJ007) and Wanfang Data Co. for their contribution to our dataset.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Guo, J., Bai, H., Tang, Z. et al. Multi modal human action recognition for video content matching. Multimed Tools Appl 79, 34665–34683 (2020). https://doi.org/10.1007/s11042-020-08998-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08998-0