Abstract
Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, demonstrate remarkable adaptability in extreme visual environments. Nevertheless, the exploitation of event cameras for pose estimation in current research has not yet fully harnessed the potential of event-driven data, and enhancing model efficiency remains an ongoing pursuit. This work focuses on devising an efficient, compact pose estimation algorithm, with special attention on optimizing the fusion of multi-view event streams for improved pose prediction accuracy. We propose EV-TIFNet, a compact dual-view interactive network, which incorporates event frames along with our custom-designed Global Spatio-Temporal Feature Maps (GTF Maps). To enhance the network’s ability to understand motion characteristics and localize keypoints, we have tailored a dedicated Auxiliary Information Extraction Module (AIE Module) for the GTF Maps. Experimental results demonstrate that our model, with a compact parameter count of 0.55 million, achieves notable advancements on the DHP19 dataset, reducing the \(\hbox {MPJPE}_{3D}\) to 61.45 mm. Building upon the sparsity of event data, the integration of sparse convolution operators replaces a significant portion of traditional convolutional layers, leading to a reduction in computational demand by 28.3%, totalling 8.71 GFLOPs. These design choices highlight the model’s suitability and efficiency in scenarios where computational resources are limited.





Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Burgermeister, D., Curio, C.: Pedrecnet: Multi-task deep neural network for full 3d human pose and orientation estimation. In: IEEE Intelligent Vehicles Symposium (IV). IEEE 2022, 441–448 (2022)
Yang, Y., Ren, Z., Li, H., Zhou, C., Wang, X., Hua, G.: Learning dynamics via graph neural networks for human pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8074–8084 (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision-ECCV: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer 2014, 740–755 (2014)
Hirao, Y., Wan, W., Kanoulas, D., Harada, K.: Body extension by using two mobile manipulators. Cyborg Bionic Syst. 4, 0014 (2023)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: international conference on 3D vision (3DV). IEEE 2017, 506–516 (2017)
Lichtsteiner, P., Posch, C., Delbruck, T.: A 128 ×128 120 db 15 µ s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circ. 43(2), 566–576 (2008)
Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2020)
Wang, Y., Zhang, X., Shen, Y., Du, B., Zhao, G., Cui, L., Wen, H.: Event-stream representation for human gaits identification using deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3436–3449 (2021)
Cao, Z., Chu, Z., Liu, D., Chen, Y.: A vector-based representation to enhance head pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1188–1197 (2021)
Calabrese, E., Taverni, G., Awai Easthope, C., Skriabine, S., Corradi, F., Longinotti, L., Eng, K., Delbruck, T.: Dhp19: dynamic vision sensor 3d human pose dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 0–0 (2019)
Manilii, A., Lucarelli, L., Rosati, R., Romeo, L., Mancini, A., Frontoni, E.: 3d human pose estimation based on multi-input multi-output convolutional neural network and event cameras: a proof of concept on the dhp19 dataset. In: Recognition, Pattern (ed.) ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, pp. 14–25. Part I. Springer, Proceedings (2021)
Choi, S., Choi, S., Kim, C.: Mobilehumanpose: Toward real-time 3d human pose estimation in mobile devices. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2328–2338 (2021)
Scarpellini, G., Morerio, P., Del Bue, A.: Lifting monocular events to 3d human poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1358–1368 (2021)
Zou, S., Guo, C., Zuo, X., Wang, S., Wang, P., Hu, X., Chen, S., Gong, M., Cheng, L.: Eventhpe: event-based 3d human pose and shape estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10 996–11 005 (2021)
Baldwin, R.W., Liu, R., Almatrafi, M., Asari, V., Hirakawa, K.: Time-ordered recent event (tore) volumes for event cameras. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2519–2532 (2023)
Chen, J., Shi, H., Ye, Y., Yang, K., Sun, L., Wang, K.: Efficient human pose estimation via 3d event point cloud. arXiv preprint arXiv:2206.04511, (2022)
Sapp, B., Toshev, A., Taskar, B.: Cascaded models for articulated pose estimation. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II 11. Springer, pp. 406–420 (2010)
Zhang, X., Li, C., Tong, X., Hu, W., Maybank, S., Zhang, Y.: Efficient human pose estimation via parsing a tree structure based human model. In: 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 1349–1356 (2009)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742 (2016)
Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. In: BMVC, vol. 1, p. 2 (2016)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision-ECCV,: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VIII 14. Springer 2016, 483–499 (2016)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1831–1840 (2017)
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: proceedings of the IEEE international conference on computer vision, pp. 1281–1290 (2017)
Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the european conference on computer vision (ECCV), pp. 713–728 (2018)
Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimed. 20(5), 1246–1259 (2017)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1, (2017)
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3517–3526 (2019)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. pp. 6000-6010 (2017)
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214, vol. 2, no. 6, (2020)
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: Tokenpose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp. 11 313–11 322 (2021)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7025–7034 (2017)
Li, S., Chan, A.B.: 3d human pose estimation from monocular images with deep convolutional neural network. In: Computer Vision–ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1–5, 2014, Revised Selected Papers, Part II 12. Springer, pp. 332–347 (2015)
Liu, D., Cui, Y., Yan, L., Mousas, C., Yang, B., Chen, Y.: Densernet: Weakly supervised visual localization using multi-scale feature aggregation. Proc. AAAI Conf. Artif. Intell. 35(7), 6101–6109 (2021)
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp. 529–545 (2018)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3d human pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6988–6997 (2017)
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4342–4351 (2019)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst., pp. 3104–3112 (2014)
Wang, K., Boonpratatong, A., Chen, W., Ren, L., Wei, G., Qian, Z., Lu, X., Zhao, D.: The fundamental property of human leg during walking: linearity and nonlinearity. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 4871–4881 (2023)
Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: Spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9816–9825 (2021)
Coskun, H., Achilles, F., DiPietro, R., Navab, N., Tombari, F.: Long short-term memory kalman filters: recurrent neural estimators for pose regularization. In: Proceedings of the IEEE international conference on computer vision, pp. 5524–5532 (2017)
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11 656–11 665 (2021)
Luvizon, D.C., Picard, D., Tabia, H.: 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5137–5146 (2018)
Yang, J., Zeng, A., Liu, S., Li, F., Zhang, R., Zhang, L.: Explicit box detection unifies end-to-end multi-person pose estimation. arXiv preprint arXiv:2302.01593, (2023)
Xu, X., Gao, Y., Yan, K., Lin, X., Zou, Q.: Location-free human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13 137–13 146 (2022)
Nibali, A., He, Z., Morgan, S., Prendergast, L.: 3d human pose estimation with 2d marginal heatmaps. In: 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, pp. 1477–1485 (2019)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, (2014)
Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., Wang, J.: Lite-hrnet: a lightweight high-resolution network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10 440–10 450 (2021)
Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11 740–11 750 (2021)
Parger, M., Tang, C., Twigg, C.D., Keskin, C., Wang, R., Steinberger, M.: Deltacnn: end-to-end cnn inference of sparse frame differences in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12 497–12 506 (2022)
Lin, T.-Y., Hsieh, L.-Y., Wang, F.-E., Wuen, W.-S., Sun, M.: Sparse and privacy-enhanced representation for human pose estimation. arXiv preprint arXiv:2309.09515, (2023)
Cao, Z., Liu, D., Wang, Q., Chen, Y.: Towards unbiased label distribution learning for facial pose estimation using anisotropic spherical gaussian. In: European conference on computer vision. Springer, pp. 737–753 (2022)
He, S., Chen, W., Wang, K., Luo, H., Wang, F., Jiang, W., Ding, H.: Region Generation and Assessment Network for Occluded Person Re-Identification. In: IEEE Transactions on Information Forensics and Security. pp. 120–132 (2024)
Acknowledgements
This work is supported by Liaoning Provincial Natural Science Foundation(2023-MS-330)
Author information
Authors and Affiliations
Contributions
Z. X.: conceptualization, methodology, investigation, software, data curation, validation, writing—original draft, visualization. Y. L.: formal analysis, writing—review and editing, supervision, and funding acquisition. H. W.: conceptualization, investigation, and validation. W. Q.: investigation and data curation. W.X.: validation. Yantao Lou: validation.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, X., Yang, L., Huang, W. et al. EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation. J Real-Time Image Proc 21, 150 (2024). https://doi.org/10.1007/s11554-024-01528-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-024-01528-3