EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation

Zhao, Xin; Yang, Lianping; Huang, Wencong; Wang, Qi; Wang, Xin; Lou, Yantao

doi:10.1007/s11554-024-01528-3

EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation

Research
Published: 09 August 2024

Volume 21, article number 150, (2024)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Xin Zhao¹,
Lianping Yang^1,3,
Wencong Huang¹,
Qi Wang¹,
Xin Wang² &
…
Yantao Lou²

235 Accesses
Explore all metrics

Abstract

Human pose estimation using RGB cameras often encounters performance degradation in challenging scenarios such as motion blur or suboptimal lighting. In comparison, event cameras, endowed with a wide dynamic range, microsecond-scale temporal resolution, minimal latency, and low power consumption, demonstrate remarkable adaptability in extreme visual environments. Nevertheless, the exploitation of event cameras for pose estimation in current research has not yet fully harnessed the potential of event-driven data, and enhancing model efficiency remains an ongoing pursuit. This work focuses on devising an efficient, compact pose estimation algorithm, with special attention on optimizing the fusion of multi-view event streams for improved pose prediction accuracy. We propose EV-TIFNet, a compact dual-view interactive network, which incorporates event frames along with our custom-designed Global Spatio-Temporal Feature Maps (GTF Maps). To enhance the network’s ability to understand motion characteristics and localize keypoints, we have tailored a dedicated Auxiliary Information Extraction Module (AIE Module) for the GTF Maps. Experimental results demonstrate that our model, with a compact parameter count of 0.55 million, achieves notable advancements on the DHP19 dataset, reducing the $\hbox {MPJPE}_{3D}$ to 61.45 mm. Building upon the sparsity of event data, the integration of sparse convolution operators replaces a significant portion of traditional convolutional layers, leading to a reduction in computational demand by 28.3%, totalling 8.71 GFLOPs. These design choices highlight the model’s suitability and efficiency in scenarios where computational resources are limited.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D Human Pose Estimation Based on Multi-Input Multi-Output Convolutional Neural Network and Event Cameras: A Proof of Concept on the DHP19 Dataset

Reducing the Sim-to-Real Gap for Event Cameras

CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer

Article 17 June 2024

Data availability

No datasets were generated or analysed during the current study.

References

Burgermeister, D., Curio, C.: Pedrecnet: Multi-task deep neural network for full 3d human pose and orientation estimation. In: IEEE Intelligent Vehicles Symposium (IV). IEEE 2022, 441–448 (2022)
Yang, Y., Ren, Z., Li, H., Zhou, C., Wang, X., Hua, G.: Learning dynamics via graph neural networks for human pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8074–8084 (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision-ECCV: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer 2014, 740–755 (2014)
Hirao, Y., Wan, W., Kanoulas, D., Harada, K.: Body extension by using two mobile manipulators. Cyborg Bionic Syst. 4, 0014 (2023)
Article Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Article Google Scholar
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: international conference on 3D vision (3DV). IEEE 2017, 506–516 (2017)
Lichtsteiner, P., Posch, C., Delbruck, T.: A 128 ×128 120 db 15 µ s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circ. 43(2), 566–576 (2008)
Article Google Scholar
Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2020)
Article Google Scholar
Wang, Y., Zhang, X., Shen, Y., Du, B., Zhao, G., Cui, L., Wen, H.: Event-stream representation for human gaits identification using deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3436–3449 (2021)
Google Scholar
Cao, Z., Chu, Z., Liu, D., Chen, Y.: A vector-based representation to enhance head pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1188–1197 (2021)
Calabrese, E., Taverni, G., Awai Easthope, C., Skriabine, S., Corradi, F., Longinotti, L., Eng, K., Delbruck, T.: Dhp19: dynamic vision sensor 3d human pose dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 0–0 (2019)
Manilii, A., Lucarelli, L., Rosati, R., Romeo, L., Mancini, A., Frontoni, E.: 3d human pose estimation based on multi-input multi-output convolutional neural network and event cameras: a proof of concept on the dhp19 dataset. In: Recognition, Pattern (ed.) ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, pp. 14–25. Part I. Springer, Proceedings (2021)
Choi, S., Choi, S., Kim, C.: Mobilehumanpose: Toward real-time 3d human pose estimation in mobile devices. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2328–2338 (2021)
Scarpellini, G., Morerio, P., Del Bue, A.: Lifting monocular events to 3d human poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1358–1368 (2021)
Zou, S., Guo, C., Zuo, X., Wang, S., Wang, P., Hu, X., Chen, S., Gong, M., Cheng, L.: Eventhpe: event-based 3d human pose and shape estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10 996–11 005 (2021)
Baldwin, R.W., Liu, R., Almatrafi, M., Asari, V., Hirakawa, K.: Time-ordered recent event (tore) volumes for event cameras. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2519–2532 (2023)
Article Google Scholar
Chen, J., Shi, H., Ye, Y., Yang, K., Sun, L., Wang, K.: Efficient human pose estimation via 3d event point cloud. arXiv preprint arXiv:2206.04511, (2022)
Sapp, B., Toshev, A., Taskar, B.: Cascaded models for articulated pose estimation. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II 11. Springer, pp. 406–420 (2010)
Zhang, X., Li, C., Tong, X., Hu, W., Maybank, S., Zhang, Y.: Efficient human pose estimation via parsing a tree structure based human model. In: 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 1349–1356 (2009)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742 (2016)
Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. In: BMVC, vol. 1, p. 2 (2016)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision-ECCV,: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VIII 14. Springer 2016, 483–499 (2016)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1831–1840 (2017)
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: proceedings of the IEEE international conference on computer vision, pp. 1281–1290 (2017)
Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the european conference on computer vision (ECCV), pp. 713–728 (2018)
Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimed. 20(5), 1246–1259 (2017)
Article Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1, (2017)
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3517–3526 (2019)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. pp. 6000-6010 (2017)
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214, vol. 2, no. 6, (2020)
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: Tokenpose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp. 11 313–11 322 (2021)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7025–7034 (2017)
Li, S., Chan, A.B.: 3d human pose estimation from monocular images with deep convolutional neural network. In: Computer Vision–ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1–5, 2014, Revised Selected Papers, Part II 12. Springer, pp. 332–347 (2015)
Liu, D., Cui, Y., Yan, L., Mousas, C., Yang, B., Chen, Y.: Densernet: Weakly supervised visual localization using multi-scale feature aggregation. Proc. AAAI Conf. Artif. Intell. 35(7), 6101–6109 (2021)
Google Scholar
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp. 529–545 (2018)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3d human pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6988–6997 (2017)
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4342–4351 (2019)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst., pp. 3104–3112 (2014)
Wang, K., Boonpratatong, A., Chen, W., Ren, L., Wei, G., Qian, Z., Lu, X., Zhao, D.: The fundamental property of human leg during walking: linearity and nonlinearity. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 4871–4881 (2023)
Article Google Scholar
Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: Spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9816–9825 (2021)
Coskun, H., Achilles, F., DiPietro, R., Navab, N., Tombari, F.: Long short-term memory kalman filters: recurrent neural estimators for pose regularization. In: Proceedings of the IEEE international conference on computer vision, pp. 5524–5532 (2017)
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11 656–11 665 (2021)
Luvizon, D.C., Picard, D., Tabia, H.: 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5137–5146 (2018)
Yang, J., Zeng, A., Liu, S., Li, F., Zhang, R., Zhang, L.: Explicit box detection unifies end-to-end multi-person pose estimation. arXiv preprint arXiv:2302.01593, (2023)
Xu, X., Gao, Y., Yan, K., Lin, X., Zou, Q.: Location-free human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13 137–13 146 (2022)
Nibali, A., He, Z., Morgan, S., Prendergast, L.: 3d human pose estimation with 2d marginal heatmaps. In: 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, pp. 1477–1485 (2019)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, (2014)
Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., Wang, J.: Lite-hrnet: a lightweight high-resolution network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10 440–10 450 (2021)
Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11 740–11 750 (2021)
Parger, M., Tang, C., Twigg, C.D., Keskin, C., Wang, R., Steinberger, M.: Deltacnn: end-to-end cnn inference of sparse frame differences in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12 497–12 506 (2022)
Lin, T.-Y., Hsieh, L.-Y., Wang, F.-E., Wuen, W.-S., Sun, M.: Sparse and privacy-enhanced representation for human pose estimation. arXiv preprint arXiv:2309.09515, (2023)
Cao, Z., Liu, D., Wang, Q., Chen, Y.: Towards unbiased label distribution learning for facial pose estimation using anisotropic spherical gaussian. In: European conference on computer vision. Springer, pp. 737–753 (2022)
He, S., Chen, W., Wang, K., Luo, H., Wang, F., Jiang, W., Ding, H.: Region Generation and Assessment Network for Occluded Person Re-Identification. In: IEEE Transactions on Information Forensics and Security. pp. 120–132 (2024)

Download references

Acknowledgements

This work is supported by Liaoning Provincial Natural Science Foundation(2023-MS-330)

Author information

Authors and Affiliations

College of Sciences, Northeastern University, Shenyang, 110819, Liaoning, China
Xin Zhao, Lianping Yang, Wencong Huang & Qi Wang
Shenyang Sport University, Shenyang, 110102, Liaoning, China
Xin Wang & Yantao Lou
Key Laboratory of Differential Equations and Their Applications, Northeastern University, Shenyang, China
Lianping Yang

Authors

Xin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Lianping Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wencong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yantao Lou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z. X.: conceptualization, methodology, investigation, software, data curation, validation, writing—original draft, visualization. Y. L.: formal analysis, writing—review and editing, supervision, and funding acquisition. H. W.: conceptualization, investigation, and validation. W. Q.: investigation and data curation. W.X.: validation. Yantao Lou: validation.

Corresponding author

Correspondence to Lianping Yang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, X., Yang, L., Huang, W. et al. EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation. J Real-Time Image Proc 21, 150 (2024). https://doi.org/10.1007/s11554-024-01528-3

Download citation

Received: 21 May 2024
Accepted: 24 July 2024
Published: 09 August 2024
DOI: https://doi.org/10.1007/s11554-024-01528-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EV-TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

3D Human Pose Estimation Based on Multi-Input Multi-Output Convolutional Neural Network and Event Cameras: A Proof of Concept on the DHP19 Dataset

Reducing the Sim-to-Real Gap for Event Cameras

CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now