Abstract
A challenging problem for robotic interaction and augmented reality is the estimation and tracking of human poses in images and videos. Pose estimation using deep neural networks has shown encouraging results in recent approaches. The environmental sensitivity and computational complexity of conventional pose estimation methods are major drawbacks. In light of these issues, this paper proposes a novel approach that uses DenseNet and CNN-based transfer learning to learn by explicitly exploiting the skeletal data. Other imageNet pre-trained models along with probabilistic and regression losses are used for comparative study. A widely accepted benchmark pose estimation dataset, FLIC (Frames Labelled in Cinema) serves as the basis for our evaluation and comparison. As a result of our experiments with an \(R^2\) score of 0.948, we recommend probabilistic loss over regression loss as the new baseline for future downstream tasks and fine-tuning-based transfer learning techniques for pose estimation.
Similar content being viewed by others
Availability of data and materials
Not applicable.
References
Andriluka M, Iqbal U, Insafutdinov E, Pishchulin L, Milan A, Gall J, Schiele B (2018) Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5167–5176
Ash A, Shwartz M (1999) R2: a useful measure of model performance when predicting a dichotomous outcome. Stat Med 18(4):375–384
Bansal Keshav, Gupta Abhishek Kumar, Rai Sushant, Bansal Bajrang (2020) Pose estimation on 3-d models using convnets. In 2020 6th International Conference on Signal Processing and Communication (ICSC), pages 58–63. IEEE
Cao Zhe, Simon Tomas, Wei Shih-En, Sheikh Yaser (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299
Carreira Joao, Agrawal Pulkit, Fragkiadaki Katerina, Malik Jitendra (2016) Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4733–4742
Chen Yilun, Wang Zhicheng, Peng Yuxiang, Zhang Zhiqiang, Yu Gang, Sun Jian (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7103–7112
Chen Xianjie, Yuille Alan L (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. Advances in neural information processing systems, 27
Cheng Bowen, Xiao Bin, Wang Jingdong, Shi Honghui, Huang Thomas S, Zhang Lei (2020) Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5386–5395
Chollet François (2017) Xception: Deep learning with depthwise separable convolutions
Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, Fei-Fei Li (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255
Farhadi Ali, Redmon Joseph (2018) Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition, pages 1804–2767. Springer Berlin/Heidelberg, Germany,
Firdaus NM, Rakun E (2019) Recognizing fingerspelling in sibi (sistem isyarat bahasa indonesia) using openpose and elliptical fourier descriptor. In: Proceedings of the international conference on advanced information science and system, pages 1–6
Gavrilyuk K, Sanford R, Javan M, Snoek Cees GM (2020) Actor-transformers for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 839–848
Geng Zigang, Sun Ke, Xiao Bin, Zhang Zhaoxiang, Wang Jingdong (2021) Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14676–14686
Graving Jacob M, Chae Daniel, Naik Hemal, Li Liang, Koger Benjamin, Costelloe Blair R, Couzin Iain D (2019) Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife, 8:e47994
He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian (2015) Deep residual learning for image recognition
Howard Andrew G, Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, Adam Hartwig (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Huang Gao, Liu Zhuang, Der Maaten Laurens Van, Weinberger Kilian Q (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708
Huang Wei-Lun, Hung Chun-Yi, Lin I-Chen (2021)Confidence-based 6d object pose estimation. IEEE Transactions on Multimedia
Huang Gao, Liu Zhuang, Maaten Laurens van der, Weinberger Kilian Q (2018) Densely connected convolutional networks
Karpathy Andrej, et al (2016) Cs231n convolutional neural networks for visual recognition. Neural networks, 1(1)
Ke Lipeng, Chang Ming-Ching, Qi Honggang, Lyu Siwei (2018) Multi-scale structure-aware network for human pose estimation. In Proceedings of the european conference on computer vision (ECCV), pages 713–728
Khirodkar Rawal, Chari Visesh, Agrawal Amit, Tyagi Ambrish (2021) Multi-hypothesis pose networks: Rethinking top-down pose estimation. arXiv preprint arXiv:2101.11223
Li Z, Ye J, Song M, Huang Y, Pan Z (2021) Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11740–11750
Li Jiefeng, Wang Can, Zhu Hao, Mao Yihuan, Fang Hao-Shu, Lu Cewu (2019) Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10863–10872
Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, Zitnick C Lawrence (2014) Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer
Ma Ningning, Zhang Xiangyu, Zheng Hai-Tao, Sun Jian (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131
McNally W, Vats K, Wong A, McPhee J (2021) Evopose2d: pushing the boundaries of 2d human pose estimation using accelerated neuroevolution with weight transfer. IEEE Access 9:139403–139414
McNally W, Wong A, McPhee J (2018) Action recognition using deep convolutional neural networks and compressed spatio-temporal pose encodings. J Comput Vis Imag Syst 4(1):3–3
McNally W, Walters P, Vats K, Wong A, McPhee J (2021) Deepdarts: Modeling keypoints as objects for automatic scorekeeping in darts using a single camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4547–4556
Nakai M, Tsunoda Y, Hayashi H, Murakoshi H (2018) Prediction of basketball free throw shooting by openpose. In: JSAI International symposium on artificial intelligence, pages 435–446. Springer
Neff Christopher, Sheth Aneri, Furgurson Steven, Tabkhi Hamed (2020) Efficienthrnet: Efficient scaling for lightweight high-resolution multi-person pose estimation. arXiv preprint arXiv:2007.08090
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision, pages 483–499. Springer
Palossi Daniele, Zimmerman Nicky, Burrello Alessio, Conti Francesco, Müller Hanna, Gambardella Luca Maria, Benini Luca, Giusti Alessandro, Guzzi Jérôme (2021) Fully onboard ai-powered human-drone pose estimation on ultralow-power autonomous flying nano-uavs. IEEE Internet of Things Journal, 9(3):1913–1929
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753–7762
Pham D-M (2018) Human identification using neural network-based classification of periodic behaviors in virtual reality. In: 2018 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 657–658. IEEE
Pleiss Geoff, Chen Danlu, Huang Gao, Li Tongcheng, Maaten Laurens van der, Weinberger Kilian Q (2017) Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990
Raaj Y, Idrees H, Hidalgo G, Sheikh Y (2019) Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4620–4628
Rafi U, Leibe B, Gall J, Kostrikov I (2016) An efficient convolutional network for human pose estimation. In: BMVC, volume 1, page 2
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Sandler Mark, Howard Andrew, Zhu Menglong, Zhmoginov Andrey, Chen Liang-Chieh (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520
Sandler Mark, Howard Andrew, Zhu Menglong, Zhmoginov Andrey, Chen Liang-Chieh (2019) Mobilenetv2: Inverted residuals and linear bottlenecks
Sapp B, Taskar B (2013) Modec: multimodal decomposable models for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3674–3681
Sun Ke, Xiao Bin, Liu Dong, Wang Jingdong (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693–5703
Sun Xiao, Shang Jiaxiang, Liang Shuang, Wei Yichen (2017) Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 2602–2611
Sun Ke, Li Mingjie, Liu Dong, Wang Jingdong (2018) Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178
Tan Mingxing, Le Quoc V (2019) Mixconv: Mixed depthwise convolutional kernels. arXiv preprint arXiv:1907.09595
Tan Mingxing, Le Quoc V (2020) Efficientnet: Rethinking model scaling for convolutional neural networks
Tang Wei, Yu Pei, Wu Ying (2018) Deeply learned compositional models for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pages 190–206
Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 648–656
Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660
Voeikov R, Falaleev N, Baikulov R (2020) Ttnet: Real-time temporal and spatial video analysis of table tennis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 884–885
Wei Shih-En, Ramakrishna Varun, Kanade Takeo, Sheikh Yaser (2016) Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4724–4732
Xiao Bin, Wu Haiping, Wei Yichen (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481
Xie Guotian, Wang Jingdong, Zhang Ting, Lai Jianhuang, Hong Richang, Qi Guo-Ju (2018) Interleaved structured sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8847–8856
Yang Y, Ramanan D (2012) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890
Yang Wei, Li Shuang, Ouyang Wanli, Li Hongsheng, Wang Xiaogang (2017) Learning feature pyramids for human pose estimation. In proceedings of the IEEE international conference on computer vision, pages 1281–1290
Yosinski Jason, Clune Jeff, Bengio Yoshua, Lipson Hod (2014) How transferable are features in deep neural networks? arXiv preprint arXiv:1411.1792
Yu Changqian, Xiao Bin, Gao Changxin, Yuan Lu, Zhang Lei, Sang Nong, Wang Jingdong (2021) Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10440–10450
Zhang J, Zhang J (2018) An analysis of cnn feature extractor based on kl divergence. International Journal of Image and Graphics 18(03):1850017
Zhang Xiangyu, Zhou Xinyu, Lin Mengxiao, Sun Jian (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856
Zhang Ting, Qi Guo-Jun, Xiao Bin, Wang Jingdong (2017) Interleaved group convolutions. In Proceedings of the IEEE international conference on computer vision, pages 4373–4382
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:1904.07850
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kumar, P., Chauhan, S. Towards improvement of baseline performance for regression based human pose estimation. Evolving Systems 15, 659–667 (2024). https://doi.org/10.1007/s12530-023-09508-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-023-09508-x