research-article

DecenterNet: Bottom-Up Human Pose Estimation Via Decentralized Pose Representation

Authors:

Jian ZhaoAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 1798 - 1808

https://doi.org/10.1145/3581783.3611989

Published: 27 October 2023 Publication History

Abstract

Multi-person pose estimation in crowded scenes remains a very challenging task. This paper finds that most previous methods fail to estimate or group visible keypoints in crowded scenes rather than reasoning invisible keypoints. We thus categorize the crowded scenes into entanglement and occlusion based on the visibility of human parts and observe that entanglement is a significant problem in crowded scenes. With this observation, we propose DecenterNet, an end-to-end deep architecture to perform robust and efficient pose estimation in crowded scenes. Within DecenterNet, we introduce a decentralized pose representation that uses all visible keypoints as the root points to represent human poses, which is more robust in the entanglement area. We also propose a decoupled pose assessment mechanism, which introduces a location map to adaptively select optimal poses in the offset map. In addition, we have constructed a new dataset named SkatingPose, containing more entangled scenes. The proposed DecenterNet surpasses the best method on SkatingPose by 1.8 AP. Furthermore, DecenterNet obtains 71.2 AP and 71.4 AP on the COCO and CrowdPose datasets, respectively, demonstrating the superiority of our method. We will release our source code, trained models, and dataset to facilitate further studies in this research direction. Our code and dataset are available in https://github.com/InvertedForest/DecenterNet.

Supplemental Material

MP4 File

Video for "DecenterNet: Bottom-Up Human Pose Estimation Via Decentralized Pose Representation".

Download
33.87 MB

References

[1]

Rusa Agafonova. 2019. International skating union versus European Commission: is the European sports model under threat? The International Sports Law Journal 19, 1 (2019), 87--101.

[2]

Md Zahangir Alom, Tarek M Taha, Christopher Yakopcic, Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin, Brian C Van Esesn, Abdul A S Awwal, and Vijayan K Asari. 2018. The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164 (2018).

[3]

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR. 3686--3693.

[4]

Guillem Brasó, Nikita Kister, and Laura Leal-Taixé. 2021. The center of attention: Center-keypoint grouping via attention for multi-person pose estimation. In ICCV. 11853--11863.

[5]

Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In CVPR. 7103--7112.

[6]

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In CVPR. 1290--1299.

[7]

Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S Huang, and Lei Zhang. 2020. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR. 5386--5395.

[8]

Yu Cheng, Bo Wang, Bo Yang, and Robby T Tan. 2021. Monocular 3D multi-person pose estimation by integrating top-down and bottom-up networks. In CVPR. 7649--7659.

[9]

Xiaochuan Fan, Kang Zheng, Yuewei Lin, and Song Wang. 2015. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In CVPR. 1347--1355.

[10]

Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. 2022. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE TPAMI (2022).

[11]

Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, and Jingdong Wang. 2021. Bottom-up human pose estimation via disentangled keypoint regression. In CVPR. 14676--14686.

[12]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR. 580--587.

[13]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In ICCV. 2961--2969.

[14]

Nan Jiang, Kuiran Wang, Xiaoke Peng, Xuehui Yu, Qiang Wang, Junliang Xing, Guorong Li, Jian Zhao, Guodong Guo, and Zhenjun Han. 2021. Anti-UAV: A large multi-modal benchmark for UAV tracking. arXiv preprint arXiv:2101.08466 (2021).

[15]

Lei Jin, Xiaojuan Wang, Xuecheng Nie, Luoqi Liu, Yandong Guo, and Jian Zhao. 2022. Grouping by center: Predicting centripetal offsets for the bottom-up human pose estimation. IEEE TMM (2022).

[16]

Lei Jin, Xiaojuan Wang, Xuecheng Nie, Wendong Wang, Yandong Guo, Shuicheng Yan, and Jian Zhao. 2023. Rethinking the Person Localization for Single-Stage Multi-Person Pose Estimation. IEEE TMM (2023).

Digital Library

[17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[18]

Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2019. Pifpaf: Composite fields for human pose estimation. In CVPR. 11977--11986.

[19]

Jin Lei, Chenyang Xu, Xiaojuan Wang, Yabo Xiao, Yandong Guo, Xuecheng Nie, and Jian Zhao. 2022. Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation. CVPR (2022).

[20]

Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. 2019. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR. 10863--10872.

[21]

Qun Li, Ziyi Zhang, Fu Xiao, Feng Zhang, and Bir Bhanu. 2022. Dite-HRNet: Dynamic Lightweight High-Resolution Network for Human Pose Estimation. In IJCAI. 1095--1101.

[22]

Qun Li, Ziyi Zhang, Feng Zhang, and Fu Xiao. [n. d.]. HRNeXt: High-Resolution Context Network for Crowd Pose Estimation. ([n. d.]).

[23]

Hongzhou Lin and Stefanie Jegelka. 2018. Resnet with one-neuron hidden layers is a universal approximator. NIPS 31 (2018).

[24]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.

[25]

Weian Mao, Zhi Tian, Xinlong Wang, and Chunhua Shen. 2021. Fcpose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In CVPR. 9034--9043.

[26]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).

[27]

Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. NeuIPS 30 (2017).

[28]

Xuecheng Nie, Jiashi Feng, Junliang Xing, and Shuicheng Yan. 2018. Pose partition networks for multi-person pose estimation. In ECCV. 684--699.

[29]

Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. 2019. Single-stage multi-person pose machines. In ICCV. 6951--6960.

[30]

George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In CVPR. 4903--4911.

[31]

Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2013. Poselet conditioned pictorial structures. In CVPR. 588--595.

[32]

Lingteng Qiu, Xuanye Zhang, Yanran Li, Guanbin Li, Xiaojun Wu, Zixiang Xiong, Xiaoguang Han, and Shuguang Cui. 2020. Peeking into occluded joints: A novel framework for crowd pose estimation. In ECCV. Springer, 488--504.

[33]

Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. 2022. End-to-end multi-person pose estimation with transformers. In CVPR. 11069--11078.

[34]

Juil Sock, Kwang In Kim, Caner Sahin, and Tae-Kyun Kim. 2018. Multi-task deep networks for depth-based 6d object pose and joint registration in crowd scenarios. arXiv preprint arXiv:1806.03891 (2018).

[35]

Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. 2017. Human pose estimation using global and local normalization. In ICCV. 5599--5607.

[36]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In CVPR. 5693--5703.

[37]

Dongkai Wang, Shiliang Zhang, and Gang Hua. 2021. Robust Pose Estimation in Crowded Scenes with Direct Pose-Level Inference. NeuIPS 34 (2021), 6278--6289.

[38]

Junjie Wang, Zhenbo Yu, Zhengyan Tong, Hang Wang, Jinxian Liu, Wenjun Zhang, and Xiaoyan Wu. 2022. OCR-Pose: Occlusion-Aware Contrastive Representation for Unsupervised 3D Human Pose Estimation. In ACMMM (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 5477--5485. https://doi.org/10.1145/3503161.3547780

Digital Library

[39]

Qingzhong Wang, Pengfei Zhang, Haoyi Xiong, and Jian Zhao. 2021. Face. evolve: A high-performance face recognition library. arXiv preprint arXiv:2107.08621 (2021).

[40]

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing convolutions to vision transformers. In ICCV. 22--31.

[41]

J. Wu, H. Zheng, B. Zhao, Y. Li, B. Yan, R. Liang, W. Wang, S. Zhou, G. Lin, and Y. Fu. 2017. AI Challenger: A Large-scale Dataset for Going Deeper in Image Understanding. In ICME.

[42]

Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In ECCV. 466--481.

[43]

Yabo Xiao, Kai Su, Xiaojuan Wang, Dongdong Yu, Lei Jin, Mingshu He, and Zehuan Yuan. 2022. QueryPose: Sparse Multi-Person Pose Regression via Spatial-Aware Part-Level Query. (2022).

[44]

Yabo Xiao, Xiao Juan Wang, Dongdong Yu, Guoli Wang, Qian Zhang, and HE Mingshu. 2022. Adaptivepose: Human parts as adaptive points. In AAAI, Vol. 36. 2813--2821.

[45]

Lumin Xu, Ruihan Xu, and Sheng Jin. 2020. Hieve acm mm grand challenge 2020: Pose tracking in crowded scenes. In ACMMM. 4689--4693.

[46]

Nan Xue, Tianfu Wu, Gui-Song Xia, and Liangpei Zhang. 2022. Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation. In CVPR. 13065--13074.

[47]

Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In CVPR. 1385--1392.

[48]

Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and Jingdong Wang. 2021. Lite-hrnet: A lightweight high-resolution network. In CVPR. 10440--10450.

[49]

Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In CVPR. 2403--2412.

[50]

Christoph Zauner. 2010. Implementation and benchmarking of perceptual image hash functions. (2010).

[51]

Jian Zhao, Yu Cheng, Yi Cheng, Yang Yang, Fang Zhao, Jianshu Li, Hengzhu Liu, Shuicheng Yan, and Jiashi Feng. 2019. Look across elapse: Disentangled representation learning and photorealistic cross-age face synthesis for age-invariant face recognition. In AAAI, Vol. 33. 9251--9258.

Digital Library

[52]

Jian Zhao, Junliang Xing, Lin Xiong, Shuicheng Yan, and Jiashi Feng. 2020. Rec-ognizing profile faces by imagining frontal view. IJCV 128 (2020), 460--478.

Digital Library

[53]

Jian Zhao12, Jianshu Li, Fang Zhao, Shuicheng Yan13, and Jiashi Feng. 2017. Marginalized CNN: Learning deep invariant representations. (2017).

[54]

C. Zhe, T. Simon, S. E. Wei, and Y. Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.

[55]

Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).

Cited By

Chen JZhou ZSun MZhao RWu LBao THe Z(2025)ZeroPose: CAD-Prompted Zero-Shot Object 6D Pose Estimation in Cluttered ScenesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.348243935:2(1251-1264)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3482439
Wen HQiu HWang LCheng HLi H(2025)Class Incremental Learning With Less Forgetting Direction and Equilibrium PointIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.347795135:2(1150-1164)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3477951
Shan WZhang YZhang XWang SZhou XMa SGao W(2024)Diffusion-Based Hypotheses Generation and Joint-Level Hypotheses Aggregation for 3D Human Pose EstimationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341534834:11(10678-10691)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3415348
Show More Cited By

Index Terms

DecenterNet: Bottom-Up Human Pose Estimation Via Decentralized Pose Representation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates
Abstract
3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated ...
Globally-Robust Instance Identification and Locally-Accurate Keypoint Alignment for Multi-Person Pose Estimation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Scenes with a large number of human instances are characterized by significant overlap of the instances with similar appearance, occlusion, and scale variation. We propose GRAPE, a novel method that leverages both Globally Robust human instance ...
Real-time camera pose estimation via line tracking

Real-time camera calibration has been intensively studied in augmented reality. However, for texture-less and texture-repeated scenes as well as poorly illuminated scenes, obtaining a stable calibration is still an open problem. In the paper, we propose ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Fundamental Research Funds for the Central Universities
Young Elite Scientist Sponsorship Program of China Association for Science and Technology
National Nature Fund
Natural Science Foundation of China
Young Elite Scientist Sponsorship Program of Beijing Association for Science and Technology

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
179
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)8

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen JZhou ZSun MZhao RWu LBao THe Z(2025)ZeroPose: CAD-Prompted Zero-Shot Object 6D Pose Estimation in Cluttered ScenesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.348243935:2(1251-1264)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3482439
Wen HQiu HWang LCheng HLi H(2025)Class Incremental Learning With Less Forgetting Direction and Equilibrium PointIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.347795135:2(1150-1164)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3477951
Shan WZhang YZhang XWang SZhou XMa SGao W(2024)Diffusion-Based Hypotheses Generation and Joint-Level Hypotheses Aggregation for 3D Human Pose EstimationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341534834:11(10678-10691)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3415348
Xue YPo LYu WWu HXu XLi KLiu Y(2024)Self-Calibration Flow Guided Denoising Diffusion Model for Human Pose TransferIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338294834:9(7896-7911)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3382948
Wang TJin LWang ZLi JLi LZhao FCheng YYuan LZhou LXing JZhao J(2024)SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00179(1824-1833)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00179
Yuan XCheng PHan S(2024)Multi-supervision transformer combining bounding box and mask for data-limited pose estimationNeurocomputing10.1016/j.neucom.2023.127209571:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.neucom.2023.127209
Dang YYin JLiu LDing PSun YHu Y(2024)DHRNet: A Dual-path Hierarchical Relation Network for multi-person pose estimationKnowledge-Based Systems10.1016/j.knosys.2024.112263300(112263)Online publication date: Sep-2024
https://doi.org/10.1016/j.knosys.2024.112263

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten