skip to main content
10.1145/3475723.3484252acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Learning Positional Priors for Pretraining 2D Pose Estimators

Published: 25 November 2021 Publication History

Abstract

The target of 2D human pose estimation is to locate the keypoints of body parts from 2D images. State-of-the-art methods for pose estimation usually construct pixel-wise heatmaps from keypoints as labels for learning neural networks, which are usually initialized randomly or using classification models on large dataset, such as ImageNet, for their backbones. According to statistical data, there are strong positional priors for human keypoints, which are highly dependent on their relationship between image patches. To learn positional priors for pretraining pose estimators, we propose Heatmap-Style Jigsaw Puzzles (HSJP) problem as self-supervised pretext task, whose target is to predict the location of each patch from an image composed of shuffled patches. During pretraining, we only use person images in MS-COCO, rather than introducing extra large dataset like ImageNet. A heatmap-style label for patch location is designed and our learning process is in a non-contrastive way. The weights learned by HSJP pretext task are utilised as backbones of 2D human pose estimators, which are then finetuned on MS-COCO human keypoints dataset. With two popular and strong 2D human pose estimators, HRNet and SimpleBaseline, we evaluate mAP score on both MS-COCO validation and test-dev datasets. Our experiments show that downstream pose estimators with our self-supervised pretraining obtain much better performance than those trained from scratch, and are comparable to those using ImageNet classification models as their initial backbones.

Supplementary Material

MP4 File (HUMA21-fp3505.mp4)
The presentation of paper "Learning Positional Priors for Pretraining 2D Pose Estimators"

References

[1]
Wenjia Bai, Chen Chen, Giacomo Tarroni, Jinming Duan, Florian Guitton, Steffen E. Petersen, Yike Guo, Paul M. Matthews, and Daniel Rueckert. 2019. Self-Supervised Learning for Cardiac MR Image Segmentation by Anatomical Position Prediction. In MICCAI .
[2]
Dov Bridger, Dov Danon, and Ayellet Tal. 2020. Solving Jigsaw Puzzles With Eroded Boundaries. In CVPR .
[3]
Yuanhao Cai, Zhicheng Wang, Zhengxiong Luo, Binyi Yin, Angang Du, Haoqian Wang, Xiangyu Zhang, Xinyu Zhou, Erjin Zhou, and Jian Sun. 2020. Learning Delicate Local Representations for Multi-person Pose Estimation. In ECCV, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.).
[4]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. In CVPR .
[5]
Maria Fabio Carlucci, Antonio D'Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. 2019. Domain Generalization by Solving Jigsaw Puzzles. In CVPR .
[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020 b. A Simple Framework for Contrastive Learning of Visual Representations. In ICML .
[7]
Tianlong Chen, Sijia Liu, Shiyu Chang, Yu Cheng, Lisa Amini, and Zhangyang Wang. 2020 c. Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning. In CVPR .
[8]
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020 a. Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297 (2020).
[9]
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded Pyramid Network for Multi-Person Pose Estimation. In CVPR .
[10]
Taeg Sang Cho, Shai Avidan, and William T. Freeman. 2010. A probabilistic image jigsaw puzzle solver. In CVPR .
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR .
[12]
Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised Visual Representation Learning by Context Prediction. In ICCV .
[13]
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas Brox. 2014. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In NeurIPS .
[14]
Richard Durstenfeld. 1964. Algorithm 235: Random permutation. Commun. ACM, Vol. 7 (1964), 420.
[15]
H. Freeman and L. Garder. 1964. Apictorial Jigsaw Puzzles: The Computer Solution of a Problem in Pattern Recognition. IEEE Transactions on Electronic Computers, Vol. EC-13, 2 (1964), 118--127. https://doi.org/10.1109/PGEC.1964.263781
[16]
Andrew C. Gallagher. 2012. Jigsaw puzzles with pieces of unknown orientation. In CVPR .
[17]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Á vila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Ré mi Munos, and Michal Valko. 2020. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. CoRR, Vol. abs/2006.07733 (2020). arxiv: 2006.07733 https://arxiv.org/abs/2006.07733
[18]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR .
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR .
[20]
Junjie Huang, Zheng Zhu, Feng Guo, and Guan Huang. 2020. The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation. In CVPR .
[21]
Umar Iqbal, Anton Milan, and Juergen Gall. 2017. PoseTrack: Joint Multi-person Pose Estimation and Tracking. In CVPR .
[22]
Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-Scale Structure-Aware Network for Human Pose Estimation. In ECCV .
[23]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR .
[24]
E. Donald Knuth. 1997. The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms. (1997).
[25]
David A. Kosiba, Pierre M. Devaux, Sanjay Balasubramanian, Tarak Gandhi, and Rangachar Kasturi. 1994. An automatic jigsaw puzzle solver. In ICPR .
[26]
Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2019. PifPaf: Composite Fields for Human Pose Estimation. In CVPR .
[27]
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2016. Learning Representations for Automatic Colorization. In ECCV .
[28]
Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi Xiao, Gang Yu, Hongtao Lu, Yichen Wei, and Jian Sun. 2019. Rethinking on Multi-Stage Networks for Human Pose Estimation. CoRR, Vol. abs/1901.00148 (2019). arxiv: 1901.00148 http://arxiv.org/abs/1901.00148
[29]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV .
[30]
Alejandro Newell and Jia Deng. 2020. How Useful Is Self-Supervised Pretraining for Visual Tasks?. In CVPR .
[31]
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked Hourglass Networks for Human Pose Estimation. In ECCV .
[32]
Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. 2019. Single-Stage Multi-Person Pose Machines. In ICCV .
[33]
Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In ECCV .
[34]
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. 2016. Context Encoders: Feature Learning by Inpainting. In CVPR .
[35]
Marie-Morgane Paumard, David Picard, and Hedi Tabia. 2018. Image Reassembly Combining Deep Learning and Shortest Path Problem. In ECCV .
[36]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 6 (2017), 1137--1149.
[37]
T. E. John Richardson and Tomaso Vecchi. 2002. A jigsaw-puzzle imagery task for assessing active visuospatial processes in old and young people. Behavior research methods, instruments, and computers (2002), 69--82.
[38]
Matteo Ruggero Ronchi and Pietro Perona. 2017. Benchmarking and Error Diagnosis in Multi-instance Pose Estimation. In ICCV .
[39]
Kilho Son, James Hays, and David B. Cooper. 2014. Solving Square Jigsaw Puzzles with Loop Constraints. In ECCV (Lecture Notes in Computer Science).
[40]
Kilho Son, Daniel Moreno, James Hays, and David B. Cooper. 2016. Solving Small-Piece Jigsaw Puzzles by Growing Consensus. In CVPR .
[41]
Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. 2017. Pose-Driven Deep Convolutional Model for Person Re-identification. In ICCV .
[42]
Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. 2019. Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information. In CVPR .
[43]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. In CVPR .
[44]
Zhi Tian, Hao Chen, and Chunhua Shen. 2019. DirectPose: Direct End-to-End Multi-Person Pose Estimation. CoRR, Vol. abs/1911.07451 (2019). arxiv: 1911.07451 http://arxiv.org/abs/1911.07451
[45]
Jonathan Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation. In NIPS . http://papers.nips.cc/paper/5573-joint-training-of-a-convolutional-network-and-a-graphical-model-for-human-pose-estimation
[46]
Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human Pose Estimation via Deep Neural Networks. In CVPR .
[47]
Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking Emerges by Colorizing Videos. In ECCV .
[48]
Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised Learning of Visual Representations Using Videos. In ICCV .
[49]
Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. 2020. Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation. In CVPR .
[50]
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple Baselines for Human Pose Estimation and Tracking. In ECCV .
[51]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI .
[52]
Chuanguang Yang, Zhulin An, Hui Zhu, Xiaolong Hu, Kun Zhang, Kaiqiang Xu, Chao Li, and Yongjun Xu. 2020. Gated Convolutional Networks with Hybrid Connectivity for Image Classification. In AAAI .
[53]
Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. 2019. Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In CVPR .
[54]
Dong Zhang and Mubarak Shah. 2015. Human Pose Estimation in Videos. In ICCV .
[55]
Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. 2020. Distribution-Aware Coordinate Representation for Human Pose Estimation. In CVPR .
[56]
Richard Zhang, Phillip Isola, and A. Alexei Efros. 2016. Colorful Image Colorization. In ECCV .

Cited By

View all
  • (2022)Knowledge Distillation Using Hierarchical Self-Supervision Augmented DistributionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.3186807(1-15)Online publication date: 2022

Index Terms

  1. Learning Positional Priors for Pretraining 2D Pose Estimators

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HUMA'21: Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis
    November 2021
    50 pages
    ISBN:9781450386715
    DOI:10.1145/3475723
    • General Chairs:
    • Wu Liu,
    • Junbo Guo,
    • John Smith,
    • Program Chairs:
    • Xinchen Liu,
    • Dingwen Zhang,
    • Wenbing Huang
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. human pose estimation
    2. self-supervised pretraining

    Qualifiers

    • Research-article

    Funding Sources

    • Strategic Priority Research Program of the Chinese Academy of Sciences
    • Equipment PreResearch Fund

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20, 2021
    Virtual Event, China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)88
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Knowledge Distillation Using Hierarchical Self-Supervision Augmented DistributionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.3186807(1-15)Online publication date: 2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media