skip to main content
research-article

Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning

Published: 05 July 2020 Publication History

Abstract

Research on single-person pose estimation based on deep neural networks has recently witnessed progress in both accuracy and execution efficiency. However, multiperson pose estimation is still a challenging topic, partially because the object regions are selected greedily from proposals via class-agnostic nonmaximum suppression (NMS), and the misalignment in the redundant detection yields inaccurate human poses. Therefore, we consider how to obtain the optimal input in human pose estimation under conditions in which intermediate label information is not available. As supervised learning–based alignment does not generalize well to unseen samples in the human pose space, in this article, we present a mask-aware deep reinforcement learning approach to modify the detection result. We use mask information to remove the adverse effects from the cluttered background and to select the optimal action according to the revised reward function. We also propose a new regularization term to punish joints that are outside of the silhouette region in the human pose estimation stage. We evaluate our approach on the MPII Multiperson dataset and the MS-COCO Keypoints Challenge. The results show that our approach yields competing inference results when it is compared to the other state-of-the-art approaches.

References

[1]
Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the CVPR. 3686--3693.
[2]
Juan C. Caicedo and Svetlana Lazebnik. 2015. Active object localization with deep reinforcement learning. In Proceedings of the ICCV. 2488--2496.
[3]
Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, and Guanbin Li. 2017a. Attention-aware face hallucination via deep reinforcement learning. In Proceedings of the CVPR. 690--698.
[4]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017b. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the CVPR. 3641--3648.
[5]
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the CVPR. 1574--1584.
[6]
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the CVPR. 1831--1840.
[7]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like environment for machine learning. In Proceedings of the NIPS Workshop. EPFL--CONF--192376.
[8]
Jifeng Dai, Kaiming He, and Jian Sun. 2015. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of the CVPR. 3992--4000.
[9]
Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the CVPR. 3150--3158.
[10]
Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the ICCV. 2951--2960.
[11]
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the ICCV. 1640--1648.
[12]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR. 580--587.
[13]
Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the CVPR. 7297--7306.
[14]
Adam W. Harley, Konstantinos G. Derpanis, and Iasonas Kokkinos. 2017. Segmentation-aware convolutional networks using local attention masks. In Proceedings of the ICCV, Vol. 2. 7.
[15]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the ICCV. 2980--2988.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.
[17]
Charmgil Hong and Milos Hauskrecht. 2015. Multivariate conditional anomaly detection and its clinical application. In Proceedings of the AAAI. 4239--4240.
[18]
Chen Huang, Simon Lucey, and Deva Ramanan. 2017. Learning policies for adaptive tracking with deep feature cascades. In Proceedings of the ICCV. 105--114.
[19]
Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. 2017. ArtTrack: Articulated multi-person tracking in the wild. In Proceedings of the CVPR. 520--527.
[20]
Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the ECCV. 34--50.
[21]
Umar Iqbal and Juergen Gall. 2016. Multi-person pose estimation with local joint-to-person associations. In Proceedings of the ECCV. 627--642.
[22]
Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-scale structure-aware network for human pose estimation. In Proceedings of the ECCV. 713--728.
[23]
Hei Law and Jia Deng. 2018. CornerNet: Detecting objects as paired keypoints. In Proceedings of the ECCV. 734--750.
[24]
Evgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, and Bjoern Andres. 2017. Joint graph decomposition 8 node labeling: Problem, algorithms, applications. In Proceedings of the CVPR. 417--422.
[25]
Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolutional instance-aware semantic segmentation. In Proceedings of the CVPR. 1450--1458.
[26]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV. 740--755.
[27]
Honglin Liu, Dehui Kong, Shaofan Wang, and Baocai Yin. 2016. Sparse pose regression via componentwise clustering feature point representation. IEEE Trans. Multimedia 18, 7 (2016), 1233--1244.
[28]
Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. 2014. Fashion parsing with weak color-category labels. IEEE Trans. Multimedia 16, 1 (2014), 253--265.
[29]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--538.
[30]
Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the NIPS. 2274--2284.
[31]
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the ECCV. 483--499.
[32]
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the CVPR. 4903--4911.
[33]
Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the CVPR. 4929--4937.
[34]
Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Attention-aware deep reinforcement learning for video face recognition. In Proceedings of the ICCV. 3931--3940.
[35]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the NIPS. 91--99.
[36]
Yan Tian, Leonid Sigal, Fernando De la Torre, and Yonghua Jia. 2013. Canonical locality preserving latent variable model for discriminative pose inference. Image Vis. Comput. 31, 3 (2013), 223--230.
[37]
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the ICML. 560--567.
[38]
Bo Xiao, Panayiotis Georgiou, Brian Baucom, and Shrikanth S. Narayanan. 2015. Head motion modeling for human behavior analysis in dyadic interaction. IEEE Trans. Multimedia 17, 7 (2015), 1107--1119.
[39]
Shuqin Xie, Zitian Chen, Chao Xu, and Cewu Lu. 2018. Environment upgrade reinforcement learning for non-differentiable multi-stage pipelines. In Proceedings of the CVPR. 472--479.
[40]
Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the ICCV. 840--847.
[41]
Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi. 2017. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the CVPR. 2711--2720.

Cited By

View all
  • (2022)GHOSM: Graph-based Hybrid Outline and Skeleton Modelling for Shape RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3554922Online publication date: 4-Aug-2022
  • (2022)Attention and Masking embedded Ensemble Reinforcement Learning for Smart Energy Optimization and Risk Evaluation under UncertaintiesJournal of Renewable and Sustainable Energy10.1063/5.0097344Online publication date: 20-Jun-2022
  • (2020)Real-time adversarial GAN-based abnormal crowd behavior detectionJournal of Real-Time Image Processing10.1007/s11554-020-01029-zOnline publication date: 31-Oct-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
August 2020
364 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3409646
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2020
Online AM: 07 May 2020
Accepted: 01 April 2020
Revised: 01 March 2020
Received: 01 July 2019
Published in TOMM Volume 16, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Computer vision
  2. deep learning
  3. regularization
  4. reinforcement learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Zhejiang Province
  • Key R8D Program of Zhejiang Province

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)2
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)GHOSM: Graph-based Hybrid Outline and Skeleton Modelling for Shape RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3554922Online publication date: 4-Aug-2022
  • (2022)Attention and Masking embedded Ensemble Reinforcement Learning for Smart Energy Optimization and Risk Evaluation under UncertaintiesJournal of Renewable and Sustainable Energy10.1063/5.0097344Online publication date: 20-Jun-2022
  • (2020)Real-time adversarial GAN-based abnormal crowd behavior detectionJournal of Real-Time Image Processing10.1007/s11554-020-01029-zOnline publication date: 31-Oct-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media