research-article

Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning

Authors:

Judith Gelernter,

Wei HuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 3

Article No.: 84, Pages 1 - 18

https://doi.org/10.1145/3397340

Published: 05 July 2020 Publication History

Abstract

Research on single-person pose estimation based on deep neural networks has recently witnessed progress in both accuracy and execution efficiency. However, multiperson pose estimation is still a challenging topic, partially because the object regions are selected greedily from proposals via class-agnostic nonmaximum suppression (NMS), and the misalignment in the redundant detection yields inaccurate human poses. Therefore, we consider how to obtain the optimal input in human pose estimation under conditions in which intermediate label information is not available. As supervised learning–based alignment does not generalize well to unseen samples in the human pose space, in this article, we present a mask-aware deep reinforcement learning approach to modify the detection result. We use mask information to remove the adverse effects from the cluttered background and to select the optimal action according to the revised reward function. We also propose a new regularization term to punish joints that are outside of the silhouette region in the human pose estimation stage. We evaluate our approach on the MPII Multiperson dataset and the MS-COCO Keypoints Challenge. The results show that our approach yields competing inference results when it is compared to the other state-of-the-art approaches.

References

[1]

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the CVPR. 3686--3693.

Digital Library

[2]

Juan C. Caicedo and Svetlana Lazebnik. 2015. Active object localization with deep reinforcement learning. In Proceedings of the ICCV. 2488--2496.

[3]

Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, and Guanbin Li. 2017a. Attention-aware face hallucination via deep reinforcement learning. In Proceedings of the CVPR. 690--698.

[4]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017b. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the CVPR. 3641--3648.

[5]

Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the CVPR. 1574--1584.

[6]

Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the CVPR. 1831--1840.

[7]

Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like environment for machine learning. In Proceedings of the NIPS Workshop. EPFL--CONF--192376.

[8]

Jifeng Dai, Kaiming He, and Jian Sun. 2015. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of the CVPR. 3992--4000.

[9]

Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the CVPR. 3150--3158.

[10]

Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the ICCV. 2951--2960.

[11]

Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the ICCV. 1640--1648.

[12]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR. 580--587.

Digital Library

[13]

Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the CVPR. 7297--7306.

[14]

Adam W. Harley, Konstantinos G. Derpanis, and Iasonas Kokkinos. 2017. Segmentation-aware convolutional networks using local attention masks. In Proceedings of the ICCV, Vol. 2. 7.

[15]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the ICCV. 2980--2988.

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.

[17]

Charmgil Hong and Milos Hauskrecht. 2015. Multivariate conditional anomaly detection and its clinical application. In Proceedings of the AAAI. 4239--4240.

[18]

Chen Huang, Simon Lucey, and Deva Ramanan. 2017. Learning policies for adaptive tracking with deep feature cascades. In Proceedings of the ICCV. 105--114.

[19]

Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. 2017. ArtTrack: Articulated multi-person tracking in the wild. In Proceedings of the CVPR. 520--527.

[20]

Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the ECCV. 34--50.

[21]

Umar Iqbal and Juergen Gall. 2016. Multi-person pose estimation with local joint-to-person associations. In Proceedings of the ECCV. 627--642.

[22]

Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-scale structure-aware network for human pose estimation. In Proceedings of the ECCV. 713--728.

[23]

Hei Law and Jia Deng. 2018. CornerNet: Detecting objects as paired keypoints. In Proceedings of the ECCV. 734--750.

[24]

Evgeny Levinkov, Jonas Uhrig, Siyu Tang, Mohamed Omran, Eldar Insafutdinov, Alexander Kirillov, Carsten Rother, Thomas Brox, Bernt Schiele, and Bjoern Andres. 2017. Joint graph decomposition 8 node labeling: Problem, algorithms, applications. In Proceedings of the CVPR. 417--422.

[25]

Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolutional instance-aware semantic segmentation. In Proceedings of the CVPR. 1450--1458.

[26]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV. 740--755.

[27]

Honglin Liu, Dehui Kong, Shaofan Wang, and Baocai Yin. 2016. Sparse pose regression via componentwise clustering feature point representation. IEEE Trans. Multimedia 18, 7 (2016), 1233--1244.

Digital Library

[28]

Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. 2014. Fashion parsing with weak color-category labels. IEEE Trans. Multimedia 16, 1 (2014), 253--265.

[29]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--538.

[30]

Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the NIPS. 2274--2284.

[31]

Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the ECCV. 483--499.

[32]

George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the CVPR. 4903--4911.

[33]

Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the CVPR. 4929--4937.

[34]

Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Attention-aware deep reinforcement learning for video face recognition. In Proceedings of the ICCV. 3931--3940.

[35]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the NIPS. 91--99.

[36]

Yan Tian, Leonid Sigal, Fernando De la Torre, and Yonghua Jia. 2013. Canonical locality preserving latent variable model for discriminative pose inference. Image Vis. Comput. 31, 3 (2013), 223--230.

Digital Library

[37]

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the ICML. 560--567.

[38]

Bo Xiao, Panayiotis Georgiou, Brian Baucom, and Shrikanth S. Narayanan. 2015. Head motion modeling for human behavior analysis in dyadic interaction. IEEE Trans. Multimedia 17, 7 (2015), 1107--1119.

Digital Library

[39]

Shuqin Xie, Zitian Chen, Chao Xu, and Cewu Lu. 2018. Environment upgrade reinforcement learning for non-differentiable multi-stage pipelines. In Proceedings of the CVPR. 472--479.

[40]

Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the ICCV. 840--847.

[41]

Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi. 2017. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the CVPR. 2711--2720.

Cited By

Alwaely BAbhayaratne C(2022)GHOSM: Graph-based Hybrid Outline and Skeleton Modelling for Shape RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3554922Online publication date: 4-Aug-2022
https://doi.org/10.1145/3554922
Sogabe TChen CMalla DSakamoto K(2022)Attention and Masking embedded Ensemble Reinforcement Learning for Smart Energy Optimization and Risk Evaluation under UncertaintiesJournal of Renewable and Sustainable Energy10.1063/5.0097344Online publication date: 20-Jun-2022
https://doi.org/10.1063/5.0097344
Han QWang HYang LWu MKou JDu QLi N(2020)Real-time adversarial GAN-based abnormal crowd behavior detectionJournal of Real-Time Image Processing10.1007/s11554-020-01029-zOnline publication date: 31-Oct-2020
https://doi.org/10.1007/s11554-020-01029-z

Index Terms

Improving Multiperson Pose Estimation by Mask-aware Deep Reinforcement Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Deep reinforcement learning in computer vision: a comprehensive survey
Abstract
Deep reinforcement learning augments the reinforcement learning framework and utilizes the powerful representation of deep neural networks. Recent works have demonstrated the remarkable successes of deep reinforcement learning in various domains ...
How to train your robot with deep reinforcement learning: lessons we have learned

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated ...
Robust Deep Reinforcement Learning with Adversarial Attacks
AAMAS '18: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems

This paper proposes adversarial attacks for Reinforcement Learning (RL). These attacks are then leveraged during training to improve the robustness of RL within robust control framework. We show that this adversarial training of DRL algorithms like Deep ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 3

August 2020

364 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3409646

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2020

Online AM: 07 May 2020

Accepted: 01 April 2020

Revised: 01 March 2020

Received: 01 July 2019

Published in TOMM Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
Natural Science Foundation of Zhejiang Province
Key R8D Program of Zhejiang Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
230
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alwaely BAbhayaratne C(2022)GHOSM: Graph-based Hybrid Outline and Skeleton Modelling for Shape RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3554922Online publication date: 4-Aug-2022
https://doi.org/10.1145/3554922
Sogabe TChen CMalla DSakamoto K(2022)Attention and Masking embedded Ensemble Reinforcement Learning for Smart Energy Optimization and Risk Evaluation under UncertaintiesJournal of Renewable and Sustainable Energy10.1063/5.0097344Online publication date: 20-Jun-2022
https://doi.org/10.1063/5.0097344
Han QWang HYang LWu MKou JDu QLi N(2020)Real-time adversarial GAN-based abnormal crowd behavior detectionJournal of Real-Time Image Processing10.1007/s11554-020-01029-zOnline publication date: 31-Oct-2020
https://doi.org/10.1007/s11554-020-01029-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents