Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room

Hansen, Lasse; Siebert, Marlin; Diesel, Jasper; Heinrich, Mattias P.

doi:10.1007/s11548-019-02044-7

Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room

Original Article
Published: 06 August 2019

Volume 14, pages 1871–1879, (2019)
Cite this article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Lasse Hansen ORCID: orcid.org/0000-0003-3963-7052¹,
Marlin Siebert¹,
Jasper Diesel² &
…
Mattias P. Heinrich¹

843 Accesses
13 Citations
Explore all metrics

Abstract

Purpose

For many years, deep convolutional neural networks have achieved state-of-the-art results on a wide variety of computer vision tasks. 3D human pose estimation makes no exception and results on public benchmarks are impressive. However, specialized domains, such as operating rooms, pose additional challenges. Clinical settings include severe occlusions, clutter and difficult lighting conditions. Privacy concerns of patients and staff make it necessary to use unidentifiable data. In this work, we aim to bring robust human pose estimation to the clinical domain.

Methods

We propose a 2D–3D information fusion framework that makes use of a network of multiple depth cameras and strong pose priors. In a first step, probabilities of 2D joints are predicted from single depth images. These information are fused in a shared voxel space yielding a rough estimate of the 3D pose. Final joint positions are obtained by regressing into the latent pose space of a pre-trained convolutional autoencoder.

Results

We evaluate our approach against several baselines on the challenging MVOR dataset. Best results are obtained when fusing 2D information from multiple views and constraining the predictions with learned pose priors.

Conclusions

We present a robust 3D human pose estimation framework based on a multi-depth camera network in the operating room. Depth images as only input modalities make our approach especially interesting for clinical applications due to the given anonymity for patients and staff.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parsing human skeletons in an operating room

Article 21 July 2016

On the Role of Depth Predictions for 3D Human Pose Estimation

3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network

Notes

[38]: https://github.com/Microsoft/human-pose-estimation.pytorch [25]: https://github.com/dragonbook/V2V-PoseNet-pytorch.

References

Achilles F, Ichim AE, Coskun H, Tombari F, Noachtar S, Navab N (2016) Patient mocap: human pose estimation under blanket occlusion for hospital monitoring applications. In: Proceedings of the international conference on medical image computing and computer-assisted intervention (MICCAI). Springer, pp 491–499
Andriluka M, Iqbal U, Insafutdinov E, Pishchulin L, Milan A, Gall J, Schiele B (2018) Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5167–5176
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 3686–3693
Andriluka M, Roth S, Schiele B (2009) Pictorial structures revisited: people detection and articulated pose estimation. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1014–1021
Belagiannis V, Wang X, Shitrit HBB, Hashimoto K, Stauder R, Aoki Y, Kranzfelder M, Schneider A, Fua P, Ilic S, Feussner H, Navab N (2016) Parsing human skeletons in an operating room. Mach Vis Appl (MVA) 27(7):1035–1046
Article Google Scholar
Cao Z, Simon T, Wei S.E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 7291–7299
Chen K, Gabriel P, Alasfour A, Gong C, Doyle WK, Devinsky O, Friedman D, Dugan P, Melloni L, Thesen T, Gonda D, Sattar S, Wang S, Gilja V (2018) Patient-specific pose estimation in clinical environments. J Transl Eng Health Med (JTEHM) 6:1–11
Google Scholar
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 7103–7112
Dietz A, Schröder S, Pösch A, Frank K, Reithmeier E (2016) Contactless surgery light control based on 3D gesture recognition. In: GCAI, pp 138–146
Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1–8
Girshick R, Shotton J, Kohli P, Criminisi A, Fitzgibbon A (2011) Efficient regression of general-activity human poses from depth images. In: Proceedings of the international conference on computer vision (ICCV). IEEE, pp 415–422
Hansen L, Diesel J, Heinrich MP (2019) Regularized landmark detection with CAEs for human pose estimation in the operating room. In: Bildverarbeitung für die Medizin (BVM). Springer, pp 178–183
Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L (2016) Towards viewpoint invariant 3D human pose estimation. In: Proccedings of the European conference on computer vision (ECCV). Springer, pp 160–177
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 770–778
Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. Trans Pattern Anal Mach Intell (TPAMI) 36(7):1325–1339
Article Google Scholar
Jacob MG, Li YT, Akingba GA, Wachs JP (2013) Collaboration with a robotic scrub nurse. Commun ACM 56(5):68–75
Article Google Scholar
Jung HY, Suh Y, Moon G, Lee KM (2016) A sequential approach to 3d human pose estimation: separation of localization and identification of body joints. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 747–761
Kadkhodamohammadi A, Gangi A, de Mathelin M, Padoy N (2017) A multi-view RGB-D approach for human pose estimation in operating rooms. In: Proceedings of the winter conference on applications of computer vision (WACV). IEEE, pp 363–372
Kadkhodamohammadi A, Padoy N (2018) A generalizable approach for multi-view 3D human pose regression. arXiv:1804.10462
Katircioglu I, Tekin B, Salzmann M, Lepetit V, Fua P (2018) Learning latent representations of 3D human pose with deep neural networks. Int J Comput Vis (IJCV) 126:1–16
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Liu S, Yin Y, Ostadabbas S (2019) In-bed pose estimation: deep learning with shallow dataset. IEEE J Transl Eng Health Med 7:1–12. https://doi.org/10.1109/JTEHM.2019.2892970
Article Google Scholar
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 21–37
McCoy TH, Perlis RH (2018) Temporal trends and characteristics of reportable health data breaches, 2010–2017. JAMA 320(12):1282–1284
Article Google Scholar
Moon G, Yong Chang J, Mu Lee K (2018) V2v-posenet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 5079–5088
Mori G, Ren X, Efros AA, Malik J (2018) Recovering human body configurations: combining segmentation and recognition. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), vol. 2. IEEE (2004)
Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 2277–2287. http://papers.nips.cc/paper/6822-associative-embedding-end-to-end-learning-for-joint-detection-and-grouping.pdf
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 483–499
Padoy N, Blum T, Ahmadi SA, Feussner H, Berger MO, Navab N (2012) Statistical modeling and recognition of surgical workflow. Med Image Anal 16(3):632–641
Article Google Scholar
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Advances in neural information processing systems workshop (NIPS-W)
Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1263–1272
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1297–1304
Silas MR, Grassia P, Langerman A (2015) Video recording of the operating room-is anonymity possible? J Surg Res 197(2):272–276
Article Google Scholar
Srivastav V, Issenhuth T, Kadkhodamohammadi A, de Mathelin M, Gangi A, Padoy N (2018) MVOR: a multi-view RGB-D operating room dataset for 2D and 3D human pose estimation. arXiv:1808.08180
Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 1653–1660
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res (JMLR) 11(Dec):3371–3408
Google Scholar
Xiao B, Wu H, Wei, Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV)
Yao A, Gall J, Van Gool L (2012) Coupled action recognition and pose estimation from multiple views. Int J Comput Vis 100(1):16–37
Article Google Scholar
Yusoff YA, Basori AH, Mohamed F (2013) Interactive hand and arm gesture control for 2D medical image and 3D volumetric medical visualization. Proc Soc Behav Sci 97:723–729
Article Google Scholar

Download references

Acknowledgements

We would like to thank the reviewers for their many insightful comments and suggestions helping to improve our paper. We gratefully acknowledge the support of the NVIDIA Corporation with their GPU donations for this research.

Author information

Authors and Affiliations

Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany
Lasse Hansen, Marlin Siebert & Mattias P. Heinrich
Drägerwerk AG & Co. KGaA, Moislinger Allee 53-55, 23558, Lübeck, Germany
Jasper Diesel

Authors

Lasse Hansen
View author publications
You can also search for this author in PubMed Google Scholar
Marlin Siebert
View author publications
You can also search for this author in PubMed Google Scholar
Jasper Diesel
View author publications
You can also search for this author in PubMed Google Scholar
Mattias P. Heinrich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lasse Hansen.

Ethics declarations

Conflict of interest

The authors declare that they have no relevant conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Statement of informed consent was not applicable since the manuscript does not contain any participants’ data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hansen, L., Siebert, M., Diesel, J. et al. Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room. Int J CARS 14, 1871–1879 (2019). https://doi.org/10.1007/s11548-019-02044-7

Download citation

Received: 16 February 2019
Accepted: 29 July 2019
Published: 06 August 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11548-019-02044-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room