Pose Tracking vs. Pose Estimation of AR Glasses with Convolutional, Recurrent, and Non-local Neural Networks: A Comparison

Firintepe, Ahmet; Habib, Sarfaraz; Pagani, Alain; Stricker, Didier

doi:10.1007/978-3-030-90739-6_6

Ahmet Firintepe^14,15,
Sarfaraz Habib¹⁴,
Alain Pagani¹⁶ &
…
Didier Stricker^15,16

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13105))

Included in the following conference series:

International Conference on Virtual Reality and Mixed Reality

727 Accesses

Abstract

In this paper, we analyze various outside-in approaches for pose tracking and pose estimation of AR glasses. We first provide two frame-by-frame pose estimation approaches. The first one is a VGG-based CNN, while the second method is the state-of-the-art, ResNet-based AR glasses pose estimation method named GlassPoseRN. We then introduce LSTMs in the mentioned approaches to achieve AR glasses pose tracking. We compare methods with and without non-local blocks, which are theoretically promising for Pose Tracking as they consider non-local neighbor features in one image and among multiple images. We further include separable convolutions in some networks for comparison, which focus on maintaining the individual channels and therefore the triple images. We train and evaluate seven different algorithms on the HMDPose dataset. We observe a significant boost on the dataset from pose estimation to tracking approaches. Non-local blocks do not improve our performance further. The introduction of separable convolutions in our recurrent networks results in the best performance with an estimation error of 0.81\(^{\circ }\) in orientation and 4.46 mm in position. We reduce the error compared to the state-of-the-art by 76%. Our results suggest a promising approach for more immersive AR content for AR glasses in the car context, as high a 6-DoF pose accuracy improves the superimposition of the real world with virtual elements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Berg, A., Oskarsson, M., O’Connor, M.: Deep ordinal regression with label diversity. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2740–2747 (2021)
Google Scholar
Borghi, G., Fabbri, M., Vezzani, R., Calderara, S., Cucchiara, R.: Face-from-depth for head pose estimation on depth images. IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 596–609 (2018)
Article Google Scholar
Borghi, G., Gasparini, R., Vezzani, R., Cucchiara, R.: Embedded recurrent network for head pose estimation in car. In: 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 1503–1508. IEEE (2017)
Google Scholar
Borghi, G., Venturelli, M., Vezzani, R., Cucchiara, R.: Poseidon: face-from-depth for driver pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Capellen, C., Schwarz, M., Behnke, S.: ConvPoseCNN: dense convolutional 6D object pose estimation, pp. 162–172 (2020)
Google Scholar
Chen, B., Parra, A., Cao, J., Li, N., Chin, T.J.: End-to-end learnable geometric vision by backpropagating PnP optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8100–8109 (2020)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807 (2017)
Google Scholar
Costante, G., Mancini, M.: Uncertainty estimation for data-driven visual odometry. IEEE Trans. Rob. 36(6), 1738–1757 (2020)
Article Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)
Google Scholar
Fanelli, G., Dantone, M., Gall, J., Fossati, A., Gool, L.: Random forests for real time 3D face analysis. Int. J. Comput. Vision 101(3), 437–458 (2013)
Article Google Scholar
Firintepe, A., Mohamed, S., Pagani, A., Stricker, D.: The more, the merrier? A study on in-car IR-based head pose estimation. In: 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE (2020)
Google Scholar
Firintepe, A., Pagani, A., Stricker, D.: HMDPose: a large-scale trinocular IR augmented reality glasses pose dataset. In: 26th ACM Symposium on Virtual Reality Software and Technology. ACM (2020)
Google Scholar
Firintepe, A., Pagani, A., Stricker, D.: A comparison of single and multi-view IR image-based AR glasses pose estimation approaches. In: 2021 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 571–572 (2021)
Google Scholar
Firintepe, A., Vey, C., Asteriadis, S., Pagani, A., Stricker, D.: From IR images to point clouds to pose: point cloud-based AR glasses pose estimation. J. Imag. 7(5) (2021). https://www.mdpi.com/2313-433X/7/5/80
Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3643–3649 (2020)
Google Scholar
Gu, J., Yang, X., De Mello, S., Kautz, J.: Dynamic facial analysis: from Bayesian filtering to recurrent neural network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1531–1540 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, June 2016
Google Scholar
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)
Google Scholar
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization, pp. 2938–2946, December 2015
Google Scholar
Li, Y., Wang, G., Ji, X., Xiang, Yu., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 695–711. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_42
Chapter Google Scholar
Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-Based 6-DoF object pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7677–7686 (2019)
Google Scholar
Murphy-Chutorian, E., Trivedi, M.M.: Head pose estimation and augmented reality tracking: an integrated system and evaluation for monitoring driver awareness. IEEE Trans. Intell. Transp. Syst. 11(2), 300–311 (2010)
Article Google Scholar
Ning, G., et al.: Spatially supervised recurrent convolutional neural networks for visual object tracking, pp. 1–4 (2017)
Google Scholar
Park, K., Patten, T., Vincze, M.: Pix2Pose: pixel-wise coordinate regression of objects for 6d pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7668–7677 (2019)
Google Scholar
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: A recurrent encoder-decoder network for sequential face alignment. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 38–56. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_3
Chapter Google Scholar
Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3848–3856, October 2017
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Schwarz, A., Haurilet, M., Martinez, M., Stiefelhagen, R.: DriveAHead-a large-scale driver head pose dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–10, July 2017
Google Scholar
Selim, M., Firintepe, A., Pagani, A., Stricker, D.: AutoPOSE: large-scale automotive driver head pose and gaze dataset with deep head pose baseline. In: International Conference on Computer Vision Theory and Applications (VISAPP). SCITEPRESS Digital Library (2020)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)
Google Scholar
Song, C., Song, J., Huang, Q.: HybridPose: 6D object pose estimation under hybrid representations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 428–437 (2020)
Google Scholar
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 292–301, June 2018
Google Scholar
Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. In: Proceedings of the 2nd Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 87, pp. 306–316. PMLR, 29–31 October 2018
Google Scholar
Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050 (2017)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. In: Kress-Gazit, H., Srinivasa, S.S., Howard, T., Atanasov, N. (eds.) Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, 26–30 June 2018 (2018)
Google Scholar
Xu, Z., Chen, K., Jia, K.: W-PoseNet: dense correspondence regularized pixel pair pose regression. arXiv preprint arXiv:1912.11888 (2019)
Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6D pose object detector and refiner. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1941–1950 (2019)
Google Scholar
Zhang, Y., Ming, Y., Zhang, R.: Object detection and tracking based on recurrent neural networks. In: 2018 14th IEEE International Conference on Signal Processing (ICSP), pp. 338–343. IEEE (2018)
Google Scholar
Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., Chandraker, M.: Learning monocular visual odometry via self-supervised long-term modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 710–727. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_42
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

BMW Group Research, New Technologies, Innovations, Garching (Munich), Germany
Ahmet Firintepe & Sarfaraz Habib
TU Kaiserslautern, Kaiserslautern, Germany
Ahmet Firintepe & Didier Stricker
German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
Alain Pagani & Didier Stricker

Authors

Ahmet Firintepe
View author publications
You can also search for this author in PubMed Google Scholar
Sarfaraz Habib
View author publications
You can also search for this author in PubMed Google Scholar
Alain Pagani
View author publications
You can also search for this author in PubMed Google Scholar
Didier Stricker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmet Firintepe .

Editor information

Editors and Affiliations

University Paris-Saclay, Orsay, France
Patrick Bourdot
Universitat Politècnica de València, Valencia, Valencia, Spain
Mariano Alcañiz Raya
Los Andes University, Bogota, Colombia
Pablo Figueroa
University of Minnesota, Minneapolis, MN, USA
Victoria Interrante
RWTH Aachen University, Aachen, Nordrhein-Westfalen, Germany
Torsten W. Kuhlen
University of Central Florida, Orlando, FL, USA
Dirk Reiners

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Firintepe, A., Habib, S., Pagani, A., Stricker, D. (2021). Pose Tracking vs. Pose Estimation of AR Glasses with Convolutional, Recurrent, and Non-local Neural Networks: A Comparison. In: Bourdot, P., Alcañiz Raya, M., Figueroa, P., Interrante, V., Kuhlen, T.W., Reiners, D. (eds) Virtual Reality and Mixed Reality. EuroXR 2021. Lecture Notes in Computer Science(), vol 13105. Springer, Cham. https://doi.org/10.1007/978-3-030-90739-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-90739-6_6
Published: 17 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90738-9
Online ISBN: 978-3-030-90739-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Pose Tracking vs. Pose Estimation of AR Glasses with Convolutional, Recurrent, and Non-local Neural Networks: A Comparison