Abstract
In this paper, we analyze various outside-in approaches for pose tracking and pose estimation of AR glasses. We first provide two frame-by-frame pose estimation approaches. The first one is a VGG-based CNN, while the second method is the state-of-the-art, ResNet-based AR glasses pose estimation method named GlassPoseRN. We then introduce LSTMs in the mentioned approaches to achieve AR glasses pose tracking. We compare methods with and without non-local blocks, which are theoretically promising for Pose Tracking as they consider non-local neighbor features in one image and among multiple images. We further include separable convolutions in some networks for comparison, which focus on maintaining the individual channels and therefore the triple images. We train and evaluate seven different algorithms on the HMDPose dataset. We observe a significant boost on the dataset from pose estimation to tracking approaches. Non-local blocks do not improve our performance further. The introduction of separable convolutions in our recurrent networks results in the best performance with an estimation error of 0.81\(^{\circ }\) in orientation and 4.46 mm in position. We reduce the error compared to the state-of-the-art by 76%. Our results suggest a promising approach for more immersive AR content for AR glasses in the car context, as high a 6-DoF pose accuracy improves the superimposition of the real world with virtual elements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Berg, A., Oskarsson, M., O’Connor, M.: Deep ordinal regression with label diversity. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2740–2747 (2021)
Borghi, G., Fabbri, M., Vezzani, R., Calderara, S., Cucchiara, R.: Face-from-depth for head pose estimation on depth images. IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 596–609 (2018)
Borghi, G., Gasparini, R., Vezzani, R., Cucchiara, R.: Embedded recurrent network for head pose estimation in car. In: 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 1503–1508. IEEE (2017)
Borghi, G., Venturelli, M., Vezzani, R., Cucchiara, R.: Poseidon: face-from-depth for driver pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Capellen, C., Schwarz, M., Behnke, S.: ConvPoseCNN: dense convolutional 6D object pose estimation, pp. 162–172 (2020)
Chen, B., Parra, A., Cao, J., Li, N., Chin, T.J.: End-to-end learnable geometric vision by backpropagating PnP optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8100–8109 (2020)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807 (2017)
Costante, G., Mancini, M.: Uncertainty estimation for data-driven visual odometry. IEEE Trans. Rob. 36(6), 1738–1757 (2020)
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)
Fanelli, G., Dantone, M., Gall, J., Fossati, A., Gool, L.: Random forests for real time 3D face analysis. Int. J. Comput. Vision 101(3), 437–458 (2013)
Firintepe, A., Mohamed, S., Pagani, A., Stricker, D.: The more, the merrier? A study on in-car IR-based head pose estimation. In: 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE (2020)
Firintepe, A., Pagani, A., Stricker, D.: HMDPose: a large-scale trinocular IR augmented reality glasses pose dataset. In: 26th ACM Symposium on Virtual Reality Software and Technology. ACM (2020)
Firintepe, A., Pagani, A., Stricker, D.: A comparison of single and multi-view IR image-based AR glasses pose estimation approaches. In: 2021 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 571–572 (2021)
Firintepe, A., Vey, C., Asteriadis, S., Pagani, A., Stricker, D.: From IR images to point clouds to pose: point cloud-based AR glasses pose estimation. J. Imag. 7(5) (2021). https://www.mdpi.com/2313-433X/7/5/80
Gao, G., Lauri, M., Wang, Y., Hu, X., Zhang, J., Frintrop, S.: 6D object pose regression via supervised learning on point clouds. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3643–3649 (2020)
Gu, J., Yang, X., De Mello, S., Kautz, J.: Dynamic facial analysis: from Bayesian filtering to recurrent neural network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1531–1540 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, June 2016
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization, pp. 2938–2946, December 2015
Li, Y., Wang, G., Ji, X., Xiang, Yu., Fox, D.: DeepIM: deep iterative matching for 6D pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 695–711. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_42
Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-Based 6-DoF object pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7677–7686 (2019)
Murphy-Chutorian, E., Trivedi, M.M.: Head pose estimation and augmented reality tracking: an integrated system and evaluation for monitoring driver awareness. IEEE Trans. Intell. Transp. Syst. 11(2), 300–311 (2010)
Ning, G., et al.: Spatially supervised recurrent convolutional neural networks for visual object tracking, pp. 1–4 (2017)
Park, K., Patten, T., Vincze, M.: Pix2Pose: pixel-wise coordinate regression of objects for 6d pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7668–7677 (2019)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: A recurrent encoder-decoder network for sequential face alignment. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 38–56. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_3
Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3848–3856, October 2017
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Schwarz, A., Haurilet, M., Martinez, M., Stiefelhagen, R.: DriveAHead-a large-scale driver head pose dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–10, July 2017
Selim, M., Firintepe, A., Pagani, A., Stricker, D.: AutoPOSE: large-scale automotive driver head pose and gaze dataset with deep head pose baseline. In: International Conference on Computer Vision Theory and Applications (VISAPP). SCITEPRESS Digital Library (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)
Song, C., Song, J., Huang, Q.: HybridPose: 6D object pose estimation under hybrid representations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 428–437 (2020)
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 292–301, June 2018
Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. In: Proceedings of the 2nd Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 87, pp. 306–316. PMLR, 29–31 October 2018
Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050 (2017)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes. In: Kress-Gazit, H., Srinivasa, S.S., Howard, T., Atanasov, N. (eds.) Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, 26–30 June 2018 (2018)
Xu, Z., Chen, K., Jia, K.: W-PoseNet: dense correspondence regularized pixel pair pose regression. arXiv preprint arXiv:1912.11888 (2019)
Zakharov, S., Shugurov, I., Ilic, S.: DPOD: 6D pose object detector and refiner. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1941–1950 (2019)
Zhang, Y., Ming, Y., Zhang, R.: Object detection and tracking based on recurrent neural networks. In: 2018 14th IEEE International Conference on Signal Processing (ICSP), pp. 338–343. IEEE (2018)
Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., Chandraker, M.: Learning monocular visual odometry via self-supervised long-term modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 710–727. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_42
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Firintepe, A., Habib, S., Pagani, A., Stricker, D. (2021). Pose Tracking vs. Pose Estimation of AR Glasses with Convolutional, Recurrent, and Non-local Neural Networks: A Comparison. In: Bourdot, P., Alcañiz Raya, M., Figueroa, P., Interrante, V., Kuhlen, T.W., Reiners, D. (eds) Virtual Reality and Mixed Reality. EuroXR 2021. Lecture Notes in Computer Science(), vol 13105. Springer, Cham. https://doi.org/10.1007/978-3-030-90739-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-90739-6_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90738-9
Online ISBN: 978-3-030-90739-6
eBook Packages: Computer ScienceComputer Science (R0)