Skip to main content
Log in

Explorations on visual localization from active to passive

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we novelly consider visual localization in active and passive two ways, with simple definition that active localization assists device to estimate location of its interest while passive localization aids device to estimate its own location in environment. Expecting to indicate some insights into visual localization, we specifically performed two explorations on active localization and more importantly explored to upgrade them from active to passive localization with extra geometry information available. In order to produce unconstrained and accurate 2D location estimation of interested object, we constructed an active localization system by fusing detection, tracking and recognition. Based on recognition, we proposed a collaborative strategy making mutual enhancement between detection and tracking possible to obtain better performance on 2D location estimation. Meanwhile, to actively estimate semantic location of interested visual region, we employed latest state-of-the-art light weight CNN models specifically designed for efficiency and trained two of them with large place dataset in perspective of scene recognition. What’s more, using depth information available from RGB-D camera, we improved the active system for 2D location of interested object to a passive system for relative 3D location of device to the interested object. Firstly estimated was the 3D location of the interested object in the coordinate system of device, then relative location of device to the interested object in world coordinate system was deduced with appropriate assumption. Evaluations both subjectively on a RGB-D sequence obtained in a lab environment and practically on a robotic platform in an office environment indicated that the improved system was suitable for autonomous following robot. As well, the active system for rough semantic location estimation of interested visual region was promoted to a passive system for fine location estimation of device, with available 3D map describing the visited environment. In perspective of place recognition, we first adopted one of the efficient CNN models previously trained for semantic location estimation as a base to generate CNN features for both retrieval of candidate loops in the map and geometrical consistency checking of retrieved loops, then true loops were used to deduce fine location of device itself in environment. Comparison with state-of-the-art results reflected that the promoted system was adequate for long-term robotic autonomy. Achieving favorable performances, the presented four explorations have implied adequacy for elaborating on some insights into visual localization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. P Viola, M Jones (2001) Rapid object detection using a boosted cascade of simple features. CVPR

  2. N Dalal, B Triggs (2005) Histograms of Oriented Gradients for Human Detection. CVPR

  3. P Felzenszwalb, D Mcallester, D Ramanan (2008) A discriminatively trained, multiscale, deformable part modelfor. CVPR

  4. Girshick R, Donahue J, Darrell T, et al (2014) Rich feature hierarchies for accurate object detection an semantic segmentation. CVPR

  5. T-Y Lin, P Dollár, R Girshick, K He, B Hariharan, S Belongie (2017) Feature pyramid networks for object detection. CVPR

  6. R Girshick (2015) Fast R-CNN. ICCV

  7. S Ren, K He, R Girshick, J Sun (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS

  8. K He, G Gkioxari, P Dollár, R Girshick (2017) Mask R-CNN. ICCV

  9. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: Unified, real-time object detection. CVPR 1(2)

  10. W Liu, D Anguelov, D Erhan, C Szegedy, S Reed (2016) SSD: Single shot multibox detector. ECCV

  11. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. CVPR 1(2):8

    Google Scholar 

  12. C-Y Fu, W Liu, A Ranga, A Tyagi, AC Berg (2016) DSSD:Deconvolutional single shot detector. arXiv:1701.06659

  13. A Krizhevsky, I Sutskever, GE Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. Adv Neu Info Proc Syst (NIPS) 1097–1105

  14. J Deng, W Dong, R Socher, L-J Li, K Li, L Fei-Fei (2009) Imagenet: A large-scale hierarchical image database. Proc CVPR

  15. Y Jia, E Shelhamer, J Donahue, S Karayev, J Long, R Girshick, S Guadar-rama, T Darrell (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093

  16. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X, Tensorflow (2016) A system for large-scale machine learning. Tech Rep Google Brain arXiv:1603.04467

  17. T Chen, M Li, Y Li, M Lin, N Wang, M Wang, T Xiao, B Xu, C Zhang, Z Zhang (2015) MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. Neural Information Processing Systems. Workshop on Machine Learning Systems

  18. Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139

    Article  MathSciNet  Google Scholar 

  19. T-Y Lin, P Goyal, R Girshick et al (2017) Focal Loss for Dense Object Detection, in ICCV

  20. Jasper R, Uijlings R, van de Sande KEA, Gevers T et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  21. A Shrivastava, A Gupta, R Girshick (2016) Training region-based object detectors with online hard example mining. CVPR

  22. Hastie T, Tibshirani R, Friedman J (2008) The elements of statistical learning. Springer series in statistics Springer, Berlin

    MATH  Google Scholar 

  23. BD Lucas, T Kanade (1981) An iterative image registration technique with an application to stereo vision. IJCAI

  24. Bradski GR (1998) Computer vision face tracking for use in a perceptual user interface. Intel Technol J 2(2):12–21

    Google Scholar 

  25. Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564–577

    Article  Google Scholar 

  26. S Avidan (2004) Support Vector Tracking. IEEE Trans Patt Anal Mach Intel 1064–1072

  27. Avidan S (2007) Ensemble tracking. IEEE Trans Pattern Anal Mach Intell 29(2):261–271

    Article  Google Scholar 

  28. Babenko B, Yang MH, Belongie S (2009) Visual tracking with online multiple instance learning, in CVPR

  29. K Zhang, L Zhang, M-H Yang (2012) Real-Time compressive tracking, In ECCV

  30. Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. TPAMI 34(7):1409–1422

    Article  Google Scholar 

  31. Zhong W, Lu H, Yang M-H (2012) Robust object tracking via sparse collaborative appearance model. CVPR

  32. Wen L, Cai Z, Lei Z (2014) Robustonline learned Spatio-Temporal Context model for visual tracking. IEEE Trans Image Proc

  33. Adam A, Rivlin E, Shimshoni I (2006) Robust fragments based tracking using the integral histogram. CVPR 1:798–805

    Google Scholar 

  34. Nebehay G, Pflugfelder R (2015) Clustering of static-adaptive correspondences for deformable object tracking. CVPR

  35. Pernici F, Del Bimbo A (2014) Object tracking by oversampling local features. TPAMI 36(12)

  36. DS Bolme, JR Beveridge, BA Draper, YM Lui (2010) Visual object tracking using adaptive correlation filters. CVPR

  37. J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the Circulant Structure of Tracking-by-detection with Kernels (2012) in ECCV. Springer Berlin Heidelberg 702–715

  38. JF Henriques, R Caseiro, P Martins, J Batista (2014) High-speed tracking with kernelized correlation filters

  39. Y Li, J Zhu (2014) A scale adaptive kernel correlation filter tracker with feature integration, in Computer Vision-ECCV 2014 Workshops. Springer 254–265

  40. M Danelljan, G Häger, FS Khan, M Felsberg (2014) Accurate scale estimation for robust visual tracking, in Proceedings of the British Machine Vision Conference BMVC

  41. M Danelljan, FS Khan, M Felsberg, and J vd Weijer (2014) Adaptive color attributes for real-time visual tracking, in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE 1090–1097

  42. M Danelljan, FS Khan, M Felsberg (2015) Convolutional features for correlation filter based visual tracking. ICCV Workshops

  43. T Liu, G Wang, Q Yang (2015) Real-time part-based visual tracking via adaptive correlation filters. Proc IEEE Conf Comput Vis Patt Recog 4902–4912

  44. C Ma, X Yang, C Zhang, M-H Yang (2015) Long-term correlation tracking. Proc IEEE Conf Comput Vis Patt Recog 5388–5396

  45. M Danelljan, G Bhat, FS Khan, M Felsberg (2017) Eco: Efficient convolution operators for tracking. CVPR

  46. Chen Z, Hong Z, Tao D (2015) An experimental survey on correlation filter-based tracking. Comput Sci 53(6025):68–83

    Google Scholar 

  47. N Wang, D-Y Yeung (2013) Learning a deep compact image representation for visual tracking. Adv Neu Info Proc Syst 809–817

  48. N Wang , S Li , A Gupta , DY Yeung (2015) Transferring Rich Feature Hierarchies for Robust Visual Tracking. Comput Sci

  49. Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical convolutional features for visual tracking. CVPR

  50. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. CVPR

  51. L Bertinetto, J Valmadre, JF Henriques, A Vedaldi, PHS Torr (2016) Fully-convolutional Siamese networksfor object tracking. arXiv:1606.09549

  52. Held D, Thrun S, Savarese S (2016) Learning to track at 100 FPS with deep regression networks. ECCV

  53. C Harris, M Stephens (1988) A combined corner and edge detector. Proc AVC 147–151

  54. P Beaudet (1978) Rotationally invariant image operators. Proc IJCPR

  55. Lindeberg T (1998) Feature detection with automatic scale selection. IJCV 30(2):79–116

    Article  Google Scholar 

  56. D G Lowe (1999) Object recognition from local scale-invariant features. Proc CVPR 1150–1157

  57. H Bay, T Tuytelaars, LV Gool (2006) Surf: Speeded up robust features. Proc ECCV 404–417

  58. E Rosten T Drummond (2005) Fusing points and lines for high performance tracking. Proc ICCV 1508–1515

  59. E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger (2010) Adaptive and generic corner detection based on the accelerated segment test. Proc ECCV

  60. M Calonder, V Lepetit, C Strecha, P Fua (2010) Brief: Binary robust independent elementary features. Proc ECCV 778–792

  61. E Rublee, V Rabaud, K Konolige, G Bradski (2011) Orb: An efficient alternative to sift or surf. Proc ICCV 2564–2571

  62. S Leutenegger, M Chli, R Siegwart (2011) Brisk: Binary robust invariant scalable keypoints. Proc ICCV 2548–2555

  63. A Alahi, R Ortiz, P Vandergheynst (2012) Freak: Fast retina keypoint. Proc CVPR 510–517

  64. Y Uchida (2016) Local Feature Detectors, Descriptors, and Image Representations: A Survey, arXiv:1607.08368

  65. J Sivic, A Zisserman (2003) Video google: A text retrieval approach to object matching in videos. Proc ICCV1470–1477

  66. D Nistér, H Stewénius (2006) Scalable recognition with a vocabulary tree. Proc CVPR 2161–2168

  67. Y Jiang, C Ngo, J Yang (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. Proc CIVR 494–501

  68. H Jégou, M Douze, C Schmid (2008) Hamming embedding and weak geometric consistency for large scale image search. Proc ECCV 304–317

  69. Galvez-Lopez D, Tardos JD (2012) Bags of binary words for fast place recognition in image sequences. IEEE Trans Robot 28(5):1188–1197

    Article  Google Scholar 

  70. S Khan, D Wollherr (2015) Ibuild: Incremental bag of binary words for appearance based loop closure detection, in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE 5441–5447

  71. L Han, L Fang (2017) Multi-Index Hashing for Loop closure Detection. Int Conf Multimed Expo

  72. L Han, L Fang (2017) Beyond SIFT Using Binary features in Loop Closure Detection. IROS

  73. K Chatfield, K Simonyan, A Vedaldi, A Zisserman (2014) Return of the Devil in the Details: Delving Deep into Convolutional Nets. Bri Mach Vis Conf (BMVC)

  74. K Simonyan, A Zisserman (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. URL http://arxiv.org/abs/1409.1556

  75. Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval. Eur Conf Comput Vis (ECCV) 8689:584–599

    Google Scholar 

  76. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. IEEE Conf Comput Vis Patt Recog (CVPR) 7–12. doi: https://doi.org/10.1109/CVPR.2015.7298594

  77. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Comput Vis Pattern Recog. https://arxiv.org/abs/1512.03385

  78. Zhang X, Liu Z (2015) A survey on stereo vision matching algorithms. Intell Control Autom 22(12):2026–2031

    Google Scholar 

  79. Kumari D, Kaur K (2016) A survey on stereo matching techniques for 3D vision in image processing. Int J Eng Manuf 4:40–49

    Google Scholar 

  80. Wei YM, Kang L, Yang B (2013) WU Ling-Da, applications of structure from motion: a survey. J Zhejiang Univ Sci C 14(7):486–494

    Article  Google Scholar 

  81. O Ozyesil, V Voroninski, R Basri (2017) A Singer, A Survey of Structure from Motion. Acta Numerica 26

  82. Aulinas J, Petillot Y, Salvi J, Lladó X (2008) The SLAM problem: a survey. Artif Intel Res Develop 184(1):363–371

    Google Scholar 

  83. Gouda W, Gomaa W, Ogawa T (2014) Vision based SLAM for humanoid robots: a survey, Japan-Egypt international conference on. Electronics:170–175

  84. Taketomi T, Uchiyama H, Ikeda S (2017) Visual SLAM algorithms: a survey from 2010 to 2016. Ipsj Trans Comput Vis Appl 9(1):16

    Article  Google Scholar 

  85. Zhang X, Zhou X, Lin M, Sun J (2017) Shufflenet: An extremely efficient convolutional neural network for mobile devices. CVPR. arXiv preprint arXiv:1707.01083

  86. Luo JH, Wu J, and Lin W (2017) Thinet: A filter level pruning method for deep neural network compression. in ICCV

  87. B Zhou, A Lapedriza, J Xiao, A Torralba, A Oliva (2014) Learning deep features for scene recognition using places database. Adv Neu Info Proc Syst

  88. B Zhou, A Lapedriza, A Khosla, A Oliva, A Torralba (2017) Places: A 10 million image database for scene recognition. IEEE Trans Pat Anal Mach Intel 99

  89. Kalal Z, Mikolajczyk K, Matas J (2010) Forward-backward error: automatic detection of tracking failures. In: Proceedings of the 2010 20th International Conference on Pattern Recognition. IEEE Comput Soc Washington 2756–2759

  90. Kalal Z, Matas J, Mikolajczyk K (2010) P-N learning: bootstrapping binary classifiers by structural constraints. In: 23rd IEEE Conference on Computer Vision and Pattern Recognition, CVPR, San Francisco

  91. J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek (2013) Image classification with the fisher vector: Theory and practice. Int’l J Comput Vis

  92. C Doersch, A Gupta, AA Efros (2013) Mid-level visual element discovery as discriminative mode seeking. Adv Neu Info Proc Syst

  93. Nebehay G, Pflugfelder R (2014) Consensus-based matching and tracking of keypoints. TPAMI 27(10). doi: https://doi.org/10.1109/WACV.2014.6836013

  94. Y Yang, N Chen, S Jiang (2017) Collaborative strategy for visual object tracking. Multimed Tools Appl 1–21

  95. Vojir T, Matas J (2014) The enhanced flock of trackers. RRIV

  96. Kwon J, Lee KM (2009) Tracking of a non-rigid object via patch-based sampling. CVPR

  97. Klein DA, Schulz D, Frintrop S, Cremers AB (2010) Adaptive real-time video-tracking for arbitrary objects. IEEE/RSJ 6219(1):772–777

    Google Scholar 

  98. Hare S, Saffari A, Torr PHS (2011) Struck: Structured output tracking with kernels. ICCV IEEE Int Conf 263–270

  99. Zhang K, Zhang L, Liu Q, Zhang D, Yang M-H (2014) Fast tracking via dense spatio-temporal context learning. ECCV

  100. M Jaderberg, A Vedaldi, A Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866

  101. V Lebedev, Y Ganin, M Rakhuba, I Oseledets, V Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553

  102. Zhang X, Zou J, He K, Sun J (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Trans Pattern Anal Mach Intell 38(10):1943–1955

    Article  Google Scholar 

  103. W Wen, C Wu, Y Wang, Y Chen, H Li (2016) Learning structured sparsity in deep neural networks. Adv Neu Info Proc Syst 2074–2082

  104. M Rastegari, V Ordonez, J Redmon, A Farhadi (2016) Xnor-net: Imagenet classification using binary convolutional neural networks. Eur Conf Comput Vis 525–542

  105. AG Howard (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. forthcoming

  106. L Sifre (2014) Rigid-motion scattering for image classification, Ph. D. thesis

  107. Ulrich I, Nourbakhsh I (2000) Appearance-based place recognition for topological localization. ICRA 2:1023–1029

    Google Scholar 

  108. Knopp J, Sivic J, Pajdla T (2010) Avoiding confusing features in place recognition. ECCV 6311:748–761

    Google Scholar 

  109. Lowry S, Sünderhauf N, Newman P, Leonard JJ, Cox D (2016) Visual place recognition: a survey. IEEE Trans Robots 32(1):1–19

    Article  Google Scholar 

  110. Williams B, Klein G, Reid I (2011) Automatic re-localization and loop closing for real-time monocular slam. IEEE Trans Pattern Anal Mach Intell 33(9):1699–1712

    Article  Google Scholar 

  111. H Strasdat (2012) Local accuracy and global consistency for efficient visual slam, Ph.D. thesis, Citeseer

  112. J Engel, T Schöps, D Cremers (2014) Lsd-slam: Large-scale direct monocular slam, in European Conference on Computer Vision. Springer 834–849

  113. D Hahnel , W Burgard , D Fox , S Thrun (2003) An efficient fastSLAM algorithm for generating maps of large-scale cyclic environments from raw laser range measurements. IROS

  114. JiaWang Bian, Wen-Yan Lin, Yasuyuki Matsushita, Sai-Kit Yeung, Tan Dat Nguyen, Ming-Ming Cheng (2017) GMS: Grid-based Motion Statistics for Fast, Ultra-robust Feature Correspondence. Conf Comput Vis Patt Recog (CVPR)

  115. Latif Y, Cadena C, Neira J (2013) Robust loop closing over time for pose graph SLAM. Int J Robot Res 32(14):1611–1626

    Article  Google Scholar 

  116. Nister D, Naroditsky O, Bergen J (2004) Visual odometry. IEEE Comput Soc Conf Comput Vis Patt Recog 1(1):I-652–I-659

    MATH  Google Scholar 

  117. Y Hou, H Zhang, S Zhou (2015) Convolutional neuralnetwork-based image representation for visual loop closure detection, in information and automation, 2015 IEEE International Conference on. IEEE 2238–2245

  118. Cummins M, Newman P (2008) Fab-map: probabilistic localization and mapping in the space of appearance. Int J Robot Res 27(6):647–665

    Article  Google Scholar 

  119. Labbe M, Michaud F (2013) Appearance-based loop closure detection for online large-scale and long-term operation. IEEE Trans Robot 29(3):734–745

    Article  Google Scholar 

  120. Kejriwal N, Kumar S, Shibata T (2016) High performance loop closure detection using bag of word pairs. Robot Auton Syst 77:55–65

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 15K16024. We gratefully acknowledge Intel China Lab and Beijing Qfeel Technology Co., Ltd., China for equipment support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongquan Yang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(MP4 9095 kb)

ESM 2

(MP4 10,062 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Wu, Y. & Chen, N. Explorations on visual localization from active to passive. Multimed Tools Appl 78, 2269–2309 (2019). https://doi.org/10.1007/s11042-018-6347-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6347-0

Keywords

Navigation