Abstract
In this paper, we novelly consider visual localization in active and passive two ways, with simple definition that active localization assists device to estimate location of its interest while passive localization aids device to estimate its own location in environment. Expecting to indicate some insights into visual localization, we specifically performed two explorations on active localization and more importantly explored to upgrade them from active to passive localization with extra geometry information available. In order to produce unconstrained and accurate 2D location estimation of interested object, we constructed an active localization system by fusing detection, tracking and recognition. Based on recognition, we proposed a collaborative strategy making mutual enhancement between detection and tracking possible to obtain better performance on 2D location estimation. Meanwhile, to actively estimate semantic location of interested visual region, we employed latest state-of-the-art light weight CNN models specifically designed for efficiency and trained two of them with large place dataset in perspective of scene recognition. What’s more, using depth information available from RGB-D camera, we improved the active system for 2D location of interested object to a passive system for relative 3D location of device to the interested object. Firstly estimated was the 3D location of the interested object in the coordinate system of device, then relative location of device to the interested object in world coordinate system was deduced with appropriate assumption. Evaluations both subjectively on a RGB-D sequence obtained in a lab environment and practically on a robotic platform in an office environment indicated that the improved system was suitable for autonomous following robot. As well, the active system for rough semantic location estimation of interested visual region was promoted to a passive system for fine location estimation of device, with available 3D map describing the visited environment. In perspective of place recognition, we first adopted one of the efficient CNN models previously trained for semantic location estimation as a base to generate CNN features for both retrieval of candidate loops in the map and geometrical consistency checking of retrieved loops, then true loops were used to deduce fine location of device itself in environment. Comparison with state-of-the-art results reflected that the promoted system was adequate for long-term robotic autonomy. Achieving favorable performances, the presented four explorations have implied adequacy for elaborating on some insights into visual localization.
Similar content being viewed by others
References
P Viola, M Jones (2001) Rapid object detection using a boosted cascade of simple features. CVPR
N Dalal, B Triggs (2005) Histograms of Oriented Gradients for Human Detection. CVPR
P Felzenszwalb, D Mcallester, D Ramanan (2008) A discriminatively trained, multiscale, deformable part modelfor. CVPR
Girshick R, Donahue J, Darrell T, et al (2014) Rich feature hierarchies for accurate object detection an semantic segmentation. CVPR
T-Y Lin, P Dollár, R Girshick, K He, B Hariharan, S Belongie (2017) Feature pyramid networks for object detection. CVPR
R Girshick (2015) Fast R-CNN. ICCV
S Ren, K He, R Girshick, J Sun (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS
K He, G Gkioxari, P Dollár, R Girshick (2017) Mask R-CNN. ICCV
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: Unified, real-time object detection. CVPR 1(2)
W Liu, D Anguelov, D Erhan, C Szegedy, S Reed (2016) SSD: Single shot multibox detector. ECCV
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. CVPR 1(2):8
C-Y Fu, W Liu, A Ranga, A Tyagi, AC Berg (2016) DSSD:Deconvolutional single shot detector. arXiv:1701.06659
A Krizhevsky, I Sutskever, GE Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. Adv Neu Info Proc Syst (NIPS) 1097–1105
J Deng, W Dong, R Socher, L-J Li, K Li, L Fei-Fei (2009) Imagenet: A large-scale hierarchical image database. Proc CVPR
Y Jia, E Shelhamer, J Donahue, S Karayev, J Long, R Girshick, S Guadar-rama, T Darrell (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X, Tensorflow (2016) A system for large-scale machine learning. Tech Rep Google Brain arXiv:1603.04467
T Chen, M Li, Y Li, M Lin, N Wang, M Wang, T Xiao, B Xu, C Zhang, Z Zhang (2015) MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. Neural Information Processing Systems. Workshop on Machine Learning Systems
Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139
T-Y Lin, P Goyal, R Girshick et al (2017) Focal Loss for Dense Object Detection, in ICCV
Jasper R, Uijlings R, van de Sande KEA, Gevers T et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
A Shrivastava, A Gupta, R Girshick (2016) Training region-based object detectors with online hard example mining. CVPR
Hastie T, Tibshirani R, Friedman J (2008) The elements of statistical learning. Springer series in statistics Springer, Berlin
BD Lucas, T Kanade (1981) An iterative image registration technique with an application to stereo vision. IJCAI
Bradski GR (1998) Computer vision face tracking for use in a perceptual user interface. Intel Technol J 2(2):12–21
Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564–577
S Avidan (2004) Support Vector Tracking. IEEE Trans Patt Anal Mach Intel 1064–1072
Avidan S (2007) Ensemble tracking. IEEE Trans Pattern Anal Mach Intell 29(2):261–271
Babenko B, Yang MH, Belongie S (2009) Visual tracking with online multiple instance learning, in CVPR
K Zhang, L Zhang, M-H Yang (2012) Real-Time compressive tracking, In ECCV
Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. TPAMI 34(7):1409–1422
Zhong W, Lu H, Yang M-H (2012) Robust object tracking via sparse collaborative appearance model. CVPR
Wen L, Cai Z, Lei Z (2014) Robustonline learned Spatio-Temporal Context model for visual tracking. IEEE Trans Image Proc
Adam A, Rivlin E, Shimshoni I (2006) Robust fragments based tracking using the integral histogram. CVPR 1:798–805
Nebehay G, Pflugfelder R (2015) Clustering of static-adaptive correspondences for deformable object tracking. CVPR
Pernici F, Del Bimbo A (2014) Object tracking by oversampling local features. TPAMI 36(12)
DS Bolme, JR Beveridge, BA Draper, YM Lui (2010) Visual object tracking using adaptive correlation filters. CVPR
J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the Circulant Structure of Tracking-by-detection with Kernels (2012) in ECCV. Springer Berlin Heidelberg 702–715
JF Henriques, R Caseiro, P Martins, J Batista (2014) High-speed tracking with kernelized correlation filters
Y Li, J Zhu (2014) A scale adaptive kernel correlation filter tracker with feature integration, in Computer Vision-ECCV 2014 Workshops. Springer 254–265
M Danelljan, G Häger, FS Khan, M Felsberg (2014) Accurate scale estimation for robust visual tracking, in Proceedings of the British Machine Vision Conference BMVC
M Danelljan, FS Khan, M Felsberg, and J vd Weijer (2014) Adaptive color attributes for real-time visual tracking, in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE 1090–1097
M Danelljan, FS Khan, M Felsberg (2015) Convolutional features for correlation filter based visual tracking. ICCV Workshops
T Liu, G Wang, Q Yang (2015) Real-time part-based visual tracking via adaptive correlation filters. Proc IEEE Conf Comput Vis Patt Recog 4902–4912
C Ma, X Yang, C Zhang, M-H Yang (2015) Long-term correlation tracking. Proc IEEE Conf Comput Vis Patt Recog 5388–5396
M Danelljan, G Bhat, FS Khan, M Felsberg (2017) Eco: Efficient convolution operators for tracking. CVPR
Chen Z, Hong Z, Tao D (2015) An experimental survey on correlation filter-based tracking. Comput Sci 53(6025):68–83
N Wang, D-Y Yeung (2013) Learning a deep compact image representation for visual tracking. Adv Neu Info Proc Syst 809–817
N Wang , S Li , A Gupta , DY Yeung (2015) Transferring Rich Feature Hierarchies for Robust Visual Tracking. Comput Sci
Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical convolutional features for visual tracking. CVPR
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. CVPR
L Bertinetto, J Valmadre, JF Henriques, A Vedaldi, PHS Torr (2016) Fully-convolutional Siamese networksfor object tracking. arXiv:1606.09549
Held D, Thrun S, Savarese S (2016) Learning to track at 100 FPS with deep regression networks. ECCV
C Harris, M Stephens (1988) A combined corner and edge detector. Proc AVC 147–151
P Beaudet (1978) Rotationally invariant image operators. Proc IJCPR
Lindeberg T (1998) Feature detection with automatic scale selection. IJCV 30(2):79–116
D G Lowe (1999) Object recognition from local scale-invariant features. Proc CVPR 1150–1157
H Bay, T Tuytelaars, LV Gool (2006) Surf: Speeded up robust features. Proc ECCV 404–417
E Rosten T Drummond (2005) Fusing points and lines for high performance tracking. Proc ICCV 1508–1515
E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger (2010) Adaptive and generic corner detection based on the accelerated segment test. Proc ECCV
M Calonder, V Lepetit, C Strecha, P Fua (2010) Brief: Binary robust independent elementary features. Proc ECCV 778–792
E Rublee, V Rabaud, K Konolige, G Bradski (2011) Orb: An efficient alternative to sift or surf. Proc ICCV 2564–2571
S Leutenegger, M Chli, R Siegwart (2011) Brisk: Binary robust invariant scalable keypoints. Proc ICCV 2548–2555
A Alahi, R Ortiz, P Vandergheynst (2012) Freak: Fast retina keypoint. Proc CVPR 510–517
Y Uchida (2016) Local Feature Detectors, Descriptors, and Image Representations: A Survey, arXiv:1607.08368
J Sivic, A Zisserman (2003) Video google: A text retrieval approach to object matching in videos. Proc ICCV1470–1477
D Nistér, H Stewénius (2006) Scalable recognition with a vocabulary tree. Proc CVPR 2161–2168
Y Jiang, C Ngo, J Yang (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. Proc CIVR 494–501
H Jégou, M Douze, C Schmid (2008) Hamming embedding and weak geometric consistency for large scale image search. Proc ECCV 304–317
Galvez-Lopez D, Tardos JD (2012) Bags of binary words for fast place recognition in image sequences. IEEE Trans Robot 28(5):1188–1197
S Khan, D Wollherr (2015) Ibuild: Incremental bag of binary words for appearance based loop closure detection, in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE 5441–5447
L Han, L Fang (2017) Multi-Index Hashing for Loop closure Detection. Int Conf Multimed Expo
L Han, L Fang (2017) Beyond SIFT Using Binary features in Loop Closure Detection. IROS
K Chatfield, K Simonyan, A Vedaldi, A Zisserman (2014) Return of the Devil in the Details: Delving Deep into Convolutional Nets. Bri Mach Vis Conf (BMVC)
K Simonyan, A Zisserman (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. URL http://arxiv.org/abs/1409.1556
Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval. Eur Conf Comput Vis (ECCV) 8689:584–599
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. IEEE Conf Comput Vis Patt Recog (CVPR) 7–12. doi: https://doi.org/10.1109/CVPR.2015.7298594
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Comput Vis Pattern Recog. https://arxiv.org/abs/1512.03385
Zhang X, Liu Z (2015) A survey on stereo vision matching algorithms. Intell Control Autom 22(12):2026–2031
Kumari D, Kaur K (2016) A survey on stereo matching techniques for 3D vision in image processing. Int J Eng Manuf 4:40–49
Wei YM, Kang L, Yang B (2013) WU Ling-Da, applications of structure from motion: a survey. J Zhejiang Univ Sci C 14(7):486–494
O Ozyesil, V Voroninski, R Basri (2017) A Singer, A Survey of Structure from Motion. Acta Numerica 26
Aulinas J, Petillot Y, Salvi J, Lladó X (2008) The SLAM problem: a survey. Artif Intel Res Develop 184(1):363–371
Gouda W, Gomaa W, Ogawa T (2014) Vision based SLAM for humanoid robots: a survey, Japan-Egypt international conference on. Electronics:170–175
Taketomi T, Uchiyama H, Ikeda S (2017) Visual SLAM algorithms: a survey from 2010 to 2016. Ipsj Trans Comput Vis Appl 9(1):16
Zhang X, Zhou X, Lin M, Sun J (2017) Shufflenet: An extremely efficient convolutional neural network for mobile devices. CVPR. arXiv preprint arXiv:1707.01083
Luo JH, Wu J, and Lin W (2017) Thinet: A filter level pruning method for deep neural network compression. in ICCV
B Zhou, A Lapedriza, J Xiao, A Torralba, A Oliva (2014) Learning deep features for scene recognition using places database. Adv Neu Info Proc Syst
B Zhou, A Lapedriza, A Khosla, A Oliva, A Torralba (2017) Places: A 10 million image database for scene recognition. IEEE Trans Pat Anal Mach Intel 99
Kalal Z, Mikolajczyk K, Matas J (2010) Forward-backward error: automatic detection of tracking failures. In: Proceedings of the 2010 20th International Conference on Pattern Recognition. IEEE Comput Soc Washington 2756–2759
Kalal Z, Matas J, Mikolajczyk K (2010) P-N learning: bootstrapping binary classifiers by structural constraints. In: 23rd IEEE Conference on Computer Vision and Pattern Recognition, CVPR, San Francisco
J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek (2013) Image classification with the fisher vector: Theory and practice. Int’l J Comput Vis
C Doersch, A Gupta, AA Efros (2013) Mid-level visual element discovery as discriminative mode seeking. Adv Neu Info Proc Syst
Nebehay G, Pflugfelder R (2014) Consensus-based matching and tracking of keypoints. TPAMI 27(10). doi: https://doi.org/10.1109/WACV.2014.6836013
Y Yang, N Chen, S Jiang (2017) Collaborative strategy for visual object tracking. Multimed Tools Appl 1–21
Vojir T, Matas J (2014) The enhanced flock of trackers. RRIV
Kwon J, Lee KM (2009) Tracking of a non-rigid object via patch-based sampling. CVPR
Klein DA, Schulz D, Frintrop S, Cremers AB (2010) Adaptive real-time video-tracking for arbitrary objects. IEEE/RSJ 6219(1):772–777
Hare S, Saffari A, Torr PHS (2011) Struck: Structured output tracking with kernels. ICCV IEEE Int Conf 263–270
Zhang K, Zhang L, Liu Q, Zhang D, Yang M-H (2014) Fast tracking via dense spatio-temporal context learning. ECCV
M Jaderberg, A Vedaldi, A Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866
V Lebedev, Y Ganin, M Rakhuba, I Oseledets, V Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553
Zhang X, Zou J, He K, Sun J (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Trans Pattern Anal Mach Intell 38(10):1943–1955
W Wen, C Wu, Y Wang, Y Chen, H Li (2016) Learning structured sparsity in deep neural networks. Adv Neu Info Proc Syst 2074–2082
M Rastegari, V Ordonez, J Redmon, A Farhadi (2016) Xnor-net: Imagenet classification using binary convolutional neural networks. Eur Conf Comput Vis 525–542
AG Howard (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. forthcoming
L Sifre (2014) Rigid-motion scattering for image classification, Ph. D. thesis
Ulrich I, Nourbakhsh I (2000) Appearance-based place recognition for topological localization. ICRA 2:1023–1029
Knopp J, Sivic J, Pajdla T (2010) Avoiding confusing features in place recognition. ECCV 6311:748–761
Lowry S, Sünderhauf N, Newman P, Leonard JJ, Cox D (2016) Visual place recognition: a survey. IEEE Trans Robots 32(1):1–19
Williams B, Klein G, Reid I (2011) Automatic re-localization and loop closing for real-time monocular slam. IEEE Trans Pattern Anal Mach Intell 33(9):1699–1712
H Strasdat (2012) Local accuracy and global consistency for efficient visual slam, Ph.D. thesis, Citeseer
J Engel, T Schöps, D Cremers (2014) Lsd-slam: Large-scale direct monocular slam, in European Conference on Computer Vision. Springer 834–849
D Hahnel , W Burgard , D Fox , S Thrun (2003) An efficient fastSLAM algorithm for generating maps of large-scale cyclic environments from raw laser range measurements. IROS
JiaWang Bian, Wen-Yan Lin, Yasuyuki Matsushita, Sai-Kit Yeung, Tan Dat Nguyen, Ming-Ming Cheng (2017) GMS: Grid-based Motion Statistics for Fast, Ultra-robust Feature Correspondence. Conf Comput Vis Patt Recog (CVPR)
Latif Y, Cadena C, Neira J (2013) Robust loop closing over time for pose graph SLAM. Int J Robot Res 32(14):1611–1626
Nister D, Naroditsky O, Bergen J (2004) Visual odometry. IEEE Comput Soc Conf Comput Vis Patt Recog 1(1):I-652–I-659
Y Hou, H Zhang, S Zhou (2015) Convolutional neuralnetwork-based image representation for visual loop closure detection, in information and automation, 2015 IEEE International Conference on. IEEE 2238–2245
Cummins M, Newman P (2008) Fab-map: probabilistic localization and mapping in the space of appearance. Int J Robot Res 27(6):647–665
Labbe M, Michaud F (2013) Appearance-based loop closure detection for online large-scale and long-term operation. IEEE Trans Robot 29(3):734–745
Kejriwal N, Kumar S, Shibata T (2016) High performance loop closure detection using bag of word pairs. Robot Auton Syst 77:55–65
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number 15K16024. We gratefully acknowledge Intel China Lab and Beijing Qfeel Technology Co., Ltd., China for equipment support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, Y., Wu, Y. & Chen, N. Explorations on visual localization from active to passive. Multimed Tools Appl 78, 2269–2309 (2019). https://doi.org/10.1007/s11042-018-6347-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6347-0