Abstract
We propose a new representation of distance information that is independent from any specific acquisition device, based on the size of portrayed subjects. In this alternative description, each pixel of an image is associated with the size, in real life, of what it represents. Using our proposed representation, datasets acquired with different devices can be effortlessly combined to build more powerful models, and monocular distance estimation can be performed on images acquired from devices that were never used during training. To assess the advantages of the proposed representation, we used it to train a fully convolutional neural network that predicts with pixel-precision the size of different subjects depicted in the image, as a proxy for their distance. Experimental results show that our representation, allowing the combination of heterogeneous training datasets, makes it possible for the trained network to gain better results at test time.









Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Notes
All sizes refer to linear size and not surface size, so they can be either height or width. For the experiments of this paper we will always use width measures.
References
Battiato S, Farinella GM, Gallo G, Giudice O (2018) On-board monitoring system for road traffic safety analysis. Comput Ind 98:208–217
Bianco S, Buzzelli M, Mazzini D, Schettini R (2017) Deep learning for logo recognition. Neurocomputing 245:23–30
Bianco S, Buzzelli M, Schettini R (2018) Multiscale fully convolutional network for image saliency. J Electron Imaging 27:27 – 27 – 10
Burgos-Artizzu XP, Ronchi MR, Perona P (2014) Distance estimation of an unknown person from a portrait. In: European conference on computer vision. Springer, pp 313–327
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Dong X, Zhang F, Shi P (2014) A novel approach for face to camera distance estimation by monocular vision. Int J Innov Comput Inf Control 10(2):659–669
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, pp 2366–2374
Elgammal A, Duraiswami R, Harwood D, Davis LS (2002) Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proc IEEE 90(7):1151–1163
Ens J, Lawrence P (1993) An investigation of methods for determining depth from focus. IEEE Trans Pattern Anal Mach Intell 15(2):97–108
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2011) The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html
Flores A, Christiansen E, Kriegman D, Belongie S (2013) Camera distance from face images. In: International symposium on visual computing. Springer, pp 513–522
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237
Godard C, Mac Aodha O, Brostow GJ (2016) Unsupervised monocular depth estimation with left-right consistency. arXiv:1609.03677
Gossan S, Ott C (2012) Methods of measuring astronomical distances
Harkness L (1977) Chameleons use accommodation cues to judge distance. Nature 267(5609):346–349
Hirschmuller H (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. In: 2005. CVPR 2005. IEEE computer society conference onComputer vision and pattern recognition, vol 2. IEEE, pp 807–814
Hochberg CB, Hochberg JE (1952) Familiar size and the perception of depth. J Psychol 34(1):107–114
Hoiem D, Efros AA, Hebert M (2008) Putting objects in perspective. Int J Comput Vis 80(1):3–15
Hong D, Tavanapong W, Wong J, Oh J, De Groen PC (2014) 3d reconstruction of virtual colon structures from colonoscopy images. Comput Med Imaging Graph 38(1):22–33
Howard IP, Rogers BJ (1995) Binocular vision and stereopsis. Oxford University Press, Oxford
Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 89–96
Li B, Shen C, Dai Y, van den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1119–1127
Liu F, Shen C, Lin G, Reid I (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38 (10):2024–2039
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Marotta J, Perrot T, Nicolle D, Servos P, Goodale M (1995) Adapting to monocular vision: grasping with one eye. Exp Brain Res 104(1):107–114
Mendelson AL, Papacharissi Z (2010) Look at us: collective narcissism in college student facebook photo galleries. Netw self: Identity, Commun Cult Soc Netw Sites 1974:1–37
Neven D, De Brabandere B, Georgoulis S, Proesmans M, Van Gool L (2017) Fast scene understanding for autonomous driving. arXiv:1708.02550
Prados E, Faugeras O (2006) Shape from shading. In: Handbook of mathematical models in computer vision, pp 375–388
Ranftl R, Vineet V, Chen Q, Koltun V (2016) Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4058–4066
Rodrigues DG, Grenader E, Nos FdS, Dall’Agnol MdS, Hansen TE, Weibel N (2013) Motiondraw: a tool for enhancing art and performance using kinect. In: CHI’13 extended abstracts on human factors in computing systems. ACM, pp 1197–1202
Ros G, Sellart L, Materzynska J, Vazquez D, Lopez AM (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3234–3243
Scharstein D, Szeliski R (2003) High-accuracy stereo depth maps using structured light. In: 2003. Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition. IEEE, vol 1, pp i–i
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Spinello L, Arras KO (2011) People detection in rgb-d data. In: 2011 IEEE/RSJ international conference on Intelligent robots and systems (IROS). IEEE, pp 3838–3843
Subbarao M, Surya G (1994) Depth from defocus: a spatial domain approach. Int J Comput Vis 13(3):271–294
Torralba A, Oliva A (2002) Depth estimation from image structure. IEEE Trans Pattern Anal Mach Intell 24(9):1226–1238
Uhrig J, Cordts M, Franke U, Brox T (2016) Pixel-level encoding and depth layering for instance-level semantic labeling. In: German conference on pattern recognition. Springer International Publishing, pp 14–25
Wedel A, Franke U, Klappstein J, Brox T, Cremers D, et al. (2006) Realtime depth estimation and obstacle detection from monocular video. Lect Notes Comput Sci 4174:475
Yonas A, Pettersen L, Granrud CE (1982) Infants’ sensitivity to familiar size as information for distance. Child Dev 53(5):1285–1290
Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE Multimed 19(2):4–10
Acknowledgements
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bianco, S., Buzzelli, M. & Schettini, R. A unifying representation for pixel-precise distance estimation. Multimed Tools Appl 78, 13767–13786 (2019). https://doi.org/10.1007/s11042-018-6568-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6568-2