Abstract
High-level structure (HLS) extraction recovers 3D elements on human-made surfaces (objects, buildings, ground, etc.). There are several approaches to HLS extraction. However, most of these approaches are based on processing two or more images captured from different camera views or on processing 3D data in the form of point clouds extracted from the camera images. In general, 3D point cloud and multiple views approaches have good performance for certain scenes with video sequences or image sequences, but they need sufficient parallax in order to guarantee accuracy. To address this problem, an alternative is to process a single RGB image seeking to interpret areas of the images where the human-made structure may be observed, thus removing parallax dependency, but adding the challenge of having to interpret image ambiguities correctly. Motivated by the latter, we propose a methodology for 3D volumetric structure extraction from a single image. Our strategy is to divide and simplify the 3D structure extraction process. For that, our methodology has three steps. First, the structure recognition step provides the segmentation, location, and delimitation of the urbanized structures in the scene. Second, we propose a graph analysis to classify and locate the boundaries between the different urbanized structures in the scene. Third, we use a proposed CNN and the pinhole camera model to extract the 3D volumetric structure. On the other hand, we evaluate this methodology in synthetic and public datasets.
Similar content being viewed by others
Notes
Parallax is defined as the angle obtained by the objects displacement from an image sequence.
References
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Susstrunk, S.: Slic superpixels. EPFL (2010) .http://www.kev-smith.com/papers/SLIC_Superpixels.pdf
Agrawal, A., Mittal, N.: Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. Visual Comput. (2020). https://doi.org/10.1007/s00371-019-01630-9
Aguilar-González, A., Arias-Estrada, M., Berry, F., de Jesús Osuna-Coutiño, J.A.: The fastest visual ego-motion algorithm in the west. Microprocess. Microsyst. 67, 103–116 (2019). https://doi.org/10.1016/j.micpro.2019.03.005
Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. arXiv e-prints arXiv:1812.11941 (2019)
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2017). https://doi.org/10.1109/TPAMI.2016.2644615
Bartoli, A., Sturm, P.: Constrained structure and motion from multiple uncalibrated views of a piecewise planar scene. Int. J. Comput. Vis. (IJCV) 52(1), 45–64 (2003). https://doi.org/10.1023/A:1022318524906
Chang, J., Wetzstein, G.: Deep optics for monocular depth estimation and 3D object detection. Comput. Vis. Pattern Recognit. (2019). arXiv:1904.08601
Chen, Y.T., Garbade, M., Gall, J.: 3d semantic scene completion from a single depth image using adversarial training. Comput. Vis. Pattern Recognit. (2019). arXiv:1905.06231
Cherian, A., Morellas, V., Papanikolopoulos, N.: Accurate 3d ground plane estimation from a single image. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2243–2249 (2009). https://doi.org/10.1109/ROBOT.2009.5152260
Dani, A., Panahandeh, G., Chung, S.J., Hutchinson, S.: Image moments for higher-level feature based navigation. In: IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 602–609 (2013). https://doi.org/10.1109/IROS.2013.6696413
de Jesus Osuna-Coutino, J.A., Cruz-Martinez, C, Martinez-Carranza J, Arias-Estrada, M, Mayol-Cuevas, W.: Dominant plane recognition in interior scenes from a single image. In: IEEE International Conference on Pattern Recognition (ICPR), pp. 1923–1928 (2016). https://doi.org/10.1109/ICPR.2016.7899917
de Jesus Osuna-Coutino, J.A., Cruz-Martinez, C, Martinez-Carranza J, Arias-Estrada, M, Mayol-Cuevas, W.: I want to change my floor: Dominant plane recognition from a single image to augment the scene. In: IEEE International Symposium on Mixed and Augmented Reality Adjunct Proceedings, pp. 135–140 (2016) https://doi.org/10.1109/ISMAR-Adjunct.2016.0060
Deng, L., Yu, D.: Deep learning: methods and applications. Now Found. Trends (2014). https://doi.org/10.1561/2000000039
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: International Conference on Neural Information Processing Systems (NIPS), pp. 2366–2374 (2014)
Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge a retrospective. Springer Int. J. Comput. Vis. 111(1), 98–136 (2015). https://doi.org/10.1007/s11263-014-0733-5
Fan, H., Su, H., Guibas, L.: A point set generation network for 3d object reconstruction from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–12 (2017). https://doi.org/10.1109/CVPR.2017.264
Favaro, P., Soatto, S.: A geometric approach to shape from defocus. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 406–417 (2005). https://doi.org/10.1109/TPAMI.2005.43
Firman, M., Aodha, O.M., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5431–5440 (2016). https://doi.org/10.1109/CVPR.2016.586
Forsyth, D.A.: Shape from texture and integrability. In: Proceedings IEEE International Conference on Computer Vision. (ICCV) (2001). https://doi.org/10.1109/ICCV.2001.937659
Fouhey, D.F., Gupta, A., Hebert, M.: Data-driven 3d primitives for single image understanding. IEEE Int. Conf. Comput. Vis. (2013). https://doi.org/10.1109/ICCV.2013.421
Fredj, H.B., Bouguezzi, S., Souani, C.: Face recognition in unconstrained environment with cnn. Visual Comput. (2020). https://doi.org/10.1007/s00371-020-01794-9
Gee, A.P., Chekhlov, D., Mayol-Cuevas, W., Calway, A.: Discovering planes and collapsing the state space in visual slam. Br. Mach. Vis. Conf., pp. 6–12 (2007). https://doi.org/10.5244/C.21.6
Gee, A.P., Chekhlov, D., Calway, A., Mayol-Cuevas, W.: Discovering higher level structure in visual slam. IEEE Trans. Robot. 24(5), 980–990 (2008). https://doi.org/10.1109/TRO.2008.2004641
Haines, O., Calway, A.: Recognising planes in a single image. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1849–1861 (2015). https://doi.org/10.1109/TPAMI.2014.2382097
Hoiem, D., Efros, A.A., Hebert, M.: Geometric context from a single image. In: IEEE International Conference on Computer Vision (ICCV), pp. 654–661 (2005). https://doi.org/10.1109/ICCV.2005.107
Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. Springer Int. J. Comput. Vis. 75(1), 151–172 (2007). https://doi.org/10.1007/s11263-006-0031-y
Howard, A., Koenig, N.: Gazebo Simulator. University of Southern California (2018). http://gazebosim.org
Howard, A., Koenig, N.: Gazebo. University of Southern California (2018). http://gazebosim.org
Huang, G., Liu, Z., Maaten, L.V.D., Weinberger, K.Q.: Densely connected convolutional networks. IEEE Conf. Comput. Vis. Pattern Recognit. (2017). https://doi.org/10.1109/CVPR.2017.243
Jiajun, W., Yifan, W., Tianfan, X., Xingyuan, S., Bill, F., Josh, T.: Marrnet: 3D shape reconstruction via 2.5d sketches. In: ACM International Conference on Neural Information Processing Systems (NIPS), pp. 540–550 (2017). https://dl.acm.org/citation.cfm?id=3294823
Kosecká, J., Zhang, W.: Extraction, matching, and pose recovery based on dominant rectangular structures. Elsevier Comput. Vis. Image Underst. 100(3), 274–293 (2005). https://doi.org/10.1016/j.cviu.2005.04.005
Lavest, J., Rives, G., Dhome, M.: Three-dimensional reconstruction by zooming. IEEE Trans. Robot. Autom. 9(2), 196–207 (1993). https://doi.org/10.1109/70.238283
Li, W., Song, D.: Toward featureless visual navigation: Simultaneous localization and planar surface extraction using motion vectors in video streams. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 9–14 (2014). https://doi.org/10.1109/ICRA.2014.6906583
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Springer European Conference on Computer Vision, pp. 740–755 (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, J.: Pangocloud. Github (2017). https://github.com/stevenlovegrove/Pangolin
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016). https://doi.org/10.1109/TPAMI.2015.2505283
Liu, D., Liu, X., Wu, Y.: Depth reconstruction from single images using a convolutional neural network and a condition random field model. MDPI Sens. (2018). https://doi.org/10.3390/s18051318
Lobay, A., Forsyth, D.A.: Shape from texture without boundaries. Springer Int. J. Comput. Vis. 67(1), 71–91 (2006). https://doi.org/10.1007/s11263-006-4068-8
Martinez-Carranza, J., Calway, A.: Unifying planar and point mapping in monocular slam. In: BMVC, pp. 1–11 (2010)
Maturana, D., Scherer, S.: 3d convolutional neural networks for landing zone detection from lidar. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 3471–3478 (2015). https://doi.org/10.1109/ICRA.2015.7139679
Mazzini, D., Buzzelli, M., Pauy, D.P., Schettini, R.: A CNN architecture for efficient semantic segmentation of street scenes. In: IEEE Conference on Consumer Electronics (ICCE) (2018). https://doi.org/10.1109/ICCE-Berlin.2018.8576193
McClean, E., Cao, Y., McDonald, J.: Single image augmented reality using planar structures in urban environments. In: IEEE Conference on Irish Machine Vision and Image Processing, pp. 1–6 (2011). https://doi.org/10.1109/IMVIP.2011.10
Michels, J., Saxena, A., Ng, A.Y.: High speed obstacle avoidance using monocular vision and reinforcement learning. In: ACM International Conference on Machine Learning August 07, pp. 593–600 (2005). https://doi.org/10.1145/1102351.1102426
Micusik, B., Wildenauer, H., Kosecka, J.: Detection and matching of rectilinear structures. In: EEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–7 (2008). https://doi.org/10.1109/CVPR.2008.4587488
Osuna-Coutiño, JAdJ, Martinez-Carranza, J.: Structure extraction in urbanized aerial images from a single view using a CNN-based approach. Int. Remote Sens. (2020). https://doi.org/10.1080/01431161.2020.1767821
Osuna-Coutiño, J.A.d.J., Martinez-Carranza, J.: A binary descriptor invariant to rotation and robust to noise (BIRRN) for floor recognition. IN: Mexican Conference on Pattern Recognition (2019)
Osuna-Coutiño, J.A.d.J., Martinez-Carranza, J.: Binary-patterns based floor recognition suitable for urban scenes. In: IEEE International Conference on Control, Decision and Information Technologies (CoDIT) (2019) https://doi.org/10.1109/CoDIT.2019.8820296
Osuna-Coutiño, JAdJ, Martinez-Carranza, J.: High level 3d structure extraction from a single image using a CNN-based approach. Sensors (2019). https://doi.org/10.3390/s19030563
Osuna-Coutiño, JAdJ, Martinez-Carranza, J.: Structure extraction in urbanized aerial images from a single view using a CNN-based approach. J. Remote Sens., Int (2020). https://doi.org/10.1080/01431161.2020.1767821
Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: Geometric neural network for joint depth and surface normal estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 283–291 (2018) https://doi.org/10.1109/CVPR.2018.00037
Rahimi, A., Moradi, H., Zoroofi, R.A.: Single image ground plane estimation. In: IEEE International Conference on Image Processing, pp. 2149–2153 (2013). https://doi.org/10.1109/ICIP.2013.6738443
Ren, Z., Lee, Y.J.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018). https://github.com/jason718/game-feature-learning
Ren, Z., Lee, Y.J.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 762–771 (2018). https://doi.org/10.1109/CVPR.2018.00086
Ruder, S.: An overview of gradient descent optimization algorithms. Mach. Learn. (2016). arXiv:1609.04747v2
Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: ACM Conference: Advances in Neural Information Processing Systems (NIPS), pp. 1161–1168 (2005). http://make3d.cs.cornell.edu/data.html#make3d. Make3D Dataset
Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), pp. 824–840 (2008). https://doi.org/10.1109/TPAMI.2008.132http://make3d.cs.cornell.edu/data.html#make3d. Make3D Dataset
Shimodaira, H.: A shape-from-shading method of polyhedral objects using prior information. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 612–624 (2006). https://doi.org/10.1109/TPAMI.2006.67
Silveira, G., Malis, E., Rives, P.: Real-time robust detection of planar regions in a pair of images. In: International Conference on Intelligent Robots and Systems (IROS), pp. 49–54 (2006) .https://doi.org/10.1109/IROS.2006.282189
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015). arXiv:1409.1556
Sucar, L.E.: Probabilistic graphical models: principles and applications. Springer Adv. Comput. Vis. Pattern Recognit. (2015). https://doi.org/10.1007/978-1-4471-6699-3
Teng, C.H., Chuo, K.Y., Hsieh, C.Y.: Reconstructing three-dimensional models of objects using a Kinect sensor. Springer Visual Comput. 34(11), 1507–1523 (2017). https://doi.org/10.1007/s00371-017-1425-2
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: International Conference on 3D Vision (3DV) (2017). https://doi.org/10.1109/3DV.2017.00012http://www.cvlibs.net/datasets/kitti. KITTI Dataset
Vosselman, G., Dijkman, S.: 3d building model reconstruction from point clouds and ground plans. In: International Society for Photogrammetry and Remote Sensing (ISPRS), pp. 37–43 (2001)
Wang, X., Yin, W., Kong, T., Jiang, Y., Li, L., Shen, C.: Task-aware monocular depth estimation for 3d object detection. Comput. Vis. Pattern Recognit. (2019). arXiv:1909.07701
Wilczkowiak, M., Sturm, P., Boyer, E.: Using geometric constraints through parallelepipeds for calibration and 3d modeling. IEEE Trans. Pattern Anal. Mach. Intell. (2005). https://doi.org/10.1109/TPAMI.2005.40
Wu, J., Zhang, C., Xue, T., Freeman, W.T., Tenenbaum, J.B.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Comput. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds) Advances in neural information processing systems, vol 29. Curran Associates, Inc
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In: Conference: Robotics: Science and Systems (2018) https://doi.org/10.15607/RSS.2018.XIV.019
Yang, D., Deng, J.: Shape from shading through shape evolution. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. (pp. 3781–3790) (2018) https://doi.org/10.1109/CVPR.2018.00398
Zhao, S., Fang, Z.: Direct depth slam: sparse geometric feature enhanced direct depth slam system for low-texture environments. Sensors (2018). https://doi.org/10.3390/s18103339
Zhuo, W., Salzmann, M., He, X., Liu, M.: Indoor scene structure analysis for single image depth estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 614–622 (2015). https://doi.org/10.1109/CVPR.2015.7298660
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
de Jesús Osuna-Coutiño, J.A., Martinez-Carranza, J. Volumetric structure extraction in a single image. Vis Comput 38, 2899–2921 (2022). https://doi.org/10.1007/s00371-021-02163-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-021-02163-w