Abstract
High-Level Structure (HLS) recognition locates elements on human-made surfaces (objects, buildings, ground, etc.). There are several approaches to HLS recognition, however, most of these approaches are based on processing 3D data in the form of point clouds extracted from the camera images. In general, 3D point cloud approaches have good performance for certain scenes with video sequences or image sequences, but they need sufficient parallax in order to guarantee accuracy. To address this problem, an alternative is to process a single RGB image seeking to interpret areas of the images where the human-made structure may be observed, thus removing parallax dependency, but adding the challenge of having to interpret image ambiguities correctly. Motivated by the latter, this work presents the results of a novel methodology for HLS recognition using a CNN-Superpixel approach from a single image. For that, our approach has three steps. First, the superpixel and centroid analysis obtains the RGB section and the superpixel to analyze. This section is a portion of the input image that our CNN uses to provide a label. Second, the structure recognition step provides a segmentation, location, and delimitation of the urbanized structures in the scene. For that, we propose a CNN-superpixel configuration, this configuration combines the abstraction power of deep learning and fast computational processing using superpixel segmentation. Third, the connectivity analysis replaces the superpixel label considering the connection between the neighbor superpixels. On the other hand, experimental results are encouraging, our approach has high performance under real-world scenarios. Also, the proposed methodology is 6.53 to 12.18 faster than previous work.
Similar content being viewed by others
Notes
A ‡ symbol expresses a significant difference between our approach (CNN-SP+RGB) and the semantic segmentation approaches (GFL,ID3-Depth-1,ID3-Depth-2,CNN-SP+D, CNN-SP+DGT, and HLS-GNet)
A ‡ symbol expresses a significant difference between our approach (CNN-SP+RGB) and the semantic segmentation approaches (GFL,ID3-Depth-1,ID3-Depth-2,CNN-SP+D, CNN-SP+DGT, and HLS-GNet)
References
A S, M S, Y NA (2009) Make3D: learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 824–840. https://doi.org/10.1109/TPAMI.2008.132
Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S (2010) SLIC superpixels. EPFL
Aguilar-González A., Arias-Estrada, M., Berry F (2019) Depth from a motion algorithm and a hardware architecture for smart cameras. Sensors MDPI. https://doi.org/10.3390/s19010053
Alhashim I, Wonka P (2019) High quality monocular depth estimation via transfer learning. arXiv:1812.11941
B M, H W, J K (2008) Detection and matching of rectilinear structures. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–7
Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615
Chen T, Liu X, Feng R, Wang W, Yuan C, Lu W, He H, Gao H, Ying H, Chen DZ, Wu J (2021) Discriminative cervical lesion detection in Colposcopic images with global class activation and local bin excitation. IEEE Journal of Biomedical and Health Informatics (JBHI). https://doi.org/10.1109/JBHI.2021.3100367
Chen J, Ying H, Liu X, Gu J, Feng R, Chen T, Gao H, Wu J (2021) A transfer learning based super-resolution microscopy for biopsy slice images: the joint methods perspective. IEEE/ACM transactions on computational biology and Bioinformatics (TCBB) 18(1):103–113. https://doi.org/10.1109/TCBB.2020.2991173
D H, A EA, M H (2007) Recovering surface layout from an image. International Journal of Computer Vision, pp 151–172. https://doi.org/10.1007/s11263-006-0031-y
D H, M EAAH (2005) Geometric context from a single image. IEEE International Conference on Computer Vision (ICCV), pp 654–661
E M, Y C, J M (2011) Single image augmented reality using planar structures in urban environments. In: Machine vision and image processing conference (IMVIP), pp 1–6
Everingham M, Eslami SMA, Gool LV, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge a retrospective. Int J Comput Vis 111:98–136. https://doi.org/10.1007/s11263-014-0733-5
Feng R, Liu X, Chen J, Chen DZ, Gao H, Wu J (2021) A deep learning approach for colonoscopy pathology WSI analysis: accurate segmentation and classification. IEEE J Biomed Health Inform 25(10):3700–3708. https://doi.org/10.1109/JBHI.2020.3040269
Gao H, Xu K, Cao M, Xiao J, Xu Q, Yin Y (2021) The deep features and attention mechanism based method to dish Healthcare under social IoT systems: an empirical study with a hand-deep local-global net. IEEE Transactions on Computational Social Systems (TCSS). https://doi.org/10.1109/TCSS.2021.3102591
Hashim HA (2021) A geometric nonlinear stochastic filter for simultaneous localization and mapping. Aerospace Science and Technology. https://doi.org/10.1016/j.ast.2021.106569
Hu MK (1962) Visual pattern recognition by moment invariants. IRE Transaction Information Theory, pp 179–187
Huang J, Zhou Y, Funkhouser T, Guibas L (2019) FrameNet: learning local canonical frames of 3D surfaces from a single rgb image. IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV.2019.00873
Joo K, Oh TH, Kim J, Kweon IS (2019) Robust and globally optimal manhattan frame estimation in near real time. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 682–696. https://doi.org/10.1109/TPAMI.2018.2799944
Kang Z, Yang J, Yang Z, Cheng S (2020) A review of techniques for 3d reconstruction of indoor environments. International Journal of Geo-Information (ISPRS), pp 1–31. https://doi.org/10.3390/ijgi9050330
Kim P, Coltin B, Kim HJ (2018) Linear RGB-D SLAM for planar environments. European Conference on Computer Vision (ECCV), pp 333–348
Kovsecká J, Zhang W (2005) Extraction, matching, and pose recovery based on dominant rectangular structures. Comput Vis Image Underst 100 (3):274–293
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. European Conference on Computer Vision, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Liu F, Shen C, Lin G, Reid I (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38:2024–2039. https://doi.org/10.1109/TPAMI.2015.2505283
Liu S, Zhou Y, Zhao Y (2021) VaPiD: a rapid vanishing point detector via learned optimizers. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 12859–12868
Luo S, Wei H (2021) Diffusion probabilistic models for 3D point cloud generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2837–2845
Mahmoud MH, Alamery S, Fouad H, Altinawi A, Youssef AE (2021) An automatic detection system of diabetic retinopathy using a hybrid inductive machine learning algorithm. Personal and Ubiquitous Computing. https://doi.org/10.1007/s00779-020-01519-8
O H, A C (2012) Estimating planar structure in single images by learning from examples. International Conference on Pattern Recognition Applications and Methods (ICPRAM), pp 289–294
O H, A C (2015 ) Recognising planes in a single image. IEEE transactionson pattern analysis and machine intelligence, pp 1849–1861. https://doi.org/10.1109/TPAMI.2014.2382097
Osuna-Coutiño JAdJ, Cruz-Martínez C, Martinez-Carranza J, Arias-Estrada M, Mayol-Cuevas W (2016) I want to change my floor: dominant plane recognition from a single image to augment the scene. IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp 135–140. https://doi.org/10.1109/ISMAR-Adjunct.2016.0060
Osuna-Coutiño JAdJ, Martinez-Carranza J (2019) High level 3D structure extraction from a single image using a CNN-based approach. Sensors. https://doi.org/10.3390/s19030563
Osuna-Coutiño JAdJ, Martinez-Carranza J (2019) A binary descriptor invariant to rotation and robust to noise (BIRRN) for floor recognition. Springer Mexican Conference on Pattern Recognition (MCPR) 11524:271–281. https://doi.org/10.1007/978-3-030-21077-9_25
Osuna-Coutiño JAdJ, Martinez-Carranza J (2019) Binary-patterns based floor recognition suitable for urban scenes. IEEE International Conference on Control, Decision and Information Technologies (CoDIT). https://doi.org/10.1109/CoDIT.2019.8820296
Osuna-Coutiño JAdJ, Martinez-Carranza J (2020) Structure extraction in urbanized aerial images from a single view using a CNN-based approach. Taylor & Francis in International Journal of Remote Sensing, pp. (pp 1–25). https://doi.org/10.1080/01431161.2020.1767821
Osuna-Coutiño JAdJ, Martinez-Carranza J (2021) Volumetric structure extraction in a single image. The Visual Computer. Springer, https://doi.org/10.1007/s00371-021-02163-w
Osuna-Coutiño JAdJ, Martinez-Carranza J, Arias-Estrada M, Mayol-Cuevas W (2016) Dominant plane recognition in interior scenes from a single image. International Conference on Pattern Recognition (ICPR), pp 1923–1928
Peng X, Zhu X, Wang T, Ma Y (2022) SIDE: center-based stereo 3D detector with structure-aware instance depth estimation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp 119–128
Ren Z, Lee YJ (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Ren Z, Lee YJ (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 762–771
Rosen DM, Doherty KJ, Terán Espinoza A, Leonard JJ (2021) Advances in inference and representation for simultaneous localization and mapping. Annual Review of Control, Robotics, and Autonomous Systems, pp 215–242. arXiv:2103.05041
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747
Saxena A, Chung SH, Ng AY (2005) Learning depth from single monocular images. Advances in Neural Information Processing Systems NIPS
Shen X, Cohen S, Wang P, Russell B, Price B, Eisenmann J (2019) Planar region guided 3D geometry estimation from a single image. Patent
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. arXiv:1409.1556
Uhrig J, Schneider N, Schneider L, Franke U, Brox T, Geiger A (2017) Sparsity invariant CNNs. International Conference on 3D Vision (3DV). KITTI Dataset. https://doi.org/10.1109/3DV.2017.00012
Wang C, Cheng M, Sohel F, Bennamoun M, Li J (2019) NormalNet: a voxel-based CNN for 3D object classification and retrieval. Neurocomputing, pp 139–147. https://doi.org/10.1016/j.neucom.2018.09.075
Wang X, Fouhey D, Gupta A (2015) Designing deep networks for surface normal estimation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 539–547
Xiang Y, Schmidt T, Narayanan V, Fox D (2018) PoseCNN: a convolutional neural network for 6D Object pose estimation in cluttered scenes conference: robotics: science and systems. https://doi.org/10.15607/RSS.2018.XIV.019
Xiao J, Xu H, Gao H, Bian M, Li Y (2021) A weakly supervised semantic segmentation network by aggregating seed cues: the multi-object proposal generation perspective. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1s, Article 15, 19 pages. https://doi.org/10.1145/3419842
Y L, L B, Y B, P H (1998) Gradient-based learning applied to document recognition. Proc IEEE, pp 2278–2324
Yu X, Tang L, Rao Y, Huang T, Zhou J, Lu J (2022) Point-bert: pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19313-19322. https://doi.org/10.48550/arXiv.2111.14819
Zhao R, Pang M, Liu C, Zhang Y (2019) Robust normal estimation for 3D LiDAR point clouds in urban environments. Sensors MDPI. https://doi.org/10.3390/s19051248
Zhu Y, Zhang W, Chen Y, Gao H (2019) A novel approach to workload prediction using attention-based LSTM encoder-decoder network in cloud environment. EURASIP Journal on Wireless Communications and Networking, 2019(247). https://doi.org/10.1186/s13638-019-1605-z
Zhu H, Zuo X, Yang H, Wang S, Cao X, Yang R (2021) Detailed avatar recovery from single image. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2021.3102128
Funding
The first author is thankful the internal call for collaboration scholarships INAOE 2021-2002. The second author is thankful for the support received through his Royal Society-Newton Advanced Fellowship with reference NA140454.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
We confirm that this work is original and has not been published elsewhere nor is it currently under consideration for publication elsewhere.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Osuna-Coutiño, J.d.J., Martinez-Carranza, J. High level structure recognition in single urban images using a CNN and SuperPixels. Multimed Tools Appl 82, 25175–25196 (2023). https://doi.org/10.1007/s11042-023-14422-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14422-0