Context–content collaborative network for building extraction from high-resolution imagery
Introduction
Massive very-high-resolution (VHR) imagery is captured constantly by aerial and satellite sensors, and are used for semantic segmentation, change detection, classification and object recognition [1], [2]. This brings opportunities to observe small ground targets (e.g., buildings) from these images. Building extraction, which is to acquire the pixel-level building annotation from VHR images, has been widely applied in the fields of computer vision and remote sensing, such as smart city development and planning [3], [4], urban disaster prevention and mitigation [5], [6], [7], etc.
In the past decade, with the vigorous development of sensor technology, enormous multisource images are collected consecutively, such as satellite images [8], [9], aerial images [10], [11], hyperspectral images [12], [13], [14]. These multisource images with high or very high spatial resolution make the buildings present the following characteristics [15], [16], [17]. (1) Changeable spectrum: As shown in Fig. 1(a), roof materials of different buildings are diverse, with significant spectral differences. (2) Distinctive texture features: The texture features of buildings are relatively uniform and significant, and usually have a certain directionality, as presented in Fig. 1(a)–(e). (3) Diverse building structures: As given in Fig. 1(a)–(e), the geometric structures of buildings generally have various shapes and scales in multisource remote sensing images. (4) Clear contextual information: Buildings often have a close spatial relationship with other land ground targets (shadows, roads, etc.), and this contextual information can be helpful in building identification. In this situation, it is hard for conventional methods to accurately extract building footprints from aerial and satellite remote sensing images.
Recent deep learning-based methods have facilitated the development of building extraction vastly [18], [19]. Most of the recent methods utilize the spatial pyramid pooling (SPP) or the atrous spatial pyramid pooling (ASPP) to capture multi-scale features of buildings with various structures for multisource remote sensing images and thus improving the stability of the model [20], [21]. However, there are two limitations remained to be resolved in subsequent studies. For one thing, the precision and completeness of building footprints segmented with these methods are still insufficient, and it is hard to achieve a good trade-off between precision and completeness. For another, the robustness of the same model is inadequate for aerial or satellite images. These drawbacks are caused by the following two reasons. First, although SPP or ASPP can increase the receptive field of the network to consider multi-scale features of various buildings, these operations may result in the loss of the localization information of the buildings. This may bring completely or incompletely missing detection of some buildings. Second, in the dense urban scenes, it is inevitable that there are multisource remote sensing images covering various buildings and other land cover ground targets. The recognition and segmentation of various buildings with different materials, shapes, and sizes depends obviously on the robustly representative feature extraction of models, as shown in Fig. 1(a) and (d). Moreover, some buildings are susceptible to the interference of other ground objects in the background, such as vehicles, and parking lots, or buildings covered by shadows and trees, as shown in Fig. 1(b) and (e).
To alleviate these limitations, the motivations of this paper lie in two aspects. On the one hand, although SPP or ASPP can capture multi-scale features to adapt various buildings, an oversized receptive field may not only lose the local localization information but also introduce the interference information caused by the other land cover ground objects in the background. Therefore, it is helpful to mitigate the limitations of using merely the spatial pyramid through providing local localization information and establishing the long-range dependencies between the locations of each building. On the other hand, the trade-off between the segmentation precision and completeness is always difficult for general network models. Hence, it is beneficial for building extraction by introducing a refinement procedure and deep supervision strategy.
Based on the above motivations, we propose a novel context–content collaborative network (CNet) for building footprint extraction in VHR aerial and satellite remote sensing images. Inspired by the attention-guided context feature network for object detection [22], in the proposed CNet, a context–content aware module (CAM) with a modified self-attention mechanism is employed, which can supplement the localization information of more buildings and capture the long-range dependencies between the locations of each building by deploying a context-aware block (CxAB) and a content-aware block (CnAB) cooperatively. Besides, an edge residual refinement module (ERM) is exploited to refine the features of the decoder output by means of a separated deep supervision strategy, thereby consciously inclining the proposed model towards the precision or completeness of building extraction to a certain extent. The contributions of this article can be mainly summarized threefold:
- (1)
In the proposed CNet, the CAM is constructed by a CxAB and a CnAB, which can not only maintain multi-scale features extracted through the widely used ASPP but also complement building localization information and provide long-range dependencies effectively between the locations of each building. Moreover, we present the interpretability of the CAM by visualizing this module, which can elevate the discriminative feature representation of buildings, thereby reducing the missed detection of the buildings.
- (2)
An ERM is employed to refine the features of the decoder output in the proposed CNet. Simultaneously, we deploy a separated deep supervision strategy before and after the ERM, which can consciously guide our CNet preference for precision and completeness of building extraction to a certain extent, thereby making them complement each other. In consequence, building extraction performance can be improved globally.
- (3)
The proposed CNet is exhaustively investigated on three open and challenging datasets (Inria [17], East Asia [15], and Massachusetts [16]). Without any pre-training and data augmentation, experimental results on three datasets have demonstrated that our proposed CNet can provide competitive performance compared with other classical and state-of-the-art (SOTA) methods. Moreover, the equilibrium comparison experiments show that the proposed method can trade-off the precision and completeness of building extraction better.
The rest of this paper is arranged as follows. Section 2 presents some related works. In Section 3, the proposed CNet is described in detail. Section 4 performs and discusses the experiments. Finally, this paper is concluded in Section 5.
Section snippets
Traditional building extraction methods
Over the past few decades, many approaches have been proposed to automatic building footprint extraction and segmentation for VHR remote sensing images [23], [24]. These methods can be roughly divided into three categories, i.e., handcrafted feature-based methods, object-based methods, and auxiliary information-based methods [25]. In the early stages, building feature-based methods complemented automatic building extraction through handcrafted features, which can quantitatively describe the
Proposed context–content collaborative network
In this section, the proposed context–content collaborative network (CNet) is described in detail. First, an overview of the proposed CNet is provided in Section 3.1. Then we present the details of the proposed context–content aware module (CAM) in Section 3.2, which complements building localization information and provide long-range dependencies effectively among the locations of each building. Finally, the edge residual refinement module (ERM) and the corresponding separated deep
Data descriptions
In the experiments, three open building extraction datasets, including Inria [17], East Asia [15], and Massachusetts [16], are employed to test the effectiveness of the proposed CNet. The details of these datasets are described as follows.
Inria Dataset: The Inria dataset was released in [17], and some examples are shown in Fig. 7(a) and (b). The dataset covers 405 and contains 180 aerial tiles. Specifically, the Inria dataset consists of five sub-datasets, including Austin, Chicago,
Conclusion and future research
In this paper, a context–content collaborative network (CNet) has been proposed for building footprint extraction from aerial and satellite imagery. In the proposed CNet, a context–content aware module (CAM) and an edge residual refinement module (ERM) are devised to enhance the feature representation ability of our model. Among them, CAM can capture the long-range contextual information and supplement the localization information by combining a context-aware block (CxAB) and a
CRediT authorship contribution statement
Maoguo Gong: Conceptualization, Validation, Resources, Writing – review & editing. Tongfei Liu: Methodology, Software, Writing – original draft. Mingyang Zhang: Investigation, Validation, Supervision. Qingfu Zhang: Investigation, Writing – review & editing. Di Lu: Data curation, Visualization, Writing – review & editing. Hanhong Zheng: Investigation, Writing – review & editing. Fenlong Jiang: Validation, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant no. 62036006) and the Natural Science Foundation of China (Grant no. 61906147).
References (73)
- et al.
Combining deep learning and ontology reasoning for remote sensing image semantic segmentation
Knowl.-Based Syst.
(2022) - et al.
Supervised remote sensing image segmentation using boosted convolutional neural networks
Knowl.-Based Syst.
(2016) - et al.
Multi-view distance metric learning via independent and shared feature subspace with applications to face and forest fire recognition, and remote sensing classification
Knowl.-Based Syst.
(2022) - et al.
On the versatility of popular and recently proposed supervised evaluation metrics for segmentation quality of remotely sensed images: An experimental case study of building extraction
ISPRS J. Photogramm. Remote Sens.
(2020) - et al.
An end-to-end shape modeling framework for vectorized building outline generation from aerial images
ISPRS J. Photogramm. Remote Sens.
(2020) - et al.
Spectral feature perception evolving network for hyperspectral image classification
Knowl.-Based Syst.
(2022) - et al.
Lightweight multi-scale residual networks with attention for image super-resolution
Knowl.-Based Syst.
(2020) - et al.
Data fusion of high-resolution satellite imagery and LiDAR data for automatic building extraction
ISPRS J. Photogramm. Remote Sens.
(2007) - et al.
CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images
ISPRS J. Photogramm. Remote Sens.
(2022) Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts
ISPRS J. Photogramm. Remote Sens.
(2013)
Deep building footprint update network: A semi-supervised method for updating existing building footprint from bi-temporal remote sensing images
Remote Sens. Environ.
A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery
ISPRS J. Photogramm. Remote Sens.
Self-attention neural architecture search for semantic image segmentation
Knowl.-Based Syst.
A review on building extraction and reconstruction from SAR image
Remote Sens. Technol. Appl.
Land cover change detection techniques: Very-high-resolution optical images: A review
IEEE Geosci. Remote Sens. Mag.
Iterative training sample expansion to increase and balance the accuracy of land classification from VHR imagery
IEEE Trans. Geosci. Remote Sens.
Landslide inventory mapping method based on adaptive histogram-mean distance with bitemporal VHR aerial images
IEEE Geosci. Remote Sens. Lett.
A spectral and spatial attention network for change detection in hyperspectral images
IEEE Trans. Geosci. Remote Sens.
Unsupervised feature extraction in hyperspectral images based on wasserstein generative adversarial network
IEEE Trans. Geosci. Remote Sens.
Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set
IEEE Trans. Geosci. Remote Sens.
Machine Learning for Aerial Image Labeling
Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images
IEEE Trans. Geosci. Remote Sens.
Deep learning-based building extraction from remote sensing images: A comprehensive review
Energies
BRRNet: A fully convolutional neural network for automatic building extraction from high-resolution remote sensing images
Remote Sens.
Attention-guided context feature pyramid network for object detection
Building extraction from high-resolution satellite images in urban areas: recent methods and strategies against significant challenges
Int. J. Remote Sens.
A survey of building extraction methods from optical high resolution remote sensing imagery
Remote Sens. Technol. Appl.
An efficient approach for automatic rectangular building extraction from very high resolution optical satellite imagery
IEEE Geosci. Remote Sens. Lett.
Complex building description and extraction based on hough transformation and cycle detection
Remote Sens. Lett.
Morphological building/shadow index for building extraction from high-resolution imagery over urban areas
IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
Digital surface models and building extraction: A comparison of IFSAR and LIDAR data
IEEE Trans. Geosci. Remote Sens.
Cited by (8)
Building Detection from SkySat Images with Transfer Learning: a Case Study over Ankara
2024, PFG - Journal of Photogrammetry, Remote Sensing and Geoinformation ScienceA shape-aware enhancement Vision Transformer for building extraction from remote sensing imagery
2024, International Journal of Remote Sensing