Elsevier

Knowledge-Based Systems

Volume 263, 5 March 2023, 110283
Knowledge-Based Systems

Context–content collaborative network for building extraction from high-resolution imagery

https://doi.org/10.1016/j.knosys.2023.110283Get rights and content

Abstract

In practical applications, different application fields have various requirements regarding the precision and completeness of building extraction. Too low precision or completeness may limit the application and promotion of building extraction. Obtaining a good trade-off between the precision and completeness of building extraction is still a challenging issue. To deal with this issue, this paper proposes a context–content collaborative network (C3Net) with an encoder–decoder structure. It consists of a context–content aware module (C2AM) and an edge residual refinement module (ER2M). In the C2AM, a context-aware block and a content-aware block complement each other and capture the localization information of buildings and long-range dependencies between the locations of each building, respectively. Thanks to the capability of the conventional filter, the ER2M can refine the features of decoder output by deploying a residual atrous spatial pyramid pooling with feature edges at the scale of the original image. To explicitly guide the function of the ER2M, we introduce a separated deep supervision strategy before and after the ER2M, which can consciously refine our C3Net towards the precision or completeness of building extraction to a certain extent, and improve the overall detection performance. Compared with several classical and state-of-the-art methods, extensive experiments on three open and challenging datasets demonstrate that the proposed C3Net not only acquires competitive performance but also achieves a better trade-off between precision and completeness of building extraction. The source code is released at https://github.com/TongfeiLiu/C3Net-for-building-extraction.

Introduction

Massive very-high-resolution (VHR) imagery is captured constantly by aerial and satellite sensors, and are used for semantic segmentation, change detection, classification and object recognition [1], [2]. This brings opportunities to observe small ground targets (e.g., buildings) from these images. Building extraction, which is to acquire the pixel-level building annotation from VHR images, has been widely applied in the fields of computer vision and remote sensing, such as smart city development and planning [3], [4], urban disaster prevention and mitigation [5], [6], [7], etc.

In the past decade, with the vigorous development of sensor technology, enormous multisource images are collected consecutively, such as satellite images [8], [9], aerial images [10], [11], hyperspectral images [12], [13], [14]. These multisource images with high or very high spatial resolution make the buildings present the following characteristics [15], [16], [17]. (1) Changeable spectrum: As shown in Fig. 1(a), roof materials of different buildings are diverse, with significant spectral differences. (2) Distinctive texture features: The texture features of buildings are relatively uniform and significant, and usually have a certain directionality, as presented in Fig. 1(a)–(e). (3) Diverse building structures: As given in Fig. 1(a)–(e), the geometric structures of buildings generally have various shapes and scales in multisource remote sensing images. (4) Clear contextual information: Buildings often have a close spatial relationship with other land ground targets (shadows, roads, etc.), and this contextual information can be helpful in building identification. In this situation, it is hard for conventional methods to accurately extract building footprints from aerial and satellite remote sensing images.

Recent deep learning-based methods have facilitated the development of building extraction vastly [18], [19]. Most of the recent methods utilize the spatial pyramid pooling (SPP) or the atrous spatial pyramid pooling (ASPP) to capture multi-scale features of buildings with various structures for multisource remote sensing images and thus improving the stability of the model [20], [21]. However, there are two limitations remained to be resolved in subsequent studies. For one thing, the precision and completeness of building footprints segmented with these methods are still insufficient, and it is hard to achieve a good trade-off between precision and completeness. For another, the robustness of the same model is inadequate for aerial or satellite images. These drawbacks are caused by the following two reasons. First, although SPP or ASPP can increase the receptive field of the network to consider multi-scale features of various buildings, these operations may result in the loss of the localization information of the buildings. This may bring completely or incompletely missing detection of some buildings. Second, in the dense urban scenes, it is inevitable that there are multisource remote sensing images covering various buildings and other land cover ground targets. The recognition and segmentation of various buildings with different materials, shapes, and sizes depends obviously on the robustly representative feature extraction of models, as shown in Fig. 1(a) and (d). Moreover, some buildings are susceptible to the interference of other ground objects in the background, such as vehicles, and parking lots, or buildings covered by shadows and trees, as shown in Fig. 1(b) and (e).

To alleviate these limitations, the motivations of this paper lie in two aspects. On the one hand, although SPP or ASPP can capture multi-scale features to adapt various buildings, an oversized receptive field may not only lose the local localization information but also introduce the interference information caused by the other land cover ground objects in the background. Therefore, it is helpful to mitigate the limitations of using merely the spatial pyramid through providing local localization information and establishing the long-range dependencies between the locations of each building. On the other hand, the trade-off between the segmentation precision and completeness is always difficult for general network models. Hence, it is beneficial for building extraction by introducing a refinement procedure and deep supervision strategy.

Based on the above motivations, we propose a novel context–content collaborative network (C3Net) for building footprint extraction in VHR aerial and satellite remote sensing images. Inspired by the attention-guided context feature network for object detection [22], in the proposed C3Net, a context–content aware module (C2AM) with a modified self-attention mechanism is employed, which can supplement the localization information of more buildings and capture the long-range dependencies between the locations of each building by deploying a context-aware block (CxAB) and a content-aware block (CnAB) cooperatively. Besides, an edge residual refinement module (ER2M) is exploited to refine the features of the decoder output by means of a separated deep supervision strategy, thereby consciously inclining the proposed model towards the precision or completeness of building extraction to a certain extent. The contributions of this article can be mainly summarized threefold:

  • (1)

    In the proposed C3Net, the C2AM is constructed by a CxAB and a CnAB, which can not only maintain multi-scale features extracted through the widely used ASPP but also complement building localization information and provide long-range dependencies effectively between the locations of each building. Moreover, we present the interpretability of the C2AM by visualizing this module, which can elevate the discriminative feature representation of buildings, thereby reducing the missed detection of the buildings.

  • (2)

    An ER2M is employed to refine the features of the decoder output in the proposed C3Net. Simultaneously, we deploy a separated deep supervision strategy before and after the ER2M, which can consciously guide our C3Net preference for precision and completeness of building extraction to a certain extent, thereby making them complement each other. In consequence, building extraction performance can be improved globally.

  • (3)

    The proposed C3Net is exhaustively investigated on three open and challenging datasets (Inria [17], East Asia [15], and Massachusetts [16]). Without any pre-training and data augmentation, experimental results on three datasets have demonstrated that our proposed C3Net can provide competitive performance compared with other classical and state-of-the-art (SOTA) methods. Moreover, the equilibrium comparison experiments show that the proposed method can trade-off the precision and completeness of building extraction better.

The rest of this paper is arranged as follows. Section 2 presents some related works. In Section 3, the proposed C3Net is described in detail. Section 4 performs and discusses the experiments. Finally, this paper is concluded in Section 5.

Section snippets

Traditional building extraction methods

Over the past few decades, many approaches have been proposed to automatic building footprint extraction and segmentation for VHR remote sensing images [23], [24]. These methods can be roughly divided into three categories, i.e., handcrafted feature-based methods, object-based methods, and auxiliary information-based methods [25]. In the early stages, building feature-based methods complemented automatic building extraction through handcrafted features, which can quantitatively describe the

Proposed context–content collaborative network

In this section, the proposed context–content collaborative network (C3Net) is described in detail. First, an overview of the proposed C3Net is provided in Section 3.1. Then we present the details of the proposed context–content aware module (C2AM) in Section 3.2, which complements building localization information and provide long-range dependencies effectively among the locations of each building. Finally, the edge residual refinement module (ER2M) and the corresponding separated deep

Data descriptions

In the experiments, three open building extraction datasets, including Inria [17], East Asia [15], and Massachusetts [16], are employed to test the effectiveness of the proposed C3Net. The details of these datasets are described as follows.

Inria Dataset: The Inria dataset was released in [17], and some examples are shown in Fig. 7(a) and (b). The dataset covers 405 km2 and contains 180 aerial tiles. Specifically, the Inria dataset consists of five sub-datasets, including Austin, Chicago,

Conclusion and future research

In this paper, a context–content collaborative network (C3Net) has been proposed for building footprint extraction from aerial and satellite imagery. In the proposed C3Net, a context–content aware module (C2AM) and an edge residual refinement module (ER2M) are devised to enhance the feature representation ability of our model. Among them, C2AM can capture the long-range contextual information and supplement the localization information by combining a context-aware block (CxAB) and a

CRediT authorship contribution statement

Maoguo Gong: Conceptualization, Validation, Resources, Writing – review & editing. Tongfei Liu: Methodology, Software, Writing – original draft. Mingyang Zhang: Investigation, Validation, Supervision. Qingfu Zhang: Investigation, Writing – review & editing. Di Lu: Data curation, Visualization, Writing – review & editing. Hanhong Zheng: Investigation, Writing – review & editing. Fenlong Jiang: Validation, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 62036006) and the Natural Science Foundation of China (Grant no. 61906147).

References (73)

  • GuoH. et al.

    Deep building footprint update network: A semi-supervised method for updating existing building footprint from bi-temporal remote sensing images

    Remote Sens. Environ.

    (2021)
  • GuoH. et al.

    A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery

    ISPRS J. Photogramm. Remote Sens.

    (2022)
  • FanZ. et al.

    Self-attention neural architecture search for semantic image segmentation

    Knowl.-Based Syst.

    (2022)
  • BoZ. et al.

    A review on building extraction and reconstruction from SAR image

    Remote Sens. Technol. Appl.

    (2012)
  • T. Feng, J. Zhao, Review and comparison: Building extraction methods using high-resolution images, in: 2009 Second Int....
  • A. Mishra, A. Pandey, A.S. Baghel, Building detection and extraction techniques: A review, in: 3rd Int. Conf. Comput....
  • ZhiYongL. et al.

    Land cover change detection techniques: Very-high-resolution optical images: A review

    IEEE Geosci. Remote Sens. Mag.

    (2021)
  • LvZ. et al.

    Iterative training sample expansion to increase and balance the accuracy of land classification from VHR imagery

    IEEE Trans. Geosci. Remote Sens.

    (2020)
  • LiuT. et al.

    Landslide inventory mapping method based on adaptive histogram-mean distance with bitemporal VHR aerial images

    IEEE Geosci. Remote Sens. Lett.

    (2021)
  • GongM. et al.

    A spectral and spatial attention network for change detection in hyperspectral images

    IEEE Trans. Geosci. Remote Sens.

    (2021)
  • ZhangM. et al.

    Unsupervised feature extraction in hyperspectral images based on wasserstein generative adversarial network

    IEEE Trans. Geosci. Remote Sens.

    (2018)
  • JiS. et al.

    Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set

    IEEE Trans. Geosci. Remote Sens.

    (2019)
  • MnihV.

    Machine Learning for Aerial Image Labeling

    (2013)
  • E. Maggiori, Y. Tarabalka, G. Charpiat, P. Alliez, Can semantic labeling methods generalize to any city? the inria...
  • GuoH. et al.

    Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images

    IEEE Trans. Geosci. Remote Sens.

    (2020)
  • LuoL. et al.

    Deep learning-based building extraction from remote sensing images: A comprehensive review

    Energies

    (2021)
  • ShaoZ. et al.

    BRRNet: A fully convolutional neural network for automatic building extraction from high-resolution remote sensing images

    Remote Sens.

    (2020)
  • CaoJ. et al.

    Attention-guided context feature pyramid network for object detection

    (2020)
  • GhaneaM. et al.

    Building extraction from high-resolution satellite images in urban areas: recent methods and strategies against significant challenges

    Int. J. Remote Sens.

    (2016)
  • JunW. et al.

    A survey of building extraction methods from optical high resolution remote sensing imagery

    Remote Sens. Technol. Appl.

    (2016)
  • WangJ. et al.

    An efficient approach for automatic rectangular building extraction from very high resolution optical satellite imagery

    IEEE Geosci. Remote Sens. Lett.

    (2015)
  • CuiS. et al.

    Complex building description and extraction based on hough transformation and cycle detection

    Remote Sens. Lett.

    (2012)
  • HuangX. et al.

    Morphological building/shadow index for building extraction from high-resolution imagery over urban areas

    IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.

    (2011)
  • X. Sun, K. Fu, H. Long, Y. Hu, L. Cai, H. Wang, Contextual models for automatic building extraction in high resolution...
  • X. Wang, P. Li, Extraction of urban building damage using spectral, height and corner information from VHR satellite...
  • GambaP. et al.

    Digital surface models and building extraction: A comparison of IFSAR and LIDAR data

    IEEE Trans. Geosci. Remote Sens.

    (2000)
  • View full text