SCFNet: Semantic correction and focus network for remote sensing image object detection

https://doi.org/10.1016/j.eswa.2023.119980Get rights and content

Abstract

In high-resolution remote sensing images, the problems of large scale variation, large intra-class variance of background, small variability and irregularity of arrangement between different targets always exist in remote sensing images, making the modeling between targets and background more difficult and the target detection task more difficult. However, general target detection methods mainly use convolutional layers of different scales to enhance the target's perceptual domain and fuse different scale features to solve the scale variation problem, without considering the other two problems prevalent in remote sensing scenes of earth observation. In order to solve the above two problems, this paper proposes a semantic correction and focusing network (SCFNet) from the perspective of modeling the relationship between background and target and target to target. The network consists of two core modules the Local Correction Module (LCM) calculates the similarity of local features through the global features of the image to correct the local features and exclude the non-relevant The Non-local Focus Module (NLFM) enhances the recognition of target features by obtaining the non-local dependencies and the corrected local features from the LCM. To demonstrate the effectiveness and robustness of our proposed method, we conducted extensive experiments on two publicly popular large remote sensing multi-target detection datasets, namely DIOR and DOTA. the experimental results show that our SCFNet achieves best-in-class performance and significant accuracy improvement on the datasets.

Introduction

With the rapid development of satellite and remote sensing technology, the quantity and resolution of remote sensing data are growing rapidly. Therefore, deep analysis and understanding of remote sensing data can make remote sensing images play an important role in many practical applications, such as agriculture (Huang et al., 2018), urban planning (Shi et al., 2015), environmental monitoring (Yuan et al., 2020), navigation (Chen et al., 2021), military (Cui et al., 2020) and so on. Target detection can locate important targets and scenes in remote sensing images, which has become one of the significant tasks of intelligent remote sensing image interpretation. However, the problems of remote sensing images with large scale variations and large number of targets make remote sensing target detection a challenging task.

The early target detection work was mainly based on manual construction of features by designing complex features to detect (Lowe, 2004), such as the Histogram of Oriented Gradient (HOG) descriptor (Dalal and Triggs, 2005) proposed in 2005, which designed the HOG descriptor to calculate on a dense grid of uniformly spaced cells and used overlapping local contrast normalization to improve accuracy. The pinnacle of traditional target detection Deformable Part Model (DPM) (Forsyth, 2014) proposed in 2008 follows the “divide and conquer” detection idea, where training can be simply seen as learning a correct method to decompose objects, and inference can be regarded as a collection of detection for different object components. With the development of remote sensing detection datasets, such as UCAS-AOD (Zhu et al., 2015), DIOR (Li et al., 2020), and DOTA (Xia et al., 2018), deep learning has made great contributions to remote sensing target detection in recent years. Unlike natural images, remote sensing images are characterized by wide field of vision, high background complexity, target aggregation and small targets, which makes the detection methods of natural images not easy to be directly transferred to the field of remote sensing images.

Many excellent methods have been put forward in the field of remote sensing detection to solve the above problems. For example, Full-scale Object Detection Network (FSoD-Net) (Wang et al., 2021) proposed a full-scale target detection network in order to address the large differences in scales, which combined the Laplace kernel with parallel multi-scale convolutional layers to provide multiscale enhancement. Feature-merged Single-shot Detection (FMSSD) (Wang et al., 2019) proposed a new area-weighted loss function for tiny objects in remote sensing images to make tiny objects have a greater proportion in the loss. In addition, some methods learn the global and local context of objects by capturing the correlation between local adjacent objects and global objects or features, and perform some modeling operations (Zheng et al., 2020, Liu et al., 2021, Zhang et al., 2019). Although all the above methods fuse the local and global features of the target for extraction, the target often blend with the backgrounds and are difficult to distinguish for remote sensing images with complex backgrounds. The extraction of local features is usually interfered by background error information, and it is also hard to obtain valuable semantic information from global features.

Non-local (Wang et al., 2018), influenced by traditional non-local methods (Buades et al., 2005), applied the previous ideas to Convolutional Neural Network (CNN) to design a non-local network module which found that there could be interactions between more distant pixels in space and time in a video, so it is possible to improve video classification performance by capturing these long-distance dependencies. Point2Node (Han et al., 2020) designed the Dynamic Node Correlation (DNC) module to mine the local, non-local and auto-correlation of each point by referring to the non-local idea (Buades et al., 2005), and effectively integrated the three correlations through Adaptive Feature Aggregation (AFA).

Inspired by the above work, we propose a Semantic Correction Network (SCFNet), which mainly consists of two core modules, the Local Correction Module (LCM) and the Non-local Focus Module (NLFM). As shown in Fig. 1, the LCM mainly corrects the acquired local features to remove the irrelevant semantic features. The NLFM mainly obtains distant dependencies, that is, the more distant relevant semantic features. In the LCM, in order to acquire local features with different receptive fields, we use multiple null convolutions with different dilation rates to obtain local feature regions of the target in parallel. In order to explore the relationship between different channels, we apply the attention mechanism to learn the dependencies between different channels, connect these features by the residual structure, and then the local features are integrated. The global features of the image are obtained by the attention pooling operation, and the local features are corrected by calculating the inverse correlation coefficient between them. In the NLFM, we obtain each pixel’s weight by calculating the similarity between the current pixel and other pixels and normalizing the similarity. After that the non-local features are gained by means of using the weight to multiply with the feature mapping value of the corresponding pixel. Finally, we utilize the corrected local features obtained by the LCM and the current non-local features to calculate the similarity and the non-local relevant semantic features are obtained centrally. SCFNet adopted the network architecture form of Feature Pyramid Network (FPN) (Lin et al., 2017) and Path Aggregation Network (PANet) (Liu et al., 2018), and verified the effectiveness of our method on DIOR (Li et al., 2020) and DOTA (Xia et al., 2018) datasets respectively. The main contributions of this paper are as follows:

  • We proposed a general remote sensing image target detection network, which can effectively detect multiple types of targets in large-scale complex scenes. Compared with other methods, ours can better detect targets that are similar to the backgrounds.

  • In order to better acquire target-related semantic information, we designed the LCM and NLFM. The former is designed to obtain local semantic features and eliminate the interference of error information, while the latter concentrates on the acquisition of long-range semantic features.

  • The performance of our network and the effectiveness of our module are validated on a publicly available large remote sensing data set.

Section snippets

Generic object detection

In the past decade, deep learning has developed rapidly in the field of object detection, which has greatly improved the performance of object detection. Object detection based on deep learning is mainly divided into two categories: one-stage detector and two-stage detector. The one-stage detector generates a series of preselected boxes that may contain objects, then filters the preselected boxes, and finally performs classification and regression tasks on the preselected boxes. Region-CNN

SCFNet

The SCFNet network structure is shown in Fig. 2. As shown in Fig. 2, our approach consists of two core modules, LCM and NLFM, which are designed to correct local semantic features and obtain long-range semantic features, the network architecture is built based on the FPN (Lin et al., 2017) and PANet (Liu et al., 2018) model structures, and more details of the network and module design are discussed in this section.

Training configurations and datasets

Our experiments were performed on a GPU workstation configured with ubuntu21.0, CUDA (10.0) and cuDNN7.0, and NVIDIA RTX 3090 with 24 GB of video memory. The model construction was implemented in python3.7 and pytorch1.1.0. Since mean Average Precision (mAP) is clearly defined and many methods are widely used in performance evaluation of multi-target detection for model evaluation, we use standard mAP with an IoU threshold of 0.5. In the model training, we use some data enhancement methods,

Conclusion

In this paper, we find that the effect of complex background on targets and the problem of small and tightly aligned differences between different targets are the difficulties of target detection tasks in high-resolution remote sensing images, which are often ignored by general target detection methods. To solve these two problems, we introduce two effective network modules, LCM and NLFM. LCM can correct and refine local features to distinguish foreground features from the background. NLFM can

CRediT authorship contribution statement

Chenke Yue: Methodology, Writing – review & editing. Junhua Yan: Conceptualization, Methodology. Yin Zhang: Software, Investigation. Zhaolong Luo: Data curation. Yong Liu: Supervision. Pengyu Guo: Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by National Defense Science and technology foundation strengthening program (Grant No. 2021-JCJQ-JJ-0834), the Fundamental Research Funds for the Central Universities (Grant No. NJ2022025), and the National Natural Science Foundation of China (Grant No. 61901504).

References (54)

  • S.M. Azimi et al.

    Towards multi-class object detection in unconstrained remote sensing imagery. In Computer Vision–ACCV 2018

  • Buades, A., Coll, B., & Morel, J. M. (2005, June). A non-local algorithm for image denoising. In 2005 IEEE computer...
  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection...
  • Z. Chen et al.

    Reconstruction bias U-Net for road extraction from optical remote sensing images

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

    (2021)
  • Z. Cui et al.

    Ship detection in large-scale SAR images via spatial shuffle-group enhance attention

    IEEE Transactions on Geoscience and Remote Sensing

    (2020)
  • Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In 2005 IEEE computer...
  • Ding, J., Xue, N., Long, Y., Xia, G. S., & Lu, Q. (2019). Learning roi transformer for oriented object detection in...
  • D. Forsyth

    Object detection with discriminatively trained part-based models

    Computer

    (2014)
  • G. Ghiasi et al.

    Nas-fpn: Learning scalable feature pyramid architecture for object detection

  • R. Girshick

    Fast r-cnn

  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

  • Han, W., Wen, C., Wang, C., Li, X., & Li, Q. (2020, April). Point2node: Correlation learning of dynamic-node for point...
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE...
  • K. He et al.

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    IEEE transactions on pattern analysis and machine intelligence

    (2015)
  • Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). Dn-detr: Accelerate detr training by introducing...
  • T.Y. Lin et al.

    Feature pyramid networks for object detection

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In...
  • Cited by (0)

    View full text