SCFNet: Semantic correction and focus network for remote sensing image object detection
Introduction
With the rapid development of satellite and remote sensing technology, the quantity and resolution of remote sensing data are growing rapidly. Therefore, deep analysis and understanding of remote sensing data can make remote sensing images play an important role in many practical applications, such as agriculture (Huang et al., 2018), urban planning (Shi et al., 2015), environmental monitoring (Yuan et al., 2020), navigation (Chen et al., 2021), military (Cui et al., 2020) and so on. Target detection can locate important targets and scenes in remote sensing images, which has become one of the significant tasks of intelligent remote sensing image interpretation. However, the problems of remote sensing images with large scale variations and large number of targets make remote sensing target detection a challenging task.
The early target detection work was mainly based on manual construction of features by designing complex features to detect (Lowe, 2004), such as the Histogram of Oriented Gradient (HOG) descriptor (Dalal and Triggs, 2005) proposed in 2005, which designed the HOG descriptor to calculate on a dense grid of uniformly spaced cells and used overlapping local contrast normalization to improve accuracy. The pinnacle of traditional target detection Deformable Part Model (DPM) (Forsyth, 2014) proposed in 2008 follows the “divide and conquer” detection idea, where training can be simply seen as learning a correct method to decompose objects, and inference can be regarded as a collection of detection for different object components. With the development of remote sensing detection datasets, such as UCAS-AOD (Zhu et al., 2015), DIOR (Li et al., 2020), and DOTA (Xia et al., 2018), deep learning has made great contributions to remote sensing target detection in recent years. Unlike natural images, remote sensing images are characterized by wide field of vision, high background complexity, target aggregation and small targets, which makes the detection methods of natural images not easy to be directly transferred to the field of remote sensing images.
Many excellent methods have been put forward in the field of remote sensing detection to solve the above problems. For example, Full-scale Object Detection Network (FSoD-Net) (Wang et al., 2021) proposed a full-scale target detection network in order to address the large differences in scales, which combined the Laplace kernel with parallel multi-scale convolutional layers to provide multiscale enhancement. Feature-merged Single-shot Detection (FMSSD) (Wang et al., 2019) proposed a new area-weighted loss function for tiny objects in remote sensing images to make tiny objects have a greater proportion in the loss. In addition, some methods learn the global and local context of objects by capturing the correlation between local adjacent objects and global objects or features, and perform some modeling operations (Zheng et al., 2020, Liu et al., 2021, Zhang et al., 2019). Although all the above methods fuse the local and global features of the target for extraction, the target often blend with the backgrounds and are difficult to distinguish for remote sensing images with complex backgrounds. The extraction of local features is usually interfered by background error information, and it is also hard to obtain valuable semantic information from global features.
Non-local (Wang et al., 2018), influenced by traditional non-local methods (Buades et al., 2005), applied the previous ideas to Convolutional Neural Network (CNN) to design a non-local network module which found that there could be interactions between more distant pixels in space and time in a video, so it is possible to improve video classification performance by capturing these long-distance dependencies. Point2Node (Han et al., 2020) designed the Dynamic Node Correlation (DNC) module to mine the local, non-local and auto-correlation of each point by referring to the non-local idea (Buades et al., 2005), and effectively integrated the three correlations through Adaptive Feature Aggregation (AFA).
Inspired by the above work, we propose a Semantic Correction Network (SCFNet), which mainly consists of two core modules, the Local Correction Module (LCM) and the Non-local Focus Module (NLFM). As shown in Fig. 1, the LCM mainly corrects the acquired local features to remove the irrelevant semantic features. The NLFM mainly obtains distant dependencies, that is, the more distant relevant semantic features. In the LCM, in order to acquire local features with different receptive fields, we use multiple null convolutions with different dilation rates to obtain local feature regions of the target in parallel. In order to explore the relationship between different channels, we apply the attention mechanism to learn the dependencies between different channels, connect these features by the residual structure, and then the local features are integrated. The global features of the image are obtained by the attention pooling operation, and the local features are corrected by calculating the inverse correlation coefficient between them. In the NLFM, we obtain each pixel’s weight by calculating the similarity between the current pixel and other pixels and normalizing the similarity. After that the non-local features are gained by means of using the weight to multiply with the feature mapping value of the corresponding pixel. Finally, we utilize the corrected local features obtained by the LCM and the current non-local features to calculate the similarity and the non-local relevant semantic features are obtained centrally. SCFNet adopted the network architecture form of Feature Pyramid Network (FPN) (Lin et al., 2017) and Path Aggregation Network (PANet) (Liu et al., 2018), and verified the effectiveness of our method on DIOR (Li et al., 2020) and DOTA (Xia et al., 2018) datasets respectively. The main contributions of this paper are as follows:
- •
We proposed a general remote sensing image target detection network, which can effectively detect multiple types of targets in large-scale complex scenes. Compared with other methods, ours can better detect targets that are similar to the backgrounds.
- •
In order to better acquire target-related semantic information, we designed the LCM and NLFM. The former is designed to obtain local semantic features and eliminate the interference of error information, while the latter concentrates on the acquisition of long-range semantic features.
- •
The performance of our network and the effectiveness of our module are validated on a publicly available large remote sensing data set.
Section snippets
Generic object detection
In the past decade, deep learning has developed rapidly in the field of object detection, which has greatly improved the performance of object detection. Object detection based on deep learning is mainly divided into two categories: one-stage detector and two-stage detector. The one-stage detector generates a series of preselected boxes that may contain objects, then filters the preselected boxes, and finally performs classification and regression tasks on the preselected boxes. Region-CNN
SCFNet
The SCFNet network structure is shown in Fig. 2. As shown in Fig. 2, our approach consists of two core modules, LCM and NLFM, which are designed to correct local semantic features and obtain long-range semantic features, the network architecture is built based on the FPN (Lin et al., 2017) and PANet (Liu et al., 2018) model structures, and more details of the network and module design are discussed in this section.
Training configurations and datasets
Our experiments were performed on a GPU workstation configured with ubuntu21.0, CUDA (10.0) and cuDNN7.0, and NVIDIA RTX 3090 with 24 GB of video memory. The model construction was implemented in python3.7 and pytorch1.1.0. Since mean Average Precision (mAP) is clearly defined and many methods are widely used in performance evaluation of multi-target detection for model evaluation, we use standard mAP with an IoU threshold of 0.5. In the model training, we use some data enhancement methods,
Conclusion
In this paper, we find that the effect of complex background on targets and the problem of small and tightly aligned differences between different targets are the difficulties of target detection tasks in high-resolution remote sensing images, which are often ignored by general target detection methods. To solve these two problems, we introduce two effective network modules, LCM and NLFM. LCM can correct and refine local features to distinguish foreground features from the background. NLFM can
CRediT authorship contribution statement
Chenke Yue: Methodology, Writing – review & editing. Junhua Yan: Conceptualization, Methodology. Yin Zhang: Software, Investigation. Zhaolong Luo: Data curation. Yong Liu: Supervision. Pengyu Guo: Visualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by National Defense Science and technology foundation strengthening program (Grant No. 2021-JCJQ-JJ-0834), the Fundamental Research Funds for the Central Universities (Grant No. NJ2022025), and the National Natural Science Foundation of China (Grant No. 61901504).
References (54)
- et al.
Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images
ISPRS Journal of Photogrammetry and Remote Sensing
(2020) - et al.
A CenterNet++ model for ship detection in SAR images
Pattern Recognition
(2021) - et al.
Agricultural remote sensing big data: Management and applications
Journal of Integrative Agriculture
(2018) - et al.
Object detection in optical remote sensing images: A survey and a new benchmark
ISPRS journal of photogrammetry and remote sensing
(2020) - et al.
ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery
IEEE Transactions on Geoscience and Remote Sensing
(2021) - et al.
Aircraft detection in remote sensing image based on corner clustering and deep learning
Engineering Applications of Artificial Intelligence
(2020) - et al.
Fast and accurate multi-class geospatial object detection with large-size remote sensing imagery using CNN and Truncated NMS
ISPRS Journal of Photogrammetry and Remote Sensing
(2022) - et al.
Deep learning in environmental remote sensing: Achievements and challenges
Remote Sensing of Environment
(2020) - et al.
HyNet: Hyper-scale object detection network framework for multiple spatial resolution remote sensing imagery
ISPRS Journal of Photogrammetry and Remote Sensing
(2020) - et al.
Object-based cloud and cloud shadow detection in Landsat imagery
Remote Sensing of Environment
(2012)