NLFFTNet: A non-local feature fusion transformer network for multi-scale object detection
Introduction
With the development of deep learning, it has been widely used in many advanced tasks [1], such as click prediction [2], [3], [4], image caption [5], [6], and semantic segmentation [7] etc. Object detection has become a research hotspot in the field of computer vision, which is the key component of many applications such as autonomous driving, robotics, and intelligent video surveillance. By simulating the human visual system in capturing interesting regions and objects, the research of computer vision has made a lot of progress in recent years. However, some challenges are yet to be addressed, such as scale variation and partial occlusions. This paper investigates the recent solution to overcome these challenges and improve them.
Currently, the most effective measure is the feature pyramid structure shown in Fig. 1. The shallow feature maps have rich spatial information and high resolution to focus more on geometric and positional information for the detection of large-scale objects. The semantic information of small-scale objects is embedded in the high-level feature maps with a low resolution [7], [8].
In Fig. 1(a), the single-scale feature pyramid network is used in some two-stage detectors like Faster RCNN [9] and R-FCN [10]. It only uses a scale feature map to detect objects, and it does not supply enough geometric information for small object detection [13]. Feature pyramid network (FPN) is shown in Fig. 1(b), which obtains strong global contextual semantic information by using both bottom-up and top-down paths to fuse features from different layers. The feature pyramid generated structure shown in Fig. 1(c) is adopted by SSD, and it uses the feature pyramid from bottom to top to make predictions. The combination of SSD with FPN to form FSSD is shown in Fig. 1(d), where the feature pyramids are generated to make predictions after the features are fused from bottom to top. However, the method of feature fusion in FSSD will cause the loss of low-level information. Recent advanced works (e.g., RSSD [11], DSSD [12]) also try to combine low-resolution and semantically strong features with high-resolution and semantically-weak features. These works use the lateral connection in top-down connection to improve the detection of shallow-layer features. Therefore, aggregating global semantic information can effectively complement the detailed features of small objects [13]. However, these works still suffer from the following limitations:
(1) Lack of analysis of long-distance dependency: The existing feature fusion methods only focus on the interaction between the same local positions but fail to describe the long-distance dependencies of features. In fact, the semantic information is correlated with other spatial locations that are significant for an object detector [14], [15], [16]. As shown in Fig. 2(a), there are persons, cars, and traffic lights in a typical autonomous driving scene. The small objects, such as distant traffic lights or obscured pedestrians, are difficult to recognize from the observation perspective of the camera. Also, there are semantic dependencies among these traffic elements due to the traffic rules, e.g., the car is waiting for the green light, while the pedestrians on the zebra crossing are allowed to pass through. Unrecognizable objects can also be obtained from the long-distance dependency based on non-local operations among multi-scale feature maps.
(2) Redundant information in feature upsampling: The shape of the feature maps must be consistent by upsampling for multi-scale fusion. Previous work has demonstrated that upsampling methods, such as nearest-neighbor interpolation and bilinear interpolation, will increase the redundant information and affect the detection accuracy [17].
(3) Data imbalance between different classes: The existing public datasets are consist of large, medium and small objects. The number of small objects is obviously less than others which makes existing algorithms do not have enough prior information to learn [18], [19], [20]. It will lead to the degradation of the overall performance of the model.
To break through the aforementioned limitations, this paper proposes a non-local feature fused transformer convolutional network.The implementation of the network consists of five steps. First, the training dataset is dynamically augmented by copying and stitching the images containing small objects so that the proportion of small objects in the whole training set is effectively expanded. Then, the backbone module is used to generate high-dimensional features from input images by employing deep convolutional neural networks. Next, the novel non-local feature fused transformer module is used to capture the long-distance dependency between different features layers. Subsequently, the dilated convolution module consisting of a series of dilated convolutional kernels is applied to deliver large reception fields of whole features from the backbone network. Finally, the detection module makes a prediction for each object. The contributions of the paper are summarized as follows:
(1) A novel non-local feature fused transformer convolutional network (NLFFTNet) is proposed for object detection. It can focus on non-local semantic information by analyzing long-distance dependencies. Also, the long-distance semantic dependence is captured to assist detection performance. Besides, content-aware reassembly of features (CARAFE) is integrated to expand the receptive field of the high-level feature map. This method effectively solves the problem of filling redundant pixels.
(2) A configurable mix-splicing (CMS) module for data augment is designed to address the problem of data imbalance between different classes. A configurable enhanced mechanism based on the proportion of small objects is designed to improve the recognition ability of objects at different scales, especially for small objects.
(3) Extensive experiments are conducted on two public datasets, KITTI and Pascal VOC, to verify the performance and generalization ability of our method. All the experimental results show that our method achieves better performance than the previous models. The code of the NLFFT network and the CMS module are available at https://github.com/vivian13maker/NLFFT-network.
The rest of the paper is organized as follows: Section 2 presents a brief review of relevant related works; Section 3 presents the proposed method in detail; the experimental details, results, and analysis are described in 4 Experiments, 5 Conclusion and future work concludes this paper.
Section snippets
Related works
In this section, the upsampling operators, data augment methods, multi-scale fusion for object detection, and transformer are introduced respectively.
Methods
In this section, an overview of the proposed NLFFTNet is presented. In the following subsections, the configurable data augment method CMS, the backbone module, the non-local feature fusion transformer (NLFFT) module, the dilated convolution (DC) module, and the detection module are introduced in detail.
Experiments
In this section, the experimental setup is introduced first. Then, some experiments are designed to demonstrate the effectiveness of the proposed NLFFT module and the data augment method CMS.
Conclusion and future work
In this research, inspired by the long-distance dependency between different pixel information, which is a piece of important feature information to implement multi-scale feature fusion, we propose a novel non-local feature fusion transformer module (NLFFT) that can capture long-distance dependency to characterize global semantic information. We further integrate the content-aware reassembly of features (CARAFE) to upsample feature maps, which can improve the detection performance by capturing
CRediT authorship contribution statement
Kai Zeng: Data curation, Writing – original draft, Writing – review & editing. Qian Ma: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Writing – review & editing. Jiawen Wu: Software, Data curation. Sijia Xiang: Software, Data curation. Tao Shen: Conceptualization, Project administration, Funding acquisition, Writing – review & editing. Lei Zhang: Writing – review & editing, Conceptualization, Methodology.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors gratefully acknowledge support by the National Natural Science Foundation of China (No.61971208),Yunnan Reserve Talents of Young and Middle-aged Academic and Technical Leaders (2019HB005), Yunnan Young Top Talents of Ten Thousands Plan (Shen Tao, Zhu Yan, Yunren Social Development No. 2018 73), Major Science and Technology Projects in Yunnan Province(202002AB080001-8).
Kai Zeng was born in 1985. He received the Ph.D. degree from University of Electronic Science and Technology of China in 2015. Now, He is an associate professor and master supervisor at Faculty of Information Engineering and Automation, Kunming University of Science and Technology. His research interests include granular computing, dis- tributed computation, etc.
References (40)
- et al.
Stimulus-driven and concept-driven analysis for image caption generation
Neurocomputing
(2020) - et al.
Deep learning for ultrasound image caption generation based on object detection
Neurocomputing
(2020) - et al.
Realize your surroundings: Exploiting context information for small object detection
Neurocomputing
(2021) - et al.
Addressing scale imbalance for small object detection with dense detector
Neurocomputing
(2022) - et al.
Enhancing object detection for autonomous driving by optimizing anchor generation and addressing class imbalance
Neurocomputing
(2021) - et al.
Vector of locally and adaptively aggregated descriptors for image feature representation
Pattern Recogn.
(2021) - et al.
Deep embedding of concept ontology for hierarchical fashion recognition
Neurocomputing
(2021) - et al.
MDFN: multi-scale deep feature learning network for object detection
Pattern Recognit.
(2020) - et al.
Local deep-feature alignment for unsupervised dimension reduction
IEEE Trans. Image Process.
(2018) - et al.
Click prediction for web image reranking using multimodal sparse coding
IEEE Trans. Image Process.
(2014)
Learning to rank using user clicks and visual features for image retrieval
IEEE Trans. Cybern.
Hierarchical deep click feature prediction for fine-grained image recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Sprnet: single-pixel reconstruction for one-stage instance segmentation
IEEE Trans. Cybern.
Faster R-CNN: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis
IEEE Trans. Circuits Syst. Video Technol.
Cited by (13)
DFN: A deep fusion network for flexible single and multi-modal action recognition
2024, Expert Systems with ApplicationsHybrid representation learning for cognitive diagnosis in late-life depression over 5 years with structural MRI
2024, Medical Image AnalysisAn effective method for small object detection in low-resolution images
2024, Engineering Applications of Artificial IntelligenceMCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection
2023, Image and Vision ComputingUnbiased feature position alignment for human pose estimation
2023, Neurocomputing
Kai Zeng was born in 1985. He received the Ph.D. degree from University of Electronic Science and Technology of China in 2015. Now, He is an associate professor and master supervisor at Faculty of Information Engineering and Automation, Kunming University of Science and Technology. His research interests include granular computing, dis- tributed computation, etc.
Qian Ma was born in 1997. She is currently working toward the M.S. degree in Circuits and systems from Kunming University of Science and Technology. Her current research interests are in the areas of FPGA and deep learning.
Jiawen Wu was born in 1996. He received the bachelor’s degree from Nanjing University of Aeronautics and Astronautics in 2015. He is currently pursuing the master’s degree at KunMing University of Science and Technology. His research interests is the application of deep learing in FPGA.
Sijia Xiang was born in 1997. She is currently working toward the M.S. degree in Circuits and systems from Kunming University of Science and Technology. Her current research interests are in multi-sensor information fusion of Autonomous Driving.
Tao Shen was born in 1984. He received the Ph.D. degree from Illinois Institute of Technology in 2013. Now, He is a professor and Ph.D. super- visor at Faculty of Information Engineering and Automation, Kunming University of Science and Technology. His research interests include intelligent perception and computation, arti?cial intelligence, terahertz technology, etc.
Lei Zhang received his PhD in 2008 from Graduate University of Chinese Academy of Sciences, and spent another two years doing a postdoctoral project in Tsinghua University. He is currently a professor in TongJi University. His research interests include AI, wireless communications, multi-dimensional information processing.