ReinforceNet: A reinforcement learning embedded object detection framework with region selection network

doi:10.1016/j.neucom.2021.02.073

Neurocomputing

Volume 443, 5 July 2021, Pages 369-379

https://doi.org/10.1016/j.neucom.2021.02.073 Get rights and content

Abstract

In recent years, researchers have explored reinforcement learning based object detection methods. However, existing methods always suffer from barely satisfactory performance. The main reasons are that current reinforcement learning based methods generate a sequence of inaccurate regions without a reasonable reward function, and regard the non-optimal one at the final step as the detection result for lack of an effective region selection and refinement strategy. To tackle the above problems, we propose a novel reinforcement learning based object detection framework, namely ReinforceNet, possessing the capability of the region selection and refinement by integrating reinforcement learning agents’ action space with Convolutional Neural Network based feature space. In ReinforceNet, we redevelop a reward function that enables the agent to be trained effectively and provide more accurate region proposals. In order to further optimize them, we design Convolutional Neural Network based region selection network (RS-net) and bounding box refinement network (BBR-net). Particularly, the former consists of two sub-networks: Intersection-of-Union network (IoU-net) and Completeness network (CPL-net) which are employed jointly for selecting the optimal region proposal. The latter aims to refine the selected one as the final result. Extensive experimental results on two standard datasets PASCAL VOC 2007 and VOC 2012 demonstrate that ReinforceNet is capable of improving the region selection and learning better agent action representations for reinforcement learning, resulting in the state-of-the-art performance.

Introduction

Recent years have witnessed some reinforcement learning (RL) based object detection methods [1], [2], [5], [6], [7], [8], [12]. These RL based methods always formulate object detection as Markov decision processes (MDP), where RL agent sequentially selects the actions to adjust the aspect ratios of input images within several steps using action-decision strategy until triggering terminal action. An obvious advantage of above RL methods is that only a few region proposals (usually no more than 10 candidates) are required for object detection, while Convolutional Neural Network (CNN) based approaches [3], [4], [9], [10], [11], [13] always demand tens of thousands of pre-computed proposals, which makes them difficult to handle the optimal region selection. However, existing RL based object detection methods always suffer from barely satisfactory performance. The main reasons are that RL agent: (i) directly generates a sequence of inaccurate regions without a reasonable reward function, (ii) regards the non-optimal one at final step as detection result without an effective region selection strategy, and (iii) only adopts action space for bounding box regression in RL process.

In this paper, we propose a reinforcement learning embedded object detection framework with region selection and refinement network, a more accurate model integrating RL agents’ action space with CNN-based feature space for object detection, as a response to the aforementioned issues. The whole network consists of three main components: (1) RL optimization: a novel reward function for RL agent optimization, (2) RS-net: a region selection network for searching the optimal region proposal, and (3) BBR-net: a bounding box refinement network for further regression.

(1)
RL optimization: A reasonable reward function is key for RL optimization. However, in previous works [5], [6], [8], IoU based reward function focuses only on the positive/negative variation of IoU difference between adjacent regions but neglects change magnitude, which does not make RL agent sensitive to small changes. To handle this problem, we simultaneously consider the IoU and Completeness of the change magnitude between adjacent regions into reward function to effectively train RL agent. Particularly, Completeness is a newly-defined evaluation metric which could measure the completeness of target object in image. Generally speaking, it is reasonable to assume that the RL agent can gradually enhance learning capacity with running epoch rising. Therefore, we introduce multiple agents rather than single agent to cover and find the optimal region proposal of the detection results.
(2)
Region selection network (RS-net): In MDP, RL agent [17] sequentially searches for objects by utilizing both the current observation of region image and historical search paths. When RL agent stops search, the final-step region proposal is treated as the detection result. However, Fig. 1 indicates that most of the final-step region proposals are not optimal from observing results of different RL detection methods. To handle this problem, we design a novel network, namely RS-net, to select the optimal region proposal. RS-net consists of two sub-networks: IoU-net and CPL-net, which are responsible for computing IoU and Completeness values of each region proposal respectively. The predicted IoU and Completeness values are used jointly to assess region proposals and select the optimal one, as shown in Fig. 1-(b) and Fig. 4.
(3)
Bounding box refinement network (BBR-net): Compared with CNN based methods [9], [13], standard RL based object detection methods employ action space instead of feature space for bounding box regression. For example, Bueno et al. [8] employs five pre-defined actions to refine the candidate bounding boxes. Nevertheless, the result of region refinement is deeply limited by the parameters of pre-defined action space since the pre-defined actions cannot cover the target size space. Motivated by bounding box fine-tuning strategy adopted in two-stage object detection architectures [9], [13], we design a bounding box refinement network (BBR-net) that integrates both action space and feature space for further regression. Specifically, we recurrently exploit CNN backbone for extracting the local feature maps of the proposals selected from RS-net and simultaneously employs them into RL framework for training BBR-net. This strategy provides a more complementary mechanism to harness the inaccurate object location problems.

The major contributions of this paper are as follows:

(1)
A novel reinforcement learning based object detection network, namely ReinforceNet, is proposed. ReinforceNet possesses the capability of the region selection and refinement by integrating RL agents’ action space with CNN-based feature space.
(2)
We redevelop an IoU and Completeness jointly guided reward function, which makes RL agent sensitive to small change magnitudes between adjacent regions. Besides, we replace the single agent with multiple agents, which enrich the expressiveness of our object detection framework.
(3)
Extensive experiments on PASCAL VOC 2007 and VOC 2012 object detection benchmarks demonstrate the superior performance of our ReinforceNet compared to state-of-the-art methods.

Section snippets

Related work

CNN based Detector The leading approaches in object detection are currently CNN-based deep detectors, which can be summarized as two-stage [9], [23], [24] and single-stage detectors [10], [11]. For two-stage detector, the pioneer work R-CNN is reported in [13] by combining external region proposal module and a region-wise classifier to formulate object detection. Although this method appears promising robustness for object detection, its extensive computational cost to obtain region proposals

ReinforceNet model

In this section, we will present our ReinforceNet, a novel reinforcement learning embedded object detection framework. In detail, the complete technical pipeline is depicted in Fig. 2 and composed of three main parts: (i) multiple RL agents for jointly generating more accurate region proposals, (ii) RS-net for selecting the optimal region proposal, and (iii) BBR-net for refining the optimal one as the detection result. All the above parts will be illustrated in the following sub-sections.

Experiment setting

All the experiments in this section are conducted on two widely-used object detection benchmark datasets, i.e., PASCAL VOC 2007 [29] and 2012 [30], which both consist of 20 categories. Because the ground truth annotations of VOC 2012 testing set have been not released publicly, we elaborately design two sets of experiments. 1) ReinforceNet is trained on the union of the 2007 and 2012 training-validation set, and tested on VOC 2007 testing set; 2) ReinforceNet is trained with VOC 2012 training

Conclusion

In this work, we have presented a general RL based object detection framework with RS-net and BBR-net. The RS-net allows us to select the optimal region of the target with combination of IoU-net and CPL-net. We state that this framework is applicable for finding more appropriate region proposals since it can compute IoU and Completeness values of each region proposal effectively in practice. For further refining the optimal region from RS-net output, we introduce the BBR-net to converge the

CRediT authorship contribution statement

Man Zhou: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Rujing Wang: Supervision. Chengjun Xie: Supervision, Writing - review & editing. Liu Liu: Supervision, Writing - review & editing. Rui Li: Software, Validation. Fangyuan Wang: Software, Validation. Dengshan Li: Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by National Natural Science Foundation of China under grant 31401293, 31671586, 61773360.

Man Zhou received the B.E degree in automation from Wuhan University of Science and Technology, Wuhan, China in 2013. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent System from University of Science and Technology of China, Hefei, China. His current research interests include deep reinforcement learning and computer vision.

References (42)

Y. Li et al.
An aircraft detection framework based on reinforcement learning and convolutional neural networks in remote sensing images
Remote Sens.
(2018)
D. Li et al.
A2-RL: aesthetics aware reinforcement learning for image cropping
X. Chang et al.
RCAA: relational context-aware agents for person search
J. Pang et al.
Libra R-CNN: towards balanced learning for object detection
Z. Cai et al.
Cascade R-CNN: delving into high quality object detection
X. Han et al.
Active object detection with multistep action prediction using deep Q-network
IEEE Trans. Ind. Inf.
(2019)
J.C. Caicedo et al.
Active object localization with deep reinforcement learning
Z. Jie et al.
Tree-structured reinforcement learning for sequential object localization
Neural Inf. Process. Syst.
(2016)
M. Bellver et al.
Hierarchical object detection with deep reinforcement learning
Neural Inf. Process. Syst.
(2016)
S. Ren et al.
Faster R-CNN: towards real-time object detection with region proposal networks
Neural Inf. Process. Syst.
(2015)

J. Redmon et al.

You only look once: unified, real-time object detection

W. Liu et al.

SSD: single shot multibox detector

P. Abbeel et al.

Apprenticeship Learning via Inverse Reinforcement Learning

R. Girshick et al.

Rich feature hierarchies for accurate object detection and semantic segmentation

J.R.R. Uijlings et al.

Selective search for object recognition

Int. J. Comput. Vision

(2013)

V. Mnih et al.

Human-level control through deep reinforcement learning

Nature

(2015)

A. Krizhevsky et al.

ImageNet classification with deep convolutional neural networks

Neural Inf. Process. Syst.

(2012)

V. Mnih et al.

Playing Atari with deep reinforcement learning

Neural Inf. Process. Syst.

(2013)

C.L. Zitnick et al.

Edge boxes: locating object proposals from edges

N. Srivastava et al.

Dropout: a simple way to prevent neural networks from overfitting

J. Mach. Learn. Res.

(2014)

K. Simonyan et al.

Very deep convolutional networks for large-scale image recognition

International Conference on Learning Representations

(2015)

Cited by (15)

Research on 3D ground penetrating radar deep underground cavity identification algorithm in urban roads using multi-dimensional time-frequency features
2024, NDT and E International
The 3D ground penetrating radar (GPR) is the main method for detecting underground cavities in urban roads. Due to the weak reflected signal energy of deep road cavities with depths exceeding 2.5 m, there is a significant shortage of available training samples. Existing identification algorithms primarily focus on the detection of shallow road cavities. As a result, the accuracy of deep cavity identification using 3D GPR is low, and there is a lack of effective intelligent algorithms for deep cavity identification. To address these challenges, this study integrates the smooth texture features and the abundant amplitude and phase spectrum features inherent in deep cavity GPR signals. Utilizing the time-frequency features of radar signals, this study has proposed an intelligent identification algorithm for the deep road cavity based on a Multi-Channel and Dimensional Time-Frequency Convolution Neural Network (MCD-TF CNN). Firstly, using MCD-FT CNN as the cavity value discriminator, inverse reinforcement learning is performed on the cavity region to obtain the value evaluation method of the 3D GPR cavity region. Subsequently, the discriminator is applied to deep radar data for cavity detection and interacts with the value discriminator through 3D target region range adjustment actions. The interaction through region adjustments aims to maximize the value of cavity areas within the discriminator relative to the target detection area. It can ensure the inclusion of cavity regions in the detection results, thereby achieving the goal of intelligent deep cavity recognition. This algorithm employs the thought of global relative optimality from reinforcement learning, addressing the limitations of existing algorithms that rely on extracting absolute features and thereby enhancing the accuracy of deep cavity identification.
Regional attention reinforcement learning for rapid object detection
2022, Computers and Electrical Engineering
Citation Excerpt :
This method is more intelligent, more efficient and faster in location and recognition. In 2020, Zhou M et al. proposed a regionally selective object detection framework [18]. In this network framework, agents are used to generate fewer region proposal boxes, and then these regions are filtered and further refined to get the optimal results.
When people observe a picture, they first pay attention to local areas of the picture, rather than the whole areas, then combine them with previous experience in the brain, and finally make judgments through thinking. This is human visual logic. In this paper, we propose a regional attention reinforcement learning model for object detection. The proposed model uses human visual logical to solve the detection problem of small and complex targets in the picture. The model uses a recurrent network structure as the main framework to extract historical information, and fuse the historical information with the current concerned information. At each recurrent time step, it can pay attention to the fused information, especially pay more attention to the information that may have objects. Experimental results show that the proposed method has more than 5% improved in recognition accuracy to the conventional methods. In terms of FLOPs, the conventional methods normally require 170 M, while the proposed method only needs 25.4M This means that the proposed method has higher detection efficiency.
Learning filter selection policies for interpretable image denoising in parametrised action space
2024, IET Image Processing
Multi-modal Instance Refinement for Cross-Domain Action Recognition
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Attribute-wise reasoning reinforcement learning for pedestrian attribute retrieval
2023, International Journal of Multimedia Information Retrieval
Multi-modal Instance Refinement for Cross-domain Action Recognition
2023, arXiv

View all citing articles on Scopus

Rujing Wang received the B.E. degree in computer science from Huazhong University of Science and Technology, Wuhan, China, in 1987, and M.S. degree in electronic engineering from Dalian University of Technology, Dalian, China, in 1990, and Ph.D. degree in pattern recognition and intelligent system from University of Science and Technology of China, Hefei, China, in 2005. He is currently with the Institute of Intelligent Machinery of the Chinese Academy of Sciences as Professor and Researcher. His main research interests include intelligent agriculture, agricultural internet of things, Agricultural knowledge engineering.

Chengjun Xie received the M.S. degree in software engineering from the Hefei University of Technology, Hefei, China, in 2008, and Ph.D. degree in image processing from in the Hefei University of Technology, Anhui, China, in 2011. He is currently working in the Institute of Intelligent Machinery of the Chinese Academy of Sciences as Associate Researcher. His research interests include crop disease and pest image recognition, agricultural big data, agricultural Internet of Things.

Liu Liu received the B.E. degree in information engineering aerospace information applications from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2015, and M.S. degree in advanced computer science from University of Manchester, Manchester, United Kingdom, in 2016. He is currently pursuing the Ph.D. degree in pattern recognition and intelligent system with University of Science and Technology of China, Hefei. His current interest is deep learning and computer vision.

Fangyuan Wang received the B.E. and M.S. degrees in electrical engineering and automation from Hefei University of Technology, Hefei, China, in 2014 and 2017, respectively. He is currently pursuing the Ph.D. degree in pattern recognition and intelligent system with University of science and technology of China, Hefei. His current interest is deep learning and computer vision.

Rui Li Received the B.E. degree in Hebei University of Architecture, China, in 2010, and M.S. degree in computer applied technology from Hefei University of Technology, Hefei, China, in 2013 respectively. He is currently pursuing the Ph.D. degree in electronic information with University of science and technology of China, Hefei. His current interest is deep learning and computer vision.

Dengshan Li receieved the B.E. degree in electronic science and technology from Hefei University of Technology, Hefei, China, in 2008, and M.S. degree in computer science and technology from Anhui University of Technology, Ma'anshan, China, in 2017 respectively. He is currently pursuing the Ph.D. degree in computer application technology with University of Science and Technology of China, Hefei. His current interest is deep learning and computer vision.

View full text

ReinforceNet: A reinforcement learning embedded object detection framework with region selection network

Abstract

Introduction

Section snippets

Related work

ReinforceNet model

Experiment setting

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Remote Sens.

A2-RL: aesthetics aware reinforcement learning for image cropping

RCAA: relational context-aware agents for person search

Libra R-CNN: towards balanced learning for object detection

Cascade R-CNN: delving into high quality object detection

Active object detection with multistep action prediction using deep Q-network

IEEE Trans. Ind. Inf.

Active object localization with deep reinforcement learning

Tree-structured reinforcement learning for sequential object localization

Neural Inf. Process. Syst.

Hierarchical object detection with deep reinforcement learning

Neural Inf. Process. Syst.

Faster R-CNN: towards real-time object detection with region proposal networks

Neural Inf. Process. Syst.

You only look once: unified, real-time object detection

SSD: single shot multibox detector

Apprenticeship Learning via Inverse Reinforcement Learning

Rich feature hierarchies for accurate object detection and semantic segmentation

Selective search for object recognition

Int. J. Comput. Vision

Human-level control through deep reinforcement learning

Nature

ImageNet classification with deep convolutional neural networks

Neural Inf. Process. Syst.

Playing Atari with deep reinforcement learning

Neural Inf. Process. Syst.

Edge boxes: locating object proposals from edges

Dropout: a simple way to prevent neural networks from overfitting

J. Mach. Learn. Res.

Very deep convolutional networks for large-scale image recognition

International Conference on Learning Representations