ReinforceNet: A reinforcement learning embedded object detection framework with region selection network
Introduction
Recent years have witnessed some reinforcement learning (RL) based object detection methods [1], [2], [5], [6], [7], [8], [12]. These RL based methods always formulate object detection as Markov decision processes (MDP), where RL agent sequentially selects the actions to adjust the aspect ratios of input images within several steps using action-decision strategy until triggering terminal action. An obvious advantage of above RL methods is that only a few region proposals (usually no more than 10 candidates) are required for object detection, while Convolutional Neural Network (CNN) based approaches [3], [4], [9], [10], [11], [13] always demand tens of thousands of pre-computed proposals, which makes them difficult to handle the optimal region selection. However, existing RL based object detection methods always suffer from barely satisfactory performance. The main reasons are that RL agent: (i) directly generates a sequence of inaccurate regions without a reasonable reward function, (ii) regards the non-optimal one at final step as detection result without an effective region selection strategy, and (iii) only adopts action space for bounding box regression in RL process.
In this paper, we propose a reinforcement learning embedded object detection framework with region selection and refinement network, a more accurate model integrating RL agents’ action space with CNN-based feature space for object detection, as a response to the aforementioned issues. The whole network consists of three main components: (1) RL optimization: a novel reward function for RL agent optimization, (2) RS-net: a region selection network for searching the optimal region proposal, and (3) BBR-net: a bounding box refinement network for further regression.
- (1)
RL optimization: A reasonable reward function is key for RL optimization. However, in previous works [5], [6], [8], IoU based reward function focuses only on the positive/negative variation of IoU difference between adjacent regions but neglects change magnitude, which does not make RL agent sensitive to small changes. To handle this problem, we simultaneously consider the IoU and Completeness of the change magnitude between adjacent regions into reward function to effectively train RL agent. Particularly, Completeness is a newly-defined evaluation metric which could measure the completeness of target object in image. Generally speaking, it is reasonable to assume that the RL agent can gradually enhance learning capacity with running epoch rising. Therefore, we introduce multiple agents rather than single agent to cover and find the optimal region proposal of the detection results.
- (2)
Region selection network (RS-net): In MDP, RL agent [17] sequentially searches for objects by utilizing both the current observation of region image and historical search paths. When RL agent stops search, the final-step region proposal is treated as the detection result. However, Fig. 1 indicates that most of the final-step region proposals are not optimal from observing results of different RL detection methods. To handle this problem, we design a novel network, namely RS-net, to select the optimal region proposal. RS-net consists of two sub-networks: IoU-net and CPL-net, which are responsible for computing IoU and Completeness values of each region proposal respectively. The predicted IoU and Completeness values are used jointly to assess region proposals and select the optimal one, as shown in Fig. 1-(b) and Fig. 4.
- (3)
Bounding box refinement network (BBR-net): Compared with CNN based methods [9], [13], standard RL based object detection methods employ action space instead of feature space for bounding box regression. For example, Bueno et al. [8] employs five pre-defined actions to refine the candidate bounding boxes. Nevertheless, the result of region refinement is deeply limited by the parameters of pre-defined action space since the pre-defined actions cannot cover the target size space. Motivated by bounding box fine-tuning strategy adopted in two-stage object detection architectures [9], [13], we design a bounding box refinement network (BBR-net) that integrates both action space and feature space for further regression. Specifically, we recurrently exploit CNN backbone for extracting the local feature maps of the proposals selected from RS-net and simultaneously employs them into RL framework for training BBR-net. This strategy provides a more complementary mechanism to harness the inaccurate object location problems.
The major contributions of this paper are as follows:
- (1)
A novel reinforcement learning based object detection network, namely ReinforceNet, is proposed. ReinforceNet possesses the capability of the region selection and refinement by integrating RL agents’ action space with CNN-based feature space.
- (2)
We redevelop an IoU and Completeness jointly guided reward function, which makes RL agent sensitive to small change magnitudes between adjacent regions. Besides, we replace the single agent with multiple agents, which enrich the expressiveness of our object detection framework.
- (3)
Extensive experiments on PASCAL VOC 2007 and VOC 2012 object detection benchmarks demonstrate the superior performance of our ReinforceNet compared to state-of-the-art methods.
Section snippets
Related work
CNN based Detector The leading approaches in object detection are currently CNN-based deep detectors, which can be summarized as two-stage [9], [23], [24] and single-stage detectors [10], [11]. For two-stage detector, the pioneer work R-CNN is reported in [13] by combining external region proposal module and a region-wise classifier to formulate object detection. Although this method appears promising robustness for object detection, its extensive computational cost to obtain region proposals
ReinforceNet model
In this section, we will present our ReinforceNet, a novel reinforcement learning embedded object detection framework. In detail, the complete technical pipeline is depicted in Fig. 2 and composed of three main parts: (i) multiple RL agents for jointly generating more accurate region proposals, (ii) RS-net for selecting the optimal region proposal, and (iii) BBR-net for refining the optimal one as the detection result. All the above parts will be illustrated in the following sub-sections.
Experiment setting
All the experiments in this section are conducted on two widely-used object detection benchmark datasets, i.e., PASCAL VOC 2007 [29] and 2012 [30], which both consist of 20 categories. Because the ground truth annotations of VOC 2012 testing set have been not released publicly, we elaborately design two sets of experiments. 1) ReinforceNet is trained on the union of the 2007 and 2012 training-validation set, and tested on VOC 2007 testing set; 2) ReinforceNet is trained with VOC 2012 training
Conclusion
In this work, we have presented a general RL based object detection framework with RS-net and BBR-net. The RS-net allows us to select the optimal region of the target with combination of IoU-net and CPL-net. We state that this framework is applicable for finding more appropriate region proposals since it can compute IoU and Completeness values of each region proposal effectively in practice. For further refining the optimal region from RS-net output, we introduce the BBR-net to converge the
CRediT authorship contribution statement
Man Zhou: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Rujing Wang: Supervision. Chengjun Xie: Supervision, Writing - review & editing. Liu Liu: Supervision, Writing - review & editing. Rui Li: Software, Validation. Fangyuan Wang: Software, Validation. Dengshan Li: Software, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by National Natural Science Foundation of China under grant 31401293, 31671586, 61773360.
Man Zhou received the B.E degree in automation from Wuhan University of Science and Technology, Wuhan, China in 2013. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent System from University of Science and Technology of China, Hefei, China. His current research interests include deep reinforcement learning and computer vision.
References (42)
- et al.
An aircraft detection framework based on reinforcement learning and convolutional neural networks in remote sensing images
Remote Sens.
(2018) - et al.
A2-RL: aesthetics aware reinforcement learning for image cropping
- et al.
RCAA: relational context-aware agents for person search
- et al.
Libra R-CNN: towards balanced learning for object detection
- et al.
Cascade R-CNN: delving into high quality object detection
- et al.
Active object detection with multistep action prediction using deep Q-network
IEEE Trans. Ind. Inf.
(2019) - et al.
Active object localization with deep reinforcement learning
- et al.
Tree-structured reinforcement learning for sequential object localization
Neural Inf. Process. Syst.
(2016) - et al.
Hierarchical object detection with deep reinforcement learning
Neural Inf. Process. Syst.
(2016) - et al.
Faster R-CNN: towards real-time object detection with region proposal networks
Neural Inf. Process. Syst.
(2015)
You only look once: unified, real-time object detection
SSD: single shot multibox detector
Apprenticeship Learning via Inverse Reinforcement Learning
Rich feature hierarchies for accurate object detection and semantic segmentation
Selective search for object recognition
Int. J. Comput. Vision
Human-level control through deep reinforcement learning
Nature
ImageNet classification with deep convolutional neural networks
Neural Inf. Process. Syst.
Playing Atari with deep reinforcement learning
Neural Inf. Process. Syst.
Edge boxes: locating object proposals from edges
Dropout: a simple way to prevent neural networks from overfitting
J. Mach. Learn. Res.
Very deep convolutional networks for large-scale image recognition
International Conference on Learning Representations
Cited by (15)
Regional attention reinforcement learning for rapid object detection
2022, Computers and Electrical EngineeringCitation Excerpt :This method is more intelligent, more efficient and faster in location and recognition. In 2020, Zhou M et al. proposed a regionally selective object detection framework [18]. In this network framework, agents are used to generate fewer region proposal boxes, and then these regions are filtered and further refined to get the optimal results.
Learning filter selection policies for interpretable image denoising in parametrised action space
2024, IET Image ProcessingMulti-modal Instance Refinement for Cross-Domain Action Recognition
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Attribute-wise reasoning reinforcement learning for pedestrian attribute retrieval
2023, International Journal of Multimedia Information Retrieval
Man Zhou received the B.E degree in automation from Wuhan University of Science and Technology, Wuhan, China in 2013. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent System from University of Science and Technology of China, Hefei, China. His current research interests include deep reinforcement learning and computer vision.
Rujing Wang received the B.E. degree in computer science from Huazhong University of Science and Technology, Wuhan, China, in 1987, and M.S. degree in electronic engineering from Dalian University of Technology, Dalian, China, in 1990, and Ph.D. degree in pattern recognition and intelligent system from University of Science and Technology of China, Hefei, China, in 2005. He is currently with the Institute of Intelligent Machinery of the Chinese Academy of Sciences as Professor and Researcher. His main research interests include intelligent agriculture, agricultural internet of things, Agricultural knowledge engineering.
Chengjun Xie received the M.S. degree in software engineering from the Hefei University of Technology, Hefei, China, in 2008, and Ph.D. degree in image processing from in the Hefei University of Technology, Anhui, China, in 2011. He is currently working in the Institute of Intelligent Machinery of the Chinese Academy of Sciences as Associate Researcher. His research interests include crop disease and pest image recognition, agricultural big data, agricultural Internet of Things.
Liu Liu received the B.E. degree in information engineering aerospace information applications from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2015, and M.S. degree in advanced computer science from University of Manchester, Manchester, United Kingdom, in 2016. He is currently pursuing the Ph.D. degree in pattern recognition and intelligent system with University of Science and Technology of China, Hefei. His current interest is deep learning and computer vision.
Fangyuan Wang received the B.E. and M.S. degrees in electrical engineering and automation from Hefei University of Technology, Hefei, China, in 2014 and 2017, respectively. He is currently pursuing the Ph.D. degree in pattern recognition and intelligent system with University of science and technology of China, Hefei. His current interest is deep learning and computer vision.
Rui Li Received the B.E. degree in Hebei University of Architecture, China, in 2010, and M.S. degree in computer applied technology from Hefei University of Technology, Hefei, China, in 2013 respectively. He is currently pursuing the Ph.D. degree in electronic information with University of science and technology of China, Hefei. His current interest is deep learning and computer vision.
Dengshan Li receieved the B.E. degree in electronic science and technology from Hefei University of Technology, Hefei, China, in 2008, and M.S. degree in computer science and technology from Anhui University of Technology, Ma'anshan, China, in 2017 respectively. He is currently pursuing the Ph.D. degree in computer application technology with University of Science and Technology of China, Hefei. His current interest is deep learning and computer vision.