Abstract
Object detection is a classic problem in computer vision. The main bottleneck of object detection lies in the fusion of multi-scale features. In this paper, we systematically study the design choices of neural network architecture for real-time object detection, and propose an Align-Yolact to improve the instance segmentation accuracy. Firstly, we propose a weighted bounding box, which improves the accurate positioning of the bounding box. Secondly, we add a bi-directional feature pyramid network to the feature fusion, which improves the mask quality and small target accuracy. Owing to these optimizations and better backbones, we achieve the SOTA results including both detection efficiency and accuracy.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-021-03340-4/MediaObjects/12652_2021_3340_Fig9_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Bolya D, Zhou C, Xiao F et al (2019) YOLACT: real-time instance segmentation. In: The IEEE International Conference on Computer Vision (ICCV), 27 October–2 November 2019, Korea, pp 9157–9166
Chen RC (2020) Automatic license plate recognition via sliding-window darknet-YOLO deep learning. Image vis Comput 87:47–56
Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: The 2009 IEEE conference on computer vision and pattern recognition (CVPR), 20–21 June 2009, Miami, pp 248–255
Duan K, Bai S, Xie L et al (2019) Centernet: Keypoint triplets for object detection. In: The IEEE International Conference on Computer Vision (ICCV), 27 October- 2 November 2019, Korea, pp 6569–6578
Ghiasi G, Lin TY, Le QV (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 16–19 June 2019, Las Vegas, pp 7036–7045
Girshick R (2015) Fast r-cnn. In: The IEEE international conference on computer vision (ICCV), 7–13 December 2015, Chile, pp 1440–1448
Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: The IEEE conference on computer vision and pattern recognition, 23–28 June 2014, Ohio, pp 580–587
Hariharan B, Arbeláez P, Bourdev L et al (2011) Semantic contours from inverse detectors. In: The 2011 International Conference on Computer Vision (ICCV), 6–13 November 2011, Barcelona, pp 991–998
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 27–30 June 2016, Las Vegas, pp 770–778
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: The IEEE international conference on computer vision (ICCV), 22–29 October 2017, Italy, pp 2961–2969
Kong T, Sun F, Liu H et al (2020) Foveabox: beyound anchor-based object detection. IEEE Trans Image Process 29:7389–7398
Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: The European Conference on Computer Vision (ECCV), 8–14 September 2018, Munich, pp 734–750
Lee Y, Park J (2020) Centermask: real-time anchor-free instance segmentation. In: The IEEE/CVF conference on computer vision and pattern recognition, 13–19 June 2020, Long Beach, pp 13906–13915
Liu S, Qi L, Qin H et al (2018) Path aggregation network for instance segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition, 18–22 June 2018, Salt Lake City, pp 8759–8768
Sandler M, Howard A, Zhu M, et al (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: the IEEE conference on computer vision and pattern recognition (CVPR), 18–22 June 2018, Salt Lake City, pp 4510–4520
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556
Tan M, Le Q V (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, 25–28 January 2019, Taiyuan, pp 6105–6114
Tan M, Pang R, Le QV (2020) EfficientDet: scalable and efficient object detection. In: The IEEE/CVF conference on computer vision and pattern recognition, 13–19 June 2020, Long Beach, pp 10781–10790
Wang X, Kong T, Shen C et al (2020) SOLO: segmenting objects by locations. In: The European Conference on Computer Vision (ECCV), 23–28 August 2020, online, pp 649–665
Yang Z, Liu S, Hu H, Wang L (2019) Reppoints: point set representation for object detection. In: The IEEE International Conference on Computer Vision (ICCV), 27 October–2 November 2019, Korea, pp 9657–9666
Zhang H, Wu C, Zhang Z et al (2020) Resnest: split-attention networks. arXiv preprint arXiv: 2004.08955
Acknowledgements
The authors would like to thank all the participants taken part in the experiments. This work was supported in part by the National Science Foundation of China (Grant No. 61841701) and Fujian Vocational College Intelligent Equipment Application Technology Collaborative Innovation Center Construction Project (Grant No. 2016-7) and the Science and Technology Project from Transportation Department of FuJian Province (Grant No. 201934).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, S., Zhu, K., Feng, C. et al. Align-Yolact: a one-stage semantic segmentation network for real-time object detection. J Ambient Intell Human Comput 14, 863–870 (2023). https://doi.org/10.1007/s12652-021-03340-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-021-03340-4