Skip to main content
Log in

An enhanced SSD with feature fusion and visual reasoning for object detection

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Single Shot Multibox Detector (SSD) is one of the top performing object detection algorithms in terms of both accuracy and speed. SSD achieves impressive performance on various datasets by using different output layers for object detection. However, each layer in the feature pyramid is used independently, and SSD considers only the fine-grained details of the objects but ignores the context surrounding objects. In this paper, we proposed an enhanced SSD, called ESSD, that improved the performance of the conventional SSD by fusing feature maps of different output layers, instead of growing layers close to the input data. Our method used two-way transfer of feature information and feature fusion to enhance the network. To assist further with object detection, we proposed a visual reasoning method that utilized fully the relationships between objects instead of using only the features of the objects themselves. This addition of visual reasoning proved very effective for detecting objects that are too small or have small features. To evaluate the proposed ESSD, we trained the model with VOC2007 and VOC2012 training sets and evaluated the performance on the Pascal VOC2007 test set. For \(300 \times 300\) input, ESSD achieved 79.2% mean average precision (mAP) at 52.0 frames per second (FPS), and for \(512 \times 512\) input, this approach achieved 82.4% mAP at 18.6 FPS. These results demonstrated that our proposed method can achieve state-of-the-art mAP, which is a better result than provided by the conventional SSD and other advanced detectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Yang F, Choi W, Lin Y (2016) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2129–2137

  2. Dai J, Li Y, He K, et al (2016) R-fcn: object detection via region-based fully convolutional networks. Adv Neural Inf Process. Syst, pp 379–387

  3. Girshick R, Donahue J, Darrell T, et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  4. Bell S, Lawrence Zitnick C, Bala K, et al (2016) Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883

  5. Fukui A, Park D H, Yang D, et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847

  6. Kong T, Yao A, Chen Y, et al (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 845–853

  7. Liu W, Anguelov D, Erhan D et al (2016) Ssd: single shot multibox detector[C]. In: European conference on computer vision. Springer, Cham, pp 21–37

  8. Gao Y, Beijbom O, Zhang N, et al (2016) Compact bilinear pooling. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 317–326

  9. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR 2005: IEEE computer society conference on computer vision and pattern recognition, 2005, vol 1. IEEE, pp 886–893

  10. Erhan D, Szegedy C, Toshev A, et al (2014) Scalable object detection using deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2147–2154

  11. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  12. Pinheiro PO, Collobert R, Dollár P (2015) Learning to segment object candidates. In: Proceedings of the 28th international conference on neural information processing systems (NIPS’15), Montreal, 7–12 December 2015, pp 1990–1998

  13. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems (NIPS’12), Lake Tahoe, 3–6 December 2012, pp 1097–1105

  14. Zhang H, Cao X, Ho JKL et al (2017) Object-level video advertising: an optimization framework. IEEE Trans Industr Inf 13(2):520–531

    Article  Google Scholar 

  15. Girshick RB, Donahue J, Darrell T et al (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38(1):142–158

    Article  Google Scholar 

  16. Uijlings JR, De Sande KE, Gevers T et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  17. Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision. Springer, Cham, pp 391–405

  18. He K, Zhang X, Ren S, et al (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 346–361

  19. Girshick RB (2015) Fast R-CNN. In: International conference on computer vision, pp 1440–1448

  20. Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision. Springer, Cham, pp 391–405

  21. Ren S, He K, Girshick RB et al (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  22. Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, 27–30 June 2016, pp 779–788

  23. Redmon J, Farhadi A (2016) YOLO9000: better, faster, stronger. arXiv preprint, p 1612

  24. Fu CY, Liu W, Ranga A, et al (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659

  25. Everingham M, Van Gool L, Williams CKI et al (2010) The Pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  26. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

Download references

Acknowledgements

This project was partially supported by Grants from Natural Science Foundation of 353 China 71671178, 9154620, and 61202321, and the open project of the Key Lab of Big Data 354 Mining and Knowledge Management. It was also supported by Hainan Provincial 355 Department of Science and Technology under Grant No. ZDKJ2016021, and by 356 Guangdong Provincial Science and Technology Project 2016B010127004.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiaxu Leng.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Leng, J., Liu, Y. An enhanced SSD with feature fusion and visual reasoning for object detection. Neural Comput & Applic 31, 6549–6558 (2019). https://doi.org/10.1007/s00521-018-3486-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-018-3486-1

Keywords

Navigation