Abstract
In an image, the category and the location of an object are related to global, spatial and contextual visual information of the object, which are all extremely important for accurate and efficient object detection. In this paper, we propose a region-based detector named Sequential Feature Fusion Network (SFFN) which simultaneously utilizes global, spatial and multi-scale contextual Region-of-Interest (RoI) features of an object and fuses them by a novel method. Specifically, we design a Feature Fusion Block (FFB) to fuse global and multi-scale contextual RoI features, which are extracted by RoI pooling layer. Then we apply the concatenation operation to integrate the fused feature with spatial RoI feature extracted by Positive-Sensitive RoI (PSRoI) pooling layer. The experimental results show that the performance of SFFN obtains significant improvements on both the PASCAL VOC 2007 and VOC 2012 datasets.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Girshick, R.: Fast R-CNN. In: CVPR, pp. 1440–1448 (2015)
Shrivastava, A., Gupta, A., Girshick, A.: Training regionbased object detectors with online hard example mining. In: CVPR, pp. 761–169 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-Outside net: detecting objects in context with skip pooling and Recurrent Neural Networks. In: CVPR, pp. 2874–2883 (2016)
Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: towards accurate region proposal generation and joint object detection. In: CVPR, pp. 845–853 (2016)
He, K., Zhang, X., Ren, S., Sun., J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 (2016)
Huang, G., Liu, Z., Weinberger, K., van der Maaten, L.: Densely connected convolutional networks. In: CVPR (2017)
Dai, J., et al.: Deformable convolutional networks. In: ICCV, pp. 764–773 (2017)
Li, Y., He, K., Sun, J.: R-FCN: object detection via regionbased fully convolutional networks. In: NIPS, pp. 379–387 (2016)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Shen, Z., et al.: DSOD: learning deeply supervised object detectors from scratch. In: ICCV, pp. 1937–1945 (2017)
Zhu, Y., et al.: CoupleNet: coupling global structure with local parts for object detection. In: ICCV (2017)
Kim, K.-H., Hong, S., Roh, B., Cheon, Y., Park, M.: Pvanet: Deep but lightweight neural networks for real-time object detection. arXiv preprint arXiv:1608.08021 (2016)
Fan, X., Guo, H., Zheng, K., Feng, W., Wang, S.: Object Detection with Mask-based Feature Encoding. arXiv preprint arXiv:1802.03934 (2018)
Xu, Y., Han, Y., Tian, R.H.Q.: Sequential video VLAD: training the aggregation locally and temporally. IEEE Trans. Image Process. (IEEE TIP) 27(10), 4933–4944 (2018)
Yang, Z., Han, Y., Wang, Z.: Catching the temporal regions-of-interest for video captioning. In: ACM MM, pp. 146–153 (2017)
Zeng, X., Ouyang, W., Yang, B., Yan, J., Wang, X.: Gated Bi-directional CNN for object detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 354–369. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_22
Li, J., et al.: Attentive contexts for object detection. IEEE Trans. Multimed. 19(5), 944–954 (2017)
Zhu, L., et al.: Discrete multimodal hashing with canonical views for robust mobile landmark search. IEEE Trans. Multimed. 19(9), 2066–2079 (2017)
Xie, L., Shen, J., Han, J., Zhu, L., Shao, L.: Dynamic multi-view hashing for online image retrieval. In: IJCAI, pp. 3133–3139 (2017)
Zhu, L., Huang, Z., Li, Z., Xie, L., Shen, H.T.: Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval. IEEE Trans. Neural Netw. Learn. Syst. 99, 1–13 (2018). https://doi.org/10.1109/TNNLS.2018.2797248
Liu, J., et al.: Multi-scale triplet CNN for person re-identification. In: ACM MM, pp. 192–196 (2016)
Liu, D., Zha, Z.J., Zhang, H., Zhang, Y., Wu, F.: Context-aware visual policy network for sequence-level image captioning. In: ACM MM (2018)
Acknowledgments
This work is supported by the NSFC (under Grant U1509206, 61472276) and Tianjin Natural Science Foundation (no. 15JCYBJC15400).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Q., Han, Y. (2018). Sequential Feature Fusion for Object Detection. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_63
Download citation
DOI: https://doi.org/10.1007/978-3-030-00776-8_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)