Abstract
Instance retrieval is a fundamental problem in the multimedia field for its various applications. Since the relevancy is defined at the instance level, it is more challenging comparing to traditional image retrieval methods. Recent advances show that Convolutional Neural Networks (CNNs) offer an attractive method for image feature representations. However, the CNN method extracts features from the whole image, thus the extracted features contain a large amount of background noisy information, leading to poor retrieval performance. To solve the problem, this paper proposed a deep region CNN method with object detection for instance-level object retrieval, which has two phases, i.e., offline Faster R-CNN training and online instance retrieval. First, we train a Faster R-CNN model to better locate the region of the objects. Second, we extract the CNN features from the detected object image region and then retrieve relevant images based on the visual similarity of these features. Furthermore, we utilized three different strategies for feature fusing based on the detected object region candidates from Faster R-CNN. We conduct the experiment on a large dataset: INSTRE with 23,070 object images and additional one million distractor images. Qualitative and quantitative evaluation results have demonstrated the advantage of our proposed method. In addition, we conducted extensive experiments on the Oxford dataset and the experimental results further validated the effectiveness of our proposed method.
Similar content being viewed by others
References
Arandjelovic R, Zisserman A (2013) All about VLAD. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1578–1585
Babenko A, Lempitsky V (2015) Aggregating local deep features for image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 1269–1277
Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval. In: European conference on computer vision, pp 584–599. Springer, Berlin
Chandrasekhar V, Lin J, Morere O, Veillard A, Goh H (2015) Compact global descriptors for visual search. In: Data compression conference (DCC), 2015, pp 333–342. IEEE
Chen DM, Girod B (2015) A hybrid mobile visual search system with compact global signatures. IEEE Transactions on Multimedia 17(7):1019–1030
Chu L, Jiang S, Wang S, Zhang Y, Huang Q (2013) Robust spatial consistency graph model for partial duplicate image retrieval. IEEE Transactions on Multimedia 15(8):1982–1996
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition Icml, vol 32, pp 647–655
Duan LY, Ji R, Chen Z, Huang T, Gao W (2014) Towards mobile document image retrieval for digital library. IEEE Transactions on Multimedia 16(2):346–359
Duan LY, Lin J, Wang Z, Huang T, Gao W (2015) Weighted component hashing of binary aggregated descriptors for fast visual search. IEEE Transactions on multimedia 17(6):828–842
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision, pp 392–407. Springer, Berlin
Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: Learning global representations for image search. In: European conference on computer vision, pp 241–257. Springer, Berlin
Gordo A, Larlus D (2017) Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: IEEE Conference on computer vision and pattern recognition (CVPR)
He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 346–361. Springer, Berlin
Hoang T, Do TT, Le Tan DK, Cheung NM (2017) Selective deep convolutional features for image retrieval. In: Proceedings of the 2017 ACM on Multimedia Conference, pp 1600–1608
Hong R, Li L, Cai J, Tao D, Wang M, Tian Q (2017) Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans Image Process 26(9):4128–4138
Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. Computer Vision–ECCV 2008:304–317
Ji R, Duan LY, Chen J, Xie L, Yao H, Gao W (2013) Learning to distribute vocabulary indexing for scalable visual search. IEEE Transactions on Multimedia 15(1):153–166
Jiang YG, Wang J, Xue X, Chang SF (2013) Query-adaptive image search with hash codes. IEEE transactions on Multimedia 15(2):442–453
Kalantidis Y, Mellina C, Osindero S (2016) Cross-dimensional weighting for aggregated deep convolutional features. In: European conference on computer vision, pp 685–701. Springer, Berlin
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, pp 21–37. Springer, Berlin
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Noh H, Araujo A, Sim J, Han B (2016) Image retrieval with deep local features and attention-based keypoints. arXiv:1612.06321
Panda J, Brown MS, Jawahar CV (2013) Offline mobile instance retrieval with a small memory footprint, pp 1257–1264
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: IEEE conference on computer vision and pattern recognition, 2007, pp 1–8
Radenović F, Tolias G, Chum O (2016) Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In: European conference on computer vision, pp 3–20. Springer, Berlin
Razavian AS, Sullivan J, Carlsson S, Maki A (2014) Visual instance retrieval with deep convolutional networks. arXiv:1412.6574
Redmon J, Farhadi A (2016)
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Transactions on Multimedia 14(3):883–895
Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813
Sharma G, Schiele B (2015)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Sivic J, Zisserman A, et al (2003) Video google: a text retrieval approach to object matching in videos. In: Iccv, vol 2, pp 1470–1477
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of cnn activations. arXiv:1511.05879
Wang S, Jiang S (2015) Instre: a new benchmark for instance-level object retrieval and recognition. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 11(3):37
Xie Y, Jiang S, Huang Q (2013) Weighted visual vocabulary to balance the descriptive ability on general dataset. Neurocomputing 119:478–488
Zheng L, Yang Y, Tian Q (2016) Sift meets cnn: a decade survey of instance retrieval. arXiv:1608.01807
Zhou W, Lu Y, Li H, Song Y, Tian Q (2010) Spatial coding for large scale partial-duplicate web image search. In: Proceedings of the 18th ACM international conference on Multimedia, pp 511–520. ACM
Zhou W, Li H, Lu Y, Tian Q (2013) Sift match verification by geometric coding for large-scale partial-duplicate web image search. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 9(1):4
Zisserman A (2014) Triangulation embedding and democratic aggregation for image search. In: Computer vision and pattern recognition, pp 3310–3317
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China (61532018,61322212, 61602437, 61672497, 61472229 and 61202152), in part by the Beijing Municipal Commission of Science and Technology (D161100001816001),in part by Beijing Natural Science Foundation (4174106), in part by the Lenovo Outstanding Young Scientists Program, in part by National Program for Special Support of Eminent Professionals and National Program for Support of Top-notch Young Professionals, and in part by China Postdoctoral Science Foundation (2016M590135, 2017T100110). This work was also supported in part by Science and Technology Development Fund of Shandong Province of China (2016ZDJS02A11 and ZR2017MF027), the Taishan Scholar Climbing Program of Shandong Province, and SDUST Research Fund (2015TDJH102).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mei, S., Min, W., Duan, H. et al. Instance-level object retrieval via deep region CNN. Multimed Tools Appl 78, 13247–13261 (2019). https://doi.org/10.1007/s11042-018-6427-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6427-1