Content-based image retrieval with compact deep convolutional features
Introduction
In the domain of content-based image retrieval (CBIR), the retrieval accuracy is essentially based on the discrimination quality of the visual features extracted from images or small patches. Image contents (objects or scenes) may include different deformations and variations, e.g. illumination, scaling, noise, viewpoint, etc., which makes retrieving similar images one of the challenging vision tasks. The typical CBIR approaches consist of three essential steps applied on images: detection of interest points, formulation of image vector, and similarity/dissimilarity matching.
In order to extract representative image features, the most existing CBIR approaches use some hand-crafted low-level features, e.g. scale-invariant features transform (SIFT) [1] and speed-up robust features (SURF) [2] descriptors. Such features are usually encoded by general orderless quantization methods such as vector of locally aggregated descriptors (VLAD) [3]. The resulting image representations have shown a high capability on preserving the local patterns of image contents by capturing local characteristics of image objects, e.g. edges and corners. Therefore, they are suitable for the image retrieval task and widely used for matching local patterns of objects. However, convolutional neural networks (CNNs) have recently demonstrated a superior performance over hand-crafted features on image classification [4], [5], [6]. Adopting a deep learning procedure on multiple layers of convolutional filters makes CNNs able to subjectively learn even complex representations for many vision and recognition tasks.
Many recent works [5], [7], [8] demonstrate that the CNN-based generic features adequately trained on sufficient and diverse image datasets, e.g. ImageNet [9], can be successfully applied to other visual recognition tasks. Additionally, performing a proper fine-tuning on CNNs using domain-specific training data can achieve a noticeable performance in common vision tasks [5], [10]; including object localization and instance image retrieval. Despite the promising results achieved by CNNs so far, there is no exact understanding or common agreement on how these deep learning architectures work; especially at the intermediate hidden layers. Several successful approaches [11], [12], [13], [14] have applied CNNs to extract generic features for image retrieval tasks and obtained promising results. They mainly utilize the power of local features to generate a generic image representation based on some pre-trained CNNs. Nevertheless, many open questions and challenges need more investigation. Foremost, the effectiveness of fine-tuning the CNN models pretrained for specific task, e.g. image classification, on transfer learning to the CBIR task. Secondly, the discrimination quality of image features directly extracted from the convolutional layers compared to the features quantized using the traditional generic approaches such as VLAD. Thirdly, the ability of reducing the unfavorable high-dimensional image representations generated by the most of existing CNN-based architectures. Finally, a proper investigation is required on how efficient connections can be made between several CBIR aspects; including query handling, similarity/dissimilarity matching, and retrieval performance in term of search time and memory usage. All of these challenges motivated us to develop and utilize a different deep CNN architecture in order to address the problems associated with features quantization, model fine-tuning, high-dimensionality, and system performance affected by the training procedure and features lengths.
Accordingly, the main aim of this paper is to propose a new CNN-based learning model in the context of CBIR. The proposed architecture is inspired by the bilinear models proposed by Tenenbaum and Freeman [15] to model the separation between the “content” and “style” factors of perceptual systems and by the promising results obtained using bilinear CNNs applied to fine-grained categorization [16]. Specifically, two parallel CNNs are adopted to directly extract image features from the activations of convolutional layers using only the visual contents and without prior knowledge about the semantic meta-data of images, i.e. no tags, annotations, or captions have been used. Image representations are generated by accumulating the extracted features at image locations and scales in order to model local feature correlations. The proposed architecture is initialized by pre-trained deep CNN models that adequately fine-tuned in unsupervised manner to learn the parameters for CBIR tasks using several standard retrieval datasets. Moreover, an efficient compact root pooling layer is also proposed based on the compact bilinear pooling recommended by Gao et al. [19], which demonstrates a noticeable improvement in the retrieval accuracy. Most critically, the resulting final image vectors are very compact so they reduce the time needed to extract them and reduce the memory size required to index the images and architecture with its parameters. Finally, the discriminatory capability of the image descriptors obtained by the proposed model is examined on different CBIR tasks, e.g. general, object-focused, landmarks image retrieval, and large-scale image retrieval.
The remaining part of this paper is organized as follows: Section 2 review the related works in literature; Section 3 presents the proposed compact bilinear architecture along with complete retrieval framework; Section 4 demonstrates and discusses the experiments carried out on several standard image retrieval datasets; and Section 5 concludes this work.
Section snippets
Related work
The most commonly CNNs architectures used in the CBIR are initially trained for classification tasks, where the representations extracted from the higher layers of CNN networks are usually used to capture semantic features for the category-level classification. Transfer learning of generic CNN features, trained on very large classification-based image datasets, to be used for image retrieval has shown a noticeable performance by several works. Wan et al. [11] applied many existing deep learning
The framework of retrieval and deep learning
Our approach consists of three main steps: 1) Initialize the architecture by deep CNN networks pre-trained on millions of images; 2) fine-tune the bilinear CNN architecture on image retrieval datasets, i.e. transfer learning; and 3) extract features of query and dataset images. As shown in Fig. 1, the CNN architecture is based on two variants of recent neural networks [20]: imagenet-vgg-m (VGG-m) and imagenet-vgg-verydeep-16 (VGG-16), and both are pre-trained on ImageNet [9]. These CNNs
Image dataset and evaluation
Holidays dataset [22]: It is one of the standard benchmarking datasets commonly used in the CBIR to measure the robustness against image rotations, viewpoint and illumination changes, blurring, etc. The dataset consists of 1491 high resolution images with a large variety of scene types, e.g. natural, man-made, water and fire effects, etc., as shown in Fig. 2 (top row). The dataset contains 500 image groups that represent distinct scenes. The first image of each image group is the query image
Conclusion
This paper introduces compact bilinear CNN-based architectures for several CBIR tasks using two parallel feature extractors without prior knowledge about the semantic meta-data of image contents. Image features are directly extracted from the activations of convolutional layers then largely reduced to very low-dimensional representations using the root bilinear compact pooling. The very deep architecture CRB-CNN-(16) and medium architecture CRB-CNN-(M) are fine-tuned for three CBIR tasks:
Acknowledgments
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
Ahmad Alzu’bi received his Ph.D. in Computer Science in 2016 from University of the West of Scotland (UWS), United Kingdom. He also received his MSc in Computer Science in 2009 from Jordan University of Science and Technology (J.U.S.T), Jordan. Dr. Alzu’bi has worked in several academic and professional positions such as lecturer and IT trainer/supervisor in several universities and training institutes. He has joined the AVCN/UWS research group in 2014 and has many publications in reputed
References (33)
- et al.
SERVE: soft and equalized residual vectors for image retrieval
Neurocomputing
(2016) - et al.
Uniforming residual vector distribution for distinctive image representation
IEEE Trans. Circuits Syst. Video Technol
(2015) - et al.
Fine-residual VLAD for image retrieval
Neurocomputing
(2016) Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
(2004)- et al.
SURF: speeded up robust features
- et al.
Aggregating local image descriptors into compact codes
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - et al.
Imagenet classification with deep convolutional neural networks
Adv. Neural Inf. Process. Syst.
(2012) - et al.
CNN features off-the-shelf: an astounding baseline for recognition
- et al.
Very deep convolutional networks for large-scale image recognition
- R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic...
Learning and transferring midlevel image representations using convolutional neural networks
ImageNet: a large-scale hierarchical image database
Deep learning of binary hash codes for fast image retrieval
Deep learning for content-based image retrieval: a comprehensive study
Multi-scale orderless pooling of deep convolutional activation features
Exploiting local features from deep networks for image retrieval
Cited by (0)
Ahmad Alzu’bi received his Ph.D. in Computer Science in 2016 from University of the West of Scotland (UWS), United Kingdom. He also received his MSc in Computer Science in 2009 from Jordan University of Science and Technology (J.U.S.T), Jordan. Dr. Alzu’bi has worked in several academic and professional positions such as lecturer and IT trainer/supervisor in several universities and training institutes. He has joined the AVCN/UWS research group in 2014 and has many publications in reputed journals and conferences in the area of CBIR, deep learning, and software engineering. He is a regular reviewer for several international journals and conferences including Elsevier, IET, and IEEE. His research interests include: Multimedia Retrieval, Image Processing, Deep Learning, and Computer Vision.
Abbes Amira received his Ph.D. in Computer Engineering in 2001 from Queen's University Belfast, United Kingdom. Since then, he has taken many academic and consultancy positions in the United Kingdom, Asia and the Middleast. During his career to date, Prof. Amira has been successful in securing substantial funding from government agencies and industry; he has supervised more than 20 Ph.D. students and has over 250 publications in top journals and conferences in the area of embedded computing, image and signal processing. He has been invited to give keynote talks, short courses and tutorials at many universities and international conferences and has been chair and program committee for a number of IEEE conferences including; tutorial presenter at the prestigious IEEE ICIP 2009, Chair of ECVW 2011, Program Chair of ECVW2010, Program Co-Chair of ICM12, DELTA 2008, IMVIP 2005 and General Co-Chair of ICM 2014. He is also a member of the IEEE Technical Committee for Biomedical Circuits and systems. He obtained many international awards, including the 2008 VARIAN prize offered by the Swiss Society of Radiobiology and Medical Physics. Prof. Amira has been a Ph.D. external examiner and member of advisory boards for many Universities worldwide and has participated as guest editor and member of the editorial board in many international journals. He has also been a regular referee for many national and international funding bodies, including (EPSRC-UK and QNRF-Qatar). He has taken visiting professor positions at the University of Tun Hussein Onn, Malaysia and the University of Nancy, Henri Poincare, France. Prof. Amira is a Fellow of IET, Fellow of the Higher Education Academy, Senior member of the IEEE, and Senior member of ACM. His research interests include: Embedded systems, high performance computing, Big Data and IoT, Connected Health, Image and Vision Systems, Biometric and Security.
Naeem Ramzan received the M.Sc. degree in telecommunication from University of Brest, France, in 2004 and the Ph.D. degree in electronics engineering from Queen Mary University of London, London, U.K, in 2008. Currently he is a full Professor at the School of Engineering and Computing, University of the West of Scotland. Prof. Ramzan has authored or co-authored over 110 research publications, including journals, book chapters, and standardization contributions. He co-edited a book entitled Social Media Retrieval (Springer, 2013). He is a fellow of the Higher Education Academy and a senior member of IEEE. He served as a Guest Editor for a number of special issues in technical journals. He has organized and co-chaired three ACM Multimedia Workshops, and served as the Session Chair/Co-Chair for a number of conferences. He is the Co-Chair of the Ultra HD Group of the Video Quality Experts Group (VQEG) and the Co-Editor-in-Chief of VQEG E-Letter. He has participated in more than 20 projects funded by European and U.K. research councils.