Elsevier

Neurocomputing

Volume 249, 2 August 2017, Pages 95-105
Neurocomputing

Content-based image retrieval with compact deep convolutional features

https://doi.org/10.1016/j.neucom.2017.03.072Get rights and content

Abstract

Convolutional neural networks (CNNs) with deep learning have recently achieved a remarkable success with a superior performance in computer vision applications. Most of CNN-based methods extract image features at the last layer using a single CNN architecture with orderless quantization approaches, which limits the utilization of intermediate convolutional layers for identifying image local patterns. As one of the first works in the context of content-based image retrieval (CBIR), this paper proposes a new bilinear CNN-based architecture using two parallel CNNs as feature extractors. The activations of convolutional layers are directly used to extract the image features at various image locations and scales. The network architecture is initialized by deep CNNs sufficiently pre-trained on a large generic image dataset then fine-tuned for the CBIR task. Additionally, an efficient bilinear root pooling is proposed and applied to the low-dimensional pooling layer to reduce the dimension of image features to compact but high discriminative image descriptors. Finally, an end-to-end training with backpropagation is performed to fine-tune the final architecture and to learn its parameters for the image retrieval task. The experimental results achieved on three standard benchmarking image datasets demonstrate the outstanding performance of the proposed architecture at extracting and learning complex features for the CBIR task without prior knowledge about the semantic meta-data of images. For instance, using a very compact image vector of 16-length, we achieve a retrieval accuracy of 95.7% (mAP) on Oxford 5K and 88.6% on Oxford 105K; which outperforms the best results reported by state-of-the-art approaches. Additionally, a noticeable reduction is attained in the required extraction time for image features and the memory size required for storage.

Introduction

In the domain of content-based image retrieval (CBIR), the retrieval accuracy is essentially based on the discrimination quality of the visual features extracted from images or small patches. Image contents (objects or scenes) may include different deformations and variations, e.g. illumination, scaling, noise, viewpoint, etc., which makes retrieving similar images one of the challenging vision tasks. The typical CBIR approaches consist of three essential steps applied on images: detection of interest points, formulation of image vector, and similarity/dissimilarity matching.

In order to extract representative image features, the most existing CBIR approaches use some hand-crafted low-level features, e.g. scale-invariant features transform (SIFT) [1] and speed-up robust features (SURF) [2] descriptors. Such features are usually encoded by general orderless quantization methods such as vector of locally aggregated descriptors (VLAD) [3]. The resulting image representations have shown a high capability on preserving the local patterns of image contents by capturing local characteristics of image objects, e.g. edges and corners. Therefore, they are suitable for the image retrieval task and widely used for matching local patterns of objects. However, convolutional neural networks (CNNs) have recently demonstrated a superior performance over hand-crafted features on image classification [4], [5], [6]. Adopting a deep learning procedure on multiple layers of convolutional filters makes CNNs able to subjectively learn even complex representations for many vision and recognition tasks.

Many recent works [5], [7], [8] demonstrate that the CNN-based generic features adequately trained on sufficient and diverse image datasets, e.g. ImageNet [9], can be successfully applied to other visual recognition tasks. Additionally, performing a proper fine-tuning on CNNs using domain-specific training data can achieve a noticeable performance in common vision tasks [5], [10]; including object localization and instance image retrieval. Despite the promising results achieved by CNNs so far, there is no exact understanding or common agreement on how these deep learning architectures work; especially at the intermediate hidden layers. Several successful approaches [11], [12], [13], [14] have applied CNNs to extract generic features for image retrieval tasks and obtained promising results. They mainly utilize the power of local features to generate a generic image representation based on some pre-trained CNNs. Nevertheless, many open questions and challenges need more investigation. Foremost, the effectiveness of fine-tuning the CNN models pretrained for specific task, e.g. image classification, on transfer learning to the CBIR task. Secondly, the discrimination quality of image features directly extracted from the convolutional layers compared to the features quantized using the traditional generic approaches such as VLAD. Thirdly, the ability of reducing the unfavorable high-dimensional image representations generated by the most of existing CNN-based architectures. Finally, a proper investigation is required on how efficient connections can be made between several CBIR aspects; including query handling, similarity/dissimilarity matching, and retrieval performance in term of search time and memory usage. All of these challenges motivated us to develop and utilize a different deep CNN architecture in order to address the problems associated with features quantization, model fine-tuning, high-dimensionality, and system performance affected by the training procedure and features lengths.

Accordingly, the main aim of this paper is to propose a new CNN-based learning model in the context of CBIR. The proposed architecture is inspired by the bilinear models proposed by Tenenbaum and Freeman [15] to model the separation between the “content” and “style” factors of perceptual systems and by the promising results obtained using bilinear CNNs applied to fine-grained categorization [16]. Specifically, two parallel CNNs are adopted to directly extract image features from the activations of convolutional layers using only the visual contents and without prior knowledge about the semantic meta-data of images, i.e. no tags, annotations, or captions have been used. Image representations are generated by accumulating the extracted features at image locations and scales in order to model local feature correlations. The proposed architecture is initialized by pre-trained deep CNN models that adequately fine-tuned in unsupervised manner to learn the parameters for CBIR tasks using several standard retrieval datasets. Moreover, an efficient compact root pooling layer is also proposed based on the compact bilinear pooling recommended by Gao et al. [19], which demonstrates a noticeable improvement in the retrieval accuracy. Most critically, the resulting final image vectors are very compact so they reduce the time needed to extract them and reduce the memory size required to index the images and architecture with its parameters. Finally, the discriminatory capability of the image descriptors obtained by the proposed model is examined on different CBIR tasks, e.g. general, object-focused, landmarks image retrieval, and large-scale image retrieval.

The remaining part of this paper is organized as follows: Section 2 review the related works in literature; Section 3 presents the proposed compact bilinear architecture along with complete retrieval framework; Section 4 demonstrates and discusses the experiments carried out on several standard image retrieval datasets; and Section 5 concludes this work.

Section snippets

Related work

The most commonly CNNs architectures used in the CBIR are initially trained for classification tasks, where the representations extracted from the higher layers of CNN networks are usually used to capture semantic features for the category-level classification. Transfer learning of generic CNN features, trained on very large classification-based image datasets, to be used for image retrieval has shown a noticeable performance by several works. Wan et al. [11] applied many existing deep learning

The framework of retrieval and deep learning

Our approach consists of three main steps: 1) Initialize the architecture by deep CNN networks pre-trained on millions of images; 2) fine-tune the bilinear CNN architecture on image retrieval datasets, i.e. transfer learning; and 3) extract features of query and dataset images. As shown in Fig. 1, the CNN architecture is based on two variants of recent neural networks [20]: imagenet-vgg-m (VGG-m) and imagenet-vgg-verydeep-16 (VGG-16), and both are pre-trained on ImageNet [9]. These CNNs

Image dataset and evaluation

Holidays dataset [22]: It is one of the standard benchmarking datasets commonly used in the CBIR to measure the robustness against image rotations, viewpoint and illumination changes, blurring, etc. The dataset consists of 1491 high resolution images with a large variety of scene types, e.g. natural, man-made, water and fire effects, etc., as shown in Fig. 2 (top row). The dataset contains 500 image groups that represent distinct scenes. The first image of each image group is the query image

Conclusion

This paper introduces compact bilinear CNN-based architectures for several CBIR tasks using two parallel feature extractors without prior knowledge about the semantic meta-data of image contents. Image features are directly extracted from the activations of convolutional layers then largely reduced to very low-dimensional representations using the root bilinear compact pooling. The very deep architecture CRB-CNN-(16) and medium architecture CRB-CNN-(M) are fine-tuned for three CBIR tasks:

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Ahmad Alzu’bi received his Ph.D. in Computer Science in 2016 from University of the West of Scotland (UWS), United Kingdom. He also received his MSc in Computer Science in 2009 from Jordan University of Science and Technology (J.U.S.T), Jordan. Dr. Alzu’bi has worked in several academic and professional positions such as lecturer and IT trainer/supervisor in several universities and training institutes. He has joined the AVCN/UWS research group in 2014 and has many publications in reputed

References (33)

  • LiJ. et al.

    SERVE: soft and equalized residual vectors for image retrieval

    Neurocomputing

    (2016)
  • LiuZ. et al.

    Uniforming residual vector distribution for distinctive image representation

    IEEE Trans. Circuits Syst. Video Technol

    (2015)
  • LiuZ. et al.

    Fine-residual VLAD for image retrieval

    Neurocomputing

    (2016)
  • G. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • H. Bay et al.

    SURF: speeded up robust features

  • H. Jégou et al.

    Aggregating local image descriptors into compact codes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Adv. Neural Inf. Process. Syst.

    (2012)
  • S. Razavian et al.

    CNN features off-the-shelf: an astounding baseline for recognition

  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

  • R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic...
  • M. Oquab et al.

    Learning and transferring midlevel image representations using convolutional neural networks

  • DengJ. et al.

    ImageNet: a large-scale hierarchical image database

  • LinK. et al.

    Deep learning of binary hash codes for fast image retrieval

  • WanJ. et al.

    Deep learning for content-based image retrieval: a comprehensive study

  • GongY. et al.

    Multi-scale orderless pooling of deep convolutional activation features

  • NgJ. et al.

    Exploiting local features from deep networks for image retrieval

  • Cited by (0)

    Ahmad Alzu’bi received his Ph.D. in Computer Science in 2016 from University of the West of Scotland (UWS), United Kingdom. He also received his MSc in Computer Science in 2009 from Jordan University of Science and Technology (J.U.S.T), Jordan. Dr. Alzu’bi has worked in several academic and professional positions such as lecturer and IT trainer/supervisor in several universities and training institutes. He has joined the AVCN/UWS research group in 2014 and has many publications in reputed journals and conferences in the area of CBIR, deep learning, and software engineering. He is a regular reviewer for several international journals and conferences including Elsevier, IET, and IEEE. His research interests include: Multimedia Retrieval, Image Processing, Deep Learning, and Computer Vision.

    Abbes Amira received his Ph.D. in Computer Engineering in 2001 from Queen's University Belfast, United Kingdom. Since then, he has taken many academic and consultancy positions in the United Kingdom, Asia and the Middleast. During his career to date, Prof. Amira has been successful in securing substantial funding from government agencies and industry; he has supervised more than 20 Ph.D. students and has over 250 publications in top journals and conferences in the area of embedded computing, image and signal processing. He has been invited to give keynote talks, short courses and tutorials at many universities and international conferences and has been chair and program committee for a number of IEEE conferences including; tutorial presenter at the prestigious IEEE ICIP 2009, Chair of ECVW 2011, Program Chair of ECVW2010, Program Co-Chair of ICM12, DELTA 2008, IMVIP 2005 and General Co-Chair of ICM 2014. He is also a member of the IEEE Technical Committee for Biomedical Circuits and systems. He obtained many international awards, including the 2008 VARIAN prize offered by the Swiss Society of Radiobiology and Medical Physics. Prof. Amira has been a Ph.D. external examiner and member of advisory boards for many Universities worldwide and has participated as guest editor and member of the editorial board in many international journals. He has also been a regular referee for many national and international funding bodies, including (EPSRC-UK and QNRF-Qatar). He has taken visiting professor positions at the University of Tun Hussein Onn, Malaysia and the University of Nancy, Henri Poincare, France. Prof. Amira is a Fellow of IET, Fellow of the Higher Education Academy, Senior member of the IEEE, and Senior member of ACM. His research interests include: Embedded systems, high performance computing, Big Data and IoT, Connected Health, Image and Vision Systems, Biometric and Security.

    Naeem Ramzan received the M.Sc. degree in telecommunication from University of Brest, France, in 2004 and the Ph.D. degree in electronics engineering from Queen Mary University of London, London, U.K, in 2008. Currently he is a full Professor at the School of Engineering and Computing, University of the West of Scotland. Prof. Ramzan has authored or co-authored over 110 research publications, including journals, book chapters, and standardization contributions. He co-edited a book entitled Social Media Retrieval (Springer, 2013). He is a fellow of the Higher Education Academy and a senior member of IEEE. He served as a Guest Editor for a number of special issues in technical journals. He has organized and co-chaired three ACM Multimedia Workshops, and served as the Session Chair/Co-Chair for a number of conferences. He is the Co-Chair of the Ultra HD Group of the Video Quality Experts Group (VQEG) and the Co-Editor-in-Chief of VQEG E-Letter. He has participated in more than 20 projects funded by European and U.K. research councils.

    View full text