Multi-negative samples with Generative Adversarial Networks for image retrieval
Introduction
Finding similar images to what he or she has at hand is very interesting, such as for a businessman or for a fashion girl. Imagine that a clotheshorse uploads a just-taken street photo with her mobile phone to obtain some similar but interesting clothes images. To fulfill this goal, the visual search engine behind plays an important role. The fundamental problem is to identify and to rank similar instances according to the user-provided query image. This task is known as the instance-level image retrieval. For this task, learning retrieval-specific representations effectively is the key to obtaining satisfactory performances [1], [2], [3].
In the literature there exist two main categories of visual representations, hand-crafted features and neural learning based ones. The former visual representations generally consist of Scale-Invariant Feature Transform (SIFT) local features with a text-analysis inspired Bags of Visual Words (BoVW) [4], its extended version Fisher Vector (FV) representations [5] and a combination of those two methods, Vector of Locally Aggregated Descriptors (VALD) [6]. The idea behind those visual representation methods is to aggregate local patches to obtain a global image representation. However, those three types of feature, including BoVW, FV and VALD are all based on an intermediate representation. And the visual vocabularies are built in the low-level feature space.
The latter visual representations are based on learning image features with certain deep neural architectures [7], [8], [9], [10]. Those types of features, compared with the former hand-crafted ones, have been used and achieved better performances. Some studies focus on leveraging the transferability of image representations, which are pre-trained on the large-scale ImageNet dataset. However, a relatively less successful performance is achieved so far compared with that achieved in the task of image classification. This consequence, to a large extent, is attributed to the learned representations. The image representations are more semantically discriminative but not well-suited for the task of image retrieval. In other words, only considering semantic discriminative representations are much more relevant to the image classification instead of the instance-level retrieval.
In order to improve the visual representations being better suitable for the task of image retrieval, certain types of structure information within samples are leveraged. Metric learning, which attempts to learn a distance function over objects, shows its natural connection to the task of image retrieval. Recently, several methods [11], [12], [13], [14], [15], [16], [17] based on metric learning with neural networks have been proposed and have achieved superior performances. The idea behind those deep metric learning approaches is to capture the structure information hidden within data with deep neural networks. In other words, the nonlinear mapping from the high-dimensional image space onto a lower dimensional representation space is leaned through discriminatively training deep networks.
Specifically, the siamese network [11], [17] uses one query with a positive or a negative sample at a time. And the triplet network [14], [16] uses one query with a positive sample and a negative one simultaneously. In order to train those deep metric networks, different loss functions have been employed. The aim of those loss functions is two folds. They encourage either to narrow the distance between a query and its positive sample, or to enlarge the distance between the query and its negative sample greater than a fixed margin. Then, convolutional neural networks (CNNs) with shared weights are learned. Similar images are consequently mapped to nearby points in the representation space. And dissimilar images are mapped apart from each other.
However, aforementioned metric learning based methods have two problems. The first problem is that only one negative sample is utilized at a time with the idea that negative samples are treated with identical importances and drawn with equal probabilities. But actually, those negative samples have different level of difficulties in retrieving. This observation is intuitively illustrated in Fig. 1. In this figure, a query image with its positive sample and three groups (i.e., easy, medium and hard) of negative images are shown. The easy group of negative images has the furthest distance to the query among the three groups. And it is unambiguously discerned from the positive. By contrast, the hard group of negative samples and the positive have almost the equal distance to the query. Therefore, it is very close to the positive and cannot be easily distinguished from each other. The second problem is that those methods are fundamentally designed for achieving encouraging performances especially when the supervised label information could be available. However, we usually have to deal with the situations that only a few labeled training data are available. Therefore, how to utilize those a few labeled data with the large unlabeled data should be considered for the task of image retrieval. To summarize, those two problems largely lead to the insufficiency on mining the structure within the available data.
In this article, we propose to utilize both virtual images and real ones to alleviate the above two problems. In our method, the virtual images are generated using Generative Adversarial Networks (GANs). And the real images consisting of one positive sample and multiple negative samples are drawn using a random sampling algorithm. Those two types of data are fed into CNNs with shared weights. Furthermore, the adversarial loss with triplet loss and multi-negative loss are combined together for training the CNNs. Through this way, we capture a more complete neighborhood structure and discriminative information between negative samples with different level of difficulties. Meanwhile, those generated samples are merged to enlarge those training samples for improving the overall performance.
The main contributions of this article are highlighted as follows.
- 1)
We develop a framework using both virtual images and real ones for image retrieval. The virtual images are generated using a semi-supervised GANs. And the real images are organized from multiple negative samples. Three types of losses are combined to train deep CNNs with shared weights.
- 2)
We propose a multi-negative loss and an effective sampling strategy. The multi-negative loss captures the neighborhood structure and discriminative information among negative samples with different retrieval difficulties. And the sampling strategy could draw hard sampling with a high bias.
- 3)
We demonstrate the advantages of our proposed framework for the task of image retrieval with extensive experiments on three publicly available datasets.
Section snippets
Related work
We briefly group and describe the related works in two directions, the learning representations for image retrieval and the virtual image generations with GANs.
Due to the wide spread of social mobile applications, the task of image retrieval has attracted considerable interest from the visual AI community. Traditionally, this task is to find similar images to a user-provided query image. Note that the diversity has been recently introduced to obtain a better overview of a query object [18].
Method
In this section, we describe the image retrieval method in detail. The overview architecture of the learning representation for the task of image retrieval is shown in Fig. 2. In this figure, the deep CNN with shared weights is our main function to be learned. Once learned, the representation function are then utilized to perform the retrieval procedure. We first obtain the corresponding representations for all images. Furthermore, in the representation space, distances of the query image to
Experimental settings
We evaluate our method by performing the task of image retrieval on publicly available datasets. In the following sections, we firstly describe the adopted three datasets. And then we introduce the evaluation protocol and baseline methods adopted in our experiments in detail.
Experimental results
In this section, we first report the performances of all of methods on those datasets. Moreover, we analyze our model by investigating important parameters and by visualizing those generated virtual images. Finally, a concise discussion on computational complexity is given.
Computational complexity
According to Algorithm 1, we note that the computational complexity of our image retrieval model depends on three main factors, (1) clustering with AP procedure, (2) generating virtual images with GANs, and (3) optimizing network parameters with Adam back-propagation. The computational complexity of the clustering with AP algorithm involves O(Q2) [48], in which Q denotes the number of samples used during clustering. Because the latter two factors both involve the convolutional neural networks,
Conclusion
In this paper, we explore the combination of virtual images using GANs and real images using a multiple negative sampling for the task of image retrieval. Specifically, the virtual images are generated using GANs additionally with a triplet loss in a semi-supervised scenario. And the real images including multi-negative samples are given using a random sampling algorithm accorded with our multi-negative loss. Furthermore, those losses are incorporated together for jointly learning the weights
Declarations of interest
None.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China Nos. 61273365, 61472046 and 61472048, in part by the Major Projects of National Social Science Foundation of China No. 2016ZDA055, and in part by the Discipline Building Plan in 111 Base No. B08004. The authors would like to thank Prof. Xuyan Tu at University of Science and Technology Beijing for giving helpful suggestions. The authors would also like to thank the editor and anonymous reviewers for their
Ruifan Li received the B.S. and M.S. degrees in control systems, and in circuits and systems from Huazhong University of Science and Technology, Wuhan, China, in 1998 and 2001, respectively. He received the Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2006. Since 2006, he joined the School of Computer Science at BUPT. From February 2011, he spent one year as a visiting scholar at Information Sciences
References (51)
- et al.
A survey of content-based image retrieval with high-level semantics
Pattern Recognit.
(2007) - et al.
Quantization-based hashing: a general framework for scalable image and video retrieval
Pattern Recognit.
(2018) - et al.
Images don’t lie: transferring deep visual semantic features to large-scale multimodal learning to rank
Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD)
(2016) - et al.
Image retrieval in multimedia databases: a survey
Proceedings of the Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing
(2009) - et al.
A survey on learning to hash
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) Visual categorization with bags of keypoints
Proceedings of the Workshop on Statistical Learning in Computer Vision ECCV
(2004)- et al.
Large-scale image retrieval with compressed fisher vectors
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2010) - et al.
Aggregating local image descriptors into compact codes
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - et al.
CNN features off-the-shelf: an astounding baseline for recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop
(2014) - et al.
Deepfashion: Powering robust clothes recognition and retrieval with rich annotations
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
Retrieving real world clothing images via multi-weight deep convolutional neural networks
Cluster Comput.
End-to-end learning of deep visual representations for image retrieval
Int. J. Comput. Vis.
Learning a similarity metric discriminatively, with application to face verification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Dimensionality reduction by learning an invariant mapping
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Distance metric learning for large margin nearest neighbor classification
J. Mach. Learn. Res.
Learning fine-grained image similarity with deep ranking
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Discriminative learning of deep convolutional feature point descriptors
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Facenet: A unified embedding for face recognition and clustering
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Deep image retrieval: learning global representations for image search
Proceedings of the European Conference on Computer Vision
Discovering latent aspects for diversity-induced image retrieval
IEEE MultiMed.
Representation learning: a review and new perspectives
IEEE Trans. Pattern Anal. Mach. Intell.
Deep learning
Nature
Backpropagation applied to handwritten zip code recognition
Neural Comput.
ImageNet classification with deep convolutional neural networks
Proceedings of the Advances in Neural Information Processing Systems (NIPS)
Very deep convolutional networks for large-scale image recognition
Proceedings of the International Conference on Learning Representations (ICLR)
Cited by (7)
Semantic-based conditional generative adversarial hashing with pairwise labels
2023, Pattern RecognitionBlack hole Entropic Fuzzy Clustering-based image indexing and Tversky index-feature matching for image retrieval in cloud computing environment
2021, Information SciencesCitation Excerpt :The tasks related to retrieval are correspondingly searching for many images using similar objects and retrieval of these images is done using similar grouping. For instance, in perambulator re-identification tasks, a similar person is considered wherein captured image is retrieved by considering similar identity with different cameras and orientation using the similar object and 3D shape retrieval and the items with similar class are retrieved [6–8]. Image retrieval is referred to as an issue of regaining a group of images pertinent to image query considering meticulous object.
Generative Adversarial Networks and Markov Random Fields for oversampling very small training sets
2021, Expert Systems with ApplicationsCitation Excerpt :Even assuming perfect knowledge of the MPDF, sampling from it cannot be a simple task except for very specific types of densities. In this work, we propose a new approach based on two concepts: Generative Adversarial Networks (GAN) (Goodfellow, 2016; Lin et al., 2018; Su et al., 2019; Li et al., 2020), an emerging paradigm in machine learning, and vector Markov Random Field (vMRF), an extension of the classical MRF (Chellappa & Jain, 1991). As we will see, the method may be considered an effort to incorporate the merits of the two mentioned approaches: no explicit estimation of the MPDF is required, but structural information of the original data can be incorporated into the synthetic data.
Multi-angle head pose classification with masks based on color texture analysis and stack generalization
2023, Concurrency and Computation: Practice and ExperienceMulti-view frontal face image generation: A survey
2023, Concurrency and Computation: Practice and Experience
Ruifan Li received the B.S. and M.S. degrees in control systems, and in circuits and systems from Huazhong University of Science and Technology, Wuhan, China, in 1998 and 2001, respectively. He received the Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2006. Since 2006, he joined the School of Computer Science at BUPT. From February 2011, he spent one year as a visiting scholar at Information Sciences Institute, University of Southern California, CA. Currently, he is an associate professor of School of Computer Science at BUPT and affiliated with Engineering Research Center of Information Networks, Ministry of Education. His current research activities include multimedia information processing, neural information processing, and statistical machine learning. Email: [email protected]
Xuesen Zhang received the M.E. degree in Computer Technology from Beijing University of Posts and Telecommunications, China, 2018. He is currently working as an Engineer at SenseTime Group Limited. Email: [email protected]
Guang Chen is currently pursuing his M.S. degree at School of Computer Science, Beijing University of Posts and Telecommunications. He received the B.S degree in Software Engineering from the School of Computer and Communication, LanZhou University of Technology. Email: [email protected]
Yuzhao Mao received the B.E. degree from Nanchang University, China. He is currently pursuing a doctorate at Beijing University of Posts and Telecommunications. His research interests include image caption generation and multi-modal representation learning. Email: [email protected]
Xiaojie Wang received his Ph.D. degree from Beihang University in 1996. He is a professor and director of the Centre for Intelligence Science and Technology at Beijing University of Posts and Telecommunications. His research interests include natural language processing and multi-modal cognitive computing. He is an executive member of the Council of Chinese Association of Artificial Intelligence, director of Natural Language Processing Committee. He is a member of Council of Chinese Information Processing Society and Chinese Processing Committee of China Computer Federation. Email: [email protected]