Object-oriented convolutional features for fine-grained image retrieval in large surveillance datasets

https://doi.org/10.1016/j.future.2017.11.002Get rights and content

Highlights

  • Proposed to represent vehicle images with appropriate convolutional features.

  • Our method reduces number of feature maps without performance degradation.

  • Selected features yield better retrieval performance than the full feature set.

Abstract

Large scale visual surveillance generates huge volumes of data at a rapid pace, giving rise to massive image repositories. Efficient and reliable access to relevant data in these ever growing databases is a highly challenging task due to the complex nature of surveillance objects. Furthermore, inter-class visual similarity between vehicles requires extraction of fine-grained and highly discriminative features. In recent years, features from deep convolutional neural networks (CNN) have exhibited state-of-the-art performance in image retrieval. However, these features have been used without regard to their sensitivity to objects of a particular class. In this paper, we propose an object-oriented feature selection mechanism for deep convolutional features from a pre-trained CNN. Convolutional feature maps from a deep layer are selected based on the analysis of their responses to surveillance objects. The selected features serve to represent semantic features of surveillance objects and their parts with minimal influence of the background, effectively eliminating the need for background removal procedure prior to features extraction. Layer-wise mean activations from the selected features maps form the discriminative descriptor for each object. These object-oriented convolutional features (OOCF) are then projected onto low-dimensional hamming space using locality sensitive hashing approaches. The resulting compact binary hash codes allow efficient retrieval within large scale datasets. Results on five challenging datasets reveal that OOCF achieves better precision and recall than the full feature set for objects with varying backgrounds.

Introduction

In recent years, we have seen tremendous increase in the production and consumption of multimedia data partly due to advent of the social web and partly because of the progress in surveillance, medical, industrial, mobile and embedded computing technologies [1]. Consequently, multimedia data including images and videos are produced and stored in huge amounts. These multimedia repositories contain wealth of highly useful information for administrators and decision makers, provided that efficient and reliable access to relevant data is ensured [2]. Content-based image retrieval (CBIR) systems attempt to locate images containing objects similar to that of a query image by analyzing their contents. CBIR has several applications in information retrieval, surveillance, medical, e-commerce, industry, and social web. Recently, it has attracted a lot of attention due to the rising interest in making the best use of available multimedia data [3]. The exponential increase in the volume of image data, and the inherent complexity of visual contents (projecting 3D world onto a 2D canvas) has made image retrieval increasingly difficult. This difficulty increases even further with fine-grained image retrieval due to the existence of high degree inter-class visual similarity [4]. One such problem arises when retrieving images from traffic surveillance datasets, where the main object of interest are vehicles [[5], [6]]. There exists greater visual similarity despite the fact that vehicles may belong to different categories.

Visual surveillance has become an undeniable necessity of the day, producing huge amounts of multimedia data, which is stored for future analysis [[7], [8]]. Indexing and retrieval of such huge volumes of data requires efficient representation methods [[9], [10]]. Though there exists numerous ways to represent visual contents in large datasets, complexity in the nature of visual data in surveillance limits the use of traditional image representation schemes. Earlier image retrieval methods used local features like scale-invariant features transform (SIFT) [11] and other feature aggregation schemes like vectors of locally aggregated descriptors (VLAD) [12] and fisher vectors (FV) [13]. In recent years, the success of CNN based features prevailed as the state-of-the-art features for image retrieval and classification. Some of the earlier works by Babenko and Lempitsky [14] and Razavian et al. [15] showed that features from a pre-trained CNN can be used to represent images, yielding state-of-the-art performance in large datasets. However, these approaches directly used activations from various layers without considering the suitability of these features for particular object classes.

In this paper, we investigated convolutional features maps of a pre-trained deep CNN to identify a set of optimal features for representing surveillance objects like vehicles for image retrieval applications. A feature selection procedure is presented for vehicles, allowing us to select appropriate features for fine-grained image search. Main contributions of our work are as follows:

  • a.

    Convolutional activation features have been investigated for vehicles in order to select appropriate features for their effective representation.

  • b.

    An efficient feature selection procedure is presented through which it is shown that the number of feature maps can be considerably reduced without any degradation in performance. The selected features exhibit greater attention to the object of interest than the background.

  • c.

    It has also been shown through experiments that the selected features yield better retrieval performance at higher ranks than the full set of features.

The rest of the paper is organized as: Section 2 introduces relevant literature in the field of image retrieval. Section 3 presents schematics of the proposed approach. Experimental results are discussed in Section 4 and the paper is concluded with future research directions in Section 5.

Section snippets

Related work

Content-based image retrieval has been extensively investigated by the multimedia research community for more than two decades [[16], [17]]. CBIR systems attempt to retrieve images based on visual content similarity, which require image representation as an essential ingredient [[18], [19]]. Traditionally, hand-engineered methods including bag-of-words histograms based on SIFT descriptors [[20], [21]], VLAD [12], GIST [22], and CENTRIST [23], etc. were used to represent images in retrieval

Proposed method

In this section, we present the object-oriented convolutional features (OOCF) approach for fine-grained image search in large scale datasets. The proposed method consists of a feature selection process which attempts to identify appropriate features for representing objects of a particular class, based on feature attention. Then, the selected features are globally pooled to index and retrieve images. The method can be effectively applied to any type of images. Details of the feature selection,

Experiments and results

The aim of this study was to develop a procedure to select appropriate features for representing objects of interest for fine-grained image search. We chose vehicle images captured by surveillance cameras to evaluate the proposed scheme. We also experimented with recent hashing methods to determine appropriateness of the proposed features for transforming them into compact binary codes. Results of various experiments and their outcomes are thoroughly discussed in this section.

Conclusions and future work

In this paper, we presented an efficient feature selection method of convolutional feature maps for a particular object category. The selected features focus on the objects of interest in the presence of background, eliminating the need to remove background prior to features extraction. We experimented on large surveillance datasets containing vehicle images captured by surveillance cameras, and two other datasets. Analysis of the convolutional feature maps on segmented vehicles images revealed

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIP) (No. 2016R1A2B4011712).

Jamil Ahmad received his BCS degree in Computer Science from the University of Peshawar, Pakistan in 2008 with distinction. He received his Masters degree in 2014 with specialization in Image Processing from Islamia College, Peshawar, Pakistan. He is also a regular faculty member in the Department of Computer Science, Islamia College Peshawar. Currently, he is pursuing Ph.D. degree in Sejong University, Seoul, Korea. His research interests include deep learning, medical image analysis,

References (58)

  • WangJ. et al.

    Learning to hash for indexing big data—a survey

    Proc. IEEE

    (2016)
  • AlharthiN. et al.

    Data visualization to explore improving decision-making within hajj services

    Sci. Modell. Res.

    (2017)
  • LoweD.G.

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • JegouH. et al.

    Aggregating local image descriptors into compact codes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • M. Douze, A. Ramisa, C. Schmid, Combining attributes and fisher vectors for efficient image retrieval, in: Proceedings...
  • BabenkoA. et al.

    Neural codes for image retrieval

  • A.S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: An astounding baseline for...
  • SmeuldersA.W. et al.

    Content-based image retrieval at the end of the early years

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • AhmadJ. et al.

    Saliency-weighted graphs for efficient visual content description and their applications in real-time image retrieval systems

    J. Real-Time Image Process.

    (2017)
  • AhmadJ. et al.

    Multi-scale local structure patterns histogram for describing visual contents in social image retrieval systems

    Multimedia Tools Appl.

    (2016)
  • J. Yang, Y.-G. Jiang, A.G. Hauptmann, C.-W. Ngo, Evaluating bag-of-visual-words representations in scene...
  • LiT. et al.

    Contextual bag-of-words for visual categorization

    IEEE Trans. Circuits Syst. Video Technol.

    (2011)
  • OlivaA. et al.

    Modeling the shape of the scene: A holistic representation of the spatial envelope

    Int. J. Comput. Vis.

    (2001)
  • WuJ. et al.

    CENTRIST: A visual descriptor for scene categorization

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • A. Babenko, V. Lempitsky, Aggregating local deep features for image retrieval, in: Proceedings of the IEEE...
  • K. Lin, H.-F. Yang, J.-H. Hsiao, C.-S. Chen, Deep learning of binary hash codes for fast image retrieval, in:...
  • L. Liu, C. Shen, A. van den Hengel, The treasure beneath convolutional layers: Cross-convolutional-layer pooling for...
  • Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activation features, in:...
  • Cited by (30)

    • Futuristic person re-identification over internet of biometrics things (IoBT): Technical potential versus practical reality

      2021, Pattern Recognition Letters
      Citation Excerpt :

      It can be done by identifying the dynamic features, which can be extracted and matched against the reference gallery through different computation levels such as edge, fog, and cloud. A pre-processing of surveillance data for making the data super-resolution [38] and for retrieval of fine-grained image [1] can be helpful in subsequent processing. Biometric authentication is a unique mechanism that assesses an individual’s biological traits, such as the face, fingerprints, ears, iris, lips, periocular, facial expressions, and behavioral traits such as keystroke dynamics, gestures, and gait are used to identify or re-identify a particular person under different circumstances [5,9,27] through appropriate feature representation [47].

    • A novel two-dimensional ECG feature extraction and classification algorithm based on convolution neural network for human authentication

      2019, Future Generation Computer Systems
      Citation Excerpt :

      It is also a fundamental deep learning tool with multiple hidden layers and parameters [43]. CNN has been widely used in many fields such as image processing [44], pattern recognition [45] and other kinds of cognitive tasks [46–48]. A typical CNN is composed of three types of layers: the convolutional layer, the fully-connected layer, and the pooling layer.

    • Fire detection for video surveillance applications using ICA K-medoids-based color model and efficient spatio-temporal visual features

      2019, Expert Systems with Applications
      Citation Excerpt :

      In recent years, general-purpose object detection methods using deep learning algorithms have been extensively studied in the area of computer vision and therefore, a large number of successful object detection/recognition methods have emerged (Guo et al., 2016). The CNN-based classifiers have shown remarkable performance in various computer vision applications (Ahmad, Muhammad, Bakshi, & Baik, 2018; Chan et al., 2015; Girshick, Donahue, Darrell, & Malik, 2016; Kantorov, Oquab, Cho, & Laptev, 2016; Zhang et al., 2015). Some recent studies have tried to utilize them to identify the fire flames accurately.

    • Privacy-preserving image retrieval for mobile devices with deep features on the cloud

      2018, Computer Communications
      Citation Excerpt :

      According to a report, Facebook is the largest growing image storage and sharing cloud service today [2]. Additionally, the efficient retrieval of image related information in enormous image datasets is another challenging issue [3,4]. There are numerous cloud based image service providers such as Amazon Cloud Drive, Flicker, iCloud by Apple, and Google that support efficient indexing and retrieval of multimedia data.

    • Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

      2023, ACM Transactions on Multimedia Computing, Communications and Applications
    View all citing articles on Scopus

    Jamil Ahmad received his BCS degree in Computer Science from the University of Peshawar, Pakistan in 2008 with distinction. He received his Masters degree in 2014 with specialization in Image Processing from Islamia College, Peshawar, Pakistan. He is also a regular faculty member in the Department of Computer Science, Islamia College Peshawar. Currently, he is pursuing Ph.D. degree in Sejong University, Seoul, Korea. His research interests include deep learning, medical image analysis, content-based multimedia retrieval, and computer vision. He has published several journal articles in these areas in reputed journals including Journal of Real-Time Image Processing, Multimedia Tools and Applications, Journal of Visual Communication and Image Representation, PLOS One, Journal of Medical Systems, Computers and Electrical Engineering, SpringerPlus, Journal of Sensors, and KSII Transactions on Internet and Information Systems. He is also an active reviewer for IET Image Processing, Engineering Applications of Artificial Intelligence, KSII Transactions on Internet and Information Systems, Multimedia Tools and Applications, IEEE Transactions on Image Processing, and IEEE Transactions on Cybernetics. He is a student member of the IEEE.

    Khan Muhammad (S’16) received the bachelors degree in computer science from the Islamia College Peshawar, Pakistan, in 2014, with a focus on information security. He is currently pursuing the M.S. leading to Ph.D. degree in digital contents from Sejong University, Seoul, South Korea. He has been a Research Associate with the Intelligent Media Laboratory since 2015. He has authored over 24 papers in peer-reviewed international journals and conferences, such as Future Generation Computer Systems, the IEEE ACCESS, the Journal of Medical Systems, Biomedical Signal Processing and Control, Multimedia Tools and Applications, Pervasive and Mobile Computing, SpringerPlus, the KSII Transactions on Internet and Information Systems, the Journal of Korean Institute of Next Generation Computing, the NED University Journal of Research, the Technical Journal, the Sindh University Research Journal, the Middle-East Journal of Scientific Research, MITA 2015, PlatCon 2016, and FIT 2016. His research interests include image and video processing, information security, image and video steganography, video summarization, diagnostic hysteroscopy, wireless capsule endoscopy, computer vision, deep learning, and video surveillance.

    Sambit Bakshi is currently with Centre for Computer Vision and Pattern Recognition of National Institute of Technology Rourkela, India. He also serves as Assistant Professor in Department of Computer Science & Engineering of the institute. He earned his Ph.D. degree in Computer Science & Engineering in 2015. He serves as associate editor of International Journal of Biometrics, IEEE Access, and Plos One. He is technical committee member of IEEE Computer Society Technical Committee on Pattern Analysis and Machine Intelligence. He received the prestigious Innovative Student Projects Award — 2011 from Indian National Academy of Engineering (INAE) for his masters thesis. He has more than 30 publications in journals, reports, conferences.

    Sung Wook Baik received the B.S. degree in computer science from Seoul National University, Seoul, Korea, in 1987, the M.S. degree in computer science from Northern Illinois University, Dekalb, in 1992, and the Ph.D. degree in information technology engineering from George Mason University, Fairfax, VA, in 1999. He worked at Datamat Systems Research Inc. as a senior scientist of the Intelligent Systems Group from 1997 to 2002. In 2002, he joined the faculty of the College of Electronics and Information Engineering, Sejong University, Seoul, Korea, where he is currently a Full Professor and Dean of Digital Contents. He is also the head of Intelligent Media Laboratory (IM Lab) at Sejong University. He served as professional reviewer for several well-reputed journals such as IEEE Communication Magazine, Sensors, Information Fusion, Information Sciences, IEEE TIP, MBEC, MTAP, SIVP and JVCI. His research interests include computer vision, multimedia, pattern recognition, machine learning, data mining, virtual reality, and computer games. He is a professional member of the IEEE.

    View full text