Object-oriented convolutional features for fine-grained image retrieval in large surveillance datasets

doi:10.1016/j.future.2017.11.002

Future Generation Computer Systems

Volume 81, April 2018, Pages 314-330

https://doi.org/10.1016/j.future.2017.11.002 Get rights and content

Highlights

•
Proposed to represent vehicle images with appropriate convolutional features.
•
Our method reduces number of feature maps without performance degradation.
•
Selected features yield better retrieval performance than the full feature set.

Abstract

Large scale visual surveillance generates huge volumes of data at a rapid pace, giving rise to massive image repositories. Efficient and reliable access to relevant data in these ever growing databases is a highly challenging task due to the complex nature of surveillance objects. Furthermore, inter-class visual similarity between vehicles requires extraction of fine-grained and highly discriminative features. In recent years, features from deep convolutional neural networks (CNN) have exhibited state-of-the-art performance in image retrieval. However, these features have been used without regard to their sensitivity to objects of a particular class. In this paper, we propose an object-oriented feature selection mechanism for deep convolutional features from a pre-trained CNN. Convolutional feature maps from a deep layer are selected based on the analysis of their responses to surveillance objects. The selected features serve to represent semantic features of surveillance objects and their parts with minimal influence of the background, effectively eliminating the need for background removal procedure prior to features extraction. Layer-wise mean activations from the selected features maps form the discriminative descriptor for each object. These object-oriented convolutional features (OOCF) are then projected onto low-dimensional hamming space using locality sensitive hashing approaches. The resulting compact binary hash codes allow efficient retrieval within large scale datasets. Results on five challenging datasets reveal that OOCF achieves better precision and recall than the full feature set for objects with varying backgrounds.

Introduction

In recent years, we have seen tremendous increase in the production and consumption of multimedia data partly due to advent of the social web and partly because of the progress in surveillance, medical, industrial, mobile and embedded computing technologies [1]. Consequently, multimedia data including images and videos are produced and stored in huge amounts. These multimedia repositories contain wealth of highly useful information for administrators and decision makers, provided that efficient and reliable access to relevant data is ensured [2]. Content-based image retrieval (CBIR) systems attempt to locate images containing objects similar to that of a query image by analyzing their contents. CBIR has several applications in information retrieval, surveillance, medical, e-commerce, industry, and social web. Recently, it has attracted a lot of attention due to the rising interest in making the best use of available multimedia data [3]. The exponential increase in the volume of image data, and the inherent complexity of visual contents (projecting 3D world onto a 2D canvas) has made image retrieval increasingly difficult. This difficulty increases even further with fine-grained image retrieval due to the existence of high degree inter-class visual similarity [4]. One such problem arises when retrieving images from traffic surveillance datasets, where the main object of interest are vehicles [[5], [6]]. There exists greater visual similarity despite the fact that vehicles may belong to different categories.

Visual surveillance has become an undeniable necessity of the day, producing huge amounts of multimedia data, which is stored for future analysis [[7], [8]]. Indexing and retrieval of such huge volumes of data requires efficient representation methods [[9], [10]]. Though there exists numerous ways to represent visual contents in large datasets, complexity in the nature of visual data in surveillance limits the use of traditional image representation schemes. Earlier image retrieval methods used local features like scale-invariant features transform (SIFT) [11] and other feature aggregation schemes like vectors of locally aggregated descriptors (VLAD) [12] and fisher vectors (FV) [13]. In recent years, the success of CNN based features prevailed as the state-of-the-art features for image retrieval and classification. Some of the earlier works by Babenko and Lempitsky [14] and Razavian et al. [15] showed that features from a pre-trained CNN can be used to represent images, yielding state-of-the-art performance in large datasets. However, these approaches directly used activations from various layers without considering the suitability of these features for particular object classes.

In this paper, we investigated convolutional features maps of a pre-trained deep CNN to identify a set of optimal features for representing surveillance objects like vehicles for image retrieval applications. A feature selection procedure is presented for vehicles, allowing us to select appropriate features for fine-grained image search. Main contributions of our work are as follows:

a.
Convolutional activation features have been investigated for vehicles in order to select appropriate features for their effective representation.
b.
An efficient feature selection procedure is presented through which it is shown that the number of feature maps can be considerably reduced without any degradation in performance. The selected features exhibit greater attention to the object of interest than the background.
c.
It has also been shown through experiments that the selected features yield better retrieval performance at higher ranks than the full set of features.

The rest of the paper is organized as: Section 2 introduces relevant literature in the field of image retrieval. Section 3 presents schematics of the proposed approach. Experimental results are discussed in Section 4 and the paper is concluded with future research directions in Section 5.

Section snippets

Related work

Content-based image retrieval has been extensively investigated by the multimedia research community for more than two decades [[16], [17]]. CBIR systems attempt to retrieve images based on visual content similarity, which require image representation as an essential ingredient [[18], [19]]. Traditionally, hand-engineered methods including bag-of-words histograms based on SIFT descriptors [[20], [21]], VLAD [12], GIST [22], and CENTRIST [23], etc. were used to represent images in retrieval

Proposed method

In this section, we present the object-oriented convolutional features (OOCF) approach for fine-grained image search in large scale datasets. The proposed method consists of a feature selection process which attempts to identify appropriate features for representing objects of a particular class, based on feature attention. Then, the selected features are globally pooled to index and retrieve images. The method can be effectively applied to any type of images. Details of the feature selection,

Experiments and results

The aim of this study was to develop a procedure to select appropriate features for representing objects of interest for fine-grained image search. We chose vehicle images captured by surveillance cameras to evaluate the proposed scheme. We also experimented with recent hashing methods to determine appropriateness of the proposed features for transforming them into compact binary codes. Results of various experiments and their outcomes are thoroughly discussed in this section.

Conclusions and future work

In this paper, we presented an efficient feature selection method of convolutional feature maps for a particular object category. The selected features focus on the objects of interest in the presence of background, eliminating the need to remove background prior to features extraction. We experimented on large surveillance datasets containing vehicle images captured by surveillance cameras, and two other datasets. Analysis of the convolutional feature maps on segmented vehicles images revealed

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIP) (No. 2016R1A2B4011712).

References (58)

WuY. et al.
An analytical model for on-chip interconnects in multimedia embedded systems
ACM Trans. Embedded Comput. Syst.
(2013)
AhmadJ. et al.
Embedded deep vision in smart cameras for multi-view objects representation and retrieval
Computers Electr. Eng.
(2017)
AhmadJ. et al.
Efficient object-based surveillance image search using spatial pooling of convolutional features
J. Vis. Commun. Image Represent.
(2017)
LiuY. et al.
A survey of content-based image retrieval with high-level semantics
Pattern Recognit.
(2007)
GlasnerD. et al.
Viewpoint-aware object detection and continuous pose estimation
Image Vis. Comput.
(2012)
GutubA. et al.
Improving Hajj and Umrah Services Utilizing Exploratory Data Visualization Techniques, presented at the Hajj Forum
(2016)
A. Gutub, Exploratory data visualization for smart systems, in: Smart Cities 2015-3rd Annual Digital Grids and Smart...
WeiX.-S. et al.
Selective convolutional descriptor aggregation for fine-grained image retrieval
IEEE Trans. Image Process.
(2017)
WuY. et al.
Modeling and analysis of communication networks in multicluster systems under spatio-temporal bursty traffic
IEEE Trans. Parallel Distrib. Syst.
(2012)
Al-OtaibiN.A. et al.
2-leyer security system for hiding sensitive text data on personal computers

WangJ. et al.

Learning to hash for indexing big data—a survey

Proc. IEEE

(2016)

AlharthiN. et al.

Data visualization to explore improving decision-making within hajj services

Sci. Modell. Res.

(2017)

LoweD.G.

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

(2004)

JegouH. et al.

Aggregating local image descriptors into compact codes

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

M. Douze, A. Ramisa, C. Schmid, Combining attributes and fisher vectors for efficient image retrieval, in: Proceedings...

BabenkoA. et al.

Neural codes for image retrieval

A.S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: An astounding baseline for...

SmeuldersA.W. et al.

Content-based image retrieval at the end of the early years

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

AhmadJ. et al.

Saliency-weighted graphs for efficient visual content description and their applications in real-time image retrieval systems

J. Real-Time Image Process.

(2017)

AhmadJ. et al.

Multi-scale local structure patterns histogram for describing visual contents in social image retrieval systems

Multimedia Tools Appl.

(2016)

J. Yang, Y.-G. Jiang, A.G. Hauptmann, C.-W. Ngo, Evaluating bag-of-visual-words representations in scene...

LiT. et al.

Contextual bag-of-words for visual categorization

IEEE Trans. Circuits Syst. Video Technol.

(2011)

OlivaA. et al.

Modeling the shape of the scene: A holistic representation of the spatial envelope

Int. J. Comput. Vis.

(2001)

WuJ. et al.

CENTRIST: A visual descriptor for scene categorization

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

KrizhevskyA. et al.

Imagenet classification with deep convolutional neural networks

A. Babenko, V. Lempitsky, Aggregating local deep features for image retrieval, in: Proceedings of the IEEE...

K. Lin, H.-F. Yang, J.-H. Hsiao, C.-S. Chen, Deep learning of binary hash codes for fast image retrieval, in:...

L. Liu, C. Shen, A. van den Hengel, The treasure beneath convolutional layers: Cross-convolutional-layer pooling for...

Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activation features, in:...

Cited by (30)

Futuristic person re-identification over internet of biometrics things (IoBT): Technical potential versus practical reality
2021, Pattern Recognition Letters
Citation Excerpt :
It can be done by identifying the dynamic features, which can be extracted and matched against the reference gallery through different computation levels such as edge, fog, and cloud. A pre-processing of surveillance data for making the data super-resolution [38] and for retrieval of fine-grained image [1] can be helpful in subsequent processing. Biometric authentication is a unique mechanism that assesses an individual’s biological traits, such as the face, fingerprints, ears, iris, lips, periocular, facial expressions, and behavioral traits such as keystroke dynamics, gestures, and gait are used to identify or re-identify a particular person under different circumstances [5,9,27] through appropriate feature representation [47].
This article presents an overview of how person re-identification can be achieved over the Internet of Biometric Things (IoBT) architecture by enabling technologies and protocols for multimodal biometric authentication leveraging futuristic cues. The Internet of Things (IoT), as a new era of technology, extends the power of the internet to a whole range of devices, thus reshaping our daily lives in the best possible way. IoT-enabled intelligent surveillance devices are the most indispensable part of public safety and security in smart cities. These IoT devices generate a vast amount of surveillance traffic that is practically impossible for humans to continuously monitor and/or analyze. Person re-identification (PRId), which aims to track and recognize a person in a multi-camera scene is an important feature of visual surveillance systems in IoT infrastructures and can utilize the aforementioned traffic. This is where the concept of IoBT, which is a cloud-centric biometric authentication architecture composed of these IoT-enabled devices for the PRId system, comes into play. This article conceptualizes an overview of interpreting various futuristic cues on the IoT platform for achieving PRId. We highlight some opportunities and key challenges of implementing this futuristic PRId system on IoBT. The article is a proof of concept of the technical potential of such implementation in the near future.
Conflux LSTMs Network: A Novel Approach for Multi-View Action Recognition
2021, Neurocomputing
Multi-view action recognition (MVAR) is an optimal technique to acquire numerous clues from different views data for effective action recognition, however, it is not well explored yet. There exist several challenges to MVAR domain such as divergence in viewpoints, invisible regions, and different scales of appearance in each view require better solutions for real world applications. In this paper, we present a conflux long short-term memory (LSTMs) network to recognize actions from multi-view cameras. The proposed framework has four major steps; 1) frame level feature extraction, 2) its propagation through conflux LSTMs network for view self-reliant patterns learning, 3) view inter-reliant patterns learning and correlation computation, and 4) action classification. First, we extract deep features from a sequence of frames using a pre-trained VGG19 CNN model for each view. Second, we forward the extracted features to conflux LSTMs network to learn the view self-reliant patterns. In the next step, we compute the inter-view correlations using the pairwise dot product from output of the LSTMs network corresponding to different views to learn the view inter-reliant patterns. In the final step, we use flatten layers followed by SoftMax classifier for action recognition. Experimental results over benchmark datasets compared to state-of-the-art report an increase of 3% and 2% on northwestern-UCLA and MCAD datasets, respectively.
A novel two-dimensional ECG feature extraction and classification algorithm based on convolution neural network for human authentication
2019, Future Generation Computer Systems
Citation Excerpt :
It is also a fundamental deep learning tool with multiple hidden layers and parameters [43]. CNN has been widely used in many fields such as image processing [44], pattern recognition [45] and other kinds of cognitive tasks [46–48]. A typical CNN is composed of three types of layers: the convolutional layer, the fully-connected layer, and the pooling layer.
A biometrics-based authentication system is usually a better security solution than traditional systems which are heavily reliant on passwords, personal identification numbers or smart cards. Electrocardiogram (ECG) is one of the most promising approaches for biometrics-based authentication in recent years, because, unlike other biometrics, it assures aliveness of the person being authenticated. In this paper, we present a novel authentication system using an efficient feature detection algorithm and a convolutional neural network (CNN) based on ECG for human authentication. Our system processes ECG signals through two main phases: a feature detection phase and an authentication phase. In the feature detection phase, preprocessing was performed first to remove as much noise as possible and straighten the ECG signals. Then, the proposed scanning and removing methods are used to extract the main features from the signals that have negative peaks, high-grade noise and baseline drifts with higher accuracy than existing algorithms. In the authentication phase, we proposed a 12-layer CNN to authenticate the ECG signals. We also introduced a new database (MWM-HIT database), which is suitable for training and validating authentication systems. In addition, we used all records from the PTB, CYBHi, and MIT-BIH arrhythmia databases for comparison between the proposed system and other systems. We achieved an average accuracy, sensitivity and positive predictivity of 97.92%, 96.96% and 98.79%, respectively for detecting all peaks on all records from the collected database and an accuracy of 99.96%, sensitivity of 99.99% and positive predictivity of 99.98% for detecting all peaks on the MIT-BIH database for ten seconds of lead II ECG signals. The proposed model is able to authenticate the ECG signals with an equal error rate (EER) of 1.63%, 4.47%, and 4.86% when using lead II from PTB, CYBHi and the collected databases, respectively. The proposed system is highly usable in a real-time authentication system.
Fire detection for video surveillance applications using ICA K-medoids-based color model and efficient spatio-temporal visual features
2019, Expert Systems with Applications
Citation Excerpt :
In recent years, general-purpose object detection methods using deep learning algorithms have been extensively studied in the area of computer vision and therefore, a large number of successful object detection/recognition methods have emerged (Guo et al., 2016). The CNN-based classifiers have shown remarkable performance in various computer vision applications (Ahmad, Muhammad, Bakshi, & Baik, 2018; Chan et al., 2015; Girshick, Donahue, Darrell, & Malik, 2016; Kantorov, Oquab, Cho, & Laptev, 2016; Zhang et al., 2015). Some recent studies have tried to utilize them to identify the fire flames accurately.
Automated detection of fire flames in videos shot from a surveillance camera is an active research topic, as fire detection must be accurate and fast. The present study proposes and evaluates an efficient fire detection method. The contributions of this method lies in threefold: (1) a robust ICA (Imperialist Competitive Algorithm) K-medoids-based color model first is developed to reliably detect all candidate fire regions in a scene; (2) a motion-intensity-aware motion detection technique is introduced to simply extract the regions containing movement together with the motion intensity rate of every moving pixel, which are then used to analyze the characteristics of the fire; (3) a set of new spatio-temporal features having the distinct characteristics of fire flames are extracted from the candidate fire regions which are fed into a support vector machine classifier in order to distinguish real fire regions from non-real ones. The experimental results for a set of benchmark fire video datasets and videos provided in this research confirm that the proposed method outperforms state-of-the-art fire detection approaches, providing high detection accuracy and a low false detection rate.
Privacy-preserving image retrieval for mobile devices with deep features on the cloud
2018, Computer Communications
Citation Excerpt :
According to a report, Facebook is the largest growing image storage and sharing cloud service today [2]. Additionally, the efficient retrieval of image related information in enormous image datasets is another challenging issue [3,4]. There are numerous cloud based image service providers such as Amazon Cloud Drive, Flicker, iCloud by Apple, and Google that support efficient indexing and retrieval of multimedia data.
With the prevalent use of mobile cameras to capture images, the demands for efficient and effective methods for indexing and retrieval of personal image collections on mobile devices have also risen. In this paper, we propose to represent images with hash codes, which is a compressed representation of deep convolutional features using deep auto-encoder on the cloud. To ensure user's privacy, the image is first encrypted using a light-weight encryption algorithm on mobile device prior to offloading it to the cloud for features extraction. This approach eliminates the computationally expensive process of features extraction on resource constrained devices. A pre-trained convolutional neural network (CNN) is used to extract features which are then transformed to compact binary codes using a deep auto-encoder. The hash codes are then sent back to the mobile device where they are stored in a hash table along with image location. Approximate nearest neighbor (ANN) search approach is utilized to efficiently retrieve the desired images without exhaustive searching of the entire image collection. The proposed method is evaluated against three different publicly available image datasets namely Corel-10K, GHIM-10K, and Product image dataset. Experimental results demonstrate that features representation using CNN and auto-encoder shows much better results than several state-of-the-art hashing schemes for image retrieval on mobile devices.
Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
2023, ACM Transactions on Multimedia Computing, Communications and Applications

View all citing articles on Scopus

Jamil Ahmad received his BCS degree in Computer Science from the University of Peshawar, Pakistan in 2008 with distinction. He received his Masters degree in 2014 with specialization in Image Processing from Islamia College, Peshawar, Pakistan. He is also a regular faculty member in the Department of Computer Science, Islamia College Peshawar. Currently, he is pursuing Ph.D. degree in Sejong University, Seoul, Korea. His research interests include deep learning, medical image analysis, content-based multimedia retrieval, and computer vision. He has published several journal articles in these areas in reputed journals including Journal of Real-Time Image Processing, Multimedia Tools and Applications, Journal of Visual Communication and Image Representation, PLOS One, Journal of Medical Systems, Computers and Electrical Engineering, SpringerPlus, Journal of Sensors, and KSII Transactions on Internet and Information Systems. He is also an active reviewer for IET Image Processing, Engineering Applications of Artificial Intelligence, KSII Transactions on Internet and Information Systems, Multimedia Tools and Applications, IEEE Transactions on Image Processing, and IEEE Transactions on Cybernetics. He is a student member of the IEEE.

Khan Muhammad (S’16) received the bachelors degree in computer science from the Islamia College Peshawar, Pakistan, in 2014, with a focus on information security. He is currently pursuing the M.S. leading to Ph.D. degree in digital contents from Sejong University, Seoul, South Korea. He has been a Research Associate with the Intelligent Media Laboratory since 2015. He has authored over 24 papers in peer-reviewed international journals and conferences, such as Future Generation Computer Systems, the IEEE ACCESS, the Journal of Medical Systems, Biomedical Signal Processing and Control, Multimedia Tools and Applications, Pervasive and Mobile Computing, SpringerPlus, the KSII Transactions on Internet and Information Systems, the Journal of Korean Institute of Next Generation Computing, the NED University Journal of Research, the Technical Journal, the Sindh University Research Journal, the Middle-East Journal of Scientific Research, MITA 2015, PlatCon 2016, and FIT 2016. His research interests include image and video processing, information security, image and video steganography, video summarization, diagnostic hysteroscopy, wireless capsule endoscopy, computer vision, deep learning, and video surveillance.

Sambit Bakshi is currently with Centre for Computer Vision and Pattern Recognition of National Institute of Technology Rourkela, India. He also serves as Assistant Professor in Department of Computer Science & Engineering of the institute. He earned his Ph.D. degree in Computer Science & Engineering in 2015. He serves as associate editor of International Journal of Biometrics, IEEE Access, and Plos One. He is technical committee member of IEEE Computer Society Technical Committee on Pattern Analysis and Machine Intelligence. He received the prestigious Innovative Student Projects Award — 2011 from Indian National Academy of Engineering (INAE) for his masters thesis. He has more than 30 publications in journals, reports, conferences.

Sung Wook Baik received the B.S. degree in computer science from Seoul National University, Seoul, Korea, in 1987, the M.S. degree in computer science from Northern Illinois University, Dekalb, in 1992, and the Ph.D. degree in information technology engineering from George Mason University, Fairfax, VA, in 1999. He worked at Datamat Systems Research Inc. as a senior scientist of the Intelligent Systems Group from 1997 to 2002. In 2002, he joined the faculty of the College of Electronics and Information Engineering, Sejong University, Seoul, Korea, where he is currently a Full Professor and Dean of Digital Contents. He is also the head of Intelligent Media Laboratory (IM Lab) at Sejong University. He served as professional reviewer for several well-reputed journals such as IEEE Communication Magazine, Sensors, Information Fusion, Information Sciences, IEEE TIP, MBEC, MTAP, SIVP and JVCI. His research interests include computer vision, multimedia, pattern recognition, machine learning, data mining, virtual reality, and computer games. He is a professional member of the IEEE.

View full text

Object-oriented convolutional features for fine-grained image retrieval in large surveillance datasets

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed method

Experiments and results

Conclusions and future work

Acknowledgment

ACM Trans. Embedded Comput. Syst.

Computers Electr. Eng.

J. Vis. Commun. Image Represent.

Pattern Recognit.

Image Vis. Comput.

Improving Hajj and Umrah Services Utilizing Exploratory Data Visualization Techniques, presented at the Hajj Forum

Selective convolutional descriptor aggregation for fine-grained image retrieval

IEEE Trans. Image Process.

Modeling and analysis of communication networks in multicluster systems under spatio-temporal bursty traffic

IEEE Trans. Parallel Distrib. Syst.

2-leyer security system for hiding sensitive text data on personal computers

Learning to hash for indexing big data—a survey

Proc. IEEE

Data visualization to explore improving decision-making within hajj services

Sci. Modell. Res.

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

Aggregating local image descriptors into compact codes

IEEE Trans. Pattern Anal. Mach. Intell.

Neural codes for image retrieval

Content-based image retrieval at the end of the early years

IEEE Trans. Pattern Anal. Mach. Intell.

Saliency-weighted graphs for efficient visual content description and their applications in real-time image retrieval systems

J. Real-Time Image Process.

Multi-scale local structure patterns histogram for describing visual contents in social image retrieval systems

Multimedia Tools Appl.

Contextual bag-of-words for visual categorization

IEEE Trans. Circuits Syst. Video Technol.

Modeling the shape of the scene: A holistic representation of the spatial envelope

Int. J. Comput. Vis.

CENTRIST: A visual descriptor for scene categorization

IEEE Trans. Pattern Anal. Mach. Intell.

Imagenet classification with deep convolutional neural networks