A data-driven study of image feature extraction and fusion
Introduction
Extracting useful features from a scene is an essential subroutine in many multimedia data analysis tasks such as classification and retrieval. Remarkable progress has been made in multimedia computing, computer vision and signal processing in recent decades. Despite this finding, it is still notably difficult for computers to accurately recognize an object or analyze the semantics of a scene. For example, suppose that we want to recognize a piece of white paper in an image. A naive feature we can use is “a white two dimensional rectangle”. However, such a feature will not work in most cases because of the following:
- 1.
The paper may be folded.
- 2.
The viewing angle of the piece of paper may not perpendicular, and hence the paper does not appear to be rectangular.
- 3.
Environmental factors such as occlusion and lighting can cause changes in its shape and color.
The above challenges are all related to feature invariance issues. A second challenge is called feature aliasing or feature selectivity: how well a feature can differentiate one object from the others. For example, the feature “white two-dimensional rectangle” can be used to describe many other objects: a piece of white cloth, a white table, and a white wall, among others. The goal of feature extraction is to find features that are both invariant and selective.
All traditional feature extraction approaches focus on some specific information in the image. For example, the color-texture codebook (CT) focuses on the statistics of colors and textures in small regions of an image. SIFT focuses on local invariant shapes. Recently, neuro-based approaches such as HMAX and convolution networks (ConvNet) have been proposed to model features according to how the human visual system extracts features. HMAX [45] builds computing models that use the pioneering neuroscience work of Hubel [22]. Hubel’s work indicates that visual information is transmitted from the primary visual cortex (V1) through extrastriate visual areas (V2 and V4) to the inferotemporal cortex (IT). The IT, in turn, is a major source of input to the prefrontal cortex (PFC), which is involved in linking perception to memory and action [33]. The pathway from the V1 to the IT (called the visual frontend) consists of a number of simple (lower) and complex (higher) layers. The lower layers attain simple features that are invariant to scale, position and orientation at the pixel level. Higher layers can combine simple features to recognize more complex features at the object-part level. Pattern recognition at the lower layers is unsupervised, whereas recognition at the higher layers involves supervised learning. This particular neuroscience-motivated model appears to enjoy at least a couple of advantages: (1) it balances feature selectivity (at lower layers) and invariance (at higher layers) and (2) it models edges of an object and then combines edges to recognize parts of an object and place these features in a hierarchical context. Similar to HMAX, ConvNet is also a neuro-based approach. It differs from HMAX primarily in the way that ConvNet iterates more over the data to learn a model with a deep architecture [39]. This allows for the capture of both the structure and detail of an object.
Herein, we perform comparative evaluation that demonstrates that different feature extraction algorithms have their own set of advantages and excel in different image categories. We provide key observations about why certain algorithms perform better with different image categories. Based on these observations, we establish feature extraction principles and identify several pitfalls for researchers and practitioners to avoid:
- 1.
When training data are insufficient, no scheme performs well. However, because simple algorithms such as CT and SIFT do not require much data to learn model parameters, they may be a better choice when training data are scarce.
- 2.
Increases in the amount of training data correlate with a jump in the accuracy of complex models, such as HMAX and ConvNet. Different feature extraction algorithms enjoy their own advantages, and excel in different image categories.
- 3.
When training data are abundant, all the four algorithms, simple or complex, converge to the same level of accuracy.
The major contributions of this paper are summarized as follows:
- 1.
Through our comparative analysis, we identify pitfalls of past studies: either they did not use enough training data, or their testbed composition already favors a particular feature extraction algorithm.
- 2.
Through our large-scale comparative study, we demonstrate the benefit of employing large training datasets in training, which can make both simple and complex algorithms converge to the same level of accuracy.
- 3.
We devise a fusion algorithm based on learned clues from each algorithm’s confusion matrix. Our algorithm harvests synergies between these four algorithms and further improves class-prediction accuracy.
- 4.
We established a large testbed for the research community, namely an annotated dataset of six million PicasaWeb images, which will be released publicly with this paper.1
The rest of the paper is organized as follows. Section 2 surveys the related work. Section 3 briefly introduces the four feature extraction algorithms evaluated in this paper. Section 4 details an algorithm that fuses multiple feature extraction methods that we demonstrate can perform better than any individual feature extraction scheme alone. Section 5 explains the setup of our experiments and presents their results. Finally, we offer concluding remarks in Section 6.
Section snippets
Related work
The multimedia community has been striving to bridge the semantic gap [20], [46], [62] between low-level features and high-level semantics for decades. (Comprehensive surveys are given in [5], [20].) With high-quality image features, fancy applications can significantly improve a user’s experience [9], [11], [30]. One key problem is how to extract powerful features. Numerous feature extraction algorithms have been developed for image annotation [46], as well as machine learning algorithms [14],
Feature extraction algorithms
In this section, we present four representative algorithms for image feature extraction: color-texture codebook (CT), SIFT codebook, HMAX, and Convolution Networks (ConvNet). Fig. 1 depicts a framework that consists of these four algorithms. The input to the framework is a set of images. After extracting descriptors such as color-texture histograms, SIFT descriptors, HMAX edges, and ConvNet’s encoding results, the framework conducts an unsupervised learning stage to learn codebooks or patch
Fusion
The four feature-extraction algorithms produce features with different advantages and drawbacks. In this section, we outline a fusion algorithm to harvest synergies between these four algorithms to further improve class-prediction accuracy. This fusion algorithm not only takes advantages of multiple image features but also avoids the curse of dimensionality. More precisely, with each feature set, a classifier can be constructed to perform class prediction. By constructing a confusion matrix
Experiments
Our experiments were designed to address the following questions:
- •
How do feature extraction algorithms compare with one another?
- •
Given an image category, which feature extraction algorithm performs the best, and why?
- •
What is the effect of the size of the training dataset?
- •
How does fusion perform compared with individual classifiers, and how do the results of fusion change with the number of training instances?
To answer the above questions, we conducted experiments on two datasets: an ImageNet
Conclusion
In this paper, we investigated four representative feature extraction algorithms, color-texture codebook (CT), SIFT codebook, HMAX, and convolutional networks (ConvNet). Comprehensive experiments were conducted that revealed differences between these algorithms. We discussed our results using two different views. The first view is the image-category view. We provided an extensive analysis of different categories, and found that different algorithms each have their own advantages that can give
Acknowledgement
This work is supported by the National Natural Science Foundation of China (Nos. 61370022, 61003097, 60933013, and 61210008), International Science and Technology Cooperation Program of China (No. 2013DFG12870), and the National Program on Key Basic Research Project (No. 2011CB302206).
References (63)
- et al.
Strengthening learning algorithms by feature discovery
Inf. Sci.
(2012) - et al.
Color based object recognition
Pattern Recognit.
(1999) Computing a shape’s moments from its boundary
Pattern Recognit.
(1991)- et al.
Coarse blobs or fine edges? Evidence that information diagnosticity changes the perception of complex visual stimuli
Cognit. Psychol.
(1997) - et al.
Exploiting pairwise recommendation and clustering strategies for image re-ranking
Inf. Sci.
(2012) - et al.
A comparison of texture feature extraction using adaptive Gabor filtering, pyramidal and tree structured wavelet transforms
Pattern Recognit.
(1996) - et al.
Are cortical models really bound by the binding problem
Neuron
(1999) - et al.
Learning fuzzy classification rules from labeled data
Inf. Sci.
(2003) - et al.
A new method of feature fusion and its application in image recognition
Pattern Recognit.
(2005) A framework for multi-source data fusion
Inf. Sci.
(2004)
Feature fusion: parallel strategy vs. serial strategy
Pattern Recognit.
Graph-based semi-supervised learning with multiple labels
J. Visual Commun. Image Represent.
Fusion of supervised and unsupervised learning for improved classification of hyperspectral images
Inf. Sci.
Learning Deep Architectures for AI
Learning mid-level features for recognition
Foundations of Large-Scale Multimedia Information Management and Retrieval: Mathematics of Perception
Toward perception-based image retrieval
PSVM: parallelizing support vector machines on distributed computers
Adv. Neural Inf. Process. Syst.
Texture analysis and classification with tree-structured wavelet transform
Image Process.
Hierarchical visual event pattern mining and its applications
Data Min. Knowl. Discovery
A sequential monte carlo approach to anomaly detection in tracking visual events
A matrix-based approach to unsupervised human action categorization
IEEE Trans. Multimedia
ImageNet: a large-scale hierarchical image database
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Features for image retrieval: an experimental comparison
Inf. Retrieval
Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position
Biol. Cyber.
Responses of primate visual cortical V4 neurons to simultaneously presented stimuli
J. Neurophysiol.
Color invariance
IEEE Trans. Pattern Anal. Mach. Intell.
Robust histogram construction from color invariants for object recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Image information retrieval: an overview of current research
Inf. Sci.
A fast learning algorithm for deep belief nets
Neural Comput.
Cited by (12)
Patch-Based correlation for deghosting in exposure fusion
2017, Information SciencesCitation Excerpt :Besides photo enhancement technology [7–9], traditional HDRI techniques [2,6,18] generally take a series of low dynamic range (LDR) images of a single scene with different exposure levels to synthesize into a single HDR image, which is then tone-mapped to an LDR image for display on common devices [3,4,24]. An alternate technique [17,21] skips the typical HDR process and directly fuses the input image sequence, directly producing a tonemapped-like image [13,26,31]. In practice, target scenes are often dynamic, containing moving objects, so ghosting and blurring artifacts are unavoidable in simple exposure compositions.
Motion-free exposure fusion based on inter-consistency and intra-consistency
2017, Information SciencesCitation Excerpt :Second, tone mapping is applied to re-map the obtained latent HDR image to an image of low dynamic range (LDR), in order to make it displayable on commonly used LDR devices [3,4,16,28]. Some other methods [19,21,26] skip the typical HDR process, and create a tone-mapped-like HDR image directly using image fusion [15,33]. Therefore, methods of this type are more efficient, and do not require tone mapping.
Recognizing multi-view objects with occlusions using a deep architecture
2015, Information SciencesCitation Excerpt :An example is the iconic representation algorithm [27], which extracts local feature vectors to represent the multi-scale local orientation of image locations. The best-known method is SIFT [37], which detects the points of interest in an image. Mikolajczyk and Schmid [22] proposed an affine-invariant interest point detector to improve the reliability of recognition using feature grouping modules.
Online primal-dual learning for a data-dependent multi-kernel combination model with multiclass visual categorization applications
2015, Information SciencesCitation Excerpt :Such features represent multiple views of image data, and provide different semantic class discrimination information. Integrating these multiple views can improve the visual recognition performance, a process known as “multiple view learning” or “multiple features/heterogeneous feature fusion” [29,76,77]. Overall, this topic involves co-training, multi-kernel learning (MKL), spectral embedding, subspace learning, and other techniques.
Detection of Cerebrovascular Diseases using Novel Discrete Component Wavelet Cosine Transform
2023, Current Computer-Aided Drug DesignConstruction of Digital Platform of Religious and Cultural Resources Using Deep Learning and Its Big Data Analysis
2022, Computational Intelligence and Neuroscience