Elsevier

Neurocomputing

Volume 77, Issue 1, 1 February 2012, Pages 281-288
Neurocomputing

Letters
A unified supervised codebook learning framework for classification

https://doi.org/10.1016/j.neucom.2011.09.010Get rights and content

Abstract

In this paper, we investigate a discriminative visual dictionary learning method for boosting the classification performance. Tied to the K-means clustering philosophy, those popular algorithms for visual dictionary learning cannot guarantee the well-separation of the normalized visual word frequency vectors from distinctive classes or large label distances. The rationale of this work is to harness sample label information for learning visual dictionary in a supervised manner, and this target is then formulated as an objective function, where each sample element, e.g., SIFT descriptor, is expected to be close to its assigned visual word, and at the same time the normalized aggregative visual word frequency vectors are expected to possess the property that kindred samples shall be close to each other while inhomogeneous samples shall be far away. By relaxing the hard binary constraints to soft nonnegative ones, a multiplicative nonnegative update procedure is proposed to optimize the objective function along with theoretic convergence proof. Extensive experiments on classification tasks (i.e., natural scene and sports event classifications) all demonstrate the superiority of this proposed framework over conventional clustering based visual dictionary learning.

Introduction

Bag-of-words (BoW) is a state-of-the-art approach for modeling the global statistics of the local visual features for image [3] or video [13] representation. Typically, a set of visual words, namely the so-called visual dictionary, are learned from the training samples, and then each image or video is expressed as a bag of these words, finally the normalized occurrence frequencies of these visual words are used for data representation. As a rule of thumb, these visual words were learned by unsupervised clustering approaches, e.g., K-means [5], driven by the philosophy that the visual words should be the centers of data clusters. The resulting histogram representations are extensively used in visual classification tasks [3], [4], [15], [16], [17] as well as visual regression tasks [19].

However, when given label information of the training samples, the above philosophy becomes not appealing due to its incapability in guaranteeing discriminating power for the normalized visual word frequency vectors. This behavior results in a non-optimal solution for the codebook, since in classification tasks, the histogram representation should preserve the property that: samples from different classes should lie far away in the feature space while samples from the same class should lie close to each other. A natural problem to study is how to harness the label information for boosting the visual classification capability of the final normalized aggregative visual word frequency representation. Fig. 1 shows a toy problem, where supervised visual dictionary boosts up the differentiability of the samples from different classes compared to the results from K-means based visual dictionary.

Several approaches have been proposed to address/partially address the above problem. Winn et al. [17] propose a method based on pair-wise word merging to reduce the large vocabulary structure. Moosmann et al. [12] propose to use random forests to construct the visual word vocabulary in a supervised manner. Ning et al. [14] develop a supervised visual dictionary learning method, where a postprocessing step is developed to improve the discriminative power of the derived visual dictionary from K-means approach. These methods typically separate the process of visual codebook generation from the process of classifier training. However, the capability of this postprocessing is limited due to the fact that some useful information for classification may have been lost in calculating the visual word frequencies and is not restorable. Yang et al. [20] propose a discriminative visual codebook generation method based on a two phase training method. By introducing a set of additional representation information i.e., visual bits as well as a classifier for each category, their learning algorithm improves the discriminative performance by iterating the optimization in terms of both above aspects. More recently, Lazebnik and Raginsky [7] present a method for supervised learning of quantizer codebook based on information loss minimization. They develop a alternating minimization procedure for simultaneously quantizing the continuous input feature vector and approximating the quantizer index of the training samples according to the posterior class label distributions. The limitation of the above methods is that they could only deal with the classification problem and could not be extended to cope with the regression problem based on histogram representations.

To alleviate this problem, in this paper, we present a more general discriminative visual dictionary (DVD) learning framework to strengthen visual dictionary by harnessing the label information of the training samples, which could be applied on classification problems. A novel objective function is proposed by unifying the dual targets, namely, the sample element, e.g., SIFT descriptor, is expected to be close to its assigned visual word, and the final normalized visual word frequency vectors are expected to possess the property that kindred samples shall be close to each other while inhomogeneous samples shall be faraway. By relaxing the hard binary constraints to soft nonnegative ones, the proposed optimization problem can be effectively solved by nonnegative multiplicative update rules, with theoretically provable convergence. Finally the normalized visual word frequency vector for new sample is derived with kernel regression approach. Extensive experiments on natural scene and sports event classifications all demonstrate the encouraging improvements in visual classification performance gained from our proposed framework for learning discriminative visual dictionary.

This paper is organized as follows. In 2 Problem formulation, 3 Convergent iterative procedure, we give the detailed description of our unified discriminative visual dictionary learning framework. The experimental results of the classification tasks are presented in Section 4, respectively. Section 5 concludes the paper.

Section snippets

Problem formulation

Before formally introducing the math formulation for the unified discriminative visual dictionary learning framework, we give the terminologies used afterwards. Let X=[x1,,xN]RD×N denote the extracted sample elements, e.g., SIFT descriptors [10], from all training samples. Then each sample element xi is represented as a D-dimensional vector, and N is the total number of extracted sample elements. The set of labeled training samples (e.g., images and videos) are denoted as S={s1,,sNs}, where N

Convergent iterative procedure

Most iterative procedures for solving high-order optimization problem transform the original intractable problem into a set of tractable sub-problems, and finally obtain the convergence to a local optimum. Our proposed iterative procedure also follows this philosophy and optimizes U and V alternately.

Experiments on visual classification

In this section, we systematically evaluate the effectiveness of the unified discriminative visual dictionary (DVD) learning framework in terms of visual classification by comparing with classical K-means based counterpart. The algorithmic convergence analysis, quantitative classification accuracy evaluation and parameter sensitivity analysis are extensively studied on two visual classification tasks: natural scene classification and sports event classification.

Conclusions

Beyond the K-means clustering like approaches based visual dictionary learning, we proposed a supervised way to harness sample label information for boosting the discriminative power of the derived normalized aggregative visual word frequency vector. This goal was embodied as a specific nonnegative matrix factorization problem with extra constraint targeting at the sample separability, and a multiplicative nonnegative update rule was presented for iterative optimization with theoretically

Acknowledgment

This work was partially supported by National Nature Science Foundation of China (60803072, 61100142, 61033013), Beijing Jiaotong University Science Foundation (Nos. 2011JBM219, 2011JBM218).

Congyan Lang is currently an Associate Professor in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing. She received her Ph.D. from Beijing Jiaotong University in 2006. Her research interests include multimedia information retrieval and analysis, machine learning, and computer vision.

References (20)

  • D. Meyer et al.

    The support vector machine under test

    Neurocomputing

    (2003)
  • C. Chang, C. Lin, Libsvm: a library for support vector machines 〈http://www.csie.ntu.edu.tw/∼cjlin/libsvm〉,...
  • T. Cover et al.

    Nearest neighbor pattern classification

    IEEE Trans. Inf. Theory

    (1967)
  • G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, 2004. Visual categorization with bags of keypoints, in: Workshop...
  • L. Fei-Fei et al.

    A Bayesian hierarchical model for learning natural scene categories

  • J. Hartigan et al.

    A K-means clustering algorithm

    Appl. Stat.

    (1979)
  • H. Kuhn et al.

    Nonlinear programming

  • S. Lazebnik et al.

    Supervised learning of quantizer codebooks by information loss minimization

    Pattern Anal. Mach. Intell.

    (2009)
  • D. Lee et al.

    Learning the parts of objects by nonnegative matrix factorization

    Nature

    (1999)
  • L. Li et al.

    What where and who? Classifying event by scene and object recognition

There are more references available in the full text version of this article.

Cited by (3)

  • Neural Bag-of-Features learning

    2017, Pattern Recognition
    Citation Excerpt :

    These approaches focused on learning class-specific dictionaries [11,34], or adjusting the dictionary in order to increase a specific objective such as the mutual information [35], or the ratio of intra/inter-class variation [9,36,37]. In [38], a two-fold objective is used for the optimization: each feature vector should be close to its codeword, while the extracted histograms should separate the objects that belong to different classes. The proposed Neural BoF model provides a new way of understanding the dictionary learning for the BoF model.

  • Codebook learning for image recognition based on parallel key SIFT analysis

    2017, IEICE Transactions on Information and Systems
  • Nonnegative signal decomposition with supervision

    2013, Mathematical Problems in Engineering

Congyan Lang is currently an Associate Professor in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing. She received her Ph.D. from Beijing Jiaotong University in 2006. Her research interests include multimedia information retrieval and analysis, machine learning, and computer vision.

Songhe Feng is currently an Assistant Professor in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing. He received his Ph.D. from Beijing Jiaotong University in 2009. His current research interests include image annotation and retrieval.

Bing Cheng received the B.E. degree from the Department of Electronics Engineering and Information Science, University of Science and Technology of China (USTC), China, in 2007. He is currently pursuing the Ph.D. degree at the National University of Singapore (NUS). Since Fall 2008, he has been with the Department of Electrical and Computer Engineering, NUS. He is currently working with Prof. S. Yan on his Ph.D. degree. His research interests include image processing, computer vision, and machine learning.

Bingbing Ni is currently a Research Fellow in Advanced Digital Sciences Center, Singapore. He received his B.Eng. degree in Electrical Engineering from Shanghai Jiao Tong University (SJTU), China, in 2005 and obtained his Ph.D. from National University of Singapore (NUS), Singapore, in 2011. His research interests are in the areas of computer vision and machine learning.

Shuicheng Yan is currently an Assistant Professor in the Department of Electrical and Computer Engineering at National University of Singapore, and the founding lead of the Learning and Vision Research Group (http://www.lv-nus.org). Dr. Yan's research areas include computer vision, multimedia and machine learning, and he has authored or co-authored over 200 technical papers over a wide range of research topics. He is an Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology, and has been serving as the guest editor of the special issues for TMM and CVIU. He received the Best Paper Awards from ACM MM'10, ICME'10 and ICIMCS'09, the winner prize of the classification task in PASCAL VOC'10, the honorable mention prize of the detection task in PASCAL VOC'10, 2010 TCSVT Best Associate Editor (BAE) Award, and the co-author of the best student paper awards of PREMIA'09 and PREMIA'11.

View full text