Decomposition-based tensor learning regression for improved classification of multimedia

https://doi.org/10.1016/j.jvcir.2016.10.006Get rights and content

Highlights

  • We introduce logical Tucker regression based on tensor Tucker decomposition.

  • We reveal the effect of the core tensor dimension for multimedia classification.

  • Our method can effectively utilize the space–time information of multimedia data.

Abstract

Existing vector-based multimedia classification often incurs loss of space-time information and requires generation of high-dimensional vectors. To explore a possible new solution for the problem, we propose a novel tensor-based logistic regression algorithm via Tucker decomposition to complete multimedia classification. In order to strengthen the classification process, F-norm is used for regularization term. A logistic Tucker regression model is established to achieve effective extraction of principal components out of the tensors, and hence reduce the dimension of inputs to improve the efficiency of multimedia classification. To evaluate the proposed algorithm, we carried out extensive experiments on a number of data sets, including two second-order grayscale image datasets and one third-order video sequence dataset. All the results indicate that our proposed algorithm outperforms the existing state-of-the-arts in relevant areas.

Introduction

As a higher-order extension of vectors, tensor is considered to be a natural representation and description of multimedia data [1]. With the rapid development of digital devices, many tensor expression patterns for multimedia emerge over the past decade. For example: grayscale images are usually expressed as second-order tensor data sets [2], [3], and color images shared in the network can be described by third-order tensor data sets. Currently, popular image features, such as histogram of oriented gradient (HOG) [4] and log-Gabor [5], are the third-order tensor features. Grayscale and color video sequences can also be viewed as the third-order and fourth-order tensor datasets, respectively [6], [7], [8].

As reported in the literature, many vector-based multimedia analysis and classification methods have been presented, of which representative techniques include K-nearest neighbor (KNN) [9], [10], support vector regression (SVR) [11], [12], and logistic regression (LR) [13]. Because of their simple and effective induction, these vector-based algorithms are widely used to handle multimedia analysis problems [14], [15]. Among these methods, LR stands out as the most important one due to the reasons that: (1) compared with the support vector regression and ridge regression, LR is able to produce appropriate estimation of class distributions for a given class, and provides great advantages in terms of the model training time and (2) it is much easier to extend LR into a multi-class classifier. Therefore, LR is widely applied to many multimedia applications. One representative example is the modified LR combined with Convolutional Neural Network (CNN) [16] for discriminative feature discovery in video quality assessment [17]. These vector-based methods, however, need to arrange the tensor data into long vectors, causing three problems, including: (1) The vectorization of tensor data often produce a high-dimensional vector, and therefore lead to the “curse of dimensionality” problem. (2) Vectorization may destroy the spatial structures among the image pixels, and cause the potential loss of time related information for video sequences. Therefore, all the existing vector-based classification methods present the essential weakness that they fail to take the full advantage of the spatial structure and temporal relevance of multimedia data, reducing the classification accuracy.

A number of attempts have been made to overcome the above described weakness with tensor-based approaches. Vasilescu and Terzopoulos [18] considered a range of diversified elements for facial image classification, including different people, different perspectives, different expressions, different lighting and different cuttings, in which a five-order tensor data is constructed, and each element of the face image is designated as one order. A tensor singular value decomposition algorithm is then further applied to the five-order tensor, in order to obtain the face tensor subspace. Compared with the traditional vector singular value decomposition, the reported algorithm achieved significant improvement by considering mutual relations among different elements of each individual face image. As a result, many vector space learning methods are extended to the tensor space, such as: tensor principal component analysis (Tensor PCA) [2], tensor linear discriminant analysis (Tensor LDA) [19], and tensor locality preserving projections (Tensor LPP) [20]. While all these reported works have made significant advancement towards the classification of multimedia, they fundamentally share the common principle that learning is carried out inside the tensor space, while classification is conducted by using vector-based classifiers. Therefore tensor-based space learning methods cannot directly classify the tensor data, and hence fail to take full advantage of the spatial structure and related information embedded inside the input multimedia data.

In order to classify tensor data directly, Ma [21] proposes a semi-supervised second-order tensor image classification model, in which second-order grayscale images are directly used to learn the two groups of vector parameters for their classification. In principle, second-order tensor is the natural expression of grayscale images, and this expression pattern can effectively utilize the space related information and avoid the generation of high dimensional vectors. However, this method does not deal with higher-order tensor data, such as: third-order video data or third-order HOG image features. Tao [22] proposed a supervised learning tensor (STL) framework, where higher-order tensor data is classified directly. The STL framework decomposes the higher-order tensor according to a rank-one tensor. Compared with the vector learning method, spatial information along each order of the input tensor is used in STL framework. The problem, however, lies in the fact that the decomposition of higher-order tensor according to a rank-one tensor can lead to loss of potential discriminative information.

Due to the fact that tensor decomposition can effectively extract principal components of the tensor data, many tensor decomposition algorithms have been researched and reported, out of which the most popular and representatives are CP (CANDECOMP/PARAFAC) decomposition [23] and Tucker decomposition [24]. Fig. 1 illustrates the CP decomposition and Tucker decomposition of a second-order tensor. As seen, CP decomposition is to decompose a tensor into the sum of R rank-1 tensors. This decomposition method obtains the effective expression of the original tensors by pre-estimating the number of rank-1 tensors. In [25], the group-sparsity norm [26] is used to optimize the number of rank-1 tensors after CP decomposition. Tucker decomposition is to decompose a tensor into the form of a core tensor multiplied with the factor matrix along each order. Dimensionality reduction can be achieved by adjusting the dimensions of the core tensor, and selection of factor matrix at each order can be conducted to extract the principal components. Therefore, Tucker decomposition is also considered as the higher-order principal component analysis. Compared with the traditional matrix decomposition, Tucker decomposition not only takes the full advantage of structural information of the tensor data, but also extracts its principal components.

Through the above analysis, we find that the existing tensor learning algorithms fail to avoid the loss of spatial information and relevance of tensor data, and yet the non-uniqueness of tensor rank in the CP decomposition results in the decrease of classification accuracy. To overcome the identified weakness, we introduce a new idea of logistic regression in this paper and establish the objective function model via Tucker decomposition, and show, via extensive experiments, that our proposed algorithm achieves significant improvement in comparison with representative benchmarks.

The rest of the paper is organized into three more sections. Section 2 briefly introduces notations for the convenience of describing our proposed algorithm, supported with some preliminary theories, in order to pave the way for the proposed new idea. Section 3 provides detailed description of our proposed algorithm via the establishment of the new logistic regression model. Section 4 reports our extensive experiments to evaluate the proposed algorithm in comparison with 7 benchmarking techniques selected out of the existing representative state of the arts, and finally, Section 5 draws the conclusions.

Section snippets

Notations and preliminaries

For the convenience of describing our ideas and presenting the proposed design of the new tensor learning algorithms for multimedia classification, we provide, in this section, a brief introduction to the symbol definitions as well as the related operations involved in our proposed algorithms.

Firstly, in this paper, scalars are denoted by the lowercase letters (a,b,c,), vectors are denoted by the bold lowercase letters (a,b,c,), matrices are denoted by the uppercase letters (A,B,C,), and

Tensor decomposition based logistic regression

As analyzed above, vector-based logistic regression classifiers need to vectorize the multimedia data, leading to the damage or loss of the spatial structure and time related information. Further, the vectorization often incurs generation of high-dimensional vector features, not only increasing the training time required but also causing over-fitting problems.

To solve these problems, we reconsider the logistic regression model and project it onto the corresponding tensor space based on Tucker

Experimental results and analysis

In this section, we evaluate the proposed tensor-based logistic model, Logistic Tucker Regression (LTuR), via extensive experiments benchmarked by seven existing algorithms. Details of all the seven existing state of the arts are highlighted as follows:

  • Support vector regression (SVR) [11].

  • Logistic regression (LR) [13].

  • Higher rank support tensor regression (hrSTR) [30].

  • Higher rank tensor ridge regression (hrTRR) [30].

  • Higher rank tensor logistic regression (LTR) [31].

  • Support Tucker tensor machine

Conclusions

While existing techniques for multimedia classification are primarily based on vectors, tensor-based description may provide a better alternative as evidenced in the experimental results reported in this paper. Comparative studies and testing of the proposed algorithms reveal that the tensor-based multimedia classification achieves a number of advantages, which can be highlighted as: (1) tensor decomposition reduces the number of estimated parameters, and hence reduces the time complexity of

Acknowledgment

This work was supported by the Hebei Provincial Natural Science Foundation, China (under Grant F2016111005). This work also was supported by the Chinese Natural Science Foundation (CNSF) (under Grant 61620106008, Grant 61373103).

References (41)

  • Y. Yan, E. Ricci, R. Subramanian, G. Liu, O. Lanz, N. Sebe, A multi-task learning framework for head pose estimation...
  • D. Xu et al.

    Reconstruction and recognition of tensor-based objects with concurrent subspaces analysis

    IEEE Trans. Circ. Syst. Video Technol.

    (2008)
  • K. Li et al.

    Non-rigid structure from motion via sparse representation

    IEEE Trans. Cybernet.

    (2015)
  • H. Yahong et al.

    Semisupervised feature selection via spline regression for video semantic recognition

    IEEE Trans. Neural Netw. Learn. Syst.

    (2015)
  • Y. Yang et al.

    Hybrid sampling-based clustering ensemble with global and local constitutions

    IEEE Trans. Neural Networks Learn. Syst.

    (2016)
  • A. Genkin et al.

    Large-scale bayesian logistic regression for text categorization

    Technometrics

    (2007)
  • Y. Yan et al.

    Multitask linear discriminant analysis for view invariant action recognition

    IEEE Trans. Image Process.

    (2014)
  • H. Yahong et al.

    Compact and discriminative descriptor inference using multi-cues

    IEEE Trans. Image Process.

    (2015)
  • L. Kang et al.

    Convolutional neural networks for no-reference image quality assessment

  • Cited by (0)

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text