A video semantic detection method based on locality-sensitive discriminant sparse representation and weighted KNN

https://doi.org/10.1016/j.jvcir.2016.09.006Get rights and content

Highlights

  • Propose a locality-sensitive discriminant sparse representation method (LSDSR).

  • Design a novel discriminant loss function with sparse coefficients.

  • Propose a new weighted KNN based on LSDSR.

  • Develop a video semantic detection model based on both LSDSR and weighted KNN.

Abstract

Video semantic detection has been one research hotspot in the field of human-computer interaction. In video features-oriented sparse representation, the features from the same category video could not achieve similar coding results. To address this, the Locality-Sensitive Discriminant Sparse Representation (LSDSR) is developed, in order that the video samples belonging to the same video category are encoded as similar sparse codes which make them have better category discrimination. In the LSDSR, a discriminative loss function based on sparse coefficients is imposed on the locality-sensitive sparse representation, which makes the optimized dictionary for sparse representation be discriminative. The LSDSR for video features enhances the power of semantic discrimination to optimize the dictionary and build the better discriminant sparse model. More so, to further improve the accuracy of video semantic detection after sparse representation, a weighted K-Nearest Neighbor (KNN) classification method with the loss function that integrates reconstruction error and discrimination for the sparse representation is adopted to detect video semantic concepts. The proposed methods are evaluated on the related video databases in comparison with existing sparse representation methods. The experimental results show that the proposed methods significantly enhance the power of discrimination of video features, and consequently improve the accuracy of video semantic concept detection.

Introduction

Many scholars have committed to the automatic detection of high-level semantic concepts from multimedia databases, and the establishment of the effective mapping between such concepts and low-level semantic.

Video semantic analysis has made great progress in signal processing. Fu et al. presented an effective sports video semantic analysis algorithm based on the fusion and interaction of multi-models and multi-features [1]. Using semantic color ratio, the video shot was classified into global shot, in-field shot and out-of-field shot. And Hidden Markov Models were adopted to associate every video shot with a particular semantic class. An ontology model of the semantic video objects was built and basic relations among these concepts are described with the aim of bridging the gap between the low-level video features and the high level semantics of the video [2]. You et al. proposed a semantic framework for weakly supervised video genre classification and event analysis [3]. They implemented a Hidden Markov Model and naïve Bayesian classifier based analysis algorithm for video classification and a Gaussian mixture model built to detect the contained event.

Sparse representation has recently been applied to signal reconstruction [4], [5], [6], [7]. In fact, the experiments have proved that sparse representation has a good performance in the video analysis [8], [9], [10], [11]. As we know, action recognition and semantic analysis are important parts of video analysis. In action recognition, discriminative dictionary learning in sparse representation was applied to represent effective video action features and performed classification well [8], [9], [10]. In video semantic analysis, a kernel discriminative sparse representation method was proposed for video semantic features representation [11]. In theory, it can tolerate the signal interference and reconstruct signal with high fidelity. The quality of dictionary is crucial for sparse efficient representation. Dictionary learning (DL) is to learn good dictionary from the training samples so that the given signal could be well represented or coded. The dictionary is usually determined in the following two categories.

The first category is to use all the training samples as the dictionary. For example, Wright et al. directly used the training samples of all classes as the dictionary to code the testing sample, and then classified the testing sample by evaluating which class leads to minimal reconstruction error [12]. Xu et al. proposed a two-phase test sample representation method [13]. The first phase of the proposed method is to represent the test sample as a linear combination of all the training samples and to seek M nearest neighbors for the test sample. The second phase represents the test sample as a linear combination of the determined M nearest neighbors and uses the representation result to perform classification. Wang et al. [15] proposed the Locality-constrained linear coding (LLC) scheme based on the algorithm nonlinear dimensionality reduction by locally linear embedding [14]. In the LLC, distance regularization function is introduced into the sparse representation method to further improve the coding performance. All the methods mentioned above use training samples as the dictionary. Although they show good classification performance, the dictionary may be not effective enough to represent the sample due to the uncertain and noisy information in the original training samples. Furthermore, it could not fully exploit the discriminative information hidden in the training samples.

The second category is to adopt the learned dictionary for sparse representation. Dictionary optimization is helpful to improve the performance of sparse representation. The K-Means Singular Value Decomposition (K-SVD) algorithm which proposed by Aharon et al. [16] is a representative DL method. Given a set of training samples, K-SVD seeks the dictionary that leads to the best representation for each sample in this set. However, this method only requires that dictionary can better express the training samples with strict sparse constraints. Thus, it is not suitable for recognition. Yang et al. proposed a method called Fisher discrimination dictionary learning (FDDL) [17]. The Fisher discrimination criterion is imposed on the coding coefficients with small within-class scatter and large between-class scatter simultaneously. Ma et al. proposed a discriminative low-rank dictionary learning for sparse representation [18]. Wei et al. [19] proposed locality-sensitive dictionary learning (LSDL) for sparse representation by adding the locality constraint into the objective function of dictionary learning, which ensures that the learned over-complete dictionary is more representative.

Video classification methods can be divided into two types: rule-based methods and statistics-based methods. Rule-based methods are suitable for specific fields and detect semantic concept of video according to domain-specific knowledge. Statistics-based methods learn from the annotated video clips combination with models and classifiers and then classify other videos. Common video classification based on sparse representation includes Bayesian classifier, reconstruction error classification, K-Nearest Neighbor (KNN) and so on. In the common video classification, the KNN algorithm [20] and sparse representation can always be integrated together in the process of the classification decision, which can perform better for classification [11].

Traditional sparse representation methods cannot produce similar coding results when the video input features are from the same categories. This paper studies that video samples of the same categories could be encoded as similar sparse codes, so as to enhance the power of discrimination of sparse representation features. Traditional KNN methods classify the samples according to the limited adjacent samples around the testing sample. It does not consider discrimination between samples well, and consequently, affect the accuracy of video semantic classification. Aiming at the above problems, we propose a video semantic detection method based on locality-sensitive discriminant sparse representation and weighted K-Nearest Neighbor (LSDSR-WKNN) to address the aforementioned problems. Locality-sensitive discriminant sparse representation (LSDSR) that a new discriminant loss function based on sparse coefficients is imposed onto LSDL method is first designed. Through the LSDSR, an optimization dictionary can be well achieved. Our LSDSR method significantly enhances the power of discrimination of sparse representation features. Based on the LSDSR, a new weighted KNN algorithm with the loss function that integrates reconstruction error and discrimination is accordingly developed for video semantic classification. To demonstrate the performance of the proposed LSDSR-WKNN, our method is compared with the existing state-of-the-art approaches regarding the recognition rate. The comprehensive experiments of video semantic analysis are conducted on three video data sets: traffic video data set from Youku website, TRECVID 2007, the video data set from Open Video Project (OV). The experimental results show that LSDSR-WKNN is an effective video semantic detection method.

The contributions of this paper can be summarized as follows:

  • (1)

    The discriminant loss function is introduced into locality-sensitive sparse representation. An optimized dictionary is obtained, and the power of discrimination of sparse coding for video samples is further enhanced.

  • (2)

    The weighted KNN with the discriminant loss function that integrates reconstruction error and discrimination is proposed, and a vote mechanism is adopted for video semantic detection.

  • (3)

    A video semantic detection model based on locality-sensitive discriminant sparse representation and weighted KNN is developed. Original video features are encoded sparsely, and video semantic is detected with the model to improve the accuracy of video concept detection.

The rest of this paper is organized as follows. Section 2 reviews related works on sparse representation. Section 3 designs locality-sensitive discriminant video features sparse representation. Section 4 proposes the weighted KNN video semantic classification method based on locality-sensitive discriminant sparse representation. Experimental results on related video databases and analysis are presented in Section 5. Section 6 concludes the paper.

Section snippets

Sparse representation

Let XRm×N be the training set and DRm×K be the sparse learning dictionary. ARK×N is the sparse representation matrix of X and yRm×1 is one test sample. Sparse representation of the original signal is represented as a sparse linear combination of the over-complete dictionary atoms. Since K-SVD algorithm requires strict sparse restrictions, the dictionary can be learned to obtain the better sparse representation of data. The dictionary learning is reformulated as:minD,AX-DAF2s.t.i,αi0T0,

Locality-sensitive discriminant sparse representation

In the video features-oriented sparse representation, video features from the same category might not be capable of achieving similar coding results. We assume that video samples from the same category should be encoded as similar sparse coefficients in the sparse representation based video semantic detection, with the goal of enhancing the power of the discrimination in sparse representation based classification. In this paper, locality-sensitive discriminant sparse representation (LSDSR) is

Weighted KNN video semantic concept detection based on LSDSR

Traditional KNN method classifies the samples according to the limited adjacent samples around the testing sample and doesn’t well consider the differences among the training samples. To improve the accuracy of KNN method for video semantic classification, a weighted KNN classification method based on locality-sensitive discriminant sparse representation (LSDSR-WKNN) is proposed. We first calculate the loss function values between the test sample and training samples from each class and then

Experiments and result analysis

To demonstrate the effectiveness of the proposed methods, we conduct experiments on three video data sets: the traffic video data set from Internet website of Youku, the news video data set from TRECVID 2007 and the video data set from OV. Fig. 2 shows an example of video key-frame pictures of three video data sets. The proposed LSDSR and LSDSR-WKNN methods were compared with LLC, SRC, FDDL and LSDL. The experiment results are carried out by 20-fold cross validation. Training samples are

Conclusion

This paper studies that video samples from the same category should be encoded as similar sparse coefficients in the process of video semantic detection based on sparse representation, so as to enhance the power of discrimination of sparse representation features. In this paper, the video semantic detection methods called LSDSR and LSDSR-WKNN are proposed. A discriminant loss function based on sparse coefficients is first introduced into LSDL in the proposed LSDSR. The LSDSR aims to generate

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 61672268, 61502208), Primary Research & Developement Plan of Jiangsu Province of China (Grant No. BE2015137), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grant Nos. 14KJB520007, 14KJB520008), China Postdoctoral Science Foundation (Grant No. 2015M570411), Natural Science Foundation of Jiangsu Province of China (Grant No. BK20150522) and Research Foundation for Talented

References (24)

  • Y. Yuan et al.

    Action recognition by joint learning

    Image Vis. Comput.

    (2015)
  • T. Guha et al.

    Action recognition by learnt class-specific overcomplete dictionaries

  • Cited by (16)

    • Kernel based locality – Sensitive discriminative sparse representation for face recognition

      2020, Scientific African
      Citation Excerpt :

      LLC could achieve a smaller reconstruction error with multiple code reconstruction and local smoothness with respect to sparsity. More recently, there has been an advancement in sparse DL for video semantic analysis as proposed in [52], a video semantic analysis based on Locality-sensitive Discriminant SR and Weighted KNN (LSDSR-WKNN). This approach gave a better category discrimination on the SR of video semantic concepts as well as robust face recognition.

    • Video semantic analysis based kernel Locality- Sensitive Discriminative sparse representation

      2019, Expert Systems with Applications
      Citation Excerpt :

      Afterwards, features consisting of the 5-dimensional radial Tchebichef moment(Mukundan, 2005), 6-dimensional gray level co-occurrence matrix (GLCM) (Haralick & Shanmugam, 1973), 81-dimensional HSV color histogram, and 30-dimensional multi-scale local binary pattern (LBP) (Ojala, Pietikainen, & Maenpaa, 2002) features from the key-frame were extracted. The details of these features could be referred from the paper (Zhan et al., 2016a). Fig. 1 below shows some key-frames from the three video datasets that were implemented.

    • Learning an event-oriented and discriminative dictionary based on an adaptive label-consistent K-SVD method for event detection in soccer videos

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      The classification methods based on sparse coding owes to the fact that by removing the redundant visual features, a high dimensional signal can be sparsely represented by the representative samples of its class in a low dimensional manifold [28]. Sparse representation has been widely applied to a variety of applications in video analysis including video event detection [1], video summarization [11,29], video action recognition [30,31], video coding [32], abnormal event detection in crowded scenes [12,13] and video semantic detection [33]. Sparse representation involves two optimization sub-problems: dictionary learning and sparse coding.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Zicheng Liu.

    1

    Junqi Liu, Minchao Wang and Jianping Gou contributed equally to this work.

    View full text