Elsevier

Neurocomputing

Volume 100, 16 January 2013, Pages 117-126
Neurocomputing

Video analysis based on Multi-Kernel Representation with automatic parameter choice

https://doi.org/10.1016/j.neucom.2011.10.034Get rights and content

Abstract

In this work, we analyze video data by learning both the spatial and temporal relationships among frames. For this purpose, the nonlinear dimensionality reduction algorithm, Laplacian Eigenmaps, is improved using a multiple kernel learning framework, and it is assumed that the data can be modeled by means of two different graphs: one considering the spatial information (i.e., the pixel intensity similarities) and the other one based on the frame temporal order. In addition, a formulation for automatic tuning of the required free parameters is stated, which is based on a tradeoff between the contribution of each information source (spatial and temporal). Moreover, we proposed a scheme to compute a common representation in a low-dimensional space for data lying in several manifolds, such as multiple videos of similar behaviors. The proposed algorithm is tested on real-world datasets, and the obtained results allow us to confirm visually the quality of the attained embedding. Accordingly, discussed approach is suitable for data representability when considering cyclic movements.

Introduction

Essentially, most of the video-based methods employed to learn and discover the real object motion and/or human behavior are subject to two main attributes: the pixel spatial disposition and their variations among time. The common strategy of the nonlinear dimensionality reduction (NLDR) techniques is to presume that the high-dimensional input data (video) lie on a manifold, defined by a smaller set of characteristics [4], [9]. Then, computing the pixel intensity similarities among a sequence of images, NLDR aims to reveal the underlying data structure in a low-dimensional space, which allows to infer the latent variables that govern the studied phenomenon.

There are several techniques for implementing NLDR based on manifold learning, such as Isometric Feature Mapping (ISOMAP) [16], Locally Linear Embedding (LLE) [14], Maximum Variance Unfolding (MVU) [20], Laplacian Eigenmaps (LEM) [1], etc. Unlike other iterative NLDR techniques (such as MVU and ISOMAP), LEM has an analytic solution. Thereby, LEM requires lower computational load when dealing with reasonable sample size, and it does not require a regularization process such as LLE. Anyway, all above mentioned techniques determine the assumed manifold by modeling the data topology through local interconnections among samples, just based on the spatial similarities, thus is, by a pixel intensity comparison. Accordingly, they discarded an amount of valuable information related to the soft transitions between adjacent pictures, which can be considered as a time constraint. Furthermore, the temporal information captured on a video should be taken into account to differentiate the cycles on a periodic movement. Furthermore, the temporal information captured on a video should be taken into account to differentiate video data regarding periodic movement cycles (activities that return to its beginning and repeats itself in the same sequence, e.g., rotating, walking, running, etc.), when traditional NLDR are not suitable to identify each repetition in the low-dimensional space.

On the other hand, it had been suggested the possibility of incorporating prior knowledge to the embedding topology, which allows to obtain enhanced low dimensional representations of the phenomenon in hand [18], [19]. However, those techniques are based on complex probabilistic models comprising some free heuristic parameters that are far from easy to be tuned by an inexpert user (not mentioning their huge computational load). Recently, an approach for incorporating temporal information to the embedding process is discussed in [8] that considers adjacent temporal neighbors to find out the structure of repetitive activities. Nonetheless, since the time variable is not reflected in the mapping process, it is not possible to identify different repetitions of a movement, and thus, the cycles are overlapped in the embedding space. To cope with this, a general model for multiview learning called Distributed Spectral Embedding (DSE), which aims to unfold the underlying data structure from different feature spaces is presented in [10]. Then, DSE calculates a common low-dimensional space that is close to each representation as much as possible. Although DSE allows to handel different space representations, the original multiview data are invisible to the final learning process, being inappropriate to explore the complementary nature of different views, and its computational load is too dense [21].

Several approaches that deal with multiple kernels within the machine learning context (classification and regression) are also presented in [7], [12], [3]. Their main goal is to employ different sources of information to identify the similarities among samples, and then, a combination of these similarities is calculated by means of statistical kernel learning. In this regard, a convenient approach is to consider that the calculated multiple kernel is actually a convex combination of a basis kernel [3], [12]. In [21], a similar approach is used based on the mathematical framework of LEM for obtaining a Multiview Spectral Embedding (MSE) of the input data. MSE approach takes advantage of different views (feature space) to find out a low-dimensional space wherein each view is sufficiently smooth. Particularly, MSE is tested in image retrieval, video annotation, and document clustering problems, mainly combining low-level visual features. Due to there is no closed-form solution for MSE, they drive an alternating-optimization to obtain the embeddings.

In this work we propose a methodology for analyzing videos based on a Multi-Kernel Representation (MKR) of the input data, improving the LEM technique to compute and learn both spatial and temporal relationships among frames. The spatial relationships refer to the change of the pixel intensity among samples. The temporal information is related to the sequence order of data, more precisely, the order of appearance of the frames. When the spatial and temporal information are considered, the low-dimensional representation reveals the real motion of the objects. In addition, a formulation for automatic tuning of the free parameters required is presented. This formulation is based on a tradeoff between the contribution of the spatial and temporal information, to minimize, as well as possible, each error representation in the low-dimensional space based on a L-curve criteria [6]. Our work is inspired by the multiple kernel learning framework for machine learning [12], [17], which is adapted in a NLDR scheme for cyclic motion analysis from video data. The presented approach is tested for revealing the spatial and temporal dynamics of several real-world videos related to cyclic motions. The main goal is to understand the true motion behavior of different scenes, making easier their interpretation. Particularly, the experiments are conduced on video sequences of rotating objects, walking humans, handwaving, and head movements. Obtained results exhibit a better performance of our method for visualizing the videos in a low-dimensional representation than traditional NLDR embeddings.

On the other hand, most of the NLDR algorithms are constrained to deal with a single manifold, attaining unappropriate low-dimensional representations when input data lie on multiple manifolds, moving apart each manifold from the others, regardless of whether the behavior among them is similar. For dealing multiple manifolds, a novel methodology was proposed in [17], which learns a joint representation from data lying on multiple manifolds. They seek to preserve the local structure on each manifold, and in the same time, collapse the different manifolds into one manifold in the embedding space, preserving the implicit correspondences between the points across different datasets. Still, this technique performs a pixel comparison, that is, the analysis is carried out in the high-dimensional space, which implies that the methodology is limited to analyze frames of video sharing a similar appearance. Therefore, the approach presented in [17] is not suitable to visually compare similar process when the appearance between the objects/subjects is different.

In this sense, we also proposed a Multiple Manifold Learning – (MML) scheme to compute a common representation in a low-dimensional space for multiple videos of similar behaviors, without making a pixel comparison between data points from different datasets. Moreover, we combine MKR with MML for revealing the spatial and temporal dynamics of a given cyclic motion, learning it from several videos. Attained results show that our proposal outperforms the method presented in [17] when the mapping to a low-dimensional space deals with multiple videos at the same time.

This work is organized as follow. In Section 2, the proposed MKR for NLDR based on the LEM algorithm is presented. Section 3 introduces the proposed method for automatic parameter selection in MKR. Section 4 shows how our approach computes a common representation in a low-dimensional space for multiple videos of similar behaviors. In Section 5 the experimental conditions and results are described. Finally, in 6 Discussion, 7 Conclusions, we discuss and conclude about the attained results.

Section snippets

Multi-Kernel Representation based on LEM

Laplacian Eigenmaps (LEM) is a nonlinear dimensionality reduction technique based on preserving the intrinsic geometric structure of the manifold. Let XRn×p the input data matrix with sample objects xi (i=1,…,n). The goal is to provide a mapping to a low-dimensional Euclidean space YRn×m, with sample vectors yi, being mp. The LEM algorithm has three main steps. First, an undirected weighted graph G(V,E) is built; where V are the vertices and E are the edges. In this case, there are n

Automatical parameter selection in MKR

The ξs and ξt parameters in (6) give a tradeoff between the spatial and temporal information retained in the low-dimensional space Y. If ξt=0 (ξs=1), we have the original mapping of LEM, as ξt increases, then ξs decreases due to the constraint ξs+ξt=1. Consequently, for a given pair of points (ξs,ξt), we can analyze the spatial and temporal representation errors in LEM with MKR as εs(ξs,ξt)=ij(yiyj)2Wsijεt(ξs,ξt)=ij(yiyj)2Wtij,where y is a row vector of Y, which is calculated using (6). We

Multiple Manifold Learning

NLDR techniques are limited to work with a single manifold, and they fail to find out a common low-dimensional representation for data lying on multiple manifolds, being necessary to develop a methodology that allows to deal with this issue. Each input sample xi can be related to one of C different manifolds sharing similar underlying structure. Let Ψ={Xc}c=1C an input set, where XcRnc×p. Our goal is to find a mapping from Ψ to a low-dimensional space YRn×m (with mp, and n=c=1Cnc), which

Single video analysis

We test the MKR methodology (Section 2) for finding a low-dimensional space that allows to visually identify the spatial and temporal dynamics of a single real-world video. In this sense, four real-world datasets are studied: COIL-100 [11], CMU MoBo [13], Action [15], and Head. The first database is the Columbia Object Image Library, which contains 72 RGB-color images for several objects in PNG format. Pictures are taken while each object is rotated every 5 degrees from 0 to 360. We create a

Discussion

According to the single video analysis results for original LEM (spatial representation), which are shown in Figs. 2(a), 3(a), 4(a), and 5(a), it is possible to notice that traditional NLDR formulations that only consider the spatial relationships among observations, lead in low-dimensional representations highlighting the underlying motion structure of the studied video. However, it does not allow to reveal the temporal dynamics of the data. Thus, it is not possible to separate each cycle of

Conclusions

In this work we learn both the spatial and temporal relationships among frames of cyclic motions from video. Therefore, we presented a nonlinear dimensionality reduction methodology based on Laplacian Eigenmaps and multiple kernel learning. Particularly, we showed that considering both spatial and temporal relationships among frames in video analysis enhances the data representability, revealing the underlying structure behind the samples in a low-dimensional space. Moreover, we proposed a

Acknowledgements

This research was carried out under grants provided by the Research Center of Excellence in TIC – ARTICA, the projects 20201006599, 20201006574, and a M.Sc. scholarship funded by Universidad Nacional de Colombia, and the project 20201006594 found by Universidad de Caldas and Universidad Nacional de Colombia.

Andrés Álvarez-Meza received his undergraduate degree in electronic engineering from the Universidad Nacional de Colombia in 2009. Actually, he is pursuing a M.Sc. at the same university. His research interests are nonlinear dimensionality reduction and kernel methods for motion analysis and signal processing.

References (21)

  • B. Li et al.

    Locally linear discriminant embedding: an efficient method for face recognition

    Pattern Recognition

    (2008)
  • M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural...
  • M. Belkin et al.

    Laplacian eigenmaps for dimensionality reduction and data representation

    Neural Comput.

    (2003)
  • M. Gonen, E. Alpaydin, Localized multiple kernel regression, in: Proceedings of the 20th International Conference on...
  • A. Gupta, F. Chen, D. Kimber, L.S. Davis, Context and observation driven latent variable model for human pose...
  • P.C. Hansen

    Analysis of discrete ill-posed problems by means of the L-curve

    SIAM J.

    (1992)
  • P.C. Hansen

    Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion

    (2000)
  • G.R.G. Lanckriet et al.

    A statistical framework for genomic data fusion

    Bioinformatics

    (2004)
  • M. Lewandowski, J. Martinez-del Rincon, D. Makris, J.-C. Nebel, Temporal extension of Laplacian eigenmaps for...
  • B. Long, et al., A general model for multiple view unsupervised learning, in: Proceedings of the 8th SIAM International...
There are more references available in the full text version of this article.

Cited by (4)

  • Learning representations from multiple manifolds

    2016, Pattern Recognition
    Citation Excerpt :

    Product manifold embedding techniques are applied for action recognition [32] and temporal motion sequence analysis [33]. In human motion analysis from video data, manifold learning with spatial and temporal constraints is applied for cyclic motion using a multiple kernel learning framework [34]. In the case of joint manifold learning, the data consist of multiple intersecting manifolds.

  • Pattern recognition in Latin America in the "big data" era

    2015, Pattern Recognition
    Citation Excerpt :

    Therefore, video databases contain huge amounts of data in high dimensions, and clever methods to describe and search these databases are needed. The work [96] proposes a technique to compute a common representation in a low-dimensional space for data lying in multiple manifolds, improving the Laplacian Eigenmaps technique [19]. This technique is used to learn spatial and temporal relationship between video frames.

  • Improvement of pose recognition by sparse regularized convolutional neural network

    2019, 2019 IEEE International Conference on Real-Time Computing and Robotics, RCAR 2019
  • Functional relevant multichannel kernel adaptive filter for human activity analysis

    2014, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Andrés Álvarez-Meza received his undergraduate degree in electronic engineering from the Universidad Nacional de Colombia in 2009. Actually, he is pursuing a M.Sc. at the same university. His research interests are nonlinear dimensionality reduction and kernel methods for motion analysis and signal processing.

Juliana Valencia-Aguirre received her undergraduate degree in electronic engineering from the Universidad Nacional de Colombia in 2009. Actually, she is pursuing a M.Sc. at the same university. Her research interests are feature extraction for training pattern recognition systems, nonlinear dimensionality reduction for motion analysis and image processing.

Genaro Daza-Santacoloma received the B.S. degree in electronic engineering (2005), the M.Sc. degree in engineering-industrial automation with honors (2007), and the Ph.D. degree in engineering-automatics with honors (2010), from the Universidad Nacional de Colombia. Currently, he is Assistant Professor at Universidad Antonio Nariño, Bogotá where he is researching about human motion and subspace learning in collaboration with Signal Processing and Recognition Group from Universidad Nacional de Colombia. His research interests are feature extraction/selection for training pattern recognition systems, artificial vision, computer animation, and machine learning.

Carlos Daniel Acosta-Medina received a B.S. degree in Mathematics from Universidad de Sucre in Colombia in 1996. In 2000 received a M.Sc. Mathematics degree and in 2008 a Ph.D. in Mathematics, both of them from Universidad Nacional de Colombia – Sede Medellín. Currently he is Associated Professor at Universidad Nacional de Colombia – Sede Manizales. His research interests are Regularization, Conservation Laws and Discrete Mollification.

Germán Castellanos-Domínguez received his undergraduate degree in radiotechnical systems and his Ph.D. in processing devices and systems from the Moscow Technical University of Communications and Informatics, in 1985 and 1990, respectively. Currently, he is a professor in the Department of Electrical, Electronic and Computer Engineering at the Universidad Nacional de Colombia at Manizales. In addition, he is Chairman of the GCPDS at the same university. His teaching and research interests include information and signal theory, digital signal processing and bioengineering.

View full text