A tensor-based deep learning framework

doi:10.1016/j.imavis.2014.08.003

Image and Vision Computing

Volume 32, Issue 11, November 2014, Pages 916-929

https://doi.org/10.1016/j.imavis.2014.08.003 Get rights and content

Highlights

•
We propose a novel deep learning model for action recognition.
•
The deep learning model exploits multilinear algebra.
•
The method derives spatio-temporal features.

Abstract

This paper presents an unsupervised deep learning framework that derives spatio-temporal features for human–robot interaction. The respective models extract high-level features from low-level ones through a hierarchical network, viz. the Hierarchical Temporal Memory (HTM), providing at the same time a solution to the curse of dimensionality in shallow techniques. The presented work incorporates the tensor-based framework within the operation of the nodes and, thus, enhances the feature derivation procedure. This is due to the fact that tensors allow the preservation of the initial data format and their respective correlation and, moreover, attain more compact representations. The computational nodes form spatial and temporal groups by exploiting the multilinear algebra and subsequently express the samples according to those groups in terms of proximity. This generic framework may be applied in a diverse of visual data, while it has been examined on sequences of color and depth images, exhibiting remarkable performance.

Introduction

A prerequisite of the human–robot interaction is the ability of the robot to comprehend the respective inputs given by humans. Similarly to humans, the most intensive sensory input of a robot is the visual one. The latter suggests that the robots should collaborate with humans they ought to be endowed with subsystems suitable for action, gesture or even emotion recognition. An interest point for such desired recognition tasks is the fact that the corresponding data encapsulate both spatial and temporal information. The temporal dimension of those problems may be administered in an appropriate manner and, consequently, the utilization of mainstream algorithms that are suitable for object recognition is considered to be improper for the aforementioned tasks.

A well-known approach to deal with video sequences is to utilize motion descriptors [1], [2], [3], [4], [5], however the formation of Bag-of-Words models does not preserve the temporal coherence. A solution to the latter disadvantage can be provided by deep learning architectures, where the acquisition of spatio-temporal contingencies and, thus, the derivation of respective features are considered to be an indispensable goal. Deep learning is an emerging area in the field of machine learning and its stepping stone are contemporary neuroscience discoveries concerning the representation of information and the feature derivation. More precisely, the findings in Ref. [6] suggest that the neocortex in mammals does not process the sensory input data in a shallow manner, but they traverse through a composite multi-layered structure of computational units which, through repetition, learn to express the signals according to the consistencies they display. Additionally, the works in Ref. [7], [8] provide affirmation for the existence of a mutual cortical architecture, not only between different modalities such as visual, auditory and somatosensory, but also among other species. The deep learning models have already been applied in a diversity of recognition tasks namely object recognition [9], [10], [11], [12], human tracking [13], segmentation [14], brain–computer interaction (BCI) [15] and in action recognition [16], [17], [18] exhibiting remarkable performance.

Tensor or multilinear algebra can provide a complete mathematical framework for analyzing the multifactor formation of image and video sets. Moreover, it administers methodologies for decomposing such sets and, thus, unfolding the factors or modes. Similarly to matrices which are considered to be linear operators over a vector space, tensors constitute multilinear operators over a set of vector spaces. Consequently, linear analysis is a special case of tensor analysis which offers a generalized mathematical framework appropriate to confront a plethora of machine learning problems. However, the vectorization of a tensor neglects the proximity information, i.e. when a video sample is processed, a patch of pixels is separated from its neighbor ones and, therefore, lacks spatio-temporal coherence. Additionally, the vectorization procedure, besides the fact that stacks the rows or columns of the original tensorial samples in an inconsistent manner, also leads to the formation of high dimensionality vectors. Thus, combining the high dimensional data representation with the small number of data samples conducts to small size problems, able to be competently handled by tensors [19]. Tensor based frameworks maintain most of the authentic constraints of the data and, thus, they are in position to improve the characterization or classification of them. Especially in the case where the number of training samples is restricted, the constraints aid the derivation of a reasonable solution in such problems. The second and the third order tensors suffice to describe image and video samples, respectively.

The motivation of the proposed model relies on three main facts, as follows. Firstly, tensor based approaches avoid over-training, especially in small size problems [20]. The over-training occurs in most deep architectures and further measures should be considered, which attempt to confine such phenomena [21]. Secondly, multilinear algebra reduces the computational complexity and the requirements in terms of memory [22], [23]. Such considerations are frequently encountered in deep learning models thus, cost expensive solutions, namely the developing of computing clusters with thousands of machines are performed [24]. Last, with the aim to reduce the number of parameters that have to be estimated, multilinear units are more efficient than vectorized non-linear ones.

This paper presents an unsupervised deep learning methodology inspired by the HTM notion [25], [26], i.e. a network of computational nodes, placed in a tree shaped hierarchy and divided into distinguished levels. Each of these nodes incorporates notions from tensor algebra to infer spatio-temporal features. To the best of our knowledge this is the first attempt to graft tensor based approaches in such architectures. The insight behind this approach is the avoidance of high dimensionality feature vectors appearing in previous HTM architectures [25], [11], [26], [12], thus, resulting to more efficient method in terms of data representation and classification accuracy. The experimental evaluation procedure justifies this insight in terms of classification accuracy. The proposed methodology is an unsupervised multi-level architecture capable of extracting spatio-temporal features from video samples. Every node – similarly to the HTM notion [25], [26] – incorporates two distinct procedures: the spatial and the temporal one, while the procedures themselves attain the training and the testing (inference) mode. Thereupon, when the spatial procedure is in training mode, the node treats every frame patch as a second-order tensor and attempts to learn and quantize the input space. This quantization is accomplished by deriving representatives (i.e. quantization centers) through a competitive learning scheme. Then, the node switches to testing mode and expresses each frame patch in terms of similarity utilizing the corresponding representatives. Afterwards, the node initiates the temporal procedure in training mode and for each respective frame patch it forms a matrix that contains the temporal changes among the frame patches of the same video sample. Those matrices are utilized to assemble temporal clusters through a tensorized extension of a vector based clustering technique. Last, the node alternates to testing mode, in which every frame patch acquires – through a membership function – the degree of closeness with respect to the derived temporal clusters. Regarding the proposed deep learning model, each level is successively trained and a level starts its training only after the nodes of the previous level have completed the respective operation. In addition, the upper level uses the concatenated outputs from its corresponding children nodes as input and treats those concatenations as a dictionary. The proposed technique exceeds in terms of classification accuracy not only previous HTM-based models but also other state-of-the-art methodologies (see Section 4 for details).

The remaining of this paper is organized as follows: In Section 2 the most significant work over the deep learning field and methodologies that exploit multilinear algebra is presented; Section 3 describes the architecture of the proposed deep learning model, including the operation of a single node; the experimental validation of the presented methodology occurs in Section 4 and, last, conclusions are drawn in Section 5.

Section snippets

Related work

This section is further separated to individually present significant methodologies that utilize either deep learning or multilinear algebra.

Proposed methodology

This section describes the proposed deep learning algorithm. It is divided into two subsections, the first one providing the architecture of the entire methodology and the second one the single node's functionality.

Experimental results

This section presents a series of experiments conducted to evaluate the effectiveness of the proposed deep learning scheme. The performance of the presented model is compared in terms of classification accuracy with other state-of-the-art methodologies. The comparison was accomplished on three publicly available datasets, for action recognition, viz. the Gesture3D [54], the KTH action [55] and the UCF Sports [56].

The proposed deep learning scheme was combined with both the linear SVM classifier

Conclusion

This paper presents an unsupervised deep learning framework capable of deriving spatio-temporal features. This generic framework may be applied in a variety of recognition problems, while the respective experimental validation shows the efficacy of the proposed methodology in action recognition tasks. The presented technique exploits the multilinear algebra, attaining the coherence between the successive frames. Each computational node forms second order representatives according to a deviation

References (63)

B. Ko et al.
Spatiotemporal bag-of-features for early wildfire smoke detection
Image Vision Comput.
(2013)
N. Ikizler et al.
Histogram of oriented rectangles: a new pose descriptor for human action recognition
Image Vision Comput.
(2009)
T.S. Lee et al.
The role of the primary visual cortex in higher level vision
Vis. Res.
(1998)
I. Kostavelis et al.
On the optimization of hierarchical temporal memory
Pattern Recogn. Lett.
(2012)
I. Kotsia et al.
Higher rank support tensor machines for visual recognition
Pattern Recogn.
(2012)
L. Qiao et al.
Sparsity preserving projections with applications to face recognition
Pattern Recogn.
(2010)
P. Dollár et al.
Behavior recognition via sparse spatio-temporal features
I. Laptev
On space-time interest points
Int. J. Comput. Vis.
(2005)
A.A. Efros et al.
Recognizing action at a distance
V.B. Mountcastle
An Organizing Principle for Cerebral Function: The Unit Module and the Distributed system
(1979)

L. von Melchner et al.

Visual behaviour mediated by retinal projections directed to the auditory pathway

Nature

(2000)

M. Ranzato et al.

Unsupervised learning of invariant feature hierarchies with applications to object recognition

M. Norouzi et al.

Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning

K. Charalampous et al.

Sparse deep-learning algorithm for recognition and categorisation

Electron. Lett.

(2012)

J. Fan et al.

Human tracking using convolutional neural networks

Trans. Neural Netw.

(2010)

S.C. Turaga et al.

Convolutional networks can learn to generate affinity graphs for image segmentation

Neural Comput.

(2010)

H. Cecotti et al.

Convolutional neural networks for p300 detection with application to brain–computer interfaces

Trans. Pattern. Anal. Mach. Intell.

(2011)

G.W. Taylor et al.

Convolutional learning of spatio-temporal features

S. Ji et al.

3d convolutional neural networks for human action recognition

Trans. Pattern. Anal. Mach. Intell.

(2013)

H. Jhuang et al.

A biologically inspired system for action recognition

D. Tao et al.

Supervised tensor learning

Knowl. Inf. Syst.

(2007)

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Improving neural networks by preventing...

W. Guo et al.

Tensor learning for regression

Trans. Image Process.

(2012)

X. He et al.

Tensor subspace analysis

J. Dean et al.

Large scale distributed deep networks

D. George

How the brain might work: a hierarchical and temporal model for learning and recognition

(2008)

D. George et al.

Towards a mathematical theory of cortical micro-circuits

PLoS Comput. Biol

(2009)

G.E. Hinton et al.

A fast learning algorithm for deep belief nets

Neural Comput.

(2006)

F.J. Huang et al.

Large-scale learning with SVM and convolutional for generic object categorization

M. Welling et al.

Exponential family harmoniums with an application to information retrieval

Y. Bengio et al.

Greedy layer-wise training of deep networks

Cited by (9)

Deep neural network based image annotation
2015, Pattern Recognition Letters
Citation Excerpt :
Gong et al. [9] perform image annotation by combining convolutional architectures with approximate top-k ranking objectives. An unsupervised deep learning framework is presented in [10], aiming to derive spatio-temporal features for human–robot interaction. Dong et al. [11] tackle the task of image super-resolution by learning a deep convolutional neural network.
Multilabel image annotation is one of the most important open problems in computer vision field. Unlike existing works that usually use conventional visual features to annotate images, features based on deep learning have shown potential to achieve outstanding performance. In this work, we propose a multimodal deep learning framework, which aims to optimally integrate multiple deep neural networks pretrained with convolutional neural networks. In particular, the proposed framework explores a unified two-stage learning scheme that consists of (i) learning to fine-tune the parameters of deep neural network with respect to each individual modality, and (ii) learning to find the optimal combination of diverse modalities simultaneously in a coherent process. Experiments conducted on a variety of public datasets evaluate the performance of the proposed framework for multilabel image annotation, in which the encouraging results validate the effectiveness of the proposed algorithms.
Evaluating the Performance of Mobile-Convolutional Neural Networks for Spatial and Temporal Human Action Recognition Analysis
2023, Robotics
Efficient Alternating Least Squares Algorithms for Low Multilinear Rank Approximation of Tensors
2021, Journal of Scientific Computing
Efficient alternating least squares algorithms for low multilinear rank approximation of tensors
2020, arXiv
Human actions recognition from motion capture recordings using signal resampling and pattern recognition methods
2018, Annals of Operations Research
Domain supervised deep learning framework for detecting Chinese diabetes-related topics
2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Stefanos Zafeiriou.

View full text

A tensor-based deep learning framework☆

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed methodology

Experimental results

Conclusion

Image Vision Comput.

Image Vision Comput.

Vis. Res.

Pattern Recogn. Lett.

Pattern Recogn.

Pattern Recogn.

Behavior recognition via sparse spatio-temporal features

On space-time interest points

Int. J. Comput. Vis.

Recognizing action at a distance

An Organizing Principle for Cerebral Function: The Unit Module and the Distributed system

Visual behaviour mediated by retinal projections directed to the auditory pathway

Nature

Unsupervised learning of invariant feature hierarchies with applications to object recognition

Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning

Sparse deep-learning algorithm for recognition and categorisation

Electron. Lett.

Human tracking using convolutional neural networks

Trans. Neural Netw.

Convolutional networks can learn to generate affinity graphs for image segmentation

Neural Comput.

Convolutional neural networks for p300 detection with application to brain–computer interfaces

Trans. Pattern. Anal. Mach. Intell.

Convolutional learning of spatio-temporal features

3d convolutional neural networks for human action recognition

Trans. Pattern. Anal. Mach. Intell.

A biologically inspired system for action recognition

Supervised tensor learning

Knowl. Inf. Syst.

Tensor learning for regression

Trans. Image Process.

Tensor subspace analysis

Large scale distributed deep networks

How the brain might work: a hierarchical and temporal model for learning and recognition

Towards a mathematical theory of cortical micro-circuits

PLoS Comput. Biol

A fast learning algorithm for deep belief nets

Neural Comput.

Large-scale learning with SVM and convolutional for generic object categorization

Exponential family harmoniums with an application to information retrieval

Greedy layer-wise training of deep networks