A tensor-based deep learning framework☆
Introduction
A prerequisite of the human–robot interaction is the ability of the robot to comprehend the respective inputs given by humans. Similarly to humans, the most intensive sensory input of a robot is the visual one. The latter suggests that the robots should collaborate with humans they ought to be endowed with subsystems suitable for action, gesture or even emotion recognition. An interest point for such desired recognition tasks is the fact that the corresponding data encapsulate both spatial and temporal information. The temporal dimension of those problems may be administered in an appropriate manner and, consequently, the utilization of mainstream algorithms that are suitable for object recognition is considered to be improper for the aforementioned tasks.
A well-known approach to deal with video sequences is to utilize motion descriptors [1], [2], [3], [4], [5], however the formation of Bag-of-Words models does not preserve the temporal coherence. A solution to the latter disadvantage can be provided by deep learning architectures, where the acquisition of spatio-temporal contingencies and, thus, the derivation of respective features are considered to be an indispensable goal. Deep learning is an emerging area in the field of machine learning and its stepping stone are contemporary neuroscience discoveries concerning the representation of information and the feature derivation. More precisely, the findings in Ref. [6] suggest that the neocortex in mammals does not process the sensory input data in a shallow manner, but they traverse through a composite multi-layered structure of computational units which, through repetition, learn to express the signals according to the consistencies they display. Additionally, the works in Ref. [7], [8] provide affirmation for the existence of a mutual cortical architecture, not only between different modalities such as visual, auditory and somatosensory, but also among other species. The deep learning models have already been applied in a diversity of recognition tasks namely object recognition [9], [10], [11], [12], human tracking [13], segmentation [14], brain–computer interaction (BCI) [15] and in action recognition [16], [17], [18] exhibiting remarkable performance.
Tensor or multilinear algebra can provide a complete mathematical framework for analyzing the multifactor formation of image and video sets. Moreover, it administers methodologies for decomposing such sets and, thus, unfolding the factors or modes. Similarly to matrices which are considered to be linear operators over a vector space, tensors constitute multilinear operators over a set of vector spaces. Consequently, linear analysis is a special case of tensor analysis which offers a generalized mathematical framework appropriate to confront a plethora of machine learning problems. However, the vectorization of a tensor neglects the proximity information, i.e. when a video sample is processed, a patch of pixels is separated from its neighbor ones and, therefore, lacks spatio-temporal coherence. Additionally, the vectorization procedure, besides the fact that stacks the rows or columns of the original tensorial samples in an inconsistent manner, also leads to the formation of high dimensionality vectors. Thus, combining the high dimensional data representation with the small number of data samples conducts to small size problems, able to be competently handled by tensors [19]. Tensor based frameworks maintain most of the authentic constraints of the data and, thus, they are in position to improve the characterization or classification of them. Especially in the case where the number of training samples is restricted, the constraints aid the derivation of a reasonable solution in such problems. The second and the third order tensors suffice to describe image and video samples, respectively.
The motivation of the proposed model relies on three main facts, as follows. Firstly, tensor based approaches avoid over-training, especially in small size problems [20]. The over-training occurs in most deep architectures and further measures should be considered, which attempt to confine such phenomena [21]. Secondly, multilinear algebra reduces the computational complexity and the requirements in terms of memory [22], [23]. Such considerations are frequently encountered in deep learning models thus, cost expensive solutions, namely the developing of computing clusters with thousands of machines are performed [24]. Last, with the aim to reduce the number of parameters that have to be estimated, multilinear units are more efficient than vectorized non-linear ones.
This paper presents an unsupervised deep learning methodology inspired by the HTM notion [25], [26], i.e. a network of computational nodes, placed in a tree shaped hierarchy and divided into distinguished levels. Each of these nodes incorporates notions from tensor algebra to infer spatio-temporal features. To the best of our knowledge this is the first attempt to graft tensor based approaches in such architectures. The insight behind this approach is the avoidance of high dimensionality feature vectors appearing in previous HTM architectures [25], [11], [26], [12], thus, resulting to more efficient method in terms of data representation and classification accuracy. The experimental evaluation procedure justifies this insight in terms of classification accuracy. The proposed methodology is an unsupervised multi-level architecture capable of extracting spatio-temporal features from video samples. Every node – similarly to the HTM notion [25], [26] – incorporates two distinct procedures: the spatial and the temporal one, while the procedures themselves attain the training and the testing (inference) mode. Thereupon, when the spatial procedure is in training mode, the node treats every frame patch as a second-order tensor and attempts to learn and quantize the input space. This quantization is accomplished by deriving representatives (i.e. quantization centers) through a competitive learning scheme. Then, the node switches to testing mode and expresses each frame patch in terms of similarity utilizing the corresponding representatives. Afterwards, the node initiates the temporal procedure in training mode and for each respective frame patch it forms a matrix that contains the temporal changes among the frame patches of the same video sample. Those matrices are utilized to assemble temporal clusters through a tensorized extension of a vector based clustering technique. Last, the node alternates to testing mode, in which every frame patch acquires – through a membership function – the degree of closeness with respect to the derived temporal clusters. Regarding the proposed deep learning model, each level is successively trained and a level starts its training only after the nodes of the previous level have completed the respective operation. In addition, the upper level uses the concatenated outputs from its corresponding children nodes as input and treats those concatenations as a dictionary. The proposed technique exceeds in terms of classification accuracy not only previous HTM-based models but also other state-of-the-art methodologies (see Section 4 for details).
The remaining of this paper is organized as follows: In Section 2 the most significant work over the deep learning field and methodologies that exploit multilinear algebra is presented; Section 3 describes the architecture of the proposed deep learning model, including the operation of a single node; the experimental validation of the presented methodology occurs in Section 4 and, last, conclusions are drawn in Section 5.
Section snippets
Related work
This section is further separated to individually present significant methodologies that utilize either deep learning or multilinear algebra.
Proposed methodology
This section describes the proposed deep learning algorithm. It is divided into two subsections, the first one providing the architecture of the entire methodology and the second one the single node's functionality.
Experimental results
This section presents a series of experiments conducted to evaluate the effectiveness of the proposed deep learning scheme. The performance of the presented model is compared in terms of classification accuracy with other state-of-the-art methodologies. The comparison was accomplished on three publicly available datasets, for action recognition, viz. the Gesture3D [54], the KTH action [55] and the UCF Sports [56].
The proposed deep learning scheme was combined with both the linear SVM classifier
Conclusion
This paper presents an unsupervised deep learning framework capable of deriving spatio-temporal features. This generic framework may be applied in a variety of recognition problems, while the respective experimental validation shows the efficacy of the proposed methodology in action recognition tasks. The presented technique exploits the multilinear algebra, attaining the coherence between the successive frames. Each computational node forms second order representatives according to a deviation
References (63)
- et al.
Spatiotemporal bag-of-features for early wildfire smoke detection
Image Vision Comput.
(2013) - et al.
Histogram of oriented rectangles: a new pose descriptor for human action recognition
Image Vision Comput.
(2009) - et al.
The role of the primary visual cortex in higher level vision
Vis. Res.
(1998) - et al.
On the optimization of hierarchical temporal memory
Pattern Recogn. Lett.
(2012) - et al.
Higher rank support tensor machines for visual recognition
Pattern Recogn.
(2012) - et al.
Sparsity preserving projections with applications to face recognition
Pattern Recogn.
(2010) - et al.
Behavior recognition via sparse spatio-temporal features
On space-time interest points
Int. J. Comput. Vis.
(2005)- et al.
Recognizing action at a distance
An Organizing Principle for Cerebral Function: The Unit Module and the Distributed system
(1979)
Visual behaviour mediated by retinal projections directed to the auditory pathway
Nature
Unsupervised learning of invariant feature hierarchies with applications to object recognition
Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning
Sparse deep-learning algorithm for recognition and categorisation
Electron. Lett.
Human tracking using convolutional neural networks
Trans. Neural Netw.
Convolutional networks can learn to generate affinity graphs for image segmentation
Neural Comput.
Convolutional neural networks for p300 detection with application to brain–computer interfaces
Trans. Pattern. Anal. Mach. Intell.
Convolutional learning of spatio-temporal features
3d convolutional neural networks for human action recognition
Trans. Pattern. Anal. Mach. Intell.
A biologically inspired system for action recognition
Supervised tensor learning
Knowl. Inf. Syst.
Tensor learning for regression
Trans. Image Process.
Tensor subspace analysis
Large scale distributed deep networks
How the brain might work: a hierarchical and temporal model for learning and recognition
Towards a mathematical theory of cortical micro-circuits
PLoS Comput. Biol
A fast learning algorithm for deep belief nets
Neural Comput.
Large-scale learning with SVM and convolutional for generic object categorization
Exponential family harmoniums with an application to information retrieval
Greedy layer-wise training of deep networks
Cited by (9)
Deep neural network based image annotation
2015, Pattern Recognition LettersCitation Excerpt :Gong et al. [9] perform image annotation by combining convolutional architectures with approximate top-k ranking objectives. An unsupervised deep learning framework is presented in [10], aiming to derive spatio-temporal features for human–robot interaction. Dong et al. [11] tackle the task of image super-resolution by learning a deep convolutional neural network.
Efficient Alternating Least Squares Algorithms for Low Multilinear Rank Approximation of Tensors
2021, Journal of Scientific ComputingHuman actions recognition from motion capture recordings using signal resampling and pattern recognition methods
2018, Annals of Operations ResearchDomain supervised deep learning framework for detecting Chinese diabetes-related topics
2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
- ☆
This paper has been recommended for acceptance by Stefanos Zafeiriou.