Human action prediction in collaborative environments based on shared-weight LSTMs with feature dimensionality reduction

doi:10.1016/j.asoc.2022.109245

Applied Soft Computing

Volume 126, September 2022, 109245

https://doi.org/10.1016/j.asoc.2022.109245 Get rights and content

Highlights

•
Human action prediction using LSTM networks.
•
Dimensionality reduction using correlation and autoencoder-inspired MLP.
•
Gaze estimation might improve human action prediction.
•
Experiments verified approach using motion capture input.

Abstract

As robots are progressing towards being ubiquitous and an indispensable part of our everyday environments, such as home, offices, healthcare, education, and manufacturing shop floors, efficient and safe collaboration and cohabitation become imperative. Given that, such environments could benefit greatly from accurate human action prediction. In addition to being accurate, human action prediction should be computationally efficient, in order to ensure a timely reaction, and capable of dealing with changing environments, since unstructured interaction and collaboration with humans usually do not assume static conditions. In this paper, we propose a model for human action prediction based on motion cues and gaze using shared-weight Long Short-Term Memory networks (LSTMs) and feature dimensionality reduction. LSTMs have proven to be a powerful tool in processing time series data, especially when dealing with long-term dependencies; however, to maximize their performance, LSTM networks should be fed with informative and quality inputs. Given that, in this paper, we furthermore conducted an extensive input feature analysis based on (i) signal correlation and their strength to act as stand-alone predictors, and (ii) a multilayer perceptron inspired by the autoencoder architecture. We validated the proposed model on a publicly available MoGaze¹ dataset for human action prediction, as well as on a smaller dataset recorded in our laboratory. Our model outperformed alternatives, such as recurrent neural networks, a fully connected LSTM network, and the strongest stand-alone signals (baselines), and can run in real-time on a standard laptop CPU. Since eye gaze might not always be available in a real-world scenario, we have implemented and tested a multi-layer perceptron for gaze estimation from more easily obtainable motion cues, such as head orientation and hand position. The estimated gaze signal can be utilized during inference of our LSTM-based model, thus making our action prediction pipeline suitable for real-time practical applications.

Introduction

With the robots becoming more capable and sophisticated, we are witnessing a growth in their presence and integration in private and professional human environments. Nowadays, such environments, besides cohabitation, often include close human–robot collaboration and interaction, yielding novel challenges concerning system efficiency and human safety. While robots are fully controllable, human behavior, on the other hand, although nearly optimal with respect to the task, is inherently stochastic. For example, imagine a healthcare worker treating a patient or a manufacturing shop floor worker assembling products in an agile production system. Their goals are well defined, but the execution and sometimes the environment are not completely controlled. While carrying out the task, the healthcare worker needs to adapt to the responses of the patient, while the worker on a manufacturing shop floor might change the order of the task execution for justified reasons. We argue that robots in human proximity should be aware of such changes and react accordingly. Having that in mind, one of the main challenges in collaborative environments is to capture the uncertainty and nuances of human behavior. Supervisory systems try to overcome these challenges by taking advantage of the plethora of methods that revolve around human trajectory prediction, safety regions assertion and action/goal prediction [1], [2], [3], [4], [5].

The problems of human action prediction and intention recognition have come under the spotlight of the research community in recent years. They serve as independent modules or are integrated into the human motion prediction either explicitly [6], [7] or implicitly [8]. The advantages of embedding human intentions implicitly in the model lie in the fact that those models can be trained jointly with the higher-level system and are validated straightforwardly through its performance. The higher-level system could be a fleet management system [9] that tries to reroute the robots out of a human’s path and is evaluated by the warehouse deliveries, the number of rerouting, and collision number or a human trajectory prediction model [10] evaluated with the root mean square error of the predicted trajectory. On the other hand, explicitly estimating human actions enables the model to be crafted or trained independently of the higher-level system. In practice, this means that training the action prediction module can be done without the robots operating thus cutting costs. These models can also be interpreted more easily [11], allowing the higher-level system to have semantic meaning and reasoning of performed actions.

In recent years, human action prediction applications ranged from robotized warehouses [9], [12] to sedentary object-picking domain [13], [14], [15] and full-body motions [16], [17], [18]. State-of-the-art human action prediction frameworks are based on Markov models [19], inverse optimal control [11] or conditional random fields [20], which try to learn moving patterns with the respect to pertaining goals, usually assuming nearly-optimal human behavior in the observed sequences. In [5] the authors propose a hybrid deep neural network model for human action recognition using action bank features leveraging fusion of homogeneous convolutional neural network (CNN) classifier. Input features are diversified and the authors propose varying the initialization of the weights of the neural network to ensure classifier diversity. Another approach based on the Long Short-Term Memory networks (LSTMs) is proposed in [21] where the authors craft a two-stream attention-based architecture for action recognition in videos. They suggest that such an approach resolves the visual attention ignoring problem by using a correlation network layer that can identify the information loss on each timestamp for the entire video. Furthermore, in [22] authors leverage a bidirectional LSTM to learn the long-term dependencies, and use the attention mechanism to boost the performance and extract the additional high-level selective action related patterns and cues. The convolutional LSTMS are used in [23] to handle the long-duration sequential features with different temporal context information and are compared to the fully connected LSTM. In [21] the authors propose an end-to-end two-stream attention-based LSTM network for human action recognition that selectively focuses on the effective features of the original input image. The concept of utilizing shared weights for neural networks was brought by de Ridder et al. in [24] with the focus on the feature extraction problem. This approach has gained traction in transfer learning [25] and physics simulation applications [26]. Regarding collaborative environments, the state-of-the-art models infer human actions by measuring different cues captured by wearable (eye gaze [14], [27], [28] or even heart rate and electroencephalography [29]) or non-wearable sensors. The use of non-wearable sensors such as motion capture systems or RGB cameras enables the model to capture crucial cues such as gestures [30], emotion [31], skeletal movement [32] or estimate eye gaze [33]. In works [14], [15], [28], [34], [35] authors have indicated that the eye gaze is a powerful predictor of human action. A good overview of human action prediction methods and their categorization by the type of problem formulation can be seen in [36]. Several works embed the eye gaze feature into human action prediction models using machine learning models such as support vector machine [14] or recurrent neural networks (RNNs) [34]. In the human collaborative scenario, the authors of [14] tested their algorithm relying on verbal instructions as additional features for their model and the actions form a sequence, In [15] the authors calculate the similarity between the hypothetical gaze points on the objects and the actual gaze points and use the nearest neighbor algorithm to classify the intended object. To the best of our knowledge, there does not exist a method that couples the human action prediction model with the directly measured eye gaze and human joint positions in a dynamic, changing environment. For example, in [14] the authors rely on gaze adding verbal commands in the feature space. In [15] the scenario is static and the subject sits while picking the objects who are always visible to the subject. Furthermore, in [36], the multiple-model estimator is leveraged for intention prediction, but the inputs to this model are extracted from a camera using convolutional networks and prior values that are not applicable in the dynamic collaborative domain.

In the last few years, multiple datasets concerning motion and action prediction have become publicly available but, to the best of our knowledge, none of them couple these two problems. Examples of purely motion prediction datasets are: ETH [37], KITTI [38] and UCY [39]. We encourage the reader to examine Table 2 in [40] for a detailed listing of the datasets and their descriptions. These datasets, alongside methods trained and evaluated on them [41], offer enough diverse data to train and test human motion prediction models focused on answering the question “Where is a human going to be during the next $N$ steps?”, but they are not adequately labeled with the context which would help to answer “What is (the goal of) the observed human motion?”. On the other hand, datasets tailored for models focused on the second question, like the CMU’s motion capture database [42], HumanEva [43] and G3D [44] excel in action diversity, but they are focused on distinguishing between different actions (jumping, catching, throwing), do not incorporate complicated motion patterns, and usually are not long enough for a long or mid-term human motion prediction problem. The MoGaze [34] dataset positions itself as an excellent blend of the aforementioned datasets because all the recorded motions have a labeled purpose (an object picking). Its subset has already been used by the authors for human motion prediction problems based on RNN networks and trajectory optimization [17], [45]. Therein, they used the Euclidean distance of the right hand to each object as an action prediction signal, improving their original motion prediction result. They also introduced the problem of graspability, which focuses on the exact wrist position at the moment of grasping, and placeability, defined as a probability distribution over possible place locations on a surface the carried object could be placed on. Mentioned models are not evaluated explicitly, but the authors compared a higher-level human motion prediction model’s error for different graspability and placeability models thus validating them implicitly.

In this paper, we propose a novel human action prediction model based on shared-weight LSTM networks [46], a part of which was published in our preliminary work [47]. The novelty of the current paper with respect to [47] lies in the (i) expanded feature dimensionality reduction method, (ii) a new gaze estimation algorithm, (iii) exhaustive evaluation with an additional quality measure, and (iv) creation of a novel dataset that validated our approach as a general method for human action recognition. Similarly to related work, our model relies on the positions and orientations of human joints, recorded by a motion capture system, and on eye gaze captured using a wearable device, but with the following contributions: (i) to reduce the model complexity, we perform feature extraction through correlation and a multilayer perceptron inspired by the autoencoder architecture, (ii) architecture based on shared weight LSTM networks enabling dynamic adding and removing of human action goals, which is typical for collaborative environments, and (iii) since eye gaze might not always be available in a real-world scenario, we introduce a neural network-based gaze estimation that serves as an additional input to the proposed method and shows promising results. We have tested our approach on the publicly available MoGaze [34] dataset and published the code with a sample pretrained network. Additionally, we present SubMotion – a simpler dataset that includes six subjects, two female and four male, in object-reaching scenarios similar to the MoGaze. Our dataset records only the head orientation and hand position – a setup that could be easily applied in a real-world application without adding to workers’ discomfort or costs. We compared the accuracy of the proposed model with alternatives such as the RNN network, fully connected LSTM network, and the strongest individual signal predictors (baselines), based on the area under the curve (AUC) score of the predicted goal accuracy and mean squared error (MSE) of the predicted goal location. Our model outperformed all of the baselines and alternative methods in MSE distance on both datasets and had better accuracy on the MoGaze dataset.

Section snippets

The proposed human action prediction method

Our methodology follows the one published in our preliminary work [47] and is based on shared-weight LSTM networks and feature selection using correlation as well as feature extraction based on the autoencoder architecture. The goal of the proposed model is to ascertain which object in the environment will the human pick next. As we mentioned in the introduction, the creation of the MoGaze dataset with $1435$ picking segments including the eye gaze, enabled us to craft a data-driven model for

The novel SubMotion dataset

In order to demonstrate the general application of the proposed algorithm, we have recorded our own dataset which aims to complement the much more comprehensive MoGaze dataset. Unlike the MoGaze dataset, which uses a specialized recording suit and proprietary software to obtain the configuration of the entire human body, our dataset records the positions, and orientations of only two joints: the head and (right) hand. Since it includes only a small subset of human motion features, we dubbed it

Experimental results

In this section we present and discuss results of the proposed method on two datasets: the MoGaze and SubMotion. Unlike our previous work in [47], where we trained a model on data belonging to one half of the subjects and performed testing on the other half, in this paper we decided to train a unique model for each subject. Such a decision was motivated by observing different motion patterns and data capture quality between subjects, which manifested mostly on the eye gaze. Also, as body

Conclusion

In this paper we have introduced a human action prediction framework based on shared-weight LSTM networks and feature dimensionality reduction. The idea behind our framework was to enable a supervisory system or a robot to have a timely and efficient reaction to accurately inferred human actions. For this paper, we decided to focus on the object picking problem, where we strived to predict which object in the scene the human is going to pick next since this represents a strong proxy of typical

CRediT authorship contribution statement

Tomislav Petković: Methodology, Software, Data Curation. Luka Petrović: Formal analysis, Validation. Ivan Marković: Conceptualization, Supervision. Ivan Petrović: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research has been supported by the European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS).

References (64)

HietanenA. et al.
Ar-based interaction for human-robot collaborative manufacturing
Robot. Comput.-Integr. Manuf.
(2020)
PulikottilT.B. et al.
A software tool for human-robot shared-workspace collaboration with task precedence constraints
Robot. Comput.-Integr. Manuf.
(2021)
YaoL. et al.
A data augmentation method for human action recognition using dense joint motion images
Appl. Soft Comput.
(2020)
IjjinaE.P. et al.
Hybrid deep neural network model for human action recognition
Appl. Soft Comput.
(2016)
PetkovićT. et al.
Human intention estimation based on hidden Markov model motion validation for safe flexible robotized warehouses
Robot. Comput.-Integr. Manuf.
(2019)
DaiC. et al.
Human action recognition using two-stream attention based LSTM networks
Appl. Soft Comput.
(2020)
MuhammadK. et al.
Human action recognition using attention based LSTM network with dilated CNN features
Future Gener. Comput. Syst.
(2021)
ZhangZ. et al.
Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions
Neurocomputing
(2020)
BuerkleA. et al.
Eeg based arm movement intention recognition towards enhanced safety in symbiotic human-robot collaboration
Robot. Comput.-Integr. Manuf.
(2021)
JiangS. et al.
A novel, co-located EMG-FMG-sensing wearable armband for hand gesture recognition
Sensors Actuators A
(2020)

TrombettaD. et al.

Variable structure human intention estimator with mobility and vision constraints as model selection criteria

Mechatronics

(2021)

ChenJ. et al.

Driver identification based on hidden feature extraction by using adaptive nonnegativity-constrained autoencoder

Appl. Soft Comput.

(2019)

da SilvaM.V. et al.

Human action recognition in videos based on spatiotemporal features and bag-of-poses

Appl. Soft Comput.

(2020)

LuoR. et al.

A framework for unsupervised online human reaching motion recognition and early prediction

DingH. et al.

Human arm motion modeling and long-term prediction for safe and efficient human-robot-interaction

LiQ. et al.

Data driven models for human motion prediction in human-robot collaboration

IEEE Access

(2020)

PetkovićT. et al.

Human motion prediction framework for safe flexible robotized warehouses

MainpriceJ. et al.

Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning

PetkovićT. et al.

Human intention recognition for human aware planning in integrated warehouse systems

(2020)

SchydloP. et al.

Anticipation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction

HuangC.-M. et al.

Using gaze patterns to predict task intent in collaboration

Front. Psychol.

(2015)

ShiL. et al.

What are you looking at? Detecting human intention in gaze based human-robot interaction

(2019)

LiY. et al.

Online human action detection using joint classification-regression recurrent neural networks

KratzerP. et al.

Anticipating human intention for full-body motion prediction in object grasping and placing tasks

P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, N. Zheng, View adaptive recurrent neural networks for high performance...

KelleyR. et al.

Understanding human intentions via hidden markov models in autonomous mobile robots

WangS.B. et al.

Hidden conditional random fields for gesture recognition

de RidderD. et al.

Feature extraction in shared weights neural networks

RozantsevA. et al.

Beyond sharing weights for deep domain adaptation

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

LiuM. et al.

Singlenn: modified behler–parrinello neural network with shared weights for atomistic simulations with transferability

J. Phys. Chem. C

(2020)

ShiL. et al.

GazeEMD: Detecting visual intention in gaze-based human-robot interaction

Robotics

(2021)

BaderT. et al.

Multimodal integration of natural gaze behavior for intention recognition during object manipulation

Cited by (7)

An ultra-low-computation model for understanding sign languages
2024, Expert Systems with Applications
In artificial intelligence applications, advanced computational models, such as deep learning, are employed to achieve high accuracy, often requiring the execution of numerous operations. Conversely, lightweight computational models are typically more resource-efficient, making them suitable for various devices, including smartphones, tablets, and wearable technology. This paper presents an ultra-low-computation solution for interpreting sign languages to assist deaf and hard-of-hearing individuals without needing specialized hardware or significant computational resources. The proposed approach initially performs data abstraction on the input data. During this process, the image is systematically scanned from various perspectives, and the collected information is then encoded into a one-dimensional vector. Subsequently, the abstracted information undergoes processing through a Fully Connected Neural Network (FCN), resulting in highly accurate output. We also introduced two abstraction methods, namely Opaque and Glass, inspired by the interaction of light with different types of objects. The proposed abstractions facilitate the comprehension of the hand gesture’s outer boundary as well as its row-wise and column-wise density of pixels. Our experiments on three datasets confirm the efficiency of the proposed method, achieving an accuracy of 99.4% in recognizing American Sign Language, 99.96% accuracy in recognizing Indian Sign Language, and 99.95% accuracy in recognizing Bangla Sign Language. Notably, the model size and the number of MAC operations are significantly smaller than state-of-the-art computational models trained on the same datasets.
φ-OTDR pattern recognition based on CNN-LSTM
2023, Optik
We proposed a pattern recognition strategy based on the long short-term memory network (LSTM) and convolutional neural network (CNN), with phase-sensitive optical time domain reflectometry (φ-OTDR) realizing vibration sensing and data acquisition in application scenarios. Given the time domain curve as well as its discrete wavelet transform (DWT) and short-time Fourier transform (STFT) as the input, the trained LSTM-CNN can effectively identify six kinds of target signals, which can assist users in taking appropriate measures. The significant improvement of LSTM on classification performance has been proven in comparison to the conventional artificial neural network (ANN) and CNN, providing an application example for the integration of LSTM and optical fiber sensing.
An Ultra-Low-Computation Model for Understanding Sign Languages
2023, SSRN
A Review of Prospects and Opportunities in Disassembly with Human-Robot Collaboration
2023, arXiv
Human Intention Recognition in Collaborative Environments using RGB-D Camera
2023, 2023 46th ICT and Electronics Convention, MIPRO 2023 - Proceedings
Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations
2022, arXiv

View all citing articles on Scopus

¹: https://humans-to-robots-motion.github.io/mogaze/.

View full text

Human action prediction in collaborative environments based on shared-weight LSTMs with feature dimensionality reduction

Highlights

Abstract

Introduction

Section snippets

The proposed human action prediction method

The novel SubMotion dataset

Experimental results

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Robot. Comput.-Integr. Manuf.

Robot. Comput.-Integr. Manuf.

Appl. Soft Comput.

Appl. Soft Comput.

Robot. Comput.-Integr. Manuf.

Appl. Soft Comput.

Future Gener. Comput. Syst.

Neurocomputing

Robot. Comput.-Integr. Manuf.

Sensors Actuators A

Mechatronics

Appl. Soft Comput.

Human action recognition in videos based on spatiotemporal features and bag-of-poses

Appl. Soft Comput.

A framework for unsupervised online human reaching motion recognition and early prediction

Human arm motion modeling and long-term prediction for safe and efficient human-robot-interaction

Data driven models for human motion prediction in human-robot collaboration

IEEE Access

Human motion prediction framework for safe flexible robotized warehouses

Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning

Human intention recognition for human aware planning in integrated warehouse systems

Anticipation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction

Using gaze patterns to predict task intent in collaboration

Front. Psychol.

What are you looking at? Detecting human intention in gaze based human-robot interaction

Online human action detection using joint classification-regression recurrent neural networks

Anticipating human intention for full-body motion prediction in object grasping and placing tasks

Understanding human intentions via hidden markov models in autonomous mobile robots

Hidden conditional random fields for gesture recognition

Feature extraction in shared weights neural networks

Beyond sharing weights for deep domain adaptation

IEEE Trans. Pattern Anal. Mach. Intell.

Singlenn: modified behler–parrinello neural network with shared weights for atomistic simulations with transferability

J. Phys. Chem. C

GazeEMD: Detecting visual intention in gaze-based human-robot interaction

Robotics

Multimodal integration of natural gaze behavior for intention recognition during object manipulation