Elsevier

Applied Soft Computing

Volume 126, September 2022, 109245
Applied Soft Computing

Human action prediction in collaborative environments based on shared-weight LSTMs with feature dimensionality reduction

https://doi.org/10.1016/j.asoc.2022.109245Get rights and content

Highlights

  • Human action prediction using LSTM networks.

  • Dimensionality reduction using correlation and autoencoder-inspired MLP.

  • Gaze estimation might improve human action prediction.

  • Experiments verified approach using motion capture input.

Abstract

As robots are progressing towards being ubiquitous and an indispensable part of our everyday environments, such as home, offices, healthcare, education, and manufacturing shop floors, efficient and safe collaboration and cohabitation become imperative. Given that, such environments could benefit greatly from accurate human action prediction. In addition to being accurate, human action prediction should be computationally efficient, in order to ensure a timely reaction, and capable of dealing with changing environments, since unstructured interaction and collaboration with humans usually do not assume static conditions. In this paper, we propose a model for human action prediction based on motion cues and gaze using shared-weight Long Short-Term Memory networks (LSTMs) and feature dimensionality reduction. LSTMs have proven to be a powerful tool in processing time series data, especially when dealing with long-term dependencies; however, to maximize their performance, LSTM networks should be fed with informative and quality inputs. Given that, in this paper, we furthermore conducted an extensive input feature analysis based on (i) signal correlation and their strength to act as stand-alone predictors, and (ii) a multilayer perceptron inspired by the autoencoder architecture. We validated the proposed model on a publicly available MoGaze1 dataset for human action prediction, as well as on a smaller dataset recorded in our laboratory. Our model outperformed alternatives, such as recurrent neural networks, a fully connected LSTM network, and the strongest stand-alone signals (baselines), and can run in real-time on a standard laptop CPU. Since eye gaze might not always be available in a real-world scenario, we have implemented and tested a multi-layer perceptron for gaze estimation from more easily obtainable motion cues, such as head orientation and hand position. The estimated gaze signal can be utilized during inference of our LSTM-based model, thus making our action prediction pipeline suitable for real-time practical applications.

Introduction

With the robots becoming more capable and sophisticated, we are witnessing a growth in their presence and integration in private and professional human environments. Nowadays, such environments, besides cohabitation, often include close human–robot collaboration and interaction, yielding novel challenges concerning system efficiency and human safety. While robots are fully controllable, human behavior, on the other hand, although nearly optimal with respect to the task, is inherently stochastic. For example, imagine a healthcare worker treating a patient or a manufacturing shop floor worker assembling products in an agile production system. Their goals are well defined, but the execution and sometimes the environment are not completely controlled. While carrying out the task, the healthcare worker needs to adapt to the responses of the patient, while the worker on a manufacturing shop floor might change the order of the task execution for justified reasons. We argue that robots in human proximity should be aware of such changes and react accordingly. Having that in mind, one of the main challenges in collaborative environments is to capture the uncertainty and nuances of human behavior. Supervisory systems try to overcome these challenges by taking advantage of the plethora of methods that revolve around human trajectory prediction, safety regions assertion and action/goal prediction [1], [2], [3], [4], [5].

The problems of human action prediction and intention recognition have come under the spotlight of the research community in recent years. They serve as independent modules or are integrated into the human motion prediction either explicitly [6], [7] or implicitly [8]. The advantages of embedding human intentions implicitly in the model lie in the fact that those models can be trained jointly with the higher-level system and are validated straightforwardly through its performance. The higher-level system could be a fleet management system [9] that tries to reroute the robots out of a human’s path and is evaluated by the warehouse deliveries, the number of rerouting, and collision number or a human trajectory prediction model [10] evaluated with the root mean square error of the predicted trajectory. On the other hand, explicitly estimating human actions enables the model to be crafted or trained independently of the higher-level system. In practice, this means that training the action prediction module can be done without the robots operating thus cutting costs. These models can also be interpreted more easily [11], allowing the higher-level system to have semantic meaning and reasoning of performed actions.

In recent years, human action prediction applications ranged from robotized warehouses [9], [12] to sedentary object-picking domain [13], [14], [15] and full-body motions [16], [17], [18]. State-of-the-art human action prediction frameworks are based on Markov models [19], inverse optimal control [11] or conditional random fields [20], which try to learn moving patterns with the respect to pertaining goals, usually assuming nearly-optimal human behavior in the observed sequences. In [5] the authors propose a hybrid deep neural network model for human action recognition using action bank features leveraging fusion of homogeneous convolutional neural network (CNN) classifier. Input features are diversified and the authors propose varying the initialization of the weights of the neural network to ensure classifier diversity. Another approach based on the Long Short-Term Memory networks (LSTMs) is proposed in [21] where the authors craft a two-stream attention-based architecture for action recognition in videos. They suggest that such an approach resolves the visual attention ignoring problem by using a correlation network layer that can identify the information loss on each timestamp for the entire video. Furthermore, in [22] authors leverage a bidirectional LSTM to learn the long-term dependencies, and use the attention mechanism to boost the performance and extract the additional high-level selective action related patterns and cues. The convolutional LSTMS are used in [23] to handle the long-duration sequential features with different temporal context information and are compared to the fully connected LSTM. In [21] the authors propose an end-to-end two-stream attention-based LSTM network for human action recognition that selectively focuses on the effective features of the original input image. The concept of utilizing shared weights for neural networks was brought by de Ridder et al. in [24] with the focus on the feature extraction problem. This approach has gained traction in transfer learning [25] and physics simulation applications [26]. Regarding collaborative environments, the state-of-the-art models infer human actions by measuring different cues captured by wearable (eye gaze [14], [27], [28] or even heart rate and electroencephalography [29]) or non-wearable sensors. The use of non-wearable sensors such as motion capture systems or RGB cameras enables the model to capture crucial cues such as gestures [30], emotion [31], skeletal movement [32] or estimate eye gaze [33]. In works [14], [15], [28], [34], [35] authors have indicated that the eye gaze is a powerful predictor of human action. A good overview of human action prediction methods and their categorization by the type of problem formulation can be seen in [36]. Several works embed the eye gaze feature into human action prediction models using machine learning models such as support vector machine [14] or recurrent neural networks (RNNs) [34]. In the human collaborative scenario, the authors of [14] tested their algorithm relying on verbal instructions as additional features for their model and the actions form a sequence, In [15] the authors calculate the similarity between the hypothetical gaze points on the objects and the actual gaze points and use the nearest neighbor algorithm to classify the intended object. To the best of our knowledge, there does not exist a method that couples the human action prediction model with the directly measured eye gaze and human joint positions in a dynamic, changing environment. For example, in [14] the authors rely on gaze adding verbal commands in the feature space. In [15] the scenario is static and the subject sits while picking the objects who are always visible to the subject. Furthermore, in [36], the multiple-model estimator is leveraged for intention prediction, but the inputs to this model are extracted from a camera using convolutional networks and prior values that are not applicable in the dynamic collaborative domain.

In the last few years, multiple datasets concerning motion and action prediction have become publicly available but, to the best of our knowledge, none of them couple these two problems. Examples of purely motion prediction datasets are: ETH [37], KITTI [38] and UCY [39]. We encourage the reader to examine Table 2 in [40] for a detailed listing of the datasets and their descriptions. These datasets, alongside methods trained and evaluated on them [41], offer enough diverse data to train and test human motion prediction models focused on answering the question “Where is a human going to be during the next N steps?”, but they are not adequately labeled with the context which would help to answer “What is (the goal of) the observed human motion?”. On the other hand, datasets tailored for models focused on the second question, like the CMU’s motion capture database [42], HumanEva [43] and G3D [44] excel in action diversity, but they are focused on distinguishing between different actions (jumping, catching, throwing), do not incorporate complicated motion patterns, and usually are not long enough for a long or mid-term human motion prediction problem. The MoGaze [34] dataset positions itself as an excellent blend of the aforementioned datasets because all the recorded motions have a labeled purpose (an object picking). Its subset has already been used by the authors for human motion prediction problems based on RNN networks and trajectory optimization [17], [45]. Therein, they used the Euclidean distance of the right hand to each object as an action prediction signal, improving their original motion prediction result. They also introduced the problem of graspability, which focuses on the exact wrist position at the moment of grasping, and placeability, defined as a probability distribution over possible place locations on a surface the carried object could be placed on. Mentioned models are not evaluated explicitly, but the authors compared a higher-level human motion prediction model’s error for different graspability and placeability models thus validating them implicitly.

In this paper, we propose a novel human action prediction model based on shared-weight LSTM networks [46], a part of which was published in our preliminary work [47]. The novelty of the current paper with respect to [47] lies in the (i) expanded feature dimensionality reduction method, (ii) a new gaze estimation algorithm, (iii) exhaustive evaluation with an additional quality measure, and (iv) creation of a novel dataset that validated our approach as a general method for human action recognition. Similarly to related work, our model relies on the positions and orientations of human joints, recorded by a motion capture system, and on eye gaze captured using a wearable device, but with the following contributions: (i) to reduce the model complexity, we perform feature extraction through correlation and a multilayer perceptron inspired by the autoencoder architecture, (ii) architecture based on shared weight LSTM networks enabling dynamic adding and removing of human action goals, which is typical for collaborative environments, and (iii) since eye gaze might not always be available in a real-world scenario, we introduce a neural network-based gaze estimation that serves as an additional input to the proposed method and shows promising results. We have tested our approach on the publicly available MoGaze [34] dataset and published the code with a sample pretrained network. Additionally, we present SubMotion – a simpler dataset that includes six subjects, two female and four male, in object-reaching scenarios similar to the MoGaze. Our dataset records only the head orientation and hand position – a setup that could be easily applied in a real-world application without adding to workers’ discomfort or costs. We compared the accuracy of the proposed model with alternatives such as the RNN network, fully connected LSTM network, and the strongest individual signal predictors (baselines), based on the area under the curve (AUC) score of the predicted goal accuracy and mean squared error (MSE) of the predicted goal location. Our model outperformed all of the baselines and alternative methods in MSE distance on both datasets and had better accuracy on the MoGaze dataset.

Section snippets

The proposed human action prediction method

Our methodology follows the one published in our preliminary work [47] and is based on shared-weight LSTM networks and feature selection using correlation as well as feature extraction based on the autoencoder architecture. The goal of the proposed model is to ascertain which object in the environment will the human pick next. As we mentioned in the introduction, the creation of the MoGaze dataset with 1435 picking segments including the eye gaze, enabled us to craft a data-driven model for

The novel SubMotion dataset

In order to demonstrate the general application of the proposed algorithm, we have recorded our own dataset which aims to complement the much more comprehensive MoGaze dataset. Unlike the MoGaze dataset, which uses a specialized recording suit and proprietary software to obtain the configuration of the entire human body, our dataset records the positions, and orientations of only two joints: the head and (right) hand. Since it includes only a small subset of human motion features, we dubbed it

Experimental results

In this section we present and discuss results of the proposed method on two datasets: the MoGaze and SubMotion. Unlike our previous work in [47], where we trained a model on data belonging to one half of the subjects and performed testing on the other half, in this paper we decided to train a unique model for each subject. Such a decision was motivated by observing different motion patterns and data capture quality between subjects, which manifested mostly on the eye gaze. Also, as body

Conclusion

In this paper we have introduced a human action prediction framework based on shared-weight LSTM networks and feature dimensionality reduction. The idea behind our framework was to enable a supervisory system or a robot to have a timely and efficient reaction to accurately inferred human actions. For this paper, we decided to focus on the object picking problem, where we strived to predict which object in the scene the human is going to pick next since this represents a strong proxy of typical

CRediT authorship contribution statement

Tomislav Petković: Methodology, Software, Data Curation. Luka Petrović: Formal analysis, Validation. Ivan Marković: Conceptualization, Supervision. Ivan Petrović: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research has been supported by the European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS).

References (64)

  • TrombettaD. et al.

    Variable structure human intention estimator with mobility and vision constraints as model selection criteria

    Mechatronics

    (2021)
  • ChenJ. et al.

    Driver identification based on hidden feature extraction by using adaptive nonnegativity-constrained autoencoder

    Appl. Soft Comput.

    (2019)
  • da SilvaM.V. et al.

    Human action recognition in videos based on spatiotemporal features and bag-of-poses

    Appl. Soft Comput.

    (2020)
  • LuoR. et al.

    A framework for unsupervised online human reaching motion recognition and early prediction

  • DingH. et al.

    Human arm motion modeling and long-term prediction for safe and efficient human-robot-interaction

  • LiQ. et al.

    Data driven models for human motion prediction in human-robot collaboration

    IEEE Access

    (2020)
  • PetkovićT. et al.

    Human motion prediction framework for safe flexible robotized warehouses

  • MainpriceJ. et al.

    Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning

  • PetkovićT. et al.

    Human intention recognition for human aware planning in integrated warehouse systems

    (2020)
  • SchydloP. et al.

    Anticipation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction

  • HuangC.-M. et al.

    Using gaze patterns to predict task intent in collaboration

    Front. Psychol.

    (2015)
  • ShiL. et al.

    What are you looking at? Detecting human intention in gaze based human-robot interaction

    (2019)
  • LiY. et al.

    Online human action detection using joint classification-regression recurrent neural networks

  • KratzerP. et al.

    Anticipating human intention for full-body motion prediction in object grasping and placing tasks

  • P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, N. Zheng, View adaptive recurrent neural networks for high performance...
  • KelleyR. et al.

    Understanding human intentions via hidden markov models in autonomous mobile robots

  • WangS.B. et al.

    Hidden conditional random fields for gesture recognition

  • de RidderD. et al.

    Feature extraction in shared weights neural networks

  • RozantsevA. et al.

    Beyond sharing weights for deep domain adaptation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • LiuM. et al.

    Singlenn: modified behler–parrinello neural network with shared weights for atomistic simulations with transferability

    J. Phys. Chem. C

    (2020)
  • ShiL. et al.

    GazeEMD: Detecting visual intention in gaze-based human-robot interaction

    Robotics

    (2021)
  • BaderT. et al.

    Multimodal integration of natural gaze behavior for intention recognition during object manipulation

  • Cited by (7)

    • Human Intention Recognition in Collaborative Environments using RGB-D Camera

      2023, 2023 46th ICT and Electronics Convention, MIPRO 2023 - Proceedings
    View all citing articles on Scopus
    View full text