1 Introduction

The coexistence of humans and automated vehicles in production spaces is expanding. While previously vehicles would operate within designated spaces and corridors only, the operating boundaries become less restrictive with the emergence of cobots. In contrast to robots, cobots are developed to collaborate either directly with human operators on a specific task or to physically interact with humans in a shared workspace. A cobot can be designed for tasks with various levels of autonomy and complexity. The specific type of robot–human interaction considered in this paper is cobots with the capability of sharing a dynamic workspace without a common task [9]. More specifically, the cobots in this case are autonomous mobile robots (AMR) tasked to move goods within an industrial environment. Navigation of robots in such spaces considers not only robot perception and navigation but also human robot interaction and human behaviour analysis and modelling, as a basis for the prediction [17]. For AMR planning and operation, human behaviour analysis and modelling requires the ability to anticipate future human movement and can enhance the efficiency and safety of path planning. This is naturally a challenging task, as humans invariably exhibit unpredictability in their movements.

Assuming that the nature of the movement is determined by internal human motivation towards a goal, methods which seek to account for the intention in the human movement [25] [12] have been developed. Addressing movement in outdoor environments, they involve intentions such as starting, crossing or stopping. However, the set of possible intentions is defined by the context of the application scenario and reported outdoor contexts focussed on driver assistance systems or driverless operation of vehicles , concern  a limited set of intentions, for example regarding crossing a street, especially close to intersections. These intentions are not sufficiently representative of intentions in working environments. For example, human indoor workspace movement intentions can be motivated by specific work environment goals. Such goals can be part of a work process and can therefore be somewhat well-defined for a given industrial environment. However, they differ from reported works on outdoor spaces and may differ from one environment to another. Furthermore, they may also differ for the same work environment, when operations sequences or business processes change. Therefore, the development of human movement prediction based on intent recognition is not sufficiently served by works focussed on pedestrian movement in outdoor spaces, and needs to be further developed for industrial work environment contexts. While such contexts could be inferred by continuous monitoring and tracking of human activity with the explicit aim to predict the movement of each monitored person, this may easily verge towards privacy breaches. It is therefore of interest to understand if human movement prediction in indoor workspaces could be achieved via methods which do not rely on personalised tracking of individuals and their specific intentions.

This paper addresses the problem of human movement prediction for indoor workspaces in a way such that the specific individual’s movement is not explicitly considered but instead the workspace can be seen as a heatmap occupancy grid where human presence is anonymised. Specifically, the aim is to investigate the extent to which future human movement prediction could be achieved based on a historical record of sequences of human presence in workspaces. The training of such data-driven predictive could allow them to develop internal representations that capture movement intention patterns for a given workspace without explicit personalised human tracking. In a workspace, many workers are all simultaneously working on their own (different) tasks and interacting with each other within a limited physical space. A shared workspace also adds cobots into the mix and thereby further increases the complexity of the setting. To account for the spatiotemporal nature of human movement in such workspaces, the human movement prediction is posed as the spatiotemporal problem of predicting human presence in a grid cell (spatial dimension) at a specific future time point (temporal dimension). Given historical records of occupancy sequences in a heatmap grid format, the development of such human motion prediction can enable AMR socially-aware navigation in shared workspaces. The historical records are records of occupancy in the grid, showing movable (robots, humans) and immovable entities (e.g. walls, obstacles, production machines, etc.). Specifically, given a sequence of occupancy grid instances, the goal is to project the future sequence of human occupancy for each cell in the grid.

While there is substantial research on socially-aware robot navigation for outdoor spaces, there is limited work targeting the problem of human movement prediction in industrial workspaces. Models suitable to capture the complex spatial and temporal dependencies in the data are, for example, convolutional neural networks (CNN) or recurrent neural networks (RNN). Such models can be leveraged to take the spatiotemporal information into account and are frequently used in the literature for human movement predictions [23]. However, in this paper, human movement can be predicted using a graph-based approach. By representing the problem using graphs, the spatial dependencies can directly be captured by the topology of a graph. Such graph neural networks (GNN) use the concept of message passing, meaning that each node representation is updated with the message received from its connected neighbours [20]. A GNN can be combined with RNNs or CNNs, where the GNN captures the spatial dynamics and the RNNs or CNNs models the temporal dependency. Such spatiotemporal graph neural networks are with success have been successfully applied to traffic speed forecasting, COVID-19 forecasting and trajectory prediction [31]. Graph neural networks have properties that render themselves suitable for spatiotemporal problems, yet such graph-based models have not been applied to predicting indoor workspaces human motion in the literature. Given that graph neural networks have performed work well in other spatiotemporal problems, e.g. traffic forecasting [13] and pedestrian trajectory prediction [16], the aim of this paper is to explore the extent to which graph-based neural networks can be used to predict human presence in the context of shared human–robot workspaces.

The following section discusses the background of the problem and relevant literature. Thereafter, Sect. 3 describes the methodology, including the necessary data preparation for a graph-based approach, introduces the specific graph neural network approach and elaborates on the training and evaluation process. Section 4 presents the obtained results, employing data derived from simulations emulating real environments, whereas Sect. 5 discusses these results, alongside their implications and limitations. Section 6 is the conclusion.

2 Background and related work

Human movement prediction is a spatiotemporal problem. In order to predict where a human will be at a specific time in the future, it is important to understand what motivates and influences the movement. Movement generally follows complex, nonlinear patterns. Humans are usually driven by their inner motivation towards some goal intent. However, along their path they are influenced by social and physical environment constraints [15]. For example, obstacles such as walls are physical constraints. Social constraints can be social rules, norms and the actions of surrounding agents. These factors are generally not directly observable but could potentially be inferred from data or directly modelled from the context.

Overall, the human movement prediction problem is typically handled through  three different modelling approaches: physics-based, planning-based and pattern-based methods [23]. Physics-based approaches are based on Newton’s law of motion. A set of equations are explicitly defined and are employed to model the future human movement. A popular example of this is the social force model by Helbing and Molnar [8] which models attraction and repulsion forces from other agents and obstacles [10, 11]. Based on the physics of movement, likely human trajectories can be predicted and robot path planning can produce sequences of waypoints which jointly meet robot movement targets, while maintaining sufficient distance from humans [10, 11]. Alternatively, and in contrast to physics-based models, planning-based methods explicitly reason on the longer-term goals of an agent’s movement. Planning-based methods assume the rationality of a human and thus consider the impact of current actions on the future [32]. These models compute path hypotheses that would enable an agent to reach its goals by minimising the total cost of a sequence of motions, and thus not the cost of one action in isolation. Both physics- and planning-based approaches have disadvantages that make them less applicable to the human movement prediction problem. For example, physics-based models generally may struggle to capture real-world complexity. Furthermore, with planning-based models, the computational time increases exponentially with the prediction horizon, the size of the environment and the number of agents, which makes it hard to implement a planning-based model in complex situations such as a manufacturing floor.

Pattern-based approaches do not, in general, suffer from the same disadvantages and are therefore frequently reported in the literature. A pattern-based approach learns human movement from data of historical trajectories. This approach can discover behavioural patterns and make predictions about future human movement. If the shared workspace is quantised in the form of an occupancy grid, with each state in the grid being either occupied or not, the human movement problem can be seen as a transition modelling problem, and can then be modelled with hidden Markov models (HMM), which can be appropriate for modelling state transitions [14]. While HMMs can thus capture the temporal relationship in the data, further customization is needed to achieve the same for the spatial relationships.

In the human trajectory prediction literature, CNNs and RNNs are frequently applied. These models are suitable for spatiotemporal problems. For example, Bartoli et al. [3] designed an LSTM model that is ‘context-aware’ and can learn and predict human motion in crowded spaces such as a sidewalk. Moreover, Zhang et al. [34] use a encoder-decoder architecture where the LSTM model is the fundamental block that captures the temporal structure of a trajectory. Nikhil and Morris [19] show that CNNs are also applicable and even argue that a CNN is superior to recurrent neural networks for trajectory prediction since the high spatiotemporal correlation can be learned more efficiently due to the computational efficiency of the convolution operation. However, most frequently RNNs and CNNs are combined to take advantage of the specific strengths of both architectures. Generally, a CNN is used to capture the spatial dimension while the temporal dimension is captured by an RNN. For example, Zhao et al. [36] use a similar concept by using a CNN to capture scene features and an LSTM to fuse trajectories and features.

As an alternative to CNNs and/or RNNs, a graph-based approach can be taken since many types of data can naturally be represented as a graph. A graph is a set of entities (nodes) and the relations (edges) between them, where each node, edge and/or the entire graph can store information [26]. GNNs are neural networks that can operate on graph data and work very well on data with high spatial dependencies such as with human movement. The spatial dependencies are directly captured by the topology of the graph. The learning of a GNN is done by the message-passing formalism. The edge and node attributes of a graph generate compressed representations (messages). These messages are then passed between nodes based on the message-passing rule; whereafter, these messages are then used to update the node and/or edge representations [22]. Hereby, a GNN can capture the interactions between multiple entities and can help find relations and complicated patterns embedded in the data. A better functioning and the most prevalent architecture in literature is the graph convolutional network (GCN). It uses the concept of convolution and generalises this into graphs.

The concept of GNN can be extended to handle sequential data and consider both spatial and temporal dependencies. A spatiotemporal graph neural network (STGNN) fuses the conceptual ideas of graph representation learning and temporal deep learning [33]. An STGNN considers the spatial and temporal dependencies at the same time [31], by performing message passing at each time point using the GNN and incorporating the new temporal information using a temporal deep learning block. The temporal and spatial layers are jointly trained together in a single machine learning model by exploiting the property that the STGNN model is end-to-end differentiable [22]. Generally, with STGNN models graph convolutions are used to capture the spatial dependence, whereas RNNs or CNNs are used to model the temporal dependency [31]. For example, Vemula et al. [28] make socially-aware predictions using a combination of graph neural networks and attention to capture the relative importance of each person in the crowd to predict human movement. Eiffert and Sukkarieh [6] build upon this work by predicting trajectories while accounting for the interaction of an agent with a robot’s path. Similar to Vemula et al. [28], they use a graph-based approach where each node represents an agent and the edge represents the interaction between them. The STGNN model developed by Eiffert and Sukkarieh [6] is successfully applied to a variety of datasets including crowds of pedestrians interacting with vehicles and bicycles, and livestock interacting with a robot. Alternatively, Choi et al. [5] use edge-level predictions on spatiotemporal graphs to make goal-oriented predictions. By levering a graph-based approach, their model incorporates the interactions between agents and the interactions with the environment. However, all the above graph-based approaches model the agents as nodes with the edges capturing the relationship between them. This approach requires that the agents can be accurately tracked over time. Therefore, this approach is not suitable in the scenario where agents are indistinguishable in the data, making it hard to observe the individual trajectories.

An alternative graph-based approach, used in traffic forecasting, is to model each node as a specific location in the environment and to predict the number of agents at each location. Li et al. [13] introduced this modelling approach and captured the spatial features using random walks on graphs. Thereafter, these spatial features are connected with the temporal features using encoder-decoder architecture. Following this work, Zhao et al. [35] intro a spatiotemporal graph neural network that outperformed the state-of-art baselines. Zhao et al. [35] introduce GCNs to capture the spatial dependencies and gated recurrent units (GRU) to capture the temporal dependencies. Building upon this work, Bai et al. [2] improve the predictive capabilities by adding an attention framework to this graph-based architecture. The attention mechanism adjusts for the importance of different time points and incorporates global temporal information. Bai et al. [2] note good short- and long-term prediction capabilities and show that this spatiotemporal graph neural network excels at capturing the spatial and temporal dependencies simultaneously. As the traffic forecasting problem can naturally be extended to a human presence prediction problem, these approaches are suitable candidates for the posed problem.

Overall, the analysed literature at large does not focus on human movement prediction in a workspace but on pedestrians’ movement on public roads [23]. A workspace is a more structured and less dynamic environment and could have restrictions and structures which are not seen in public spaces. Furthermore, the literature about graph-based models in human movement prediction is relatively limited and graph-based models have not been applied in a manufacturing setting. Moreover, the modelling approach used in the graph-based human movement prediction literature is not suitable when accurately tracking agents over time is hard. Alternatively, treating a node as fixed location in the environment has not been applied to the human movement problem in the literature. However, the ability of graph-based models to capture the spatial dynamics makes them highly relevant and applicable for human movement prediction within a manufacturing environment. Given the limited research in this area and the potential applicability of graph-based models, it is of interest to explore human movement prediction in shared workspaces with graph-based neural networks, on the basis of past movement data.

3 Methodology

This paper addresses the problem of human movement prediction in shared human–robot workspaces using a graph-based approach. Before introducing the data-driven model, Sect. 3.1 presents the data collection process. Human presence data are made available in an occupancy grid structure. The workspace and occupancy grid are converted into a graph representation, outlined in Sect. 3.2. The specific graph-based neural network types of models applied on this architecture are introduced in Sect. 3.3, explaining the workings and reasoning of using the model. This model is based on a similar graph-based spatiotemporal model identified in the literature review. However, some adjustments were made to make it appropriate for the problem formulation. After the model is introduced and input data are collected and processed, Sect. 3.4 explains the steps and methods employed for the training of the model. Moreover, this section also discusses the evaluation metrics used to assess the performance of the model.

3.1 Data collection and preparation

When employing data-driven machine learning solutions, the availability of annotated data with access to ground truth is typically a challenge. This issue is even greater when the data represent human behaviours, because of both regulation concerns (e.g. privacy protection) and the required scale of the training corpus. The use of simulations is a solution to overcome this issue. The presented work employs a simulation platform developed by THALES to provide large-scale pedestrian movements simulation. The simulator was originally developed to support control centres of critical infrastructure, such as railway stations (Fig. 1) and airports (Fig. 2), in support of dealing with the challenge of safe management of people in such spaces [18].

Fig. 1
figure 1

Warsaw East Station Simulation

Fig. 2
figure 2

Pisa Airport Simulation

The simulator can be used in a serious game context to train control centre operators to react to rare incidents.Footnote 1 Furthermore, it can be used as a digital twin of such operating spacesFootnote 2 and create realistic data of human presence and movement [1] which can be used to train machine learning models. The simulator employs various approaches for moving agent trajectories and robot path planning including the Field D* algorithm [7] or ORCA (Van Den [27]) for collision avoidance. Human movement behaviour modelling is still an open research area, as it remains a challenge to produce solutions which represent with sufficient accuracy realistic phenomena and are therefore appropriate for actionable data creation. The present work seeks to bridge the distance between simulated and real-life data by concentrating on sequences of workspace occupancy grid instances, which can be optimised to follow patterns or real behaviour as observed by applying visual analytics on real visual scenes. This avoids privacy concerns which would have existed if tracking of individual persons were applied. The occupancy grid approach creates a discretisation of the environment, and it becomes possible to follow blurred sequences of movement from individuals, without identification becoming possible, in a way similar to having a very low-resolution blurred scene image. The flexibility to customise synthetic 3D operating and workspace environments allows the generation of high-level human behaviours that define how humans behave, for instance through following specific work sequence patterns. Therefore, the work presented in this paper capitalises on the opportunities offered by such simulations to produce realistic and rich data, which are used to train models for human movement prediction.

The human movement prediction presented in this paper is part of a solution developed to ensure safe and efficient robot fleet path planning in shared human–robot Industry 4.0-enabled workspaces [21]. The overall mobile robot fleet path planning employs a variation of globally guided reinforcement learning (B. [29]). It is beyond the scope of this paper to present the global guidance learning for path planning. Nonetheless, it can be said that in the path planning version without accounting for human movement prediction, the learned guidance steers the fleet of robots to attractor points. These attractors are the workstations where humans perform tasks. The added value of the human movement prediction is in influencing the planning to avoid areas in the grid with likely human presence and in optimising the fleet path guidance taking human likely presence into account. The considered use case involves specific physical work environments. However, to explore the potential of developing a solution that is applicable to a wider range of environments, a simulation approach is taken to create a range of fictitious industrial work environments and generate a rich set of sequences of human and robot movement sequences of data in such spaces. The approach is described next, and the associated data are made available.Footnote 3

The simulation environment employed to generate the fictitious industrial workspaces, alongside occupancy human–robot movement data have been developed by THALES and are employed in broader studies to assess scenarios relevant not only to workspaces, but also to indoor transportation hub public spaces, such as railway or airport terminals. Our experiments are conducted in a simulated industrial workspace layout, presenting different types of workstations and production machinery. Additionally, other static attractor points are defined, in the form of coffee/vending machines, to emulate worker short breaks and deviations from defined work sequences aligned with specific business processes. While multiple such scenarios were examined upon which human movement models were developed, the present work employs a simulated scenario with introduced randomness but with approximately 100 workers operating in it, belonging to four categories of worker profiles. Each category of worker is associated with a distinct business process described through sequences of visits to workstations, opting for the workstation with the shortest queue within their category. Moreover, randomness is introduced applying a normal distribution to govern the speed at which workers move, and a uniform distribution over the number of workers of each type. Furthermore, there is a 0.1 probability of workers deviating from their work process to visit another attractor point. An example layout of the manufacturing floor is shown in Fig. 3. The rectangles are workbenches and the small squares at the left border of the manufacturing floor are coffee stations, used as further attractors for human movement behaviour that deviates from work sequences. The proximity of the workbenches and the narrow passages between the walls adds complexity to the simulation. Although the workers are largely following work sequences patterns, it is hard to infer an individual worker’s type and work sequence from the data. The workers are unlabelled and indistinguishable in the data, accounting for privacy concerns. The prediction problem requires a model capable of capturing the underlying spatial and temporal dynamics of the data, which are influenced by the process sequences that the workers are part of.

Fig. 3
figure 3

Simplified shop floor: lines show wall boundaries and rectangles show workstations

Sections 3.2 and 3.3 introduce a spatiotemporal graph neural network to tackle this problem. The manufacturing floor is a 100 × 100 grid where each grid cell contains information about the number of workers and contextual information about the obstacles in that cell. The types of obstacles recorded in the simulation are walls, workbenches and coffee machines. The aim of incorporating this additional contextual information is to improve the predictive capabilities of the graph neural network by enabling it to make context-aware predictions. The obstacle information is extracted from the data and made binary, e.g. if a grid cell contains a coffee machine, a 1 is recorded and 0 otherwise. On the data, z-score normalisation is applied. This helps to suppress the impact of outliers and large values on the prediction and during training of the model. Empirical observations indicated the using normalised data improved the quality of the predictions. The next section introduces the graph architecture, while Sect. 3.3 presents in detail the specific version of the spatiotemporal graph network with an attention mechanism, which is implemented in the architecture.

3.2 Graph architecture

The simulation data have a grid structure where each grid cell contains information about the number of humans and the presence and type of obstacle(s) in a grid cell. This grid structure must be converted to a graph to be usable in a graph-based neural network. This paper adopts a node classification approach similar to the speed forecasting problem of Bai et al. [2]. Here each node represents a specific grid cell. The aim is to predict the human presence in that cell. Using the node classification approach, the graph structure of workspace G can be formulated as follows:

$$G\left( {V, E, X_{v\left( t \right)} } \right)$$
(1)

where V is the set of nodes, E is the set of edges and \({X}_{v\left(t\right)}\) is the set of node features at time t. More specifically, each node in V corresponds to a grid cell in the data. For example, cell (0,0) is node 1, cell (0, 5) is node 6 and cell (1,0) is node 101. The graph used has \(N=\left|V\right|= \text{10,000}\) total nodes. Furthermore, \({X}_{v\left(t\right)}\) is a set that contains a vector of node features for each node at time t. Each vector of node features contains information about the human occupancy and additional contextual information (if a wall, coffee machine or workbench is present). Furthermore, the adjacency matrix is a NxN matrix that stores the connectivity information of all nodes:

$$A_{i,j} = \left\{ {\begin{array}{*{20}c} {1,} \\ {0,} \\ \end{array} } \right.\begin{array}{*{20}c} { \left( {i,j} \right) \in E} \\ { otherwise} \\ \end{array}$$
(2)

To construct the graph from the grid, it is determined that each node is connected to itself and all its first-degree neighbouring nodes including the diagonals. As the data have a time interval of 1 s, connecting only the first-degree neighbours would be sufficient as it is unlikely that workers would travel a larger distance in 1 s. Furthermore, since workers can move in both directions between nodes, the graph is bidirectional. Moreover, each edge has two weights since the actual direction of movement is important. These weights influence the effect of the node features in the graph convolution. With varying edge weights, the node features of different nodes are not contributing equally. The higher the edge weight, the greater the influence of these node features in the graph convolution. The edge weights are introduced as learnable parameters in the model. After each training step, the edge weights are updated. Hereby, the model would be able to capture underlying spatial dynamics in the workspace. By learning the edge weights, the trained model implicitly accounts for these different patterns. The weights are initialised by setting them equal to 1/8, where the value 8 corresponds to the average number of edges per node. To create a sparser graph, the nodes (and their connecting edges) that do not have any human presence during the entire simulation are omitted from the graph. The result is a graph with significantly fewer nodes and edges. Since these nodes were never visited during the entire simulation, it is reasonable to assume that these are unlikely to be visited in the future. The same modelling without this assumption is also applicable, only with added complexity. An example of the process of converting the grid to graph is visualised in Fig. 4, where the dotted lines represent nodes and edges that are omitted.

Fig. 4
figure 4

Example visual representation of grid-to-graph conversion of a 3 × 3 grid. The squares are the grid cells; circles are the nodes; connections are the bidirectional edges. The dotted edges and nodes show how non-populated nodes are omitted from the graph

Following the outlined procedure, 6147 nodes are omitted from the graph. This results in a total of 3846 nodes in the final graph with 31510 edges, meaning that each node has on average 8.17 edges. Without omitting these nodes, the graph would have 10000 nodes with 88,804 edges. The constructed graph is a static one with a temporal signal in the sense that the graph structure remains the same while node features change over time. For training, the time series of graphs are sliced into smaller sequences. The model uses a historical time series of length q to predict a time series of length p, i.e. the model predicts the next p graph representations using the q previous observations. To evaluate model performance, predictions are compared with targets, i.e. the graph representations from t + 1 up to and including t + p. Sequences of input data are created and split into a test/train set to control for overfit. These sequences of node features, combined with the connectivity information, serve as input data for the graph-based model (Fig. 5). The employed spatiotemporal version of the graph neural network with built-in attention mechanism is outlined in the next section.

Fig. 5
figure 5

Visualisation of how sequences of graph inputs are linked with outputs, wherein t denotes the time, q the length of the input time series and p is the dimension (time length) of the predictor

3.3 Graph neural network

A spatiotemporal graph neural network (STGNN) is designed to predict the human presence in the workspace. Most of the human movement prediction literature models the agents as nodes and captures the interactions between them. However, in this case the workers are indistinguishable and hard to follow over time, meaning that such an approach would not be suitable. Therefore, a different spatiotemporal approach is used. Specifically, the graph-based model is based on the work of Bai et al. [2]. The model by Bai et al. [2] is designed for a traffic forecasting problem which shares a similarity with the problem posed in this paper. Both problems are spatiotemporal problems using node prediction on a static graph with temporal signals. Instead of predicting the traffic speed, in the present work the human occupancy at the nodes at a specific time is predicted, given the previous node attributes and topology of the graph. The graph-based model developed by Bai et al. [2] has good long- and short-term prediction capability in the traffic forecasting problem. Therefore, this STGNN architecture can also be leveraged for the human movement prediction task in this paper.

The model captures the spatial dependency by a 2-layered GCN on the graph. The 2-layered GCN takes both the first- and second-order adjacent node attributes into account for the spatial characteristics of the graph. The implication is that information from second-degree neighbours which are not directly connected are also considered by each node. Contrary to Bai et al. [2], the edge weights in our model are set as learnable parameters. Bai et al. [2] use a weighted graph, where all nodes have a specified predetermined weight. These weights are manually set and fixed. The learned edge weights in our work are aimed to capture the underlying spatial dynamics within the data. They are used as inputs to the GCN after a ReLu operation is performed on the edge weights. This ensures the stability of the model by making the edge weights non-negative. Following the GCNs layers, the temporal dependencies are captured by GRUs, which learn short-term trends on the characteristics of the graph. The GRUs determine the human presence at a given time by using the hidden states from the previous moment and the information from the GCN at the current moment as input, thus retaining the temporal information through the gated mechanisms. The reset gate controls the degree of irrelevant information to be omitted for the forecast, whereas the update gate controls the quantity of information from the previous movement that should be considered for the current state. Putting both the GCNs and GRUs together yields the T-GCN model as developed by Zhao et al. [35]. Each graph representation goes through the GCN layers and is used as input by the GRU. Besides the input from the GCN, the GRU also receives information from the previous hidden state. Bai et al. [2] build upon this model by feeding the hidden states through an attention model. The attention model learns the importance of the occupancy information at every moment and learns the variation trends of the occupancy states. Finally, a fully connected layer produces the occupancy prediction in a single shot, i.e. it predicts the next p time steps at once. For all results p = 40 denotes that human occupancy predicted 40 s into the future. The architecture of the model is visualised in Fig. 6. The next section outlines how model training is performed and assessed.

Fig. 6
figure 6

The model architecture: from top to bottom, a time sequence of grid cell input vectors is fed into graph convolution network layer which maps spatial features and the gated recurrent units capturing the temporal characteristics, then to the attention layer and eventually to the prediction

3.4 Model training and performance metrics

Hereby, the evaluation assesses  how the model predicts human occupancy in both the short and long term. The model is implemented using the python library PyTorch Geometric Temporal which is created by Rozemberczki et al. [22] to handle spatiotemporal graph neural networks. To update the parameters of the network and measure the performance, a mean squared error (MSE) loss function is used, defined over the occupancy of all nodes, and every prediction time step, as follows:

$$MSE = \frac{1}{N}\frac{1}{p}\mathop \sum \limits_{t = 1}^{p} \mathop \sum \limits_{i = 1}^{N} \left( {\hat{y}_{i}^{t} - y_{i}^{t} } \right)^{2}$$
(3)

where N is the total number of nodes, p is number of seconds predicted, \({\widehat{y}}_{i}\) denotes the predicted value of node i and \({y}_{i}\) denoted the actual value of node i. During the training of the model, L2 regularisation is added to limit the effects of overfitting.

The network yields a regression output that by using a threshold can be converted into a binary classification. The performance of the network is evaluated using regression and classification metrics. The classification predictions are evaluated using accuracy, precision, recall and F-score. Furthermore, as the data are heavily unbalanced, with the majority of nodes being unoccupied, balanced accuracy is also included to provide a view of the performance of the model that accounts for the data imbalance. Furthermore, the F2-score is used to evaluate performance. The general F-score formula is shown in formula 7, where the F2-score has a \(\beta =2\). Whereas the F1-score equally weights precision and recall, any \(\beta >1\) provides more emphasis on recall than precision.

$$Balanced Accuracy = \frac{TPR + TNR}{2}$$
(4)
$$Precision = \frac{TP}{{TP + FP}}$$
(5)
$$Recall = TPR = \frac{TP}{{TP + FN}}$$
(6)
$$F_{\beta } = \left( {1 + \beta^{2} } \right)\frac{precision*recall}{{\beta^{2} *precision + recall}}$$
(7)

where TP denotes the true positives, TN the true negatives, FP the false positives, FN the false negatives and \(\beta\) is the configuration parameter of the general F-score metric. Performance estimation does not account for the omitted nodes. Including the omitted nodes with a value of 0 would inflate the reported assessment metrics, but not accurately reflect the ability of the model to predict human occupancy. Only when plotting the occupancy grids these omitted nodes are added back. The next section presents the results of applying the graph-based model for the human movement prediction.

4 Results

The obtained results are presented in this section. Section 4.1 specifically discusses the effect of adding learnable weights to the model. Section 4.2 evaluates the performance and behaviour of the model over time. In Sect. 4.3, the performance of the model with different contextual node information is assessed. The parameters used to train the model are empirically determined using the validation loss for different configurations. To determine the learning rate and the L2 regularisation parameter, a range of values between 0.001 and 0.3 has been tested. The best empirical results were found using a learning rate = 0.01 and a L2 regularisation = 0.05. Moreover, the model is trained with a batch size = 32 for 5 epochs. Furthermore, it is empirically determined that the model is best configured for this specific problem setting using 256 hidden units and an input time series of 5 s. Model training was performed using a 2 × NVIDIA T4 Tesla GPUs from the Google Cloud Computing platform with 32 GB of memory in total. Involving 256 hidden units and a batch size of 32 in the employed spatio-termporal graph formulation used most of the above allocated memory. In this setting, model training took roughly 30 min for 5 epochs. However, the response time when passing new data through the trained network is practically instantaneous.

4.1 Learnable edge weights

Making the edge weights a learnable parameter is different from the model as developed by Bai et al. [2]. This section assesses whether adding the learnable edge weights improves the performance of the model compared to using an unweighted graph. After training, the average edge weight is equal to 0.138. This is close to the edge weight at initialisation, however, that does not imply that no learning occurred. The edge weight has a standard deviation of 0.142, and the maximum weight is equal to 1.182. The distribution of the edge weights is illustrated in Fig. 7, exhibiting significant variability as a result of the learning process. Due to the ReLu operation on the edge weights, all weights are non-negative. Moreover, it can be observed that the majority of the edge weights have a value close to or of exactly zero. However, a significant number of edge weights have larger values than the values at initialisation. Overall, this indicates that during the training process, the model learns which edges are important and which edges are unimportant for the human movement prediction.

Fig. 7
figure 7

Histogram of learned edge weights showing the distribution of the learned weights, showing a small subset of weights having more influence

To provide a more robust performance assessment, the performance metrics presented in this section are averaged over the 40 predicted time periods. This actually under-reports the performance of shorter-term prediction (i.e. 10 time periods) by adding the larger errors observed at the end of the predictive horizon. However, it offers a broader assessment beyond short-term prediction. The model with learnable weights has an MSE of 0.4095 whereas the model without learnable weights has an MSE of 0.4501, which is a noticeable improvement. Table 1 shows the classification performance metrics for different thresholds. The table shows that the model with learnable weights yields a higher balanced accuracy, recall, precision and F2-score for almost all thresholds. The enhanced performance indicates that learnable weights improve the model’s predictive capabilities.

Table 1 Performance comparison showing improved performance with learnable weights over unweighted graph for the most relevant threshold values of 0.05 and 0.10

4.2 Model performance over extended time horizon

The previous sections presented the performance averaged over all 40 predicted time periods. However, averaging performance over a longer time period is less informative, if there is notable difference between shorter and longer time horizon predictive performance. Therefore, this section assesses how the model performance changes over a longer time horizon.

Figure 8 plots the balanced accuracy over time and shows that it decreases when the time horizon gets longer, until it reaches a flat level, presumably determined by the average occupancy of the grid, but shifted according to the cut-off determined by the threshold. The closer the performance gets to this level, the less meaningful and therefore less actionable is the prediction. However, in the first approximately 8 time steps the performance is relatively high. Beyond this time horizon, the prediction ceases to be meaningful and at approximately 10 s it offers little more than a guess based on first-order statistics, i.e. mean expected occupancy. A similar conclusion can be drawn from Fig. 9 and Fig. 10, which visually illustrate occupancy over time as coloured and black density heatmaps, respectively. Figure 9 shows the regression outputs as a heatmap over time. The regression outputs represent the belief by the model about human presence at that cell. While the first two plots show concrete and discrete occupancy patterns, from time horizon 10 and above, the occupancy prediction simply indicates the overall movement patterns associated with the work processes practically converge to the occupancy expectation over time. Therefore, only the first two images show a clear meaningful and potentially actionable prediction.

Fig. 8
figure 8

Balanced accuracy over time for different thresholds

Fig. 9
figure 9

Predicted heatmaps for different time periods

Fig. 10
figure 10

Predicted occupancy grids for different thresholds and time periods

The case when the regression output is converted to a classification outcome, by applying a classification threshold, is shown in Fig. 10. Choosing a low classification threshold, such as up to 0.05, results in more grid cells being predicted as occupied, which would be consistent with a conservative ‘safety first’ interpretation for the AMR planner. In terms of accuracy, this would lead to a higher ‘false positive’ rate. The practical outcome of an excessively conservative threshold choice of 0.01 is that the AMR planner will be left with too limited options to optimise planning. Conversely, a higher threshold results in a reduced proportion of grid cells predicted as occupied. Therefore, very high threshold values would lead to a higher ‘false negative’ rate. This would allow more flexibility for the AMR planner. However, too aggressive threshold choices, such as 0.35 or higher in the specific case, would leave the handling of human presence missed by the predictor to be further processes  during the real-time sensing and navigation of the robots.

4.3 Performance assessment with the addition of contextual node features

The results presented in the previous section were based on scenarios which did not explicitly take into account the structure of the workspace. One may assume nonetheless that this information might be implicitly present in (and therefore learned by) the data. Cells with walls define movement boundaries. Cells with workbenches are human movement attractor areas, and in particular with a specific sequence, according to job sequences. However, it is of interest to explore whether such contextual information can be made more explicit. To do so, additional contextual features were added to each node. Specifically, each node was set to carry information not only about the human occupancy, but also about the presence of a wall (fixed boundary without acting as attractor), workbench (fixed physical element, which acts as main job attractor) or a vending machine (fixed element, which acts as a non-job related attractor). The experiments were performed again after introducing such modifications to the graph nodes of the network. Table 2 reports the MSE for a full experiment with the different combinations of node contextual features.

Table 2 Regression results with different contextual information

Compared to the results without any additional context information, all additional node features see a small decrease in MSE and therefore increase in performance, as seen in this table. The combination of several contextual node features further increases the performance of the model. When including all node features the best performing model is obtained. It must be noted that the improvement in performance is modest. This is subject to interpretation though. Specifically, results are presented for the whole duration of 40 time steps. This implies that the performance is actually under-reported. As in the case of the results presented in the previous section, it is the performance over the first approximately 8 steps that offers a meaningful prediction. Therefore, the incorporation of the accumulated MSE from all remaining steps blurs the picture, as the largest part of the reported error is from the longer time horizon steps. The inclusion of learnable edge weights also appears to have a somewhat similar effect, and it remains unclear whether in terms of performance it would be better to encode contextual information into the data or simply rely on learnable weights. However, similar observations regarding the range of time horizon steps and classification threshold values which lead to meaningful, and therefore actionable predictions, can be made. This becomes more evident when focussing on prediction results from the first 10 only time horizon steps, Table 3 summarises such results, from experiments employing different combinations of contextual node features for a time horizon of 10 steps, and for different classification thresholds.

Table 3 Classification results with different contextual information

The results provide evidence that a classification threshold of approximately 0.1 and the inclusion of all contextual features result in the best observed performance, as demonstrated by the balanced accuracy. Choosing a lower threshold, such as 0.05, leads to similar performance, which may be preferable, when opting to err at the side of caution. In this case the robot path planning would be more sensitive when choosing areas to avoid, as predicted to be occupied by humans.

5 Discussion

The key objective of the study was to assess to what extent graph-based neural networks can be applicable to predicting human presence in a shared workspace. The results addressed this research question and offered insights into how graph neural networks, in the form employed in this research, perform in such a problem. The workspace is seen as an occupancy grid. The approach taken falls into the category of pattern-based methods, where the target is to predict future occupancy on the basis of past observations. The past observations can be the outcome of performing video analytics over videos acquired via a network of distributed cameras. In the considered simulation scenario, a richer set of simulated observations are produced, compared to what would have been possible with real video analytics from a dedicated workspace. Resembling to real work environments, the workspace locations (grid cells) are characterised by a relatively low occupancy rate. This makes the available datasets unbalanced, with a much higher number of cells never visited, compared to cells which are visited. Motivated in part by mitigating the low occupancy rate and in part to improve computational efficiency, the unvisited nodes were omitted from the data preparation. This improved the predictive capabilities and reduced the size of the model. As a consequence, the model cannot predict movement to nodes that are omitted even though it would be a feasible movement for a worker, which is a limitation. Within the simulation this did not result in any problems but in the real world with more randomness and variation, this could be avoided by considering all cells, albeit with  higher computational costs. Furthermore, it is assumed that connecting all nodes with only their first-degree neighbours is sufficient since the 2-layered GCN can also consider information from second-degree neighbours without being directly connected to these nodes. The assumption that workers are unlikely to move a greater distance in 1 s appears to be adequate since the results show no indication of this being insufficient. However, future work can also look at relaxing such assumptions, generate data which allowing more extensive movements which cover the whole workspace and at higher movement speeds, and modelled by larger network structures, at the expense of higher computational costs, and increased imbalance in the employed datasets. Ultimately, the outputs of the model can be used to improve AMR planning in a shared workspace. Anticipating the future human movement can help the planning of AMRs to avoid potential collisions making smart manufacturing operations safer and more efficient. It is worth noting that comparisons with other established solutions for human movement prediction are not directly applicable, due to the specific framing of the problem in the present study via a heatmap occupancy format, as part of a broader integrated solution for AMR planning via occupancy data. Nonetheless, it would be beneficial for future research to establish relevant benchmarks for industrial environment contexts, extending current initiatives proposing benchmarks for outdoors, shopping malls and simple room-based scenarios [24]. Furthermore, predictive approaches for human trajectories would be appropriate to be integrated within broader safety assurance mechanisms for robotic systems [4] and digital twin solutions [30], for safer and trustworthy human–robot collaboration in manufacturing.

5.1 Learnable edge weights

Compared to the model developed by Bai et al., [2], the graph neural network in this paper makes the edge weights a learnable parameter. The edge weights influence the importance between connected nodes. By making these weights learnable, spatial patterns can be captured. These learned edge weights are introduced to implicitly account for different work sequences and other spatial patterns in the data, which otherwise would not have been accounted for. The results show that the model with learnable edge weights offers superior performance when compared with an unweighted graph. The model with learnable edge weights yields a lower mean squared error, and a higher balanced accuracy, precision and recall. Therefore, it is concluded that learnable edge weights capture some additional spatial dynamics in the data, compared to what would have been possible without these learnable weights. The learnable weights learn the specific layout and movement dynamics in the factory and thus implicitly account for the different work sequences of the workers.

Adding learnable weights is an exclusive feature to graph neural networks. The inclusion of learnable weights in a graph neural network has interesting implications for generalizability. Since the specific layout and dynamic of the factory are learned, the model cannot directly be applied to a different factory (setting). The model must be retrained using data from that specific setting. At the same time, this could be seen as an advantage for generalisation, to the extent that learnable weights can actually capture such implicit patterns in factory settings, and so such information may not have to be hard-encoded. Future research could explore different approaches that account for the work sequences which extend the generalisation capability of the model.

5.2 Occupancy prediction over time

The model makes a single-shot prediction for the next 40 time instances (seconds) based on the input data. The very short-term predictions show a high balanced accuracy. When evaluating the predictions for a longer time interval, the results show that the further the predictions are into the future, the worse the performance of the model becomes. The degradation of performance over longer prediction horizons is to be expected. From the performance metrics and visual inspection, it can be concluded that the model can provide meaningful prediction up until approximately 8 steps ahead. Predictions over longer time horizons are less meaningful and ultimately converged to the apparent average occupancy of cells, according to the number of cells and the average observed number of workers in the workspace. This implies that this graph-based model is only applicable for short-term predictions. This is a different conclusion compared to Bai et al. [2], who concluded that the model is applicable for both short-term and long-term forecasting tasks. However, the context of the prediction tasks is quite different: in our case, it is about human movement prediction in shared workspace, whereas in the case of Bai et al. [2] it is about traffic forecasting, with the latter likely to show stronger patterns, including time/seasonal ones, which are far less observable in our case. To address the limitation related to short-term predictions, future research could explore combining graph-based models with planning-based models. By combining the pattern-based graph network with a planning-based model, the long-term goal intent of a worker might be explicitly accounted for and would improve the performance of the graph-based model for a longer prediction horizon.

5.3 Contextual node features

By adding contextual information to the node attributes, the graph-based model can produce predictions which could be seen as being context-aware. Assuming that a worker is more likely to move towards a workbench than to a wall, adding this information potentially improves the predictive capabilities of the model. The obtained results confirm this and the MSE error gradually improves when adding additional contextual information, with the best results being obtained when including all available contextual information. A similar pattern is observed with the classification metrics, where the best results are obtained with all the available contextual information. Although the contextual information improves the predictive capabilities of the model, the reported differences in performance are somewhat small. However, this is partly due to the choice made earlier regarding how to report error-based performance. Specifically, instead of monitoring and reporting performance over the shorter time horizon that the prediction is meaningful (i.e. below 10 time steps and approximately 8 steps / seconds), the error is reported over 40 time steps. This implies that the largest part of the reported error is actually due to predictions beyond the shorter time horizon of good predictive performance and therefore observed differences are actually much more significant. In the future, it is of interest to narrow down the performance reporting to shorter-term predictions.

More research is also needed to look at how best to embed contextual information as the findings were inconclusive when considering possibilities also offered by learnable weights. Future research could study the performance of different types of graph convolutional layers and aggregation methods to improve context-aware prediction. Different approaches could possibly be better at extracting context information and thereby increasing the performance of the model in a shared workspace. Furthermore, the contextual information can also explicitly be considered by combining the graph neural network with a physics-based approach. For example, a variant of the social force by Helbing and Molnar [8] can be considered which could model workbenches as attraction forces and walls as repulsion forces. Thereby, the context information can be incorporated differently and enhance the performance of a graph neural network.

6 Conclusion

The main objective of this paper was to evaluate to what extent a graph-based neural network can predict human movement in a shared workspace. The literature analysis showed that graph-based models have not been applied to predict human movement in such spaces, yet they possess characteristics which are likely relevant to such tasks. The analysis found examples of similar spatiotemporal problems for which these graph-based models have been successfully applied, however, not for human movement prediction. This paper has selected and implemented a graph neural network to be applied to a workspace setting to evaluate the applicability of a graph-based model to the posed problem. Specifically, the model developed by Bai et al. [2] was selected based on the literature analysis. The model was further adapted to include learnable weights to better capture the spatial patterns and work sequences of the problem. This model was trained using simulation data supplied by employing a simulator already employed in other contexts, to simulate human movement in transportation hubs. The nature of the problem implies that historical data would be imbalanced towards low occupancy and this was reproduced in the simulated data.

Based on the results, it can be concluded that the implemented graph neural network yields sufficiently accurate predictions in the short term, i.e. for up to 8 time steps (seconds). For longer time horizon, the model predictions gradually shift towards the average occupancy rate that is reflected on the data, and therefore, the predictions miss the specific spatial and temporal intricacies of the human movement. It was additionally shown that by adding learnable edge weights the graph-based model can learn better some of the underlying spatial dynamics in the data. The optimised weights improve the model performance but at the cost of generalizability. However, this is to be expected and can be mitigated in the future by training different models for different workspaces and work sequences. The inclusion of contextual information appeared to improve the performance of the model, but more investigation is needed regarding how best to include contextual information.

Overall, this paper contributes to safer and more efficient robot–human coexistence. It is shown that graph-based neural networks are applicable to human movement predictions in shared workspaces, specifically in the short term. Furthermore, graph-based approaches can be used to make context-aware predictions. These predictions can serve as input for the AMR planning in a shared workspace. By being able to anticipate the likely future human occupancy, improved planning can reduce the risk that robots encounter workers, forcing the robot to stop. Although graph models are promising, future research could explore additional ways to improve predictive performance, such as how the context information in a workspace can better be incorporated by the model or, how to combine the graph-based model with a planning- and/or physics-based approach.