1 Introduction

Industrial production is moving from mass production to what is called “mass customization”, where batches can be as small as a single unit Bruner and Kisgergely (2019); Matthias (2015). In this new scenario, Human Robot Collaboration (HRC) using collaborative robots (a.k.a. cobots) provides the flexibility that is required, and consequently the presence of collaborative robots in industrial facilities is increasing significantly Cobots for Flexible Automation (2021); Williams (2018).

In this new scenario the most important requirement to meet is safety Lohse (2009); Bolstad et al. (2006); Onal et al. (2013), but the use of standard safety methods in industry such as curtains or physical barriers makes it impossible for human and robots to collaborate by sharing the same workspace.

Implementing current regulations Robots and robotic devices (2016, 2011a, 2011b), cobots are designed and built to be intrinsically safe, with safety-ensuring capabilities such as anticipating collisions and reacting to them Matsas et al. (2018). However, even though intrinsic techniques have been demonstrated to be sufficiently safe Malm et al. (2019), users might still experience a sense of risk when entering the area of influence of a robot. This concern arises from the possibility of potential hazards, such as malfunctions in safety mechanisms or risks associated with the robot holding objects. Collision detection techniques are thus insufficient on their own to provide a situation awareness that promotes a sense of safety and preserves trust in robots Lasota and Shah (2015). In the long term, the stress caused by working in a state of permanent uncertainty due to poor situation awareness can damage health Oken et al. (2015). Improving situation awareness is thus necessary for workers’ well-being and safety, as well as to obtain a good user experience (UX) from collaborating with robots Lohse (2009); Bolstad et al. (2006); Onal et al. (2013). In recent years, and partly motivated by the need to improve situation awareness, an increase in the research of mixed reality (MR) techniques applied to HRI has been observed Bagchi et al. (2018); Green et al. (2007); Pan et al. (2012); Rowen et al. (2019).

Recent literature on situation awareness shows that previsualizing robot movement trajectories is a commonly used technique that often leads to significant improvements in safety and performance during human-robot collaboration Macciò et al. (2022); Cleaver et al. (2021); Tsamis et al. (2021); Rosen et al. (2019). However, even with previsualized trajectories, users are still not fully aware of possible danger related to sharing a space with the robot, and such state of misinformation could affect negatively their perception of safety Cong Hoang et al. (2021). In this research, we also use previsualization of robot trajectories, to match common practice. Taking such baseline design as starting point, we propose adding multimodal (auditory and/or visual) MR display presentations of hazard-related situation awareness information, which might convey improved perception of safety. According to the classification by Suzuki et al. (2022), our research aims to develop an on-body approach that augments the surrounding to improve the awareness that the user has about the robot’s state. With such hazard-related information, the goal is that users are in control of their own safety, reassured that they are aware at all times of the varying levels of potential hazard that results from the proximity between robot and worker, while both move across the shared workspace. The idea of granting control to the human provides the opportunity to evaluate whether the designs developed are sufficient to maintain user safety and to observe differences between modalities. By allowing the human worker to decide how to position themselves in relation to the moving robot, we can better understand the effectiveness of our safety designs and their impact on perceived safety and task efficiency.

Although this might contravene current safety regulations, one of the aims of this work is to assess the perceived sense of control over own safety. In real work scenarios, deployment of hazard awareness displays would be done in conjunction with any other necessary safety mechanisms, as required by risk analysis exercises and mandatory regulations.

This paper makes several contributions. We first identify and discuss the limitations that current HRC safety strategies present. In response to this, we propose an MR application that extends beyond simply informing users of the robot’s upcoming trajectory. Our application provides real-time updates on multiple danger levels through both audio and visual cues, ensuring users remain aware and safe while working alongside robots in a shared workspace. In this way, users do not only know whether they are in danger or not, but also the level of danger they are exposed to, even in instances when they are not looking directly at the robot. The designs of an auditory and a visual hazard awareness display are reported, and their implementations described. We then report an experimental user study (n=24) to compare the performance and subjective experience obtained from using the auditory display, the visual display, and the audio-visual display that results from combining both, while completing a task in a space that is shared with a moving physical robot.

2 Related work

Safety in robotics often refers to industrial applications, as development in robotics and their adoption has been the fastest in that area. Manufacturing industries commonly use Computer Aided Design (CAD) and Manufacturing (CAM) systems in order to deal with rapid technological progress and aggressive economic competition on a global scale Makris et al. (2014).

In the last ten years, AR has emerged as a key technology in manufacturing Montuwy et al. (2017); De Pace et al. (2018). This technology assists in the workspace by presenting in the worker’s field of view the information that needs at each moment. AR devices offer the possibility to display dynamic information and enhance the UX in modern industrial workplaces, including those using cobots. AR devices have been addressed as aids for workers in a variety of sectors such automotive industry Doshi et al. (2017), assembly industries Evans et al. (2017), logistic industries Reif and Günthner (2009), shipyards Blanco-Novoa et al. (2018) and construction Li et al. (2018). AR has also been used for assisting humans working alongside fenceless collaborative robots in a variety of contexts Makris et al. (2016); Matsas and Vosniakos (2017); Wen et al. (2014).

At the same time, the increase in the use of cobots creates a need to improve safety and situation awareness with the aim of reducing anxiety created when working with robots. AR has proved its potential in addressing this issue. Brending et al. (2016) suggested that the use of AR in HRC could reduce anxiety by showing contextual information to a human operator working in close proximity with a robot. Authors in Vogel et al. (2016) suggested that a projection-based AR system could improve HRC.

With this motivation, Vogel and others Vogel et al. (2011, 2013, 2015, 2021) developed a projection-based sensor system, using cameras and data from robot controllers to monitor the target object’s position and the configuration of the robot. Based on this information, they projected on the work surface safety areas around the robot and the object to be grasped. If any object crossed the boundary of the safety areas, the robot stopped. While safe, this behavior is not compatible in practice with collaborative work requiring the worker to continuously cross that boundary. The worker may need to be made aware of the situation, without the robot slowing down or stopping altogether. At the same time, utilizing 2D AR interfaces requires the user to shift their focus between the display and physical workplace. Improved situation awareness is also needed when the worker has to look away from the robot during collaboration. In such contexts, extended reality (XR) technologies offer the opportunity to present information about the relative position of the robot, thus warning the worker about the potential hazard.

Prior attempts have been made to improve such awareness using XR technologies. In Hietanen et al. (2020), authors proposed a workspace model for HRC, based on three different zones, robot, human and danger zonesBdiwi et al. (2017), which were continuously monitored and updated. This model was presented to the user by two different AR-based interactive user interfaces: projector-mirror and HoloLens. However, in this study, the user was only allowed to enter the workspace when the robot was stopped, and the boundaries, which conveyed the actual configuration of the robot, were displayed even when the user was in a safe space. In Aivaliotis et al. (2023), the same situation is observed, where only boundaries related to the actual configuration are displayed, and no data concerning the user’s state is shown. In Bolano et al. (2018), authors made use of visual and acoustic information to warn users about the robot’s intention, making the planned trajectories easier to understand. Nevertheless, the acoustic information only provided notifications about new trajectory re-planning and execution, while the visual data was displayed on a screen, causing users to shift their attention while working. Using Unity and Vuforia, Palmarini et al. (2018) developed an AR interface which virtually displayed the planned trajectory of the robot before it was executed. They concluded that AR systems improve the trust of the operator in collaborative tasks. In Gruenefeld et al. Gruenefeld et al. (2020), the authors presented three methods for expressing robot intention using gradient maps based on warm-cold color mapping. However, in this approach, users must focus on the robot or its immediate surroundings to understand the robot’s intentions and their current state. This requirement can cause users to alternate their attention between their task and the robot, potentially affecting task efficiency. In addition, gradient mapping has been shown to require more cognitive resources to interpret user states, which can increase mental workload and might not be adequate for safety-critical situations where quick interpretation is crucial Ware (2019). Matsas et al. (2018) prototyped proactive and adaptive techniques for HRC in manufacturing using virtual reality (VR). They compared the effectiveness of each technique but did not analyze the effectiveness of the information displays used. In San Martín and Kildal (2019), we analyzed different audio and/or visual hazard information display designs, aimed for a generic static source of hazard, emerging from a static vertical axis in space. A user study showed that the auditory and a visual display obtained through iterative design resulted in behaviors that were reasonably equivalent as hazard displays and in the hazard related information that they conveyed. In a second user study San Martin and Kildal (2021) we compared the single modality displays and the audio-visual display resulting from the combination of both. However, the scenario in San Martin and Kildal (2021) involved hazards of an undetermined nature, emerging from vertical holographic axes fixed in space, and with no resemblance of robots either in shape or in behavior.

While advancements in MR technologies have increased their usage in HRC applications, their potential to improve situation awareness remains insufficiently researched. Proposed MR solutions for HRC often fall short of providing comprehensive insights into the user’s safety status during collaborative work with robots. To address this, we propose an MR application that goes beyond merely offering information about the robot’s next trajectory. Our application provides real-time updates on various levels of danger through both audio and visual cues while users are actively working alongside robots in the shared workspace. This real-time feedback informs users about the degree of risk they are facing and the specific factors contributing to each level of danger, empowering them to navigate potential hazards more effectively. Concurrently, our work aims to evaluate and assess the effectiveness of audio, visual, and audio-visual displays and analyze the strengths of each display in an HRC scenario.

3 Developing an awareness display for HRC

The information displayed to raise awareness of hazards in HRC scenarios is dependent on the characteristics of such scenarios. In the context of this paper, HRC scenarios involve a worker that shares space with a real robot arm, where both robot and human cooperate to complete a task. Robot and human move in relation to each other and, based on real time information from the awareness display, it is the human who modifies their behavior to modulate exposure to the hazard resulting from proximity with the robot.

For the HRC scenario in this paper, the physical cobot arm selected was an LBR Iiwa 14 robot. To preserve user safety during the display design process, a virtual version of the scenario was used, consisting of the virtual expression of the digital twin of the selected LBR Iiwa 14 robot, which preserved the dimensions and kinematics of the real robot. A virtual single-legged white round table was added to the otherwise empty collaboration space, with an also-virtual object on the table. Target positions for the object on the table were represented as small transparent boxes. In this scenario, a simple HRC task involving the manipulation of the boxes amid the interfering movement of the virtual robot could be carried out, to test display designs developed in successive iterations.

The virtual scenario was displayed on a Microsoft HoloLens 2 head mounted display (HMD). We selected HoloLens 2 due to its capability to deliver visual information even when users have their back to the robot, while user can see through the real world and provide directional information through 3D spatial sound. Additionally, HoloLens is more feasible to generalize to other scenario Park et al. (2021) and potentially reduces setup costs associated with cameras and projectors. For the implementation, we used Unity (v.2020.3.11f1) and the Unity Robotics Hub (V.0.7.0) with ROS-TCP Connector(v.0.7.0) in order to connect with the Robot Operating System (ROS). A Dell computer with an Intel CPU i7 and 32 GB RAM was used to run ROS, MoveIt and Gazebo, which provided the capability to simulate the robot. The computer containing ROS was directly built in the robot and it moved the robot to the position set by HoloLens and then, the visualization of the HMD was updated in real-time with the simulated robot. The Hololens made use of ROS-TCP via WiFi in order to exchange information with the robot’s computer. HoloLens and the Figs. 1 and 2 combine representations of the visual and auditory hazard awareness displays that we designed and tested with the interactive setup just described. Details of each display design are described in the following subsections.

3.1 Design of the visual display

In order to design the visual displays, we first analyzed what color codes are standard for hazard warning. We based our choice of colors on this analysis, so that users could recognize the level of hazard they were exposed to while collaborating with the robot. Zielinska et al. (2014, 2017) reported that red color is perceived as the most hazardous of all colors, followed by orange and yellow in second and third place. Although there was no significant difference in the hazard perceived between these two colors, the orange color was perceived in both studies as more dangerous than yellow. Hence, the red color was used for the highest level of hazard, orange for a middle hazard level and yellow for the lowest level of hazard. From the studies it can also be inferred that green color is perceived as signaling a space that is safe or with a level of danger that is very low. Hence, we decided to display green color for one second, once the participant left the dangerous space and as confirmation of entering a safe space.

However, we did not only want to make users aware of the level of hazard they were exposed to, but also aware of the distance they were from the focus of danger. To achieve that, the dynamics and movements of the robot were considered: for each level of hazard, the shape of the hazard volume was displayed warping around the robot. The resulting shape of the colored hazard zone was informative about the distance to the robot, as it varied in shape and color with the movement of the robot and with the level of hazard of the space occupied.

Fig. 1
figure 1

Visual Display Designsa1, b1 and c1 show exact shapes of the display concept for each range of the hazard zone. a2, b2 and c2 show the hazard volumes implemented in each case, using simple geometric shapes that fully contain the exact concept shapes and are easy to interpret by the user

Following these ideas, the visual display was designed as follows:

  • Low level of danger, yellow: This level of danger warned users when they entered a space in which they could be reached by the robot (the region of influence of the robot). This space was represented by the whole reachable volume of the robot, which is static (similar to Fig. 1-a1). We decided to represent it with a simple geometric shape, a sphere (Fig. 1-a2) to make it easier to understand and estimate the extent of that region.

  • Middle level of danger, orange: The orange level (a sub-region of the yellow level) corresponded to a region closer to the source of danger, and linked to a specific trajectory that the robot was executing at the time. In each trajectory, the robot moved from a current position to a target position, and the volume created by all the poses along that trajectory was the volume used to represent the orange hazardous space. The exact shape created by all the poses of the target trajectory could be quite complex (similar to what is represented in Fig. 1-b1). To simplify the shape and make it easier to understand at a glance, we assimilated it to a cylinder (as shown in Fig. 1-b2), which contained (by excess) the complete exact volume. The cylinder was constructed with a radius and a height obtained by computing maximum distances in x, y and z, considering the robot’s configuration and target point, from the center of danger. The highest and lowest values in x, y and z were transformed into a cylinder, which is a simple geometric shape that contains completely (and by excess) the tighter exact representation of the hazard volume. This volumetric figure was in constant change with the movement of the robot.

  • High level of danger, red: The volume with the highest level of danger (a sub-region of the orange volume) represented imminent danger. For a robot, this corresponds to the immediate proximity to the position currently occupied by the robot. This hazard space was a skin-like layer around the robot, as shown in Fig. 1-c2. The thickness of the skin above the robot was 0 mm when the robot was static, and it grew linearly with the speed of displacement of the robot. The maximum speed allowed for the displacement of the robot was 1 m/s, which corresponded to a thickness of the red region of 125 mm above the surface of the robot.

As shown in Fig. 1, in the visual display design, users could see at all times the boundaries of each danger volume activated by their presence, to help them identify the way out quickly and easily.

3.2 Design of the auditory display

For the auditory display, we used a short sound pulse (440Hz 50ms) that was heard pulsating at varying frequencies when the user was inside the region of danger and depending on the distance to the robot. The sound was spatial, to help the user perceive in which relative location in the surrounding space the robot was. The sound was single pitch, instead of a range of frequencies, in order to convey a more moderate sense of urgency Edworthy et al. (1991). Chords or harmonically-complex sounds were also avoided, so that the sound did not interfere with other audition-based communication channels during interaction, such as with natural speech Edworthy and Hellier (2006). We used tempo as the parameter to map the distance to the origin of the danger Giang and Burns (2012). In Zobel (1998), Zobel concluded that users could perceive a strong differentiation in urgency between 1 Hz, 2 Hz, 4 Hz, and 6 Hz pulse rates, and users could also detect higher urgency with higher frequencies such as 8 Hz. Following this idea, we used a pulse rate of between 2 and 10Hz, depending on the distance of the user to the center of danger, located in the centroid of the geometric configuration of the robot (the “c” point in Fig. 2I). The frequency of the pulse increased linearly as the user got closer to the c point.

The auditory signal activated in the same position as the visual display’s yellow zone did, i.e., as soon the operator entered the region of influence of the robot (Fig. 2I-a). Pulsation frequency was lowest when the robot reached its furthest position in the opposite direction to the user. In this way, the equation that related pulsation frequency with user proximity was obtained (see equation in Fig. 2I-c)). Using the head related transfer functions (HRTFs) from MRTK, the sound was spatialized, so that the user could hear the hazard warning emerging from a specific position in the space around them, and thus could be located easily.

Fig. 2
figure 2

I Auditory Display Designs: Frequency (f, in Hz) of the auditory signal depending on the proximity of the user to the center of danger. II Collision Detection of User: Objects that represent the user: a full body; b full body with extended arm. They were used to detect collisions with hazard spaces

3.3 Collision detection of user with hazard spaces

Unity’s support for collision detection was used to detect collisions between user and each hazard zone to determine when the user entered a hazard zone. This is a functionality of Unity that informs about the collision of two objects.

The visible shapes described in the visual design and the invisible reachable space described in the auditory design, were all objects created in the Unity scenario. Some of these, such as the orange and red shapes, were updated during execution to exhibit the behavior just described. Interacting with them required identifying potential collisions of the user with the objects of the different hazard levels. For this, another three objects were created to represent the user’s body.

The first object was a 3D rectangle that represented the user’s body. This object was placed in the position of the HMD device, and so it moved with the user. This object could be edited when the application was launched, inserting the user dimension in x, y and z (Fig. 2II-a)). It was assumed that the user would work sitting down or standing up, so the object would only rotate with the z axis of the glasses Fig. 2II-b). The problem came when the user reached out with a hand, because the hand then came out of the box object containing the body. To detect this, HoloLens supports articulated hand-tracking of the user when wearing the glasses, as long as they are inside the HMD field of view. Whenever the Hololens device detected a hand, an object with its corresponding label (‘Hand_left’, ‘Hand_right’) was added to the scene. Making use of these objects, a box was attached to each arm as shown in Fig. 2II-b), so as to represent the worker’s hands when either of them came out of the body’s box.

Having created the objects representing the user, a script detected when the operator collided with any hazard zone (OnCollisionEnter) and identified which level of hazard it was. With this information, if the visual modality was used, the corresponding hazard zone was turned on. Instead, if the modality was auditory, the audio with its corresponding pulse frequency was played. In the same way that collisions were detected, events of the user leaving these objects (OnCollisionExit) were also detected. In the visual display, when the user left the lowest danger zone, a green light was displayed for one second to inform the user that she was safe.

4 Modality comparison user study

Once completed the design and implementation process of two modality-specific hazard awareness displays, we conducted an experimental user study (n = 24) with the physical robot, to assess each display on its own, as well as the audio-visual display that resulted from the combination of both. To ensure participants’ safety, any sharp or pointy object that could harm participants was removed, and the robot was set to move very slowly during the study, with a mean speed of 0.25 m/s, so that users could react easily to the robot’s movement. At the same time, the experimental study was performed in an open space where participants could not get trapped by the robot. In addition, an observer holding a wireless emergency stop button was present at all times and responsible for stopping the robot immediately, should any unforeseen hazardous situation arise.

One objective of the study was to assess the three display versions in terms of user performance, by observing the extent to which displays enabled users to remain safe during collaboration. Another objective was to assess the subjective experience that each unimodal display, and the multimodal combination of both, elicited in users during collaboration with a physical robot. By conducting the study also for the combined audio-visual display, we aimed to understand which strengths and weaknesses of each modality-specific display remained in the multimodal version, and if the simultaneous presentation of all modality-specific information (which was partly redundant and partly complementary between modalities) resulted in a safer display for collaboration. Further, we wanted to understand how the experience of using the multimodal display compared to using each of its constituent unimodal displays alone.

4.1 Participants

We recruited 24 participants (12 male and 12 female) with a mean age of 28 years (SD = 5.87), \(CI_{95\%}\)[26.152, 30.847], who volunteered to take part in the study. 16 of the participants had some previous experience with augmented reality, 13 of which had experience with a HoloLens HMD device. Out of 24 just 5 of them had previous experience with collaborative robots. Participants were recruited from a population of researchers and applied engineers with various degrees of seniority in various different areas of technical knowledge.

4.2 Experimental task

We designed a three-condition within-groups repeated-measures study to compare the performance and UX obtained when collaborating with a robot by using each of the three hazard awareness display versions. A HRC task was designed to be completed in each condition: (i) visual condition (visual display only), (ii) audio condition (auditory display only), (iii) audio-visual condition (the simultaneous presentation of both audio and visual displays). The order in which conditions were presented to the participants was fully counterbalanced, and in each condition the task was repeated 3 times.

Fig. 3
figure 3

Scenario of the user study. The first step to set-up the scenario was matching the virtual robot (discontinuous green robot) with the real robot. In any condition, while the robot was operating on its own, the participant was kept busy making mental calculations for a number series. Once the robot called the participant, she moved to the robot’s table (III) to perform the task assigned (T1, T2 or T3). While the participant performed the task at the table, the robot moved in a loop between predetermined points: T1[P11\(\cdot \)P18]; T2[P21\(\cdot \)P27]; T3[P31\(\cdot \)P39]. During each condition the experimenters (II) took notes about how the participant performed the task

We set up an experimental HRC pick and place acti in the open space of an industrial facility, where noise from machinery that was operating in the same facility could be heard in the background. This choice of environment was representative of an actual industrial facility in which workers (target users of our solution) and robots typically cooperate in real production scenarios. For the robot used in this study, we chose an LBR Iiwa robot arm, that was securely affixed to a table for its task. An inspection task was selected for the study, representative of tasks commonly found in industrial production Brito et al. (2020); Zheng et al. (2021). More specifically, the selected task required the robot to inspect specific documents (see Fig. 3).

To evaluate the hazard awareness displays, we wanted a task in which the user and the robot had to occupy the same shared space at various times during the execution of the cooperative task.

For that, we devised a task for participant and robot to cooperate. The task consisted of placing a numbered sheet in its corresponding position in the stack of numbered pages that formed an unbound book, so the robot could inspect the whole book, page by page. This task aligns with relevant industrial activities, such as the need to coordinate with robots and ensure human accuracy in placing items correctly within a limited time frame. The precision required in this task can be also generalized to various industrial applications, including assembling electronics, automotive parts, and packaging. This generalizability makes it a valuable model for studying and improving HRC across multiple sectors.

First, the participant was located at a separate workstation from the robot’s table, executing a distractor task that involved performing mental arithmetic calculations (see Fig. 3I). This approach was used to minimize any mental practice that might lead to learning or the development of strategies for subsequent repetitions Schmidt and Bjork (1992). After a waiting period, the robot requested the participant’s intervention by playing a sound and displaying a yellow circle in the upper left side of the HoloLens display. Such request took place three times in each condition.

Upon receiving the request, the participant moved to the robot’s table to take a numbered sheet. They then proceeded to the other side of the robot’s table where a stack of numbered pages was placed and inserted the sheet into the corresponding location. The user had to complete the insertion of each page with the destination document on the table, which was shared with the robot. The participant was not allowed to remove the document from the table and return it after inserting the loose sheet. Instead, the participant had to carry out the insertion while the robot was operating on the opposite side of the table and had to pause the insertion task to avoid the robot, when robot approached their vicinity. Once a loose sheet was inserted in its correct position, the participant had to return to the individual workplace, further away from the robot’s table, press an orange virtual end button, and resume the distractor task of writing down arithmetic calculations, until the robot requested the participant’s help again.

The task was the same for the three conditions (audio, visual and audio-visual). It also remained the same for all the repetitions in each condition: pick a sheet and place it in the corresponding location following the numerical order of the sheets. The location on the table on which the book was placed was selected randomly, but ensuring partially counterbalanced exposure, for each condition from among three options (see positions T1, T2, T3 in Fig. 3 III), without repeating the location for each participant. Each location of the book on the table also determined where the participant would take (one-by-one) the sheets from. The location of the book also determined the route along which the robot moved in a repeating cycle during the task. Movement routes of the robot were designed so that the robot interfered with the participant in the execution of the task. Thus, and referring to Fig. 3, with the book in T1, the participant picked the sheets from location T2, and the robot followed cyclically the trajectory between the points P11 to P18. With the book in T2, the participant picked the sheets from location T1, and the robot followed cyclically the trajectory between points P21 to P27. As for when the book was in T3, the participant picked the sheets from location T2, and the robot followed cyclically the trajectory between points P31 to P39. When deciding locations and robot trajectories, we analyzed the workspace to create 3 variants of the task that were equivalent in difficulty.

We opted for pre-defined and pre-sampled trajectories across all subjects to ensure a more balanced and controlled analysis. This approach aimed to assess how effectively each modality (audio, visual, or audio-visual) helped to maintain user safety without introducing external factors, such as dynamic changes in robot trajectories, which might interfere with the control of the conditions. By keeping the trajectories consistent through all the users, we ensured that any observed differences in user safety could be attributed to the feedback modality rather than the variability of the robot’s movements. Also it has to be noted that users were not notified regarding that the trajectories were pre-defined.

Table 1 Threshold values used for gamification scoring, categorized by speed (time execution time), staying safe (score that depends on the number of times entering high level hazard zones), and correctness (how well task has been performed)

The task was gamified to encourage effort from the participants in executing the task as well as possible according to the instructions. Thus, three aspects about the execution of the task (speed, staying safe and correctness) were measured or evaluated, resulting in a numerical score of to up to 15 points (up to 5 points for each aspect). The aim was that users obtained the closest mark to 15 as possible. Time to complete the task was taken as measure of speed, where depending on the time spent they obtained a score from 1-5 (see Table 1, Speed column). As a measure of staying safe, each intrusion in the orange hazard region was counted as a point for entry scoring, and as 2 when the red region was entered. The smaller the entry scoring, the bigger the score they obtained in this aspect, with a maximum of 5 points (see Table 1, Staying Safe column). In addition, the observer rated numerically how correctly the instructions had been followed in keeping the book on the robot’s table during the insertion of loose sheets. Again here the best value was obtaining a 5 (see Table 1, Correctness column). Each participant was informed about the scores obtained after the three conditions were completed.

4.3 Experimental procedure

Each participant read and signed a consent form and filled out a demographic questionnaire before the experimental session started. Having read a description of the task, every participant received a training session in and practiced executing the task with feedback from each of the display versions. For safety reasons, the training session was conducted by interacting with the virtual representation of the robot’s digital twin, not with the real robot as during the execution of the task for the study. During the training session, the robot’s digital twin was placed on the table, where the real robot would later be placed throughout the experiment.

After the training session, the scenario was prepared to perform the study. To implement the interaction and the responses designed for the displays, the real robot and its digital twin had to be present, with the digital twin perfectly superimposed on the real robot. The presence of the virtual replica of the real robot was only necessary to detect proximity with the user that was wearing the HoloLens device. Thus, once the position of the virtual replica was calibrated with the real robot, with both versions of the robot moving together, the proximity of the user with the digital robot was also indicating the proximity of the user with the physical robot. However, since from the user’s perspective the scene only included one physical robot, the virtual replica of the robot was made invisible for the participants, who only saw the physical robot. In this way, participants could observe the hazard awareness information over the real robot (see Fig. 4). For a live demonstration of the designs (audio, visual and audio-visual design) in a real environment, and the training session, please refer to the video (https://youtu.be/JvLUrciz2hM).

Regarding the volume of sound of the HoloLens speakers used for the training and experimental tasks, it was standardized for all participants to ensure consistency. The volume was set at a level that was sufficient for all users to clearly hear the auditory display without exceeding 80 decibels (dB), in line with the World Health Organization (WHO) Footnote 1 recommendations for safe listening levels. This setup ensured that all participants could hear the auditory cues clearly while also maintaining their hearing safety.

Fig. 4
figure 4

Augmented real scene view and schematic representation of increasing level of hazard (yellow-orange-red) as a person gets closer to a collaborative robot. a1 and a2 whole volume reachable by the robot; b1 and b2 volume likely to be reached in the current configuration of the robot; c1 and c2 immediate proximity of the robot. When the user is outside any danger zone, the hazard zones are not visible

4.4 Metrics

After each condition, the participants responded to two post-task questionnaires: a Single Ease Question (SEQ) using a 7-point Likert scale, where 1 represents the worst value and 7 represents the best value Sauro (2012), and a raw NASA-TLX (RTLX) questionnaire Hart (2006), extended with the category “irritability”. The generalized practice of adding subscale categories is acknowledged and welcomed by the creators of the original scale Hart (2006), as long as the index is calculated with the originally validated set of subscales, as we do. As highlighted by Haas et al. Haas and Edworthy (1996), the design of auditory warnings, plays a critical role in conveying urgency and can significantly influence user perception and emotional state. Adding irritability as a factor allows for a more comprehensive assessment of user experience and mental workload, accounting for the potential emotional strain caused by auditory stimuli. The NASA-TLX uses a scale with 21 gradations where 1 is the most positive value and 21 is the most negative, except for the Performance scale that is framed positively Footnote 2 and hence values are reversed. At the end of the experimental session, participants were asked to select the best and the worst hazard awareness displays (the auditory, the visual or the audio-visual display) with respect to the following categories: pleasantness, capacity to capture attention, capacity to provide sense of safety, capacity to convey sense of distance to origin of hazard, and preference. The selections made by every participant, about the best and worst hazard awareness displays, with respect to the categories, were used to initiate discussion in the semi-structured interview that followed, with the aim of obtaining further insights from participants regarding their experience during the three conditions. We aimed to analyze aspects such as: How they perceived their safety, Whether the danger was correctly conveyed, The overall experience they had in each condition and Any other design improvements they suggested. These questions provide us with qualitative insights that enhance and complement the quantitative metrics obtained during the user study. This helps us identify any issues or positive aspects that the users felt during the study, which may not be fully captured by the quantitative data alone. This qualitative feedback allows us to contextualize the metrics, ensuring a more comprehensive evaluation and refinement of the system. The interview was enhanced by discussing the participant’s responses provided in the extended RTLX questionnaire. The interviews were recorded and subsequently transcribed for further analysis. The interviews were recorded and subsequently transcribed for further analysis. The questionnaires were also transcribed and analyzed using the methodology of affinity diagramming, as described in Lucero (2015).

During the execution of the conditions and the post-task questionnaires, two experimenters were involved. One of them observed and interacted with users throughout the entire process, while the other was solely an observer. Both experimenters participated in the development of the data analysis and the creation of the affinity diagram.

5 Results

We analyzed both quantitative and qualitative data collected during the study. Quantitative results included objective observed data and quantified subjective data obtained through questionnaires. Qualitative data were obtained from the semi-structured interviews conducted at the end of each study session.

For the analysis of quantitative data, we used effect size (ES) estimation techniques, specifically Cohen’s d, along with 95% confidence intervals (CI), instead of using null-hypothesis significance testing (NHST). Cohen’s d quantifies the size of the difference between means, offering a more informative basis for evidence accumulation. Additionally, the use of CI 95%, which provides a range of values within which the true effect size is likely to be contained, offering a clearer picture of the precision and reliability of the estimated effect. CIs allow for a more informed interpretation of the results by providing a more informative basis for evidence accumulation. Cohen’s d provides a standardized measure of effect size by quantifying the difference between two group means in terms of standard deviations. The interpretation of Cohen’s d values is typically guided by the following thresholds: a small effect size is indicated by a|d| of 0.2 or less, reflecting a small difference between groups; a small to medium effect size ranges from 0.2 to 0.5, suggesting a more pronounced difference that is not yet medium; a medium effect size, ranging from 0.5 to 0.8, indicates a moderate and practically meaningful difference; and a large effect size is represented by a d of 0.8 or more, highlighting a substantial and significant difference between groups. These thresholds are based on conventions established by Jacob Cohen Cohen (2013, 1992) and are widely used to contextualize research findings, enabling researchers to assess the strength of observed effects. In this way, readers can extract their own critical conclusions, as currently recommended for user studies in disciplines of human-robot interaction Cumming (2014); Dragicevic (2016).

Table 2 Numerical values of means, standard deviation (between parentheses) and confidence intervals corresponding to quantitative data for each condition. Values represent 95% confidence intervals
Table 3 Cohen’s d value of quantitative data between conditions

5.1 Observed quantitative results

On Fig. 5 all the data related to time are shown, measured during the experiments. In the top row, time spent inside the different hazardous levels is summarized. It was observed that, for every level of danger, there was almost no difference between conditions in the total average time spent in them (see Time in Yellow, Time in Orange and Time in Red columns in Table 2). It can be observed that, in every condition, participants spent the bulk of the time in yellow and in orange hazard zones, with less time spent in orange zones. If we take a look to the cohen’s d shown in Table 3, in yellow and red area there is no difference in ES, since all|d| values are smaller than 0.2. In orange area, the effect size between audio and visual was observed to be negligible too, having a value of 0.0065. However, both audio vs audio-Visual (d = \(-\)0.2279) and visual vs audio-visual (d = \(-\)0.2174) showed small negative effects, indicating that the audio and visual conditions involved slightly less time spent in orange zones compared to the audio-visual condition. The mean time spent in red zones was almost 0 s in every condition. In terms of time needed to perform the task, and percentage of time performing the task inside hazardous spaces (see Time Performing a Task column and % of Task Time in Danger column in Table 2), we could also observe that there was almost no difference in terms of these two aspects between modalities, also confirmed by the cohen’s d values of Table 3. As shown, the percentage of time spent inside the hazardous area was greater in all modalities above 50%. As explained before, to perform the task, the users had to enter inside the reachable space of the robot, therefore entering the hazardous space. But as observed in Table 2, users spent most of the time inside the yellow area (low level of hazard). Regarding the total time spent inside hazardous areas while performing a task, only small differences were observed between conditions, although the average time in the audio-visual condition (28.76 s (SD = 17.10), \(CI_{95\%}\)[24.81, 32.71]) was slightly larger than in the audio (25.65 s (SD = 12,91), \(CI_{95\%}\)[22.66, 28.63]) and visual (27.0 s (SD = 14.33), \(CI_{95\%}\)[23.69, 30.31]) conditions. The Cohen’s d values further indicate that the time spent in hazardous areas in the audio condition is slightly less than in the audio-visual condition, with a value of \(-\)0.2054, suggesting a small/medium effect size.

Fig. 5
figure 5

This image presents the plots for the quantitative data, showcasing the mean values, 95% CI, and distributions of each model using violin plots. The violin plots are constrained to the maximum and minimum values observed in the data. Top row: Time (seconds) spent in zones with different levels of danger. Bottom row: total time spent in danger zones of any level, task completion time and relative proportion between the latter two. (A: audio display only, V: visual display only, AV: both displays)

5.2 Questionnaire results

Figure 6 summarizes the answers provided for the SEQ questionnaire, rating in a 7 point Likert scale how easy it was to perform the task in each condition Sauro (2012). On average, in all the modalities, the task execution was indicated to be easy, with small differences in the mean values and with mostly overlapping confidence intervals. The average ease rating with the audio-visual display (5.79 (SD = 1.32), \(CI_{95\%}\)[5.26,6.31]) was slightly lower than with the audio display (6 (SD = 0,93), \(CI_{95\%}\)[5.62,6.37]) and the visual display (6 (SD = 0.78), \(CI_{95\%}\)[5.68,6.31]). This is also confirmed by the Cohen’s d, as none of them was over the 0.2 value (audio vs visual: d = 0; audio vs audio-visual: d = 0.1825; visual vs audio-visual: d = 0.1924 ). Thus, the type or amount of hazard information provided did not seem to affect how easy the task appeared to be.

Fig. 6
figure 6

This image presents the plots for the responses of the SEQ questionnaire, showcasing the 95% CI, mean values, and distributions of each model using violin plots. The violin plots are constrained to the maximum and minimum values observed in the data

Fig. 7
figure 7

Results obtained from the extended raw NASA-TLX questionnaire, presented by condition, with 95% confidence intervals, mean values, and violin plots representing data distribution. The irritability category was added as an extension to the questionnaire, and therefore it was not used to calculate the RTLX index. The graphs show average values per condition. Error bars represent 95% confidence intervals. Numerical values of means and confidence intervals are shown in Table 4

Table 4 Results from the extended raw NASA-TLX questionnaire: average values, with standard deviation and 95% confidence intervals
Table 5 Cohen’s d results from the extended raw NASA-TLX questionnaire

In analyzing the extended raw NASA-TLX questionnaire, irritability (the extending category) was not used to calculate the RTLX index. As shown graphically in Fig. 7 and numerically in Table 4, the differences in mean values observed between conditions were very small in the 6 categories that form the questionnaire, and in the RTLX index itself. However, despite the large overlaps between modalities, some small to medium differences in ES were observed in Table 5. There are small to medium differences in performance comparing audio vs visual (d = \(-\)0.2166) and similar values were observed in visual vs audio-visual (d = 0.2455). In the frustration category, the same range of ES differences was observed in visual vs audio-visual condition (d = \(-\)0.2054). In the irritability category, small to medium ES differences are again noted in audio vs audio-visual (d = \(-\)0.2263) and visual vs audio-visual (d = \(-\)0.2462). Although this category was added because auditory displays are often found to be irritating, this was not evident when compared with the visual display condition. Finally, in the RTLX index, a small to medium difference is also observed in the visual vs audio-visual comparison (d = \(-\)0.2310).

Fig. 8
figure 8

Results from the post-study questionnaire. The graph shows the number of participants that selected each condition as the best (upwards) and the worst (downwards), for the categories shown

Results from the analysis of the post-study questionnaire administered at the end of the experiment are shown in Fig. 8, in which users were asked to select the best and worst condition in the categories pleasantness, capture of attention, sense of safety, perception of distance to danger origin, and preference.

In the pleasantness category, the auditory display gathered most of the negative votes, which was expected from the literature on auditory displays. This was also the reason why we added an Irritability category to the NASA-TLX questionnaire, although no difference between displays was observed there (Fig. 7).

In all the other categories, the audio-visual display received most positive votes, well ahead of the single modality displays in receiving positive votes, in every case. For these categories, negative votes were quite equally obtained by the single modality displays. An exception was the capture-of-attention category, in which the visual-only display was thought to be the absolute worst. However, in the sense-of-safety category the auditory display received more votes than the visual for being the worst.

5.3 Interview results

Regarding the semi-structured interviews conducted at the end of each participant’s session, all interviews were first transcribed, and then we initiated the process of organizing user comments using post-it notes, following the methodology outlined in Lucero (2015).

Each post-it note was formatted as follows:

  • Top left corner User number.

  • Top right corner Possible Modalities and trajectory used for the modality.

  • Middle User’s comment.

  • Bottom right corner Number of the post-it.

In each post it was added the comment of the user, the user number, marked the modality the user made the comment of, the task performed in that modality and the number of post-it (see Fig. 9, b)). Once all the comments were transcribed onto the post-it notes, a team of three people read through the different post-its individually. After the initial reading, the team start discussing post-it by post-it and started to identify potential clustering. This iterative process of reading and discussion was repeated multiple times, ensuring thorough engagement with the data.

During these discussions, the team worked collaboratively to categorize the post-it notes into broader classes and subclasses. We further refined all classes by indicating whether the comments were positive or negative. In instances where a comment was relevant to more than one class, the corresponding post-it note was duplicated (with the duplicate retaining the original numbering) and placed in the appropriate additional classes. This whole process took 7 days and a total of 21 h.

Fig. 9
figure 9

Affinity diagram created from the analysis of subjective participant statements collected during semi-structured interviews. a Affinity diagram formed after clustering all the handwritten post-its. b Handwritten post-its used in the affinity diagram, showing information about the user (#01-#24), the modality discussed (A: audio display only, V: visual display only, AV: both displays), the trajectory used for the modality (T1, T2 or T3),the post-it number and the comment

Once all team members agreed with the distribution and clustering of the post-its (see Fig. 9,a)), they were transcribed digitally using Excel. This digital transcription facilitated further analysis and visualization. When all the data was entered into Excel, it was transformed into a graph to enhance the readability of the results.

The graph affinity diagram is shown in Fig. 10. We identified 5 main groups in which to classify all the interview data: Description of Danger, Safety Perception, Experience and Perception, Versatility and Design Features. These main groups were divided into different characteristics shown in Fig. 10.

Fig. 10
figure 10

Graph obtained from affinity diagram created through the analysis of subjective participant statements collected during the semi-structured interviews. The graphs under each leaf category inform about the number of participants that provided positive (up in the scale) or negative (down in the scale) comments about that category, in reference to a condition in the study

The transcription of all participant comments produced 173 post-it notes, which were most successfully organized in the 5 branch categories just mentioned, which contained 14 leaf categories in total. The following is an outline of each leaf, from left to right:

  • Complete information Comments on how complete the information received was (“The information of the hazard is more complete” P24).

  • Change of status Awareness of transitioning between different levels of hazard (“The visual display helps to identify a change in the level of danger I am exposed to” P1).

  • Direction of danger Awareness of the direction of the source of danger, with respect to the participant (“You can understand the direction and distance to danger” P12).

  • Distance to center of danger Awareness about how far the center of the danger was (“You can understand the direction and distance to danger” P12).

  • Safety perception Comments about perceived safety (“I felt more safe with audio modality P11).

  • Time pressure Experiencing urgency to complete the task (“It pressures me to make the task fast” P20).

  • Mental demand Comments about how mentally demanding the task was (“It provides too much information P3).

  • Irritability Comments about displays being irritating (“The audio is irritating” P17).

  • Capture attention Comments about the ability of a display to capture attention (“It is irritating but it captures your attention” P24).

  • Pleasantness Considerations about how pleasant a display was (“All of them are pleasant” P6).

  • Intuitiveness Considerations about how intuitive the interaction with a display was (“It is more simple and hence more intuitive” P22).

  • Help in movement Comments about a display being particularly helpful while the user was moving from one place to another in the shared workspace (“while walking, the shapes are very helpful” P6).

  • Multitasking Comments about the display providing support for the participant to simultaneously focus on task execution and monitoring of the robot (“It helps with safety, while you are concentrating on performing a task” P18).

  • Shapes Comments that mention the outside shape of the volumes for different hazard levels (only relevant for visual information) (“The shapes help to understand the direction of the hazard” P10).

More quotes from the participants, representative of the data collected from the interviews, are included in the discussion section.

6 Discussion

The core conclusion we obtained from this study is that all three modality conditions were suitable for a HRC scenario, since almost no differences are observed between values and ESs (the ones observed did not arrive to medium difference), although each display presented specific strengths, mostly impacting UX rather than performance.

In all three conditions the difficulty of the experimental task was rated as low (see Fig. 6) and no differences were observed between ESs. In the NASA-TLX categories, some small to medium differences in ESs were noted (see Table 5). Users reported slightly better performance in the visual condition compared to the audio only and audio-visual conditions. The audio-visual condition was found to be more frustrating than the visual condition and more irritating than both the audio and visual conditions. A worse performance and frustration appear to negatively impact the RTLX index, where the audio-visual condition shows a small to medium negative difference in ES compared to the visual condition indicating a worse task load index comparing to visual condition. Regarding time-related metrics (see Fig. 5) differences observed between conditions were also small. While being cautious not to read too far into these time-related results, the small differences seem to suggest that, with both visual and auditory information simultaneously on display (AV condition), participants dared to venture themselves further into higher-risk regions (working in closer proximity with the robot), and for longer time (e.g., time spent in the orange area). This observation is supported by the Cohen’s d value, where a small to medium difference was observed in the audio-visual condition compared to other conditions in the time spent in the orange area, indicating that users tended to spend more time inside the orange area under this condition. Rather than interpreting this as a worse performance for remaining in the safer yellow region during task execution, it might be understood that more complete awareness information (complementary and redundant from both sensory channels) provided the necessary reassurance and confidence for participants to remain closer to the robot, although this did not result in task completion time being lower. As for the highest risk situations, participants remained successfully away from the no-go red region with every display design, suggesting that all three conditions might be similarly effective in helping users remain at a safe distance from the robot.

The affinity diagram analysis of interview data allowed for a more nuanced understanding of the UX in the different conditions. Leaf categories such as “complete information” confirmed the previous idea, that participants consciously realized that they were better informed in the audio-visual condition (AV condition) (15/24 participants provided statements in this sense). In the “direction of danger” category, it is remarkable that this aspect of the situation awareness seemed to be boosted by the combination of both modalities, and the capacity of the display to capture attention was also perceived to be best in that same case. Determining the distance to the robot seemed to also benefit from the simultaneous perception of both modalities, as supported by statements such as “the audio-visual display helped me understand if the robot was approaching or leaving and understanding the state I was in with respect to it” (participant P23). This was also reflected in the post-study questionnaire, where most users voted positively regarding safety-related aspects conveyed by the audio-visual display (capture of attention, feeling of safety and distance to the center of danger). The audio-visual display was in fact voted as the most preferred display.

As a result, subjective perception of safety (the “Safety Perception” leaf category) was highest in the bimodal condition, with no negative aspects about it stated in the study.

Reportedly, the visual display on its own helped users to move in the shared workspace with respect to the also-moving robot (“Helps in movement” leave category). The singular characteristic of the visual display that provided this assistance was the visible shape of each hazard zone. Summed in the “Shapes” leaf category, 4/24 participants provided positive statements, such as “while walking, the shapes are very helpful” (P6). This could be because the collaboration took place in a shared space in which the relative distance between robot and human (which is the source of a potential hazard for the human) varied when either, or both, moved. The visible shapes of the hazard zones were then seen in changing perspective, which made their extent in space easier to understand. While this visual shape feature was also present in the multimodal condition, participants referred to it only in relation to the single modality visual display, possibly because it filled in some cognitive gap left by the absence of special auditory feedback. Although most data seemed to indicate a superiority of the multimodal display, some few participants (3/24) boldly praised the simplicity and intuitiveness of the visual-only display, e.g., “the visual display is more simple and easier to understand” (P18). This simplicity might have also helped reduce mental demand, which appeared to be marginally lower in average in the corresponding NASA-TLX category (Fig. 7). However, distinct modality displays provided the strength of complementary in situations such as when the user was static and the robot moving: the auditory information appeared to help with capture of attention (Fig. 8, and also the “Capture attention” leaf category in Fig. 10), and the visual display was of assistance in understanding the level of danger (the “Change of status” leaf category in Fig. 10).

As for the auditory display, it was found to be the most unpleasant of the displays, both in the post-study questionnaire (Fig. 8) and in statements from the interviews (“Pleasantness” leaf category, Fig. 10). 13/24 participants made comments such as “this display is unpleasant” (P20). However, this perception did not translate with any clarity in the scores of the “Irritability” category introduced as extension in the NASA-TLX questionnaire (Fig. 7). In this category, the audio-visual condition showed a small to medium effect size difference compared to the other two conditions, indicating that participants might have found the audio-visual condition to be more irritable. The mild unpleasantness quality of audio is thus thought to contribute to capturing attention efficiently. 16 out of the 24 participants spoke positively about these aspects of the auditory display, where 11 of these 16 users indicated that the display was unpleasant. From these 16 participants, 13 commented that the audio-visual display caught attention, which might be due to the auditory display: “it catches attention better than the visual display” (P15). While the auditory display seemed to be superior to the visual display at capturing attention (expressed clearly by participants in the post-study questionnaire), a minority of participants (4/24) also attributed this capacity to the visual display in their interview statements, suggesting that personal preference might also play a role: “visual captures attention better than in audio” (P16). Another strength of auditory information (probably linked to its attention-grabbing capacity) was the assistance it offered for multitasking, meaning that the participant could focus on executing the task, while also monitoring the changing proximity with the robot. 8/24 participants provided statements supporting this notion, such as “it helps with safety, while you are concentrating on performing a task” (P18).

7 Design and implementation limitations

One of the limitations comes from constraints in current technology to recreate an ecologically valid visual field. Although HoloLens 2 has a wider field of view (54°) than its preceding version (34°), it is still much narrower than the field of view of human vision. For this reason, we used the swiveling-arrow-based visual aid as a strategy, with limited compensatory effect.

Regarding the performance capacity of HoloLens 2, a few limitations determined the way of calculating the shapes’ values. HoloLens 2 Footnote 3 has a Qualcomm Snapdragon 850 Compute Platform CPU with 4-GB LPDDR4x system DRAM of memory. In terms of software, it already has functionalities built in for human understanding such as hand tracking, eye tracking and voice command recognition. These functions consume processing resources of HoloLens. This, together with the rendering of colored shapes for hazard representation in the visual display, results in occasional saturation. For this reason, some calculations had to be simplified, such as using the actual position and target position of the robot for the calculation of the orange shape, instead of computing the area with all the points of the planned path.

The resolution of the last two constraints discussed is dependent on the advancement of the technology used in future HMD devices.

Additionally, while the use of pre-defined and consistent trajectories was beneficial for maintaining balance and control in the evaluation, it also represents a limitation of this study. Collaborative robotics can also involve dynamic and adaptive trajectories, which were not considered in our experimental setup. Future research should incorporate dynamic trajectories to evaluate the effectiveness of feedback modalities in scenarios where such trajectories are required, providing a more comprehensive assessment of their impact on safety and interaction.

Finally, for detecting collisions with different levels of hazard, we have used a framework which creates 3D bounding boxes around users’ head, body and hands. However, the body bounding box moved accordingly to the participant’s head and not with real body, since HoloLens only enables the possibility to track hands and head (depicted in Fig. 2II). Although the system was correctly detecting all collisions, for more dangerous scenarios a more detailed body tracking might be needed.

8 Conclusion and future work

Motivated by the increasing use of fenceless robots and cobots, we wanted to develop displays capable of providing awareness about potential hazards that working alongside a robot could happen. In collaborative scenarios in which both robot and human worker moved in the shared space, the goal was that the user was adequately informed to decide confidently about their positioning with respect to the robot, minimizing exposure to hazard without the robot having to reduce the efficiency of its task execution.

The performance of the participants in the study, in terms of remaining in safe proximity with the robot during collaboration, was found to be similar with either of the displays. Strengths of each display design were identified through the user study, which modulated the UX obtained by the user.

The multimodal version of the display, presenting awareness information simultaneously through the spatial auditory and through the visual channels, resulted in the most effective hazard awareness display, as well as the preferred one. The triangulation of quantitative results with qualitative results from the study supports the notion that more complete awareness information (complementary and redundant from both sensory channels) conveyed reassurance for participants to remain closer to the robot during collaboration. Such a combination of information from different modalities helped participants focus on the collaborative task while remaining aware of the level of hazard they were exposed to at each moment. Results suggest that participants in the study were aware of their changing relative position with respect to the robot, in terms of relative direction and distance. Alongside the best perception of safety provided by the multimodal display, saturation of information was not found to be an issue most of the time, although a minority of participants preferred the simplicity of single modality displays.

Changing contextual and environment conditions in the collaborative scenario (including lighting and noise), together with the diversity of profiles of human workers (with different preferences and perceptual capacities) may require the reliance of situation awareness solely on the information from one of the sensory channels. In that sense, the study showed that the visual-only display was helpful for users to make quick decisions about how to move around the robot or position themselves with respect to it, when either one or both agents were moving in the shared space. The specific feature of the visual display that seemed to help with this was the visible color cast shape that enveloped the robot and evolved dynamically as the robot executed its trajectories. The visual information was also the most effective to signal the transition between regions with a different level of hazard in the scale of three discrete levels used.

As for the auditory-only display, its main strength was observed in its capacity to capture the attention of the user. Responsible for the capacity of the auditory signals to be noticed while the user was focusing on the task was probably their mildly unpleasant quality. Although results did not indicate that such unpleasantness might be excessive, acceptance by users to listen to the auditory display over the long term is an important safety consideration, and careful tweaking of auditory display parameters should be tried to strike an optimal balance between effectiveness and acceptance of the auditory hazard awareness display.

A final important consideration must be made regarding the suitability and acceptability (e.g., from a safety regulations point of view) of granting the end human user control over the relative position and movement between worker and robot. We propose that good conscious awareness of the relative situation with respect to a robot and its actions opens up a desirable avenue for human workers to have the control about how collaboration should evolve. However, we are very aware that this most sensitive question of safety at work requires that wrong judgements by the worker can never lead to any situation that might result in any harm. Thus, beyond the motivation of reaching the best UX in HRC, external systems should still ultimately ensure safety.Footnote 4Footnote 5