Keywords

1 Introduction

Automation of technical systems has greatly increased, but it is usually still the human operator who is responsible for the safety and effectiveness of the human-machine-system. Hence, adverse user states that impair the operator’s effectiveness can have severe consequences, especially in safety-critical task environments. A notable example was the crash of Air France flight AF447 in 2009. According to the flight accident report [1] technical failure followed by an autopilot disconnection provoked multiple adverse mental states of the pilots (e.g. confusion, overload, inadequate situation awareness). As a consequence pilots were not able to regain control over the aircraft and it crashed (cf. [2] for a more detailed discussion).

To prevent such critical user states in highly automated systems, approaches of adaptive system design have been developed, e.g. Adaptive Aiding [3], Adaptive Automation [4], and Augmented Cognition [5]. In adaptive human-machine systems, technology adapts its behavior to the current state of the human operator to mitigate critical user states and performance decrements. However, researchers have faced a number of challenges transferring those approaches from the laboratory to the real world. One major challenge is that human-machine systems are impacted by a wide range of influencing factors in the real world that are often not accounted for in laboratory settings. Researchers have therefore concluded that context, environmental parameters, system state and goals must be considered in order to successfully apply adaptation strategies (e.g., [6,7,8]).

Additionally, as illustrated in the crash of flight AF447, human error and performance decrements can be the result of multiple critical user states that are both intertwined and interrelated. Hence, a one-dimensional consideration of user state (e.g., by focusing on high workload) may not be sufficient to successfully select and apply appropriate adaptation strategies in real-world settings. We have therefore proposed to also account for the multidimensional nature of user state in adaptive system design [2]. Our approach of a multidimensional user state assessment focuses on six dimensions of user state that we have found to be able to substantially impact human performance, namely workload, fatigue, attention, situation awareness, motivational aspects of engagement, and emotional states characterized by negative valence and high arousal. Evidence of their effect on performance is summarized in Table 1.

Table 1. User states considered in RASMUS and their effect on user performance

This paper introduces our diagnostic engine. “RASMUS” (Real-time Assessment of Multidimensional User State) enables the technical system to detect performance declines of the user and to infer potential causes of performance declines based on diagnosed critical user states and related environmental and individual impact factors. The next section outlines the main findings of our analysis phase and summarizes corresponding design requirements, used for the conceptual design of RASMUS. The generic conceptual framework is detailed in Sect. 3. Subsequently, we present a proof-of-concept implementation of RASMUS that provides real-time diagnoses for high workload, passive task-related fatigue and incorrect attentional focus in an air surveillance task. We also present some preliminary results of a recent validation study (Sect. 4). The article concludes with a discussion and an outlook on future work (Sect. 5).

2 Design Requirements

Analysis of related work revealed that previous studies have mainly focused on just one dimension of user state. For example, workload is often used as a trigger for Adaptive Automation [16,17,18]. Many studies also focus on a specific field of application, e.g. command and control, aviation, driving. In contrast, our multidimensional approach is more holistic and not domain-specific. During the review phase of our work we therefore analyzed studies across different domains focusing on different types of user states. Literature reviews lead to the identification of five aspects that appeared particularly relevant to our approach. The following subsections deal with each of the five aspects and describe which design requirements were derived for the conceptual development of RASMUS. These findings and design requirements are summarized in Sect. 2.6.

2.1 Indicators of User State

The six user states considered for multidimensional user state assessment are hypothetical constructs that cannot be measured directly. But various assessment methods have been established to provide indicators of those states: subjective measures, performance-based measures, physiological and behavioral metrics, as well as model-based assessment. Considering their properties, psychophysiological and behavioral metrics appear to be particularly well-suited for user state assessment in adaptive systems, as they provide continuous indicators of user state in real time. Moreover, many of those metrics can be measured by sensors that are rather unobtrusive (e.g. remote eye trackers). However, previous research has also revealed some shortcomings:

  • Physiological measures are influenced not only by user state but also by other factors not related to user state which may cause misleading results [5, 19], e.g. pupil dilation is also influenced by lighting conditions.

  • Physiological measures are not indicative of a single user state, e.g. heart rate can increase due to high workload but it can also increase as a result of emotional states with high arousal e.g. anxiety or anger [20].

  • Adaptive systems should address the causes rather than the symptoms of critical user states and performance decrements [7]. Even if physiological and behavioral measures did reflect changes in user state, they would not provide any information about what provoked those state changes.

In order to compensate for these shortcomings, our approach of multidimensional user state assessment combines physiological and behavioral measures (e.g. pupil dilation, heart rate, breathing rate, mouse click frequency) with environmental and individual impact factors on user state. Environmental factors involve all factors that externally impact user state and performance including task properties, context factors, conditions of the surrounding, objectives and events. Individual factors refer to long-term and short-term properties of the human that impact his/her state and performance internally (e.g. level of experience, capabilities, skills, constitution, mood, or well-being). These indicators were identified in previous analyses and integrated into a self-developed generic model of user state (cf. [2]).

2.2 Self-Regulation of the Human Operator

When designing an adaptive technical system it must be considered that humans are adaptive systems themselves. By applying self-regulation strategies, e.g. investing more effort if task demands increase or drinking coffee to combat fatigue, the human operator is also able to mitigate critical user states. For Adaptive Automation, researchers strongly recommend consideration of the human’s effort-regulation processes [9, 21]. Veltman and Jansen [21] point out that increases in workload can be a sign of a successful adaptation of the operator to changing task demands as he or she is investing more effort to meet the demands. Thus, if these changes in workload were used by the technical system to reduce the task load, it might result in counterproductive interaction between the two adaptive systems. However, as accidents caused by critical user states indicate, there are also situations (e.g. extreme underload or overload) when the operator’s state regulation processes fail to successfully maintain the operator’s effectiveness. Veltman and Jansen [21] therefore propose that adaptive technical systems are more likely to work successfully if adaptation is triggered only in those situations when the operator’s effort regulation mechanisms are unable to adequately react to changing task demands. As our approach does not only focus on workload as a trigger and is not limited to adjusting the level of automation, we extend this recommendation and suggest to not counteract any productive self-regulation strategies of the operator by technical adaptation strategies. Consequently, we decided to use performance measures as a trigger for adaptation because a decline in performance is a clear indication that self-regulation has failed and the operator needs support.

2.3 Individual Differences

User state is often examined at a group level, trying to demonstrate significant effects of assessment methods between task conditions (e.g. [19, 22]). However, individuals differ in their physiological reactions. Accordingly, physiological measures that are sensitive at the group level have been shown to lack sensitivity in single-trial analysis needed for real-time adaptation (e.g. [23]). We were able to replicate these findings in previous experimental studies [24, 25]. Results supported the sensitivity of most physiological measures to indicate changes in user state and performance at a group level. At an individual level, however, we found outcomes to vary strongly between individuals even when using normalized data. This finding may indicate that the sensitivity of a physiological metric is user-specific which means that certain measures are sensitive indicators for some subjects but not for others. Veltman and Jansen [23] make two suggestions how to deal with individual differences in Adaptive Automation: (1) Increase the sensitivity at an individual level by combining different physiological measures, and (2) use individual sets of baseline data from different sensors to select those measures that are most sensitive for the given individual. Additionally, it might be useful to weigh indicators in the assessment according to their user-specific sensitivity. However, individual weighting of indicators would only be appropriate if the indicators’ sensitivity was temporally stable.

2.4 Temporal Stability

For assessing the temporal stability of our findings, we conducted a retest experiment one year after the initial experiment involving the same task conditions and participants (cf. [25]). Related work testing the temporal stability of physiological measures is rather sparse and findings are reported predominantly at a group level (e.g. [26]). These studies indicate moderate levels of temporal stability for different kinds of physiological measures. Likewise, our retest experiment confirmed the temporal stability of three out of four tested physiological metrics at the group level. However, analysis at an individual level revealed that outcomes differ strongly not only between but also within individuals from test to retest. Some indicators that showed high sensitivity for one participant in the first test showed rather weak sensitivity for the same participant in the retest and vice versa.

We assume these variations in time were a result of environmental and individual impact factors that we could not control for in test and retest, e.g. learning effects, different degrees of initial fatigue or motivation, differences in mood or different fitting and tracking quality of sensors. Considering real-world settings, there are even more uncontrollable impact factors causing some indicators to be more and some less sensitive for variations in user state in a specific situation. Consequently, we propose to refrain from user-specific selection and weighting of user state indicators. Instead the results highlight the importance of combining different kinds of measures to compensate for potentially biased or invalid measurements.

2.5 Oscillation

Researchers have pointed out that adaptation triggered by a threshold algorithm may evoke undesirable oscillation or “yo-yoing” [5, 27, 28]. Particularly physiological gauges have been observed to frequently pass a predetermined threshold resulting in adaptations being switched on and off in short time intervals. This oscillation of adaptation has been shown to produce detrimental effects on operator performance as it can confuse the user and increase workload [5]. To prevent oscillation effects it has been suggested to smooth physiological measures e.g. by filters or moving-mean estimation and to define a minimum time interval between switches in adaptation (coined as “refractory period” [27] or “deadband” [28]). As our approach uses performance decrements to trigger adaptation, these effects are avoided. Nevertheless, we consider these suggestions when using physiological measures for user state assessment.

2.6 Summary

Table 2 provides a summary of the main findings of our analysis phase and lists the corresponding design requirements we formulated for the conceptual development of RASMUS. Findings and design requirements refer to each of the five aspects detailed in Sects. 2.12.5 (indicated in the left column of Table 2).

Table 2. Summary of literature findings and corresponding design requirements

3 Real-Time Assessment of Multidimensional User State (RASMUS)

The “Real-Time Assessment of Multidimensional User State” (RASMUS) is part of a larger dynamic adaptation framework. A simplified model is depicted in Fig. 1. The “information processing” component represents the basic functionality of traditional technical systems in human-machine-interaction. It analyses and displays data from the environment and processes operator inputs. To enable adaptive behavior of the technical system we added a state regulation component that is modeled after the four-stage model of human information processing and the corresponding classes of system functions proposed in [29]. It includes four stages of state regulation: data acquisition, user state assessment, action selection and execution (action implementation). RASMUS diagnostics introduced in this paper address the first two stages of state regulation: the acquisition of data from the operator and the environment and the subsequent assessment of the user state. The stages of action selection and action implementation refer to the selection and application of appropriate adaptation strategies, accomplished through an Advanced Dynamic Adaptation Management (ADAM) detailed in another paper within this volume [30].

Fig. 1.
figure 1

Simplified model of our dynamic adaptation framework

The diagnostic process of RASMUS is broken down into four consecutive steps depicted in Fig. 2: “Acquire and synchronize data”, “Evaluate need for support”, “Analyze critical user states and impact factors” and “Display and forward diagnostic results”. The black arrows in Fig. 2 indicate the sequence of the diagnosis and stage regulation process. Information gathered on one step is required not only for the immediate next step but also to accomplish subsequent steps. These dependences are indicated by the grey arrows in Fig. 2. The four steps are explained in more detail below.

Fig. 2.
figure 2

Steps of the diagnostic process in RASMUS

3.1 Acquire and Synchronize Data (Step 1)

Literature and empirical findings suggest combining different kinds of measures in order to gain more robust and valid diagnoses (cf. Sect. 2). RASMUS therefore acquires data about physiological and behavioral reactions as well as environmental and individual impact factors. These measures are derived from different kinds of sources. Information about individual factors (e.g. level of experience) is obtained by questionnaires. Sensors, such as an eye tracker, an EEG headset, or a chest strap provide physiological metrics. Additionally, the experimental system logs data on environmental parameters and user activity (e.g. number of current tasks, number of mouse clicks). These data streams are merged and synchronized in real-time, using the iMotions biometric research platform (iMotions, Inc., MA, USA). To normalize physiological measures RASMUS records a baseline at the beginning of each session to compare subsequent data to individual baseline states (cf. Sect. 2.3).

3.2 Evaluate Need for Support (Step 2)

This step determines when to trigger adaptation. In RASMUS, this decision is based on an evaluation of the operator’s performance. Using performance declines to detect a need for support further ensures that the adaptive system does not counteract productive self-regulation mechanisms of the human (cf. Sect. 2.2). Declines in performance clearly indicate that self-regulation has failed and the operator needs support. Diagnosis of performance decrements is based on rules stored and processed in a rule engine. The researcher can define and edit these rules in a self-developed software tool. As an example, the researcher may define that a performance decrement should be detected when a certain task is not completed within a specified time frame (e.g. 60 s).

It must be noted that not every detected decrement in performance triggers adaptation. As stated in Sect. 2.5, unfavorable oscillation effects can occur that may produce rapid adaptation changes in short time intervals. To prevent rapid oscillation, we use a deadband to suppress new adaptation if further performance decrements occur within a given time interval after the previous adaptation.

3.3 Analyze Critical User States and Impact Factors (Step 3)

As noted in Sect. 2.1 adaptive systems should address the causes rather than the symptoms of performance decrements and critical user states. When a performance decrement, and thus a need for support, is detected, RASMUS determines potential causes for the performance decrement by evaluating user states and associated contextual indicators. This information can later be used to select an appropriate adaptation strategy (cf. [30]).

The assessment of indicators used for the evaluation of critical user states is accomplished in a similar way as the detection of performance decrements. The researcher defines rules for each indicator to detect potentially adverse outcomes. Likewise, rules are defined to link indicators (or combinations of indicators) with potentially critical user states. Both high and low thresholds may be selected to indicate a critical state. For example, a low heart rate (compared to that individual’s baseline state) may indicate fatigue while a large positive deviation from baseline can indicate high workload.

As stated in Sect. 2.4 the sensitivity of indicators can unpredictably vary over time. Hence, whenever possible, the detection of critical user states is not based on a single indicator but on a combination of different indicators. A critical state is detected only in those cases when the majority of its indicators support the diagnosis.

3.4 Display and Forward Diagnostic Results (Step 4)

RASMUS forwards all diagnostic results to the adaptation management component where they are processed to select, configure, and execute appropriate adaptation strategies. Diagnostic results are also saved for later offline analyses. Additionally, RASMUS diagnostics are visualized in real-time in a “Performance and User State Monitor” application. This software allows researchers or other observers to monitor diagnoses of performance decrements, critical user states, and all indicators used for user state assessment. This is helpful to observe and demonstrate the mechanisms of RASMUS.

4 Proof-of-Concept Implementation

In a first proof-of-concept implementation the generic diagnostic framework detailed in Sect. 3 was applied and tailored to the specific requirements of a naval Anti-Air-Warfare (AAW) task paradigm. For this purpose it was necessary to determine appropriate indicators for performance and user state assessments and to specify the rule base for critical outcomes.

4.1 Task Environment

RASMUS diagnostics were implemented as a Java-based research testbed and connected to an existing AAW simulation. Figure 3 shows the research testbed with the sensors currently utilized for user state assessment: an SMI REDn eye tracker underneath the monitor, the Zephyr BioHarness3 multisensor chest strap on the left, and a webcam positioned on top of the monitor. The monitor shows the user interface of the AAW simulation. The Tactical Display Area (TDA) located in the center displays virtual contacts in the surroundings of the ownship.

Fig. 3.
figure 3

Research testbed with user monitor and sensors

The simulation includes four simplified AAW-tasks (cf. Table 3 for task descriptions). These tasks occur at scripted times throughout the scenario and may also occur simultaneously. In this case the task with the highest priority must be performed first. Each task is associated with a time limit for task completion. Time limits were assigned based on outcomes of an earlier study that employed the same tasks and simulation software [31]. If a task is not completed within the time limit or if task completion is incorrect, RASMUS detects a performance decrement. Table 3 provides the priorities (with 500 being highest and 100 being lowest) and the respective time limits of each task.

Table 3. Task descriptions and task properties

4.2 User States and Assessment Criteria

The proof of concept implementation focused on three out of the six user state dimensions introduced in Sect. 1. These three dimensions, namely workload, attention, and fatigue, were regarded as particularly relevant for air surveillance tasks by domain experts. RASMUS was implemented to provide diagnoses for the potentially adverse states of high workload, passive task related fatigue, and incorrect attentional focus, all detailed below.

High Workload.

The state of high workload can arise if task processing is highly demanding. According to Neerincx [32], task demand or task load can be modulated in experimental conditions by changing the task complexity, task volume, and number of task-set switches. We therefore designed a scenario in which high workload states are provoked by increasing the number of different tasks that have to be performed simultaneously. High workload is assessed through a combination of different indicators: heart rate variability, respiration rate, pupil dilation as physiological measures, the number of mouse clicks as a behavioral measure, and the number of tasks as an environmental factor. These indicators were chosen as they had proven sensitive for the assessment of high workload in previous studies. High workload is detected if at least three of these five indicators show critical outcomes.

Passive Task Related Fatigue.

Following May and Baldwin [33], we distinguish between sleep-related (SR) and task-related (TR) fatigue. SR fatigue is influenced by sleep deprivation and the circadian rhythm (time of day) while task-related fatigue is induced by task properties and time-on-task. TR fatigue can be further subdivided into a passive and active form (cf. [33]). We focus on the passive form that is induced by monotonous tasks with a low level of cognitive demand. Passive TR fatigue is associated with a low level of arousal resembling the contrary problem state to high workload. Thus, we used the same indicators for assessment of passive TR fatigue as for high workload but with opposite criteria (cf. Table 4).

Table 4. Indicators and rule base for problem states

Incorrect Attentional Focus.

This state is closely related to Wickens’ concept of “attentional tunneling” which is defined as “allocation of attention to a particular channel of information, diagnostic hypothesis or task goal, for a duration that is longer than optimal, given the expected cost of neglecting events on other channels, failing to consider other hypotheses, or failing to perform other tasks” ([13], p. 1). Our focus is on correctly prioritized task processing. Hence, RASMUS detects incorrect attentional focus if a higher priority task is neglected because the user is processing a lower priority task or if he/she missed a task in the absence of an alternative task. We included the latter rule as we observed that monitoring contacts on the TDA also requires attention even though it is not associated with processing a specific task.

4.3 Rule Base

To account for individual differences in physiological reactions (cf. Sect. 3.1), critical outcomes of physiological measures are detected by analyzing the deviation of current recordings from a baseline. We use the standard deviation (SD) as criterion for a critical deviation. As physiological measures fluctuate in short time intervals (cf. Sect. 2.5), RASMUS calculates moving averages over a time window of 30 s to smoothen the data. The physiological indicator is labeled as critically high or low if the current mean deviates by more than 1 SD from the baseline mean.

For indicators “number of tasks” and “frequency of mouse clicks”, outcomes are labeled as critically high or low if the number of tasks/click frequency during the current time interval falls above or below a threshold value that has been derived from previous observations. Incorrect attentional focus is detected if the task that is currently being processed is not the highest priority task. Indicators and corresponding rules for critical outcomes are summarized in Table 4.

4.4 Validation

We recently conducted a validation experiment with 12 participants to examine the validity of RASMUS diagnostics for the detection of high workload, passive TR fatigue, and incorrect attentional focus. Participants performed the AAW tasks described in Sect. 4.1 in a scenario-based simulation. The scenario was designed to provoke states of high workload, passive TR fatigue, and incorrect attentional focus. Whenever RASMUS detected a performance decrement the scenario was paused and the participant was asked to rate his or her current user state with respect to the six state dimensions introduced in Sect. 1. We then compared RASMUS’ user state diagnostics at the time of performance decrements with outcomes of the user rating. First of all, analyses revealed that most performance decrements were associated with at least one user state evaluated as potentially critical. For those states with critical outcomes corresponding user ratings mostly showed consistent deviations from the baseline. Hence, these preliminary results indicate that the validity of RASMUS diagnostics can be confirmed. More detailed results from this experiment will be reported in a future publication.

5 Conclusions and Future Work

With our concept of a real-time assessment of multidimensional user state (RASMUS) we address some major challenges that have been identified for real-world applications of adaptive system design. For example, RASMUS considers self-adaptation of the human as it detects a need for support when performance is declined, and thus self-adaptation has failed to successfully maintain the operator’s effectiveness. Also, user state assessment in RASMUS is based on the combination of different kinds of measures to provide more robust and valid diagnoses. With the assessment of several potentially problematic user states and associated contextual indicators RASMUS enables dynamic adaptive systems to not only determine when the user needs support but to infer what kind of support is most appropriate to restore the user’s effectiveness. To that end, RASMUS diagnostics have already been combined with a dynamic adaptation management component to accomplish near real-time selection and configuration of adaptation strategies (cf. [30]).

With our proof-of-concept implementation of RASMUS we also demonstrated the feasibility of applying our generic concept to the domain of naval Anti-Air Warfare. Initial results of a validation experiment support the validity of RASMUS diagnostics in the event of a performance decrement within this task environment. As indicators and rule base are variable entities in our framework, they may be modified to further improve the diagnostic capabilities or tailor our dynamic adaptation framework to various application areas in which human-machine systems act in safety-critical task environments. Visualization of real-time diagnostics on our “Performance and User State Monitor” may also be beneficial for adaptive training, e.g. to verify and monitor the deliberate induction of adverse user states to develop coping strategies.

The current implementation is limited in that it only provides diagnoses for three specifically relevant problem states out of the six dimensions covered by our generic concept of multidimensional user state assessment. We plan to expand the rule base to cover additional user states in the near future. Also, it is important to note that the purpose of RASMUS’ user state assessment is to identify factors that likely contributed to an observed decline in performance. A critical state diagnosis therefore indicates that the respective state deviates considerably from the optimal condition which, by itself, does not necessarily imply that the user is on the edge of a breakdown and requires assistance. Therefore, RASMUS user state assessment cannot be used for proactive prevention of critical user states and performance decrements. However, while proactive adaptation may appear superior to our post-hoc approach, intervening too early may provoke conflicts with the user’s self-adaptation mechanisms and favor complacency [34]. We therefore believe that combining reasonable performance thresholds as a trigger for adaptation with user state assessment for root-cause analysis is still an effective way to enable dynamic, context-sensitive adaptation.