Keywords

1 Introduction

Adaptive Instructional Systems (AIS) gather diagnostic information about learner characteristics and provide an individually tailored response. By adapting to temporally changing, individual learning abilities and learner needs, approaches of micro-adaptive instructional systems aim to dynamically support the individual learner in achieving his or her learning goals [1]. On-task measures of learner performance as well as the learner’s current mental state are essential to provide this tailored feedback during the learning process. Several approaches in AIS consider the motivational state of the user (e.g. [2,3,4]) while in the operational context adaptive systems are often based on workload assessments (e.g. [5, 6]). Contrary to these single state analyses, Schwarz, Fuchs and Flemisch [7] proposed a multidimensional assessment of user state as a more holistic approach for user state analysis in adaptive system design. Out of this approach, Schwarz and Fuchs (2017) developed RASMUS (‘Real-Time Assessment of Multidimensional User State’) - the diagnostic component of a dynamic adaptation framework [8, 9]. As the conceptual framework is generic, it can be applied to various operational and instructional settings. The proof of concept implementation focused on a naval air surveillance task, providing on-task information about the potentially critical user states high workload, passive task-related fatigue and incorrect attentional focus [8].

Diagnostic outcomes for these three user states were validated in a prior experimental study [10]. However, as RASMUS diagnostics are currently based on self-determined rules, we assume that modification of these rules might increase the diagnostic accuracy. Hence, the aim of this paper is to investigate an approach to evaluate and optimize the existing diagnostic rules of RASMUS by using the data of the prior validation study. To determine optimized rules receiver operator characteristic (ROC) curve analyses were performed. ROC graphs can be used to assess diagnostic accuracy of a test and are commonly used in medical research [11]. Using the data set of the prior validation study, we performed ROC curve analyses to optimize the rules for the physiological parameters that serve as indicators for high mental workload. Diagnostic accuracy of initial and modified rules was then evaluated in a repetition of the validation study.

The next chapter summarizes main aspects of the conceptual framework and the workload assessment within RASMUS. Section 3 details the results of the ROC curve analyses that were performed for defining optimized rules. Subsequently, Sects. 4 and 5 describe methods and results of the newly conducted validation experiment. The paper concludes with a discussion of the results (Sect. 6) and a summary of conclusions and lessons learned (Sect. 7).

2 Conceptual Framework and Workload Assessment

RASMUS is the diagnostic component of an adaptation framework that detects performance decrements of the user and analyzes which user states show critical outcomes that may have caused the performance decrement [8]. The adaptation management component of this framework, named ADAM, then selects an adaptation strategy that is most appropriate to mitigate the detected critical state, and thus, to restore the user’s effectiveness [9].

RASMUS diagnostics are based on a multidimensional view on user state. This approach considers up to six user state dimensions (mental workload, fatigue, attention, situation awareness, motivation, and emotional state) that are found to have great influence on human performance [7]. Currently, the proof-of-concept implementation provides assessments of high workload, passive task-related fatigue, and incorrect attentional focus as these states can be considered as particularly relevant for the chosen task domain of naval air surveillance.

As the scope of this paper is to evaluate and optimize the diagnostic rules for mental workload the following sections detail the workload assessment in RASMUS and the validation method.

2.1 High Mental Workload

RASMUS combines five parameters to assess mental workload: number of tasks, number of mouse clicks, pupil diameter, heart rate variability (HRV) and respiration rate. A state of high mental workload is assessed if at least three of these parameters show outcomes that indicate high workload. The classification of parameter outcomes is based on self-determined rules that were derived from literature findings. This paragraph briefly describes the three physiological parameters chosen for the workload assessment, their assumed relation to mental workload, and the rules that were defined for critical outcomes.

Pupil diameter has been found to increase during high workload tasks (e.g. [12,13,14]). Coyne and Sibley [15] showed that pupil diameter can be reliably measured with eye tracking systems which are non-invasive and easy to set up. HRV responds to mental stress or workload, suggesting a decrease in HRV during mentally demanding tasks (e.g. [14, 16,17,18,19]). Lastly, there is empirical evidence that during mentally demanding tasks breathing rates increase (e.g. [16, 19, 20]). Both, HRV and breathing frequency or respiration rate, can be measured with rather non-invasive devices, such as chest worn wearable sensors.

The physiological parameters were recorded continuously and evaluated based on moving mean windows of 30 s that were compared to an individual baseline value. Baseline values were calculated for each participant and each parameter at the beginning of the experiment by taking the average of the data collected in a period of 120 s. During this baseline measurement, mental workload was kept low to moderate. In the initial rule set, RASMUS labels each physiological parameter as critically high or low if the current mean deviates by more than 1 standard deviation (SD) from the baseline mean. For pupil diameter and respiration rate high workload is indicated by positive deviations > 1 SD from the baseline mean, and for HRV it is indicated by negative deviations > 1 SD from the baseline mean.

2.2 Perceived Mental Workload

Perceived mental workload was used as a comparative measure for validating and optimizing the mental workload diagnosis of RASMUS. There are various questionnaires to assess subjective mental workload. The National Aeronautics and Space Administration Task Load Index (NASA-TLX [21]) is one of the superior scales with respect to sensitivity and user acceptance [22]. In the prior and the repeated experimental study workload rating was performed using the NASA-TLX subscale of mental effort. Ratings were obtained each time RASMUS detected a performance decrement (the scenario was stopped at that time to ensure the rating did not affect the user’s task completion). The rating was performed on a 15-point scale proposed by Heller [23] that is divided into five subsections: very low (1–3), low (4–6), medium (7–9), high (10–12), very high (13–15).

2.3 Task

The generic diagnostic tool RASMUS has been integrated into a naval anti-air warfare (AAW) simulation [24]. In this simulation, operators completed four different simplified subtasks: identification of contacts, creation of new contacts, warn, and engage contacts. Figure 1 shows the tactical display area (TDA) of the simulation. The blue dot in the center of the map represents the own ship. Identified radar contacts are visualized in green (neutral), blue (friendly), or red (hostile). New, unidentified contacts (yellow) have to be identified as neutral, friendly or hostile according to certain criteria. If hostile contacts enter the blue or the red circle around the own ship (see Fig. 1) they have to be warned or engaged, respectively.

Fig. 1.
figure 1

Screenshot of the tactical display area of the naval air-surveillance simulation (Color figure online)

The tasks occur at scripted times during the scenario. If tasks have to be performed simultaneously, users were told to process tasks in order of priority. Each task has to be finished within a specified time limit, cf. [10]. If the time limit is exceeded or the task is not completed correctly, RASMUS logs a performance decrement.

3 Definition of New Rules

ROC graphs or curves aim to quantify the accuracy of a binary diagnostic test or classifier and are created by plotting the diagnostic sensitivity and specificity values. The measure that is commonly used in this context is the area under the curve (AUC) of the ROC curve [cf. [11, 25]). Performing a ROC analyses requires information about the true state. However, user states, such as mental workload, are latent constructs that cannot be measured directly. For this reason we used the NASA-TLX subjective mental effort rating (cf. Sect. 2.2) as an approximation of the true user state within the ROC curve analysis. Subjective rating outcomes were dichotomized in order to discriminate between critical and noncritical mental workload states. The cut-off value was set based on the subsections of the questionnaire (cf. Sect. 2.2): any rating above 9 on the 15 point scale was considered to be a high (critical) workload state whereas a rating of 9 or smaller was not.

At first, ROC curves were calculated using the threshold value initially set for each parameter for discriminating between critical and noncritical outcomes with respect to mental workload. We then systematically varied the threshold values in order to determine the value that maximizes the AUC.

The ROC curve analysis resulted in modified rules for pupil diameter (>.5 SD instead of >1 SD positive deviation from baseline) as well as HRV (>2 SD instead of >1 SD negative deviation from baseline). For respiration rate the analysis did not reveal any improvement by changing the rule. Therefore, the existing rule (>1 SD positive deviation from baseline) was not modified. Table 1 summarizes the resulting AUC, sensitivity and specificity for the initial as well as the modified rules for each parameter. AUC values range between .6–.7 for the modified rules, which can be considered a sufficient outcome [25].

Table 1. Comparison of initial and modified rules for the physiological parameters after performing individual ROC curve analyses

As a next step, we analyzed to which extent this modification of the rules for the physiological parameters affect the accuracy of the overall workload assessment in RASMUS. Figure 2 shows the mean deviation (MD) from the baseline of the subjective rating for critical and noncritical system diagnoses with respect to the modified as well as the initial rule set. For the initial rule set, subjects rated their perceived workload significantly higher when the system diagnosis was critical than when it was noncritical (t(74) = 3.301; p < .01). The same outcome was observed for the modified rule set (t(74) = 3.882; p < .001). However, the results suggest a slightly better distinction between critical and noncritical system diagnosis for the modified rule set based on the subjective rating.

Fig. 2.
figure 2

Mean perceived workload ratings (with SE as error bars) for critical and noncritical system diagnoses by RASMUS for the initial set of rules (a) and the modified set of rules (b) applied to the l data set of the prior validation study [10]

The overall ROC curve for the diagnosis of high mental workload (see Fig. 3) also indicates a slightly higher AUC for the modified set of rules (AUCmodified = .780; p < .001) than for the initial set of rules (AUCinitial = .730; p < .01). Exceeding a value of .7, both diagnostic rule sets can be considered good diagnostic tests [25].

Fig. 3.
figure 3

Initial and modified rule set applied to the prior validation experiment data set [10] the optimization was based on.

4 Repetition of Validation Study

A repetition of the initial validation study was conducted to investigate whether the obtained outcomes from the ROC curve analysis can be replicated, and thus would be temporarily stable.

4.1 Methodological Design

Fifteen subjects (8 male, 7 female) aged between 20 and 51 years (M = 31.26 ± 8.27) participated in the experiment. A multisensory chest strap (Zephyr BioHarness3) was used to collect data on HRV and respiration rate. Pupil diameter was recorded with an eye tracker (Tobii X3-120) placed underneath the monitor. The setup is depicted in Fig. 4.

Fig. 4.
figure 4

Experimental setup. Multisensory chest strap (front left), eye tracking device attached to the monitor underneath the screen

After reading the instructions, participants completed a ten-minute training scenario, during which the examiner explained the task completion for every subtask (cf. Sect. 2.3). Subsequently, participants performed the tasks in an experimental test scenario with a net duration of 45 min. The scenario was divided into three successive phases, merging into each other without breaks (see Fig. 5). The scenario paused whenever a performance decrement was detected. Users then rated their current perceived mental workload. Thus, the actual duration of the experiment depended on user’s performance. Perceived mental workload was recorded at the end of the training phase as well as at the end of the experiment, to obtain an individual baseline of the subjective rating.

Fig. 5.
figure 5

Sequence of the different phases and their durations (cf. [10])

4.2 Hypotheses

Two hypotheses were tested in this experiment (see below). The first hypothesis addresses the question whether the outcomes of the first validation experiment can be replicated. With the second hypothesis we aim to assess whether the modified rule set shows a higher diagnostic accuracy than the initial rule set.

  • H1: Perceived mental workload is rated higher for performance decrements with critical system diagnoses than for noncritical system diagnoses

  1. (a)

    using the initial rule set,

  2. (b)

    using the modified rule set.

  • H2: In comparison to the first rule set, the diagnostic accuracy is increased by the modified rule set.

4.3 Data Analysis

The psychophysiological and behavioral data was logged to text and CSV files for each participant. Data preparation included allocating the subjective ratings to the corresponding diagnostic outcomes of RASMUS. Hypothesis 1 was tested by comparing the mean deviation of the subjective rating from baseline for high and non-high workload outcomes of RASMUS using the initial rule set to test H1a and the modified rule set to test H1b. Concerning Hypothesis 2 the diagnostic accuracy was assessed for modified and initial rule sets by performing ROC curve analyses with the dichotomized subjective rating as “true” user state. The data analysis was conducted with SPSS (version 25.0).

5 Results

5.1 Descriptive Analysis

A total of 79 performance decrements occurred across all subjects. As expected, most of the performance decrements occurred in the high workload phase (see Table 2). During the monotony phase only 12 performance decrements could be observed. Two performance decrements were recorded during the second half of the baseline phase. The number of performance decrements for each of the phases is very similar or the same as in the first validation experiment [10] (see numbers in brackets in Table 2). However, only little more than 25% of the subjects showed performance decrements during the monotony phase in the second experiment, whereas almost 60% of the subjects were affected in the preceding experiment.

Table 2. Number of performance drops and subjects affected per phase. Numbers in brackets refer to the first validation experiment [10].

5.2 Hypothesis Testing

A non-parametrical Mann-Whitney U-test was conducted to test H1 due to the violation of the assumption of normality for parts of the data set. Figure 6.a and b show the subjective rating for critical and noncritical system diagnoses for the initial as well as the modified set of rules respectively. The analysis confirmed that perceived workload was rated significantly higher by the subjects for critical states of workload compared to noncritical states of workload diagnosed by RASMUS using the initial set of rules (z = −2.64; p < .01). However, subjective ratings differed less between critical and noncritical states of workload when using the modified set of rules (see Fig. 6.b). The statistical analysis revealed the difference to be nonsignificant (z = −1.3; ns). Therefore, H1 can be confirmed for the initial rule set (H1a) but not for the modified one (H1b).

Fig. 6.
figure 6

Mean perceived workload ratings (SE) for critical and noncritical system diagnoses by RASMUS for the initial (a) and the modified set of rules (b) applied to the new data set

With respect to H2, we evaluated whether the modified set of rules lead to an improved accuracy of the workload diagnosis. In the first experiment, the overall ROC curve indicated a higher discrimination between critical and noncritical states for the modified rule set compared to the initial rule set (see Fig. 3). Figure 7 shows the resulting ROC curves when applying both rule sets to the data set of the second experiment.

Fig. 7.
figure 7

ROC curves for initial and modified set of rules applied to the new data set

The analysis showed that the modified rule set was less accurate than the initial rule set when applied to the new data set. The diagnostic accuracy of the initial rule set significantly differs from .5 at an AUC of .645 (p < .05; sensitivity = .643; specificity = .647) whereas for the modified rule set it does not (AUC = .588; p = .198; sensitivity = .607; specificity = .569). Consequently, the hypothesis that the diagnostic accuracy is higher with the modified rule set (H2) cannot be accepted.

6 Discussion

The results of the second validation study could confirm the temporal stability of the diagnostic outcomes when using the initial rule set. Surprisingly, the initial rules also showed a better overall diagnostic performance than the modified rules that were determined by performing the ROC curve analysis. Hence, the outcomes of the second study indicate that the initial rule set is likely to achieve a more consistent distinction between critical and noncritical subjective workload states than the modified rule set. Results imply the modified diagnostic rules are specifically tailored to the data set the optimization is based on, and are thus not applicable to a different data set.

It should be noted that as part of a post hoc analysis not detailed in this paper we also performed ROC curve analyses for each individual physiological parameter. Results also indicate that the initial rules for the physiological parameters provide a better diagnostic accuracy than the individual modified rules when applied to the new data set. However, a surprising result was found for heart rate variability. ROC curve analysis revealed that AUC was even below the value of .5 for both, modified and initial rules. This means the accuracy of the diagnosis for this data set is less accurate than guessing (e.g. [25]) and suggests that HRV behaves in the opposite way as indicated by literature findings. This could have various reasons, e.g. sensor-related measurement errors or inadequate sensor placement. However, further post hoc analyses revealed that, according to expectations, HRV negatively correlates with the subjective (not dichotomized) workload rating, even though the correlation is rather weak (r = −.27). Hence, the unexpected AUC outcome may result from the dichotomization of the subjective rating (critical states: ratings > 9) that was necessary for performing the ROC curve analyses.

This contradictory finding on HRV illustrates a general challenge of validating and optimizing rules for user state classification. In contrast to e.g. medical diagnoses it is hard to obtain an appropriate reference measure that reliably differs between true and false critical user states. In our analysis we used the subjective rating as an estimation of true workload. However, the subjective workload rating is also error prone, e.g. affected by response bias, and it has to be artificially dichotomized for performing ROC curve analyses. This means, the cut-off value chosen for discriminating between true and false high workload states also impacts analysis outcomes.

Nevertheless, the diagnostic accuracy of the initial rule set could be confirmed by the second validation study, indicating that RASMUS can reliably differentiate between potentially high and non-high workload states. The fact that HRV was not found to be a reliable indicator of workload in the second study emphasizes the necessity to combine several indicators in order to provide a more robust diagnostic result.

7 Conclusion and Lessons Learned

ROC curve analysis is a common method to evaluate diagnostic tests in medical research. In this paper we investigated whether this approach can be used to evaluate and optimize diagnostic rules for physiological user state assessment in adaptive systems. Considering the results of our study, we suggest, that ROC curve analysis may be useful for evaluating and comparing the diagnostic accuracy of different workload indicators. However, the results of this study could not show that this method is appropriate for defining and optimizing rules for single user state indicators. As these outcomes also depend on the validity of the subjective rating and the dichotomization to obtain a “true” high workload state, in future studies it could be examined, whether other cut-off values for the subjective rating than > 9 may be more appropriate to classify a high workload state. Also other methods could be investigated for optimizing and validating the user state indicators.

Another option for optimizing diagnostic outcomes is to apply methods of artificial intelligence, such as artificial neural networks, as proposed e.g. by Wilson and Russell [26]. However, those systems are often considered “black boxes” as the algorithm, which provides the diagnostic outcomes, is often too complex to be understood [27]. The rule-based approach has the advantage of providing more transparency.

Considering RASMUS’ application within an adaptation framework, results indicate that the current workload assessments of RASMUS are sufficiently accurate and reliable to support a proper selection and configuration of adaptation strategies for this task domain. Nevertheless, we identified further options for improving RASMUS diagnostics: Two of five parameters currently used for the workload assessment (heart rate variability and respiration rate) are retrieved from the same sensor (BioHarness). Hence, whenever there’s a problem with this sensor, both indicators will not be reliable, and thus the robustness of the diagnostic outcomes decreases.

Adding one or more independent parameters to the diagnosis could avoid this problem. Hernandez et al. [28], for example, investigated the possibility to use a pressure-sensitive keyboard and capacitive mouse in the context of stress detection. They found increased typing pressure in more than 79% as well as an increase of surface contact with the mouse in 75% of the participants during stressful tasks [28]. Another possible measure for workload and stress detection could be the inclination of the trunk (e.g. [29]), which actually can be retrieved from the BioHarness sensor but there is also the possibility to use a separate pressure sensing mat that is placed on the seat of the operator.

One last note: The diagnostic framework of RASMUS has been applied to a naval air surveillance task. Hence, the indicators currently used for user state assessment in RASMUS were specifically selected for this task. In the context of AIS these assessments might also prove useful for determining mental states of the learner in order to provide adequate feedback and support. However, this has to be investigated in more detail in future experimental studies.