Keywords

1 Introduction

1.1 Problem Summary

To support a complex array of current and envisioned missions, the U.S. Air Force trains and educates a large uniformed workforce across diverse Air Force Specialty Codes (AFSCs). Early-stage training in many AFSCs presents Airmen and officer trainees with large corpora of foundational knowledge to master, delivered through a variety of modalities. This content is taught at technical training or flight schools in courses lasting up to several months. Because motivation is a necessary element for learning retention and force readiness, Air Force education and training stakeholders continue to look for ways to engage their constituencies, through gamified interactive learning, simulation, and the use of mixed-reality immersive environments.

To maintain an engaged, motivated cadre of Airmen, OMEGA, via a service-oriented, general-purpose appliance operating in concert with the learning environment, will recommend interventions when end-users are showing signs of a lapse in engagement. These services will be accessible to any AF instructional system, which can automatically or through a human instructor locally effect OMEGA’s recommendations. OMEGA will enable AF and DoD learning environments to identify and adapt to detected lapses in engagement, promoting greater motivation, retention, and readiness.

1.2 Project Context: Pilot Training Next

The test case for OMEGA, selected by the Air Force, is Pilot Training Next (PTN). PTN, now training its third iteration of pilots, aims to the reduce the time and cost of undergraduate pilot training (UPT) by leveraging virtual and augmented reality, biometrics, AI, and data analytics. While PTN student pilots fly the same hours and sorties in the T-6 as do their legacy pilot training counterparts, the ground-based training is done using an immersive PC-based flight simulator (Lockheed Martin’s Prepar3D®), VR headset (HTC’s VIVE™ Pro), stick, throttle and rudder pedals, and a syllabus of PTN-specific scenarios (Fig. 1).

Fig. 1.
figure 1

PTN station: VR headset, controls, displays. (U.S. Air Force photo by Sean M. Worrell)

PTN has maintained a heavy emphasis on data collection and analysis. Reliable and predictive metrics are needed not only to assure instructors and higher command that student pilots are achieving the same skills as their legacy training counterparts, but also because PTN students qualify for graduation on the basis of achievements rather than on calendar time. This fundamental change in how students progress requires an abundance of data that supplement instructor evaluations to ensure skill mastery.

During scenarios flown in the simulator, objective data is readily captured for every time interval, such as aircraft state (position, attitude, airspeed) and configuration (aileron, rudder and elevator deflections; flap and gear positions). Instructors can also monitor how the scenario is progressing overall and can provide verbal feedback in real-time.

Some important metrics though are less directly observable or measurable. Student engagement is widely accepted as a critical mediating factor in both learning retention [1] and learning outcomes [2]. The importance of engagement is not lost on instructors (though labels like attention and focus are more common in pilot training), and is often the basis for, or at least an element of, scoring situational awareness (SA).

Two additional factors make engagement even more salient for PTN. First, instructors have less visibility into student engagement (due to the VR headset) than in conventional simulators. Second, a vision for PTN is for one instructor to be monitoring multiple students simultaneously. Indirectly-observable measures such as engagement will thus require some level of automation support to cue instructors when lapses are detected.

2 Modelling Engagement

2.1 Conceptual Model of Engagement

In training sorties flown in the aircraft or in conventional simulators, an instructor pilot (IP) monitors the student pilot’s performance. Maneuvers are evaluated by observing air speed, vertical speed, attitude, angle of attack, and so on. Instructors are also interested in situational awareness (SA), which they can assess by observing the student’s ability to “stay ahead of the airplane”: anticipating upcoming changes in heading, airspeed, or altitude; applying smooth control inputs to adjust bank angle, pitch and power; and maintaining proper scan of the flight instruments.

Several theoretical frameworks for characterizing engagement informed our model of engagement. We created an inventory of nine relevant engagement and disengagement models from the literature that emphasize behavioral indicators (e.g., data from log files or from direct queries to the user) [3]. These included Intrinsic vs. Extrinsic [4]; Two Factor Hygiene-Motivator Theory [5]; Motivators from Maslow’s Hierarchy (Ibid); Achievement Goal Theory [6]; D’Mello & Graesser’s Engagement model [7]; and Baker’s indicators of passive vs. active disengagement [8]. From this we synthesized a multi-timescale engagement and motivation model [9].

More recently, we refined the model to reflect the aviation focus of this project, preferring metrics associated with event response tasks (e.g., maneuvering to avoid a new hazard) and monitoring tasks (e.g., maintaining straight and level attitude). We incorporated significant research conducted to identify indicators of distraction and disengagement for accidents attributed to loss of control and airplane state awareness [10, 11]. A subset of these states is relevant to flight tasks performed in simulated environments:

  • Attention vs. Distraction: Situational awareness was particularly reduced by the induction of diverted attention. Channelized attention or attentional tunneling also indicated loss of situational awareness [10];

  • Boredom and Distraction: Distraction can be characterized by any time without interaction with the system; engaged pilots interacted with the system to optimize performance, even when this was not required to meet performance requirements [12];

  • Attentional Tunneling: Attentional tunneling is indicated by lack of interaction with one or more system elements, coupled with strong interaction with another element [11];

  • Vigilance: Diligence, distraction and daydreaming all lead to failures in practical monitoring tasks [13].

The model resulting from this additional analysis is shown in Fig. 2. The resulting expert model is based on eight input metrics, used to compute three mid-level features: Performance, Efficiency, and Responsiveness. An overall composite measure of current engagement within a given data window is derived from these mid-level features. The relative weights and derivation methods shown in Fig. 2 represent the initial trial conditions. We anticipate that these will be refined through additional testing.

Fig. 2.
figure 2

Engagement model

2.2 Engagement Metrics and Virtual Reality

A desktop flight simulator generates a rich set of data, including aircraft position, attitude and configuration. From such data, objective performance metrics can be calculated with some reliability. For instance, detecting when a student pilot lowers the gear while the airspeed exceeds the maximum gear-down speed is straightforward. To monitor engagement, however, requires aggregating observable measures to generate an indirect estimate of engagement. Our model, for instance, specifies eight such indirect measures.

The addition of a VR head-mounted display (HMD) adds additional data points that could be incorporated as part of a suite of metrics to monitor engagement levels. Typical VR headset and sensors can capture head position and movement; higher-end devices, such as the VIVE Pro Eye, can capture eye tracking data. The VR environment thus adds to the already rich data stream available from the simulator. This apparent abundance of data, however, does not solve the problem of developing reliable measures of engagement. Several challenges for interpreting the data remain, including, non-exhaustively:

  1. 1.

    Understanding which data points are relevant to engagement;

  2. 2.

    Setting proper coefficients representing how each data point should be weighted;

  3. 3.

    Distinguishing between and properly applying a single data point x observed at time t compared with a trend of how x behaves over some interval (e.g., from t − 5 s to t + 5 s).

  4. 4.

    Incorporating the velocity of the change in a data point, for instance, how abrupt an aileron deflection or throttle movement the student applied.

A principal emphasis of this work is to explore the role that machine learning models could play in interpreting simulator and VR device data in order to develop measures to drive our conceptual model of engagement. This machine learning approach is summarized in the next section.

3 Machine Learning Approach

3.1 Machine Learning Model

We employ machine learning to allow OMEGA to develop more accurate predictive associations between raw data inputs and higher-level aggregated engagement metrics. This section describes the techniques and architecture of the OMEGA machine learning component. Our design leverages the underlying data streams available from Prepar3D to provide better predictive power in situations where there is limited access to interpreted data (e.g. when interpreted metrics of event occurrence, event success/failure, and efficiency are not available). To achieve this, we employ three methods in combination:

  1. 1.

    We use standard machine learning techniques to attempt to accurately predict engagement and disengagement in input metric sequences. These approaches are attractive because they enjoy fast estimation methods with low run-time, and therefore can provide near-instant feedback to instructors. Based on the features and data available in Prepar3D, we have selected Support Vector Machine (SVM) and Binomial Regression as the most applicable approaches. These techniques are most powerful in cases where sequence classification is not strongly context-dependent. For OMEGA, however, we expect some context-dependence in the data. For example, rapid adjustments of heading, altitude and airspeed may represent recovery from a period of inattention if these maneuvers occur between waypoints, but may represent an attentive reaction if observed during an event requiring active response (e.g. a heading change when passing a waypoint). To mitigate this risk and to improve the model, we deploy two additional “deep learning” layers of machine learning that are more robust to sequence classification in context-dependent data.

  2. 2.

    We use a form of deep learning called bidirectional long short term memory (BD-LSTM), a type of recurrent neural network, to produce improved results in sequence classification problems that are heavily context-dependent. In this case, determining whether a given sequence of composite low-level metric readings represents disengagement is likely to be highly context-dependent, for instance a climb at full power versus a climb at normal cruise speed. We us bidirectional LSTMs, which consider the ‘context’ of both the preceding and following time slice data when predicting disengagement, to reduce the incidence of false-positive detection of disengagement in this environment. The tradeoff for improved disengagement and inattention detection is the high resource and time cost of maintaining bidirectional LSTM in OMEGA. Recurrent Neural Networks (RNNs) like LSTM are can be difficult to train due to memory-bandwidth-bound computation limitations. For this reason, we have selected a second deep learning approach in case the performance requirements present too much risk.

  3. 3.

    We use an alternative deep learning approach called Attention-based Modeling as the third layer in OMEGA’s machine learning stack. Attention-based models are sequence-to-sequence models designed to improve performance of RNN-based approaches. This third layer provides an alternative mechanism in cases where bidirectional LSTM is too resource-intensive to be effective in real-time.

3.2 Training the Model

Data for training the machine learning components derives from experimental subjects who fly a pre-selected set of PTN scenarios in a data collection station that mirrors most of a PTN simulator, namely, the simulation software, stick and throttle, and VIVE Pro HMD and sensors. The data collection station also includes a dedicated application for the experimenter to monitor each scenario, interact with the subject, and time-stamp relevant events. Figure 3 shows an experimenter and subject during a data collection session.

Fig. 3.
figure 3

Data collection (background), experimenter (foreground) stations. Photo by the authors.

For purposes of creating training data for the machine learning models, experimenters are trained in a protocol to (1) time-stamp lower-intensity and higher-intensity segments of a scenario, to help the models account for workload in processing measures of user activity; and (2) engage the subject in conversation at specific points during a scenario. Conversing with the subject acts as a surrogate for disengagement. We posit that loss of attention, or distraction, will be statistically detectable in the simulation log files. Specifically, we anticipate three possible types of deviations:

  1. 1)

    Response Time: Most dominantly, we anticipate that subjects’ time to respond to changes in the environment will be slower and/or less precise when engaging in conversation. Specifically, we anticipate a longer duration with no response after an event that requires a maneuver (e.g. heading change), followed in some cases by an initial control input that is more abrupt, more prone to overcorrect, or may even be in the wrong direction.

  2. 2)

    Performance: We anticipate more likely failure to accomplish scenario goals (e.g., missing required waypoints).

  3. 3)

    Efficiency: We anticipate that periods of distraction will tend to be less efficient, due to the above issues and due to less precise control over the aircraft (e.g., slower damping of over-correcting heading changes).

Subjects were recruited from flying clubs in the Corvallis, OR and Los Angeles regions. Subjects qualified for the study through meeting either flying hour criteria or flight simulator experience criteria. Each subject was given a practice period with the PTN station and then asked to complete six PTN scenarios. The collected data are being used to train the machine learning model, comparing both the predictive power and the latency and resource requirements for each of our deep learning modeling techniques.

4 Adaptive Instruction

4.1 Adaptive Recommendations Model

OMEGA processes detect engagement levels to generate adaptive recommendations to help an instructor restore lapsed engagement. During a scenario, based on combinations of different state signals, OMEGA will generate a set of intermediate inferences. These inferences include, for example, whether poor performance is due to consistently bad results versus irregular behavior or inconsistency (e.g., carelessness). The model employs both the basic state model and the aggregated inferences as inputs to calculate a scoring ranking for different adaptive interventions.

Our model considers three levels of outcomes: performance, responsiveness, and efficiency, each representing a distinct dimension of quality. Performance represents the basic ability to complete the assigned tasks, based on the performance criteria for those tasks (e.g., following a set of waypoints). Responsiveness represents the speed and effectiveness for a learner to adjust to new tasks or requirements (e.g., if a waypoint is moved, how quickly does the user adjust heading). Efficiency represents lean and strategic use of resources to complete a scenario (e.g., faster completion times).

These quality criteria can each be thought of as building upon each other: a learner must adjust heading to a new waypoint or else there is no way to determine responsiveness. Likewise, efficiency is impossible if the user is not responsive enough to stay on course. This means that only some factors should be addressed with certain types of learners (e.g., high vs. low expertise). For example, if a user is failing to master proper take-off procedures, critiquing fuel efficiency would add no training value. On the other hand, an otherwise high-performing student pilot who is drifting off-course or leaving assigned altitudes may benefit from noting a need for improved in-flight checks.

The policy for adaptive interventions is depicted schematically in Fig. 4. The right-hand side of Fig. 4 shows the interventions proposed for PTN. These include three distinct types of interventions: Messaging (Information about the task), Motivation (Context about the task and learning goals), and Recommendations (Suggestions on different tasks or breaks to improve learning). The left-hand side of Fig. 4 outlines a high-level policy for when specific interventions are expected to be appropriate for users with different skill levels and in different states. These connections between intervention types and student states do not represent the actual model. Instead, they represent key dynamics that the model will produce. However, since the actual state space to calculate an effective intervention policy is too large to easily convert into a short graph, this model captures the key behaviors that the intervention model will be tested against, to ensure it behaves reasonably versus what would align to theoretical frameworks for engagement and responding to disengagement. The intervention types are outlined in Table 1.

Fig. 4.
figure 4

High-Level policy for adaptive interventions in PTN

Table 1.  Intervention types for PTN

4.2 Generating Adaptive Recommendations

We propose two distinct methods for generating adaptation recommendations based on the internal state of the models used to measure engagement. The first approach is much less computationally-intensive and will produce recommendations with lower latency. However, given the highly contextualized nature of the input data, we expect the second approach to produce more accurate results. Work is currently in-progress for testing the trade-offs between timeliness and quality under different simulation conditions.

In the first approach, we use the calculated values from the engagement model (performance, efficiency, and responsiveness) as inputs to a machine learning classification model. Using the labeled data set produced during the data collection trials discussed above, we apply several traditional machine learning modeling techniques to the classification task, where the outputs are the available set of adaptations available in the PTN training environment. We use both Naive Bayes and Support Vector Machine (SVM) approaches. Since the input metrics include variables that are highly interdependent, we expect that SVM will yield superior results. SVMs have been demonstrated to predict the likelihood of learner withdrawal from online courses, for example [14]. These techniques do not account for the contextual nature of the data, instead analyzing each time slice as a separate case. As was the case for detecting engagement levels, the determination of an appropriate adaptation will depend on the context in which the disengagement event occurs, as well as on the environmental conditions being simulated.

We define context as the data stream of pilot behaviors and actions preceding and following a time slice, and environment as the set of conditions that obtain for that particular segment of the simulation (e.g. aircraft attitude, airspeed, status of systems). In order to fully account for the contextual nature of both disengagement detection and the recommendation of an appropriate adaptation, we are developing a second, more powerful model for capturing temporal information and learning high-level representations hidden in the metric data stream based on Artificial Neural Networks (ANN).

Much of the research in applying ANN to the interpretation of sensor data streams has been focused on traditional neural network approaches, such as feed-forward neural networks (FFNN) and deep convolutional neural networks (CNN). Al-Shabandar, et al. [15] employed a range of machine learning models including ANNs to investigate factors driving student motivation in massively-online open courses (MOOCs). Recent success of recurrent neural networks (RNN) with long short-term memory (LSTM) in other applications has led to promising trials of this approach in using sensor data to predict highly contextual operational states. RNNs have been used for, among other applications, associating student engagement with outcomes in MOOCs [16].

Variations of this approach have recently explored incorporating operational conditions into the predictive model. These approaches use several BD-LSTM models and a final FFNN layer to integrate both the contextual information encoded in the data stream (representing the sensor data) and the available operating context and environmental data. The model we have designed adapts this approach to the interpretation of Prepar3D data for (1) predicting pilot engagement and detecting disengagement; and (2) using context about the nature of the disengagement to predict the most effective adaptation to recommend to the human instructor in the context of PTN training.

The model is composed of several stacked layers of ANNs. The first BD-LSTM network extracts latent features from the multiple metric data streams describing pilot behavior. The second BD-LSTM network extracts latent features from the metric data stream describing aircraft movement, and the third BD-LSTM network extracts higher level features describing the operational environment. These layers are stacked with a final neural network layer to predict pilot disengagement level and events, depicted schematically in Fig. 5. The states of the internal layers of these BD-LSTM networks are then used as the input into a separate recommendation model to predict the most appropriate adaptation in a given context. The recommendation model layer is a CNN network, which will be trained using the labeled data set from the pilot trials.

Fig. 5.
figure 5

Adaptation recommendation stacked neural network architecture

5 Conclusions

We have concluded data collection and will present our results from the machine learning model development during the conference. A formative evaluation using Air Force instructor pilots to provide feedback to OMEGA’s recommendations will immediately follow the model development. Simulations enhanced with VR provides immersive training that promises to advance learning outcomes and retention. A key factor in achieving positive results is learner engagement, which is more challenging to assess than directly observable or objectively measurable factors. In some instances, the VR environment itself can obscure cues relevant to learning engagement from instructor view. OMEGA addresses this gap by using machine learning models to develop predictive associations between simulation events and learner actions on the one hand, and learner engagement on the other. OMEGA also incorporates a model of adaptive interventions to remedy engagement lapses, and employs machine learning to develop associations between the context and environment of the engagement lapse and the optimal intervention to recommend.

Our results will provide concept validation to establish more general-purpose, service-oriented appliance that client learning applications can employ for detecting lapses in engagement and motivation, and for recommending adaptive interventions. OMEGA can thus address a need, across the service branches, to ensure that simulation-based training, and training incorporating VR, results in engaged and motivated warriors, using adaptive instruction and providing data to help training managers track the efficacy of new technologies and paradigms.