1 Introduction

Findings from a large number of clinical studies [7, 10] have linked cardio-vascular disorders to psychological stressors. In particular, long-term exposure to stressful environments, such as those commonly found in offices, has been shown to be a major risk factor for major depressive disorder (MDD), a disorder that affects an estimated 350 million people across the world [11]. In response to this, we have witnessed a surge in the development of ambient intelligences for the purpose of mental health and stress monitoring [1].

1.1 HRV Stress Analysis

Historically, heart rate variability (HRV) spectrograms were used by medical practitioners as the primary tool to measure cardiac neural regulation externally. This was brought out by findings that mental and physical states have certain effects on cardiac rhythm that could easily and unintrusively be measured with electrocardiographs (ECG) [8]. In the past few decades, studies have established the connection between HRV and the sympatho-vagal stress response as being visible in both the time and frequency domains. Through experiments and empirical anlaysis, these works have identified indices which show high correlations with physical and mental stress, particulary with the high and low frequency ranges [12].

Despite our growing understanding of physiological signals, even now there is no widely accepted standard for stress evaluation based on HRV features. In recent surveys [9, 16] it has been found that neurobiological evidence links some components of HRV to activity in cortical regions related to stress appraisal as well as other brain mechanisms. As it stands, it seems that one of the main challenges is identifying which components of HRV may be relevant to evaluating stress.

1.2 Deep Learning and Heart Rate Analysis

Traditionally, affective analysis such as those done for stress have mostly focused towards using visual or auditory measures [17], where deep learning has been found to be effective. In recent years, others works in the field of ECG analysis for medical applications have also pushed for end-to-end applications of deep learning.

In one of the most prominent works [14] they used a 34-layer residual convolutional neural network (CNN) to analyze raw ECG sequences in order to identify 14 different types of arrhythmia. Using a very large dataset of over 60,000 recordings they were able to train a model that performed better on average than actual cardiologists recruited for the experiment. Their work demonstrates how deep learning is able to identify structural patterns in heart rate data.

In the current work, we attempt to exclude morphological patterns and instead focus on frequency-related features of heart beats. Towards this end we introduce the use of short-time Lomb-Scargle (LS) spectrograms to train CNNs.

2 Methodology

This work aims to use spectral HRV data representations together with deep learning models to predict mental stress. Specifically, we use a sliding window Lomb-Scargle approach to build a spectrogram of heart rate in lieu of the traditional short-term Fourier transform (STFT). Next, we use pre-built deep models designed for image recognition and compared them with more simplified architectures and train them to recognize stress. In the following sections we describe the experiments and tools used in this research.

2.1 Dataset

For the following experiments we use the same work stress dataset introduced in a previous work [5]. It consists of ECG recordings of subjects performing naturalistic desk work activities in front of a PC. The physiological recording device used was a wearable ECG sensor by ImecFootnote 1. This allowed unconstrained and continuous measurement of participants’ heart rate. The sensor is worn with leads placed diagonally over the heart as shown in Fig. 1. RR beats were then extracted from the raw ECG signal using a QRS detection toolkit [4], which were then used to build the HRV spectrograms.

Fig. 1.
figure 1

The 2-lead wearable ECG sensor used in the experiments.

The dataset was gathered from four subjects each contributing five separate hour-long sessions annotated with activity and stress labels as shown in Table 1. This provided us with a total of 20 h of ECG data that was used for training and testing.

Table 1. Stress annotations distribution

2.2 Signal Processing

Previous works on spectral anlaysis of HRV applied the fast fourier transform (FFT) [19] to produce periodograms from which the power of different frequency bands can be measured and compared. However, one limitation of the FFT algorithm is that it assumes that all samples used for analysis are evenly spaced in time. This poses a fundamental issue for heart rate since the RR intervals used to represent detected heart beats are not evenly sampled. A study [2] has shown that data resampling methods used to align RR interval data for FFT analysis can lead to artifacting and higher errors in the final PSD estimation. A better alternative has been put forward in the form of the Lomb-Scargle periodogram, a method based on FFT that does not require evenly sampled data.

Lomb-Scargle Periodogram. Periodograms are some of the most basic tools used for spectral analysis, the most common of which is the discrete Fourier transform (DFT). The DFT can be defined for any sampled dataset,

\({X(t_i),i=1,2,...,N_0}\) as,

$$\begin{aligned} FT_X(\omega )=\sum _{j=1}^{N_0} X(t_j)\exp (-i \omega t_j). \end{aligned}$$
(1)

The periodogram can then be defined as,

$$\begin{aligned} \begin{aligned} P_X(\omega )&=\frac{1}{N_0}|FT_x(\omega )|^2 \\&= \frac{1}{N_0}\Bigg |\sum _{j=1}^{N_0} X(t_j)\exp (-i \omega t_j)\Bigg |^2 \\&= \frac{1}{N_0}\Bigg [\bigg (\sum _{j} X_j \cos \omega t_j\bigg )^2+\bigg (\sum _{j} X_j \sin \omega t_j\bigg )^2\Bigg ] \end{aligned}, \end{aligned}$$
(2)

[3] which while sufficient for calculating an accurate estimation of the FFT of most signals was found to be sensitive to noisy signals and other artifacts [18]. In practice, it is necessary to apply additional spectral smoothing equations, such as spectral window functions, to reduce the variances of the spectrum. Most of these methods are designed to be applied on evenly sampled data. Though some may also be applied to the unevenly sampled case, such as in RR intervals, it has been found that it can lead to higher error and increased sensitivity to signal noise [13]. This can also be mitigated by the addition of more data for calculating the spectra as a way to taper off the signal-to-noise ratio, but this may not be an option for data-scarce or time-intensive domains. A reasonable alternative is the Lomb-Scargle periodogram [15], a notable method designed for use in calculating frequency spectrograms from unevenly sampled data streams such as heart beats. The equation is based on the FFT and is wholly similar but addresses some of the noise sensitivity and spectral leakage issues of the standard FFT equation. Here, the periodogram is defined as,

$$\begin{aligned} P_X(\omega )=\frac{1}{2}\left\{ \frac{\left[ \sum _j X_j \cos \omega (t_j-\tau )\right] ^2}{\sum _j \cos ^2 \omega (t_j - \tau )}\right\} + \frac{\left[ \sum _j X_j \sin \omega (t_j-\tau )\right] ^2}{\sum _j \sin ^2 \omega (t_j-\tau )} \end{aligned}$$
(3)

with \(\tau \) defined as,

$$\begin{aligned} \tan (2\omega \tau ) = \left( \sum _j \sin 2 \omega t_j \right) \Bigg / \left( \sum _j \cos 2 \omega t_j \right) . \end{aligned}$$
(4)

This modified equation, outputs similar values to the original FFT in most cases, however it has a simpler statistical behavior and is equivalent to the reduction of the sum of squares in least-squares fitting of sin waves to data [15]. It is also time-translation invariant and reduces to a similar representation as DFT in cases where data is evenly spaced. This makes it ideal for HRV analysis and is the method used in this research. The next issue is how to tackle the non-stationary nature of heart rate. For this, we construct spectrograms from the periodograms.

Spectrogram Analysis. Spectrograms are visual representations of frequency spectra over time. They are tools that are used extensively in applications that deal with audio and radio signals. Recently, they have also seen frequent application in the analysis of physiological signals such as brainwaves and heart rate. The most common method used for building spectrograms is the short-time Fourier transform (STFT). Put simply it is built by applying the DFT over a pre-defined window of data and sliding the windowed function over the data series using a fixed amount of overlap. Formally, for any series x(n) we can defined the STFT as,

$$\begin{aligned} X_w(mS,\omega ) = \sum _{n=-\infty }^{\infty } x(n)w(n-mS)\exp (-j \omega n) \end{aligned}$$
(5)

for analysis window w, frame index m and step size S [20]. The internal equation \(\sum _{n=-\infty }^{\infty } x(n)w(n-mS)\exp (-j \omega n)\) represents the Fourier transform of \(x_w(mS,\omega )\) and is stored in the matrix \(\hat{X}\) whose columns are Fourier periodograms \(P_X(\omega )\). To adapt the spectrogram to the unevenly sampled heart rate data, we replace this equation with the Lomb-Scargle equation,

$$\begin{aligned} \begin{aligned} X_w(mS,\omega ) = \frac{1}{2}\left\{ \frac{\left[ \sum _j x(n) \cos \omega (t_j-\tau )\right] ^2}{\sum _j \cos ^2 \omega (t_j - \tau )}\right\} \\ + \frac{\left[ \sum _j x(n) \sin \omega (t_j-\tau )\right] ^2}{\sum _j \sin ^2 \omega (t_j-\tau )} \end{aligned} \end{aligned}$$
(6)

In this way we mitigate the sensitivity of the standard STFT while maintaining its ability to capture the nonstationary patterns in the temporal data. In effect, we end up with a very high dimensional representation that could potentially have more expressive and appropriate features for detecting stress. The next step is to use machine learning to discover these features and compare the performance difference with models based on traditional HRV features. Due to the 2D spatial nature of data representation, convolutional neural networks should provide an efficient learning architecture.

3 Experiments and Validation

3.1 Dataset and Models

For the following experiments, we applied a sliding window method to the stress dataset in order to segment the data. This method also lends itself to real-time processing in the future. While deciding the appropriate window length, we observed that shorter segments tended to be more stable but not very expressive. Meanwhile, longer segments were more expressive but tended to overlap across samples with different labels. Through empirical testing, we found that 505-beat windows (\(\tilde{6}\) mins) showed the best balance between stability and expressiveness while staying above the 5-minute minimum length recommended for accurate short-term HRV analysis (See Fig. 2).

Fig. 2.
figure 2

Image samples of extracted ECG spectrograms at different window widths.

We also applied a modal labelling scheme for window segments which overlapped between multiple labels. A threshold of 70% was set in order to mark stable segments which could be labelled using the modal value of the contained samples and separate them “transitional” segments which were considered to be more ambiguous. The final experiment dataset was comprised of  1,500 windowed samples which were either labelled based on the 4-level stress annotation or as transitions.

For modeling, we trained perceptron and autoencoder models using the stress HRV data to serve as baselines and compared these to convolutional neural network (CNN) models trained on our spectral HRV data representation. The convolutional layers were topped with a 128-unit dense layer to match the ones used for the non-deep HRV stress models. For the CNN models, we tested both standard deep learning models typically used for image classification tasks as well as shallower CNN architectures to verify the hierarchical feature requirements of the dataset. Performance was validated using stratified 10-fold cross-validation.

3.2 Results and Discussion

First, we analyze the performance results for models trained on the standard HRV features, which were 14 time and frequency domain features, and compared them to the deep ResNet [6] model trained on the LS spectrum data. As seen in Fig. 3, the baseline HRV-MLP model achieved 52% test accuracy which was improved by up to 60% by applying denoising autoencoders for more efficient latent feature selection. These results are well-above the 20% random classification on this 5-class dataset.

Fig. 3.
figure 3

Comparison of test performance results between models trained on standard HRV features (MLP and DA) and the ResNet model trained on STLST heart rate spectrum data.

Deep Spectral Model. For the spectral models, ResNet50 was able to achieve over 80% testing accuracy, which is well above both of the traditional HRV baselines. These significant gains could be explained by the effective combination of the more expressive spectral data with the feature learning ability of deep learning algorithms. The traditional indices used in traditional HRV analysis help humans understand physical conditions, but they may also obfuscate some intricate features that may be necessary for deeper affective analysis. By allowing the model to learn from a spatial-spectral representation, we are able to train a more effective model. However, this comes at the cost of increased complexity since the standard 50-residual unit model requires over 25 million parameters. This not only leads to higher data requirements, but may also lead to more challenges in interpretation and optimization.

Shallow Spectral Model. To help alleviate the problem of increased complexity, we attempted to train a more compact model. For these tests, the goal was to reduce the depth and overall complexity of our model without compromising performance. This ultimately culminated in a model featuring 5 intermediate 128-filter CNN layers with BatchNorm for regularization. The relatively shallow model also meant residual units were unnecessary and training was much faster than the deeper model. It should also be noted that tests were also performed on shallower models below 5 layers, but this typically lead to larger performance variations and more noticeable overfitting. Figure 4 shows a comparison of the learning process of both the deep and shallow models. Despite using less than 1 million parameters, as opposed to the over 25 million of ResNet50, the shallow model is able to match the performance of the much larger ResNet model while also avoiding overfitting. Actual training time could also reduced by up to 5x which may be useful in situations where occasional retraining is necessary.

Fig. 4.
figure 4

Learning and test performance of ResNet50 and the shallower 5-layer CNN model.

3.3 ROC Analysis

Next, we wish to analyze how well the model is able to discriminate between the different stress classes. For this we look at the receiver operating characteristic (ROC) curves for each of the classes, using a one versus all methodology. As seen in the in Fig. 5, using the deep ResNet model has a high rate of discrimination for the edge labels (i.e., very low and very high). Meanwhile, it is slightly more challenging to correctly distinguish the median labels (low and high). This could indicate that there is more ambiguity with regards to the physiological signals shown for moderate stress levels. Given that the stress levels were expected to be ordinal, it would be easier to separate classes on either side of the spectrum rather than in the middle. This may be a sign that the features discovered by the model follow the same intended ordinality as the stress labels.

Alternatively, this could also be due to labelling inconsistencies common to psychological labels. Simply, it is easier to write accurate labels for extreme emotions rather than moderate ones. Future work should focus on exploring this further in order to determine a more efficient annotation methodology, or possibly even correct or omit inconsistent labels.

Fig. 5.
figure 5

ROC for the classification results of the ResNet50 model.

4 Summary and Future Work

In this work we trained a deep model of work stress using a spectral representation of heart rate extracted using short-time Lomb-Scargle transform. The experiments showed that the deep models trained on the spectral representation could easily outperform those trained on the traditional HRV features for detecting naturalistic stress levels. Additionally, we show that even relatively shallow models featuring at least 5 convolutional layers can match the performance of ResNet50, this could indicate that at least some shallow hierarchical features may be important for effectively modelling naturalistic stress from heart rate.

In future work we plan to investigate further the specific features learned by our model through feature activation analysis. We also plan to train and compare the performance of the model on other affective computing datasets for emotion. Finally, it would benefit the affective computing community greatly if we can develop optimal representations that can enable the aggregation of data using different devices and dataset to build a single model that contains all the richest features for HRV analysis.