I see it in your eyes: Training the shallowest-possible CNN to recognise emotions and pain from muted web-assisted in-the-wild video-chats in real-time

https://doi.org/10.1016/j.ipm.2020.102347Get rights and content

Highlights

  • We propose the shallowest-possible, and perhaps the shallowest-ever, convolutional neural network model that can predict emotions from real-life, noisy, laggy, internet-based (in-the-wild) videos real-time, capturing nuances of emotions, i.e. value- and time-continuous affect prediction. The research we present in this paper is directly relevant to healthcare for applications such as real-time patient monitoring, AI-assisted doctor-patient consultations.

  • The proposed models are computationally inexpensive, can be embedded into devices such as smartglasses.

  • We use a novel feature selection paradigm that is driven by feature attribution score computations.

  • We investigate and reason the AI performance, present computations on how exactly it utilises the input features to make affect-related predictions (Explainable AI).

  • We compute the relevance and utilisation of facial action unit (FAU)-derived features by the model, comparing it against the human perception of emotion expression.

  • We extend this FAU-based ’affect’ prediction approach to the FAU-based ’pain-intensity’ prediction problem.

  • As FAUs can be extracted in near real-time, and because the models we developed are exceptionally shallow, this study paves the way for a robust, cross-cultural, end-to-end, in-the-wild real-time affect and pain prediction, that is also (nuanced or) value- and time-continuous.

Abstract

A robust value- and time-continuous emotion recognition has enormous potential benefits within healthcare. For example, within mental health, a real-time patient monitoring system capable of accurately inferring a patient’s emotional state could help doctors make an appropriate diagnosis and treatment plan. Such interventions could be vital in terms of ensuring a higher quality of life for the patient involved. To make such tools a reality, the associated machine learning systems need to be fast, robust and generalisable. In this regard, we present herein, a novel emotion recognition system consisting of the shallowest realisable Convolutional Neural Network (CNN) architecture. We draw insights from visualisations of the trained filter weights and the facial action unit (FAU) activations, i. e. the inputs to the model, of the participants featured in the in-the-wild, spontaneous video-chat sessions of the SEWA corpus. Further, we demonstrate the generalisablity of this approach on the German, Hungarian, and Chinese cultures available in this corpus. The obtained cross-cultural performance is a testimony to the universality of FAUs in expression and understanding of the human affective behaviours. These learnings were moderately consistent with the human perception of emotional expression. The practicality of the proposed approach is also demonstrated in another key healthcare applications; pain intensity prediction. Key results from these experiments highlight the transparency of the shallow CNN structure. As FAU can be extracted in near real-time, and because the models we developed are exceptionally shallow, this study paves the way for a robust, cross-cultural, end-to-end, in-the-wild, explainable real-time affect and pain prediction, that is value- and time-continuous.

Introduction

The simultaneous emergence of social media environments as a new source of data and unprecedented advancement of artificial intelligence (AI) has opened doors to many next-generation applications in healthcare. We discuss first precisely this ever-increasing synergy between the three domains (cf. Figure 1), the related challenges, followed by our contribution to the emotion-aware and pain-aware AI for remote patient monitoring and counselling.

Because the social media provides an unparalleled insight into the lives, emotions of the real people, it can be leveraged to recognise patterns relating to their health needs. For example, a patient’s feedback on his/her treatment, in the context of their ethnicity, age, gender, smoking, drinking habits gives health professionals a better insight into the patient-education and demographic-tailored remedies (Antheunis, Tates, & Nieboer, 2013). Social media data helps create strategies for improving customer engagement for different healthcare services, e. g. targeted stimulus to motivate a fitness regime (Korda & Itani, 2013). Social media enables everyone to exchange their healthcare experience, and to learn from others (Spink et al., 2004). Information sharing also helps healthcare professionals to identify the gaps towards a better health outcome, e. g. reduction of waiting times, improved customer satisfaction and doctor-patient relationship (Smailhodzic, Hooijsma, Boonstra, & Langley, 2016). In a pandemic like COVID-19, or even otherwise, a web-assisted video platforms can be used to address customer grievances, and for remote patient consultations, counselling and health monitoring (Armfield, Gray, & Smith, 2012) – the related research being the prime focus of this paper. While not the main focus of this paper, but for the sake of completeness, the novel challenges posed by AI to the ‘AI-Healthcare-Social data’ synergy are identified next, when considering widespread adaption of any application including the research presented herein.

The rapid AI advancements have made high volume, high velocity social media analytics a feasible reality (Amiriparian, Schmitt, Hantke, Pandit, Schuller, 2019b, Pandit, Amiriparian, Schmitt, Schuller, 2019a). Such is the pace of this progress that the finesse with which AI performs complicated tasks remains hard to believe, yet is a common knowledge ironically – be it coherent synthetic text generation, voice mimicking, or a high resolution ‘deep-fake’ video production. For example, it might be hard to believe that a few applications of social data in healthcare in this very manuscript were suggested by the GPT-2 English language model (Radford et al., 2019)1, with a frighteningly impressive human-like coherency. Because a human-like content generation is now this easy, the same AI advancement simultaneously manifests itself both as a novel challenge and as an opportunity.

Likewise, issues related to data privacy, a central consideration in healthcare, have become complex. Highly personal data (e. g. health issues, vital signs) is what drives the healthcare. AI helps with synthetic yet realistic data augmentation, coupled with more advanced anonymisation techniques addressing privacy-related issues. Simultaneously and ironically, research in AI-based deanonymisation has made anonymisation of data lot more challenging (Malin & Sweeney, 2004).

The earliest and the main criticism of deep learning technology has long been its inexplicability. A white-box AI is a necessity in healthcare, where an incorrect treatment can be fatal, where each diagnosis should be based on a reliable understanding of the data available. In recent years, there are monumental developments with several approaches proposed for model explainability (Alber et al., 2019). In this paper as well, in addition to reporting performance metrics, we compute feature attributions towards better understanding of the models proposed.

Data mining has helped improve diagnosis and treatment of various diseases (Rodríguez-González et al., 2012), e. g. diagnosis of cancer (Ruiz et al., 2018), sleep apnea (Janott, Schmitt, Zhang, Qian, Pandit, Zhang, Heiser, Hohenhorst, Herzog, Hemmert, Schuller, 2018, Qian, Janott, Pandit, Zhang, Heiser, Hohenhorst, Herzog, Hemmert, Schuller, 2017), diabetes (Pratt, Coenen, Broadbent, Harding, & Zheng, 2016), cardiovascular diseases (Goldberger et al., 2000), and assessment of psychological stress (Thelwall, 2017, Yoo, Lee, Ha, 2019). It can also be used as a preventive and diagnostic method to identify ‘red flag’ situations in real-time, e. g. to help identify people prone to suicide, or those with mental conditions such as bipolar disorder (Amiriparian et al., 2019a), autism (Roche et al., 2018), depression (Schuller, 2016), undergoing pain (Lucey, Cohn, Prkachin, Solomon, & Matthews, Walter, Gruss, Ehleiter, Tan, Traue, Werner, Al-Hamadi, Crawcour, Andrade, Silva, 2013) or stress (Zhou, Hansen, & Kaiser, 2001). As discussed previously, an assistive technology recording dyadic conversations and estimating the psychological and physiological state of the recorded individuals can be envisioned.

However, in order for its widespread use in the health sector, it needs meet very high standards of requirements, aside from its explainability of its predictions. First, it needs to be robust for use in a non-laboratory, unconstrained, noisy, i. e. ‘in-the-wild’ conditions, with data featuring spontaneous behaviours. It should ideally capture nuances of emotions/affects of people of different backgrounds, rather than a crude classification into three (positive, negative, and neutral), or six or such tiny number of basic emotion classes. It should ideally track one’s emotions continuously in time, in real-time. In summary, an explainable value-continuous, time-continuous, multidimensional, subject-independent, light-weight, real-time, physiological and affect prediction model trained on the in-the-wild data is desired; the scope of this paper.

In this paper, we unearth learnings of a shallowest possible CNN one can ever realise, as it learns to predict human emotions from in-the-wild videos. The training and predictions are of value- and time-continuous affect dimensions of individuals coming from very different cultures, contexts, gender and age-groups featured in the SEWA database, without making use of any audio or textual data. Inspired by these findings, we also explore use of this network topology for remote video-based pain monitoring on the UNBC-McMaster Shoulder Pain Expression database, and the BioVid database.

In Section 2, we discuss each previous research directly relevant to our current study. We discuss in detail the performance metric we chose for evaluation in Section 3 and the datasets we used in Section 4, complete with statistical analysis of the features and the labels – which is crucial before beginning with the experiments. We detail next the experimental design pipeline, different types of models we tried, and how the powerful, yet shallowest-realisable CNN evolved through these experiments in Section 5. We present the insights gained by analysing the trained weights mapping the features to the output labels in Section 6. We re-evaluate our proposed approach for the time-continuous pain prediction problem in Section 7. We summarise our findings, mentioning briefly the limitations of this study in Section 8. We also list various avenues for future work as the logical next step – including the research paths we have already begun venturing into.

Section snippets

Related research

The target research problem here is the explainable, robust, value- and time-continuous recognition of affect dimensions (e. g. arousal and valence) on in-the-wild audiovisual recordings, featuring spontaneous behaviours in the conversational context. The publicly available databases (e. g. SMARTKOM (Schiel, Steininger, & Türk, 2002), IEMOCAP (Busso et al., 2008), RECOLA (Ringeval, Sonderegger, Sauer, & Lalanne, 2013), MAHNOB Mimicry (Bilakhia, Petridis, Nijholt, & Pantic, 2015), 4D CCDb (

Choice of evaluation metric

The primary focus of this paper is the explainable, time-continuous prediction of the affective and physiological state of a subject under observation, e. g. a patient. Choosing a metric for evaluation of the predictive capability of the system is a crucial step. For a time-series prediction that is value-continuous (i. e. a regression problem), arguably the most popular performance metrics are: the Mean Squared Error (MSE), the Mean Absolute Error (MAE), the Pearson Correlation Coefficient (CC

Overview in terms of the subjects, and the data-splits

The SEWA dataset features spontaneous dyadic internet-based conversations between the participants, discussing an advertisement they watched. The recordings were not standardised intentionally, and are truly in-the-wild. The participants were allowed to converse from wherever they wished (e. g. home, office, cafeteria), with no set requirement on the devices (notebooks, microphones, cameras) and internet connection in use. Many of the audiovisuals suffer from bad lighting conditions, noise,

Affect prediction: Experiment and model design pipeline

We train various minimalist CNN topologies using the Keras (v2.2.2) library with the Tensorflow (v1.11.0) backend. Training and evaluations are run on regular notebook with a Nvidia GeForce GTX 1050 Ti GPU-card. The FAU features obtained from 34 video chat sequences of German participants (cf. Figure 3) and the RMSprop optimiser (learning rate = 0.001) are used for training, which runs for 2 000 epochs. We choose the model weights based on their performance on the development set. Because an

Feature attribution calculation

Consider model D consisting of only one intermediate Conv1D layer between the input and the output layer. Because the output is just a single channel (we predict only one emotion dimension at a time), the number of filters associated with the last and the only intermediate Conv1D layer is one. Each output is the element-wise product of trained filter weights and a section of the input matrix, added to the trained bias value. Thus, the Conv1D layer represents the degree of similarity of the

Pain intensity prediction

Inspired by predictive performance of the minimalist CNN we proposed to model emotional state of the human subjects cross-culturally, that uses only a handful FAU activations recorded in the noisy conditions, we now turn our attention to the pain intensity prediction problem – equally relevant for many healthcare applications, including the automated remote patient monitoring system. An interpretable and explainable AI continues to be the primary focus of the experiments. One of the popularly

Conclusions

Towards explainable, robust, value- and time-continuous, real-time affect prediction on social media-based, web-assisted ‘in-the-wild’ video chat sessions, we presented a shallow CNN-based model consisting of a single one dimensional convolutional layer. Through statistical analysis of the input features, we investigated as to how and why the model assigns the filter weights the way it does. Because we used linear activation, we computed the feature attribution scores of the individual features

CRediT authorship contribution statement

Vedhas Pandit: Conceptualization, Methodology, Software, Validation, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization. Maximilian Schmitt: Methodology, Software, Data curation, Writing - review & editing. Nicholas Cummins: Writing - review & editing, Resources. Björn Schuller: Writing - review & editing, Resources, Project administration, Funding acquisition, Supervision.

References (55)

  • Baltrusaitis, T., Zadeh, A., Lim, Y. C., & Morency, L. P. (2018). Openface 2.0: Facial behavior analysis toolkit. 13th...
  • C. Busso et al.

    IEMOCAP: Interactive emotional dyadic motion capture database

    Language Resources and Evaluation

    (2008)
  • Cavé, C., Guaïtella, I., Bertrand, R., Santi, S., Harlay, F., & Espesser, R. (1996). About the relationship between...
  • Chen, H., Deng, Y., Cheng, S., Wang, Y., Jiang, D., & Sahli, H. (2019). Efficient spatial temporal convolutional...
  • F. Eyben et al.

    The acoustics of eye contact – Detecting visual attention from conversational audio cues

    Proc. 6th Workshop on Eye Gaze in Intelligent Human Machine Interaction: Gaze in Multimodal Interaction, GAZEIN, 15th ICMI

    (2013)
  • E. Gibaja et al.

    A tutorial on multilabel learning

    ACM Computing Surveys (CSUR)

    (2015)
  • A. Goldberger et al.

    Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals

    Circulation

    (2000)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Computation

    (1997)
  • Kaya, H., Fedotov, D., Dresvyanskiy, D., Doyran, M., Mamontov, D., Markitantov, M., Salah, A. A. A., Kavcar, E.,...
  • H. Korda et al.

    Harnessing social media for health promotion and behavior change

    Health Promotion Practice

    (2013)
  • J. Kossaifi et al.

    SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2019)
  • Lucey, P., Cohn, J., Prkachin, K., Solomon, P., & Matthews, I. (2011). Painful data: The UNBC-mcmaster shoulder pain...
  • V. Pandit et al.

    Big data multimedia mining: Feature extraction facing volume, velocity, and variety

    Big Data Analytics for Large-Scale Multimedia Search

    (2019)
  • Pandit, V., Cummins, N., Schmitt, M., Hantke, S., Graf, F., Paletta, L., & Schuller, B. (2018a). Tracking authentic and...
  • Pandit, V., Schmitt, M., Cummins, N., Graf, F., Paletta, L., & Schuller, B. (2018b). How good is your model ‘really’?...
  • V. Pandit et al.

    I know how you feel now, and here’s why!: Demystifying time-continuous high resolution text-based affect predictions in the wild

    Proc. 32nd international symposium on computer-based medical systems, CBMS

    (2019)
  • Pandit, V., & Schuller, B. The many-to-many mapping between the concordance correlation coefficient and the mean square...
  • Cited by (12)

    View all citing articles on Scopus
    View full text