I see it in your eyes: Training the shallowest-possible CNN to recognise emotions and pain from muted web-assisted in-the-wild video-chats in real-time
Introduction
The simultaneous emergence of social media environments as a new source of data and unprecedented advancement of artificial intelligence (AI) has opened doors to many next-generation applications in healthcare. We discuss first precisely this ever-increasing synergy between the three domains (cf. Figure 1), the related challenges, followed by our contribution to the emotion-aware and pain-aware AI for remote patient monitoring and counselling.
Because the social media provides an unparalleled insight into the lives, emotions of the real people, it can be leveraged to recognise patterns relating to their health needs. For example, a patient’s feedback on his/her treatment, in the context of their ethnicity, age, gender, smoking, drinking habits gives health professionals a better insight into the patient-education and demographic-tailored remedies (Antheunis, Tates, & Nieboer, 2013). Social media data helps create strategies for improving customer engagement for different healthcare services, e. g. targeted stimulus to motivate a fitness regime (Korda & Itani, 2013). Social media enables everyone to exchange their healthcare experience, and to learn from others (Spink et al., 2004). Information sharing also helps healthcare professionals to identify the gaps towards a better health outcome, e. g. reduction of waiting times, improved customer satisfaction and doctor-patient relationship (Smailhodzic, Hooijsma, Boonstra, & Langley, 2016). In a pandemic like COVID-19, or even otherwise, a web-assisted video platforms can be used to address customer grievances, and for remote patient consultations, counselling and health monitoring (Armfield, Gray, & Smith, 2012) – the related research being the prime focus of this paper. While not the main focus of this paper, but for the sake of completeness, the novel challenges posed by AI to the ‘AI-Healthcare-Social data’ synergy are identified next, when considering widespread adaption of any application including the research presented herein.
The rapid AI advancements have made high volume, high velocity social media analytics a feasible reality (Amiriparian, Schmitt, Hantke, Pandit, Schuller, 2019b, Pandit, Amiriparian, Schmitt, Schuller, 2019a). Such is the pace of this progress that the finesse with which AI performs complicated tasks remains hard to believe, yet is a common knowledge ironically – be it coherent synthetic text generation, voice mimicking, or a high resolution ‘deep-fake’ video production. For example, it might be hard to believe that a few applications of social data in healthcare in this very manuscript were suggested by the GPT-2 English language model (Radford et al., 2019)1, with a frighteningly impressive human-like coherency. Because a human-like content generation is now this easy, the same AI advancement simultaneously manifests itself both as a novel challenge and as an opportunity.
Likewise, issues related to data privacy, a central consideration in healthcare, have become complex. Highly personal data (e. g. health issues, vital signs) is what drives the healthcare. AI helps with synthetic yet realistic data augmentation, coupled with more advanced anonymisation techniques addressing privacy-related issues. Simultaneously and ironically, research in AI-based deanonymisation has made anonymisation of data lot more challenging (Malin & Sweeney, 2004).
The earliest and the main criticism of deep learning technology has long been its inexplicability. A white-box AI is a necessity in healthcare, where an incorrect treatment can be fatal, where each diagnosis should be based on a reliable understanding of the data available. In recent years, there are monumental developments with several approaches proposed for model explainability (Alber et al., 2019). In this paper as well, in addition to reporting performance metrics, we compute feature attributions towards better understanding of the models proposed.
Data mining has helped improve diagnosis and treatment of various diseases (Rodríguez-González et al., 2012), e. g. diagnosis of cancer (Ruiz et al., 2018), sleep apnea (Janott, Schmitt, Zhang, Qian, Pandit, Zhang, Heiser, Hohenhorst, Herzog, Hemmert, Schuller, 2018, Qian, Janott, Pandit, Zhang, Heiser, Hohenhorst, Herzog, Hemmert, Schuller, 2017), diabetes (Pratt, Coenen, Broadbent, Harding, & Zheng, 2016), cardiovascular diseases (Goldberger et al., 2000), and assessment of psychological stress (Thelwall, 2017, Yoo, Lee, Ha, 2019). It can also be used as a preventive and diagnostic method to identify ‘red flag’ situations in real-time, e. g. to help identify people prone to suicide, or those with mental conditions such as bipolar disorder (Amiriparian et al., 2019a), autism (Roche et al., 2018), depression (Schuller, 2016), undergoing pain (Lucey, Cohn, Prkachin, Solomon, & Matthews, Walter, Gruss, Ehleiter, Tan, Traue, Werner, Al-Hamadi, Crawcour, Andrade, Silva, 2013) or stress (Zhou, Hansen, & Kaiser, 2001). As discussed previously, an assistive technology recording dyadic conversations and estimating the psychological and physiological state of the recorded individuals can be envisioned.
However, in order for its widespread use in the health sector, it needs meet very high standards of requirements, aside from its explainability of its predictions. First, it needs to be robust for use in a non-laboratory, unconstrained, noisy, i. e. ‘in-the-wild’ conditions, with data featuring spontaneous behaviours. It should ideally capture nuances of emotions/affects of people of different backgrounds, rather than a crude classification into three (positive, negative, and neutral), or six or such tiny number of basic emotion classes. It should ideally track one’s emotions continuously in time, in real-time. In summary, an explainable value-continuous, time-continuous, multidimensional, subject-independent, light-weight, real-time, physiological and affect prediction model trained on the in-the-wild data is desired; the scope of this paper.
In this paper, we unearth learnings of a shallowest possible CNN one can ever realise, as it learns to predict human emotions from in-the-wild videos. The training and predictions are of value- and time-continuous affect dimensions of individuals coming from very different cultures, contexts, gender and age-groups featured in the SEWA database, without making use of any audio or textual data. Inspired by these findings, we also explore use of this network topology for remote video-based pain monitoring on the UNBC-McMaster Shoulder Pain Expression database, and the BioVid database.
In Section 2, we discuss each previous research directly relevant to our current study. We discuss in detail the performance metric we chose for evaluation in Section 3 and the datasets we used in Section 4, complete with statistical analysis of the features and the labels – which is crucial before beginning with the experiments. We detail next the experimental design pipeline, different types of models we tried, and how the powerful, yet shallowest-realisable CNN evolved through these experiments in Section 5. We present the insights gained by analysing the trained weights mapping the features to the output labels in Section 6. We re-evaluate our proposed approach for the time-continuous pain prediction problem in Section 7. We summarise our findings, mentioning briefly the limitations of this study in Section 8. We also list various avenues for future work as the logical next step – including the research paths we have already begun venturing into.
Section snippets
Related research
The target research problem here is the explainable, robust, value- and time-continuous recognition of affect dimensions (e. g. arousal and valence) on in-the-wild audiovisual recordings, featuring spontaneous behaviours in the conversational context. The publicly available databases (e. g. SMARTKOM (Schiel, Steininger, & Türk, 2002), IEMOCAP (Busso et al., 2008), RECOLA (Ringeval, Sonderegger, Sauer, & Lalanne, 2013), MAHNOB Mimicry (Bilakhia, Petridis, Nijholt, & Pantic, 2015), 4D CCDb (
Choice of evaluation metric
The primary focus of this paper is the explainable, time-continuous prediction of the affective and physiological state of a subject under observation, e. g. a patient. Choosing a metric for evaluation of the predictive capability of the system is a crucial step. For a time-series prediction that is value-continuous (i. e. a regression problem), arguably the most popular performance metrics are: the Mean Squared Error (MSE), the Mean Absolute Error (MAE), the Pearson Correlation Coefficient (CC
Overview in terms of the subjects, and the data-splits
The SEWA dataset features spontaneous dyadic internet-based conversations between the participants, discussing an advertisement they watched. The recordings were not standardised intentionally, and are truly in-the-wild. The participants were allowed to converse from wherever they wished (e. g. home, office, cafeteria), with no set requirement on the devices (notebooks, microphones, cameras) and internet connection in use. Many of the audiovisuals suffer from bad lighting conditions, noise,
Affect prediction: Experiment and model design pipeline
We train various minimalist CNN topologies using the Keras (v2.2.2) library with the Tensorflow (v1.11.0) backend. Training and evaluations are run on regular notebook with a Nvidia GeForce GTX 1050 Ti GPU-card. The FAU features obtained from 34 video chat sequences of German participants (cf. Figure 3) and the RMSprop optimiser (learning rate = 0.001) are used for training, which runs for 2 000 epochs. We choose the model weights based on their performance on the development set. Because an
Feature attribution calculation
Consider model D consisting of only one intermediate Conv1D layer between the input and the output layer. Because the output is just a single channel (we predict only one emotion dimension at a time), the number of filters associated with the last and the only intermediate Conv1D layer is one. Each output is the element-wise product of trained filter weights and a section of the input matrix, added to the trained bias value. Thus, the Conv1D layer represents the degree of similarity of the
Pain intensity prediction
Inspired by predictive performance of the minimalist CNN we proposed to model emotional state of the human subjects cross-culturally, that uses only a handful FAU activations recorded in the noisy conditions, we now turn our attention to the pain intensity prediction problem – equally relevant for many healthcare applications, including the automated remote patient monitoring system. An interpretable and explainable AI continues to be the primary focus of the experiments. One of the popularly
Conclusions
Towards explainable, robust, value- and time-continuous, real-time affect prediction on social media-based, web-assisted ‘in-the-wild’ video chat sessions, we presented a shallow CNN-based model consisting of a single one dimensional convolutional layer. Through statistical analysis of the input features, we investigated as to how and why the model assigns the filter weights the way it does. Because we used linear activation, we computed the feature attribution scores of the individual features
CRediT authorship contribution statement
Vedhas Pandit: Conceptualization, Methodology, Software, Validation, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing, Visualization. Maximilian Schmitt: Methodology, Software, Data curation, Writing - review & editing. Nicholas Cummins: Writing - review & editing, Resources. Björn Schuller: Writing - review & editing, Resources, Project administration, Funding acquisition, Supervision.
References (55)
- et al.
Patients’ and health professionals’ use of social media in health care: Motives, barriers and expectations
Patient Education and Counseling
(2013) - et al.
The MAHNOB mimicry database: A database of naturalistic human interactions
Pattern Recognition Letters
(2015) - et al.
Snoring classified: The Munich Passau snore sound corpus
Computers in Biology and Medicine
(2018) - et al.
How (not) to protect genomic data privacy in a distributed network: Using trail re-identification to evaluate and design anonymity protection systems
Journal of Biomedical Informatics
(2004) - et al.
Convolutional neural networks for diabetic retinopathy
Procedia Computer Science
(2016) - et al.
A deep matrix factorization method for learning attribute representations
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2017) - Amiriparian, S., Awad, A., Gerczuk, M., Stappen, L., Baird, A., Ottl, S., & Schuller, B. (2019a). Audio-based...
- et al.
iNNvestigate neural networks
Journal of Machine Learning Research
(2019) - et al.
Humans inside: Cooperative big multimedia data mining
(2019) - et al.
Clinical use of skype: A review of the evidence base
Journal of Telemedicine and Telecare
(2012)
IEMOCAP: Interactive emotional dyadic motion capture database
Language Resources and Evaluation
The acoustics of eye contact – Detecting visual attention from conversational audio cues
Proc. 6th Workshop on Eye Gaze in Intelligent Human Machine Interaction: Gaze in Multimodal Interaction, GAZEIN, 15th ICMI
A tutorial on multilabel learning
ACM Computing Surveys (CSUR)
Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals
Circulation
Long short-term memory
Neural Computation
Harnessing social media for health promotion and behavior change
Health Promotion Practice
SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild
IEEE Transactions on Pattern Analysis and Machine Intelligence
Big data multimedia mining: Feature extraction facing volume, velocity, and variety
Big Data Analytics for Large-Scale Multimedia Search
I know how you feel now, and here’s why!: Demystifying time-continuous high resolution text-based affect predictions in the wild
Proc. 32nd international symposium on computer-based medical systems, CBMS
Cited by (12)
Intangible cultural heritage image classification with multimodal attention and hierarchical fusion
2023, Expert Systems with ApplicationsWhich part of a picture is worth a thousand words: A joint framework for finding and visualizing critical linear features from images
2023, Information Processing and ManagementAutomatic assessment of pain based on deep learning methods: A systematic review
2023, Computer Methods and Programs in BiomedicineIntroduction to the special issue on Methods and applications in the analysis of social data in healthcare
2021, Information Processing and Management