Keywords

1 Introduction

1.1 Motivations

Locked-In Syndrome (LIS) is a pathological condition where patients (for instance people in advanced stages of Amyotrophic Lateral Sclerosis - ALS) lost all capability to move any part of their body except the eyes [1]. When this last capability is lost, a state of Complete Locked-In (CLIS) is reached. LIS ad CLIS patients’ abilities are severely limited, including in terms of communication. Therefore, innovative communication systems for such patients are much needed to increase their communication capabilities and engagement with people around them. Considering this context, the main goal of this research is to build novel modalities of technologically mediated communication that are specifically designed to improve LIS patients’ quality of life. In this paper we will focus on the skillset of LIS patients and exploit their ability to control their gaze as the input for computer-assisted communication systems [2]. Overall, the requirements for such solutions include the creation of novel user interfaces adapted to LIS patients capabilities to provide them with a more extensive, and complete communication system than the current state of the art. The solutions should improve their ability to express emotions as well as words.

1.2 State of the Art

The first communication systems for LIS patients consisted in codes using eye blinking to signify yes and no or more complex sentences using techniques such as Morse codes [3]. Other types of communication exist such as transparent letter board held by the interlocutor [3] (Fig. 1). In this case, the patient can indicate a letter by gazing at it. The interlocutor must then write down or remember the letters sequence to form words.

Fig. 1.
figure 1

(Image Courtesy of Low Tech Solutions)

E-TRAN letter board.

The letter board is still widely used nowadays. Nonetheless, more advanced systems do exist. Notably, the ability of LIS patients to control their gaze was used to send commands to computer systems through eye-tracking cameras [2]. This technique enabled them to select letters on keyboards displayed on computer screens and to “read” the written sentence out loud using voice synthesis [2]. Such systems mostly focus on composing words letter by letter. However, when we communicate, we do not only use words but also a great range of additional non-verbal communications cues such as voice intonation, facial expression or body gesture [4]. Such additional input helps the interlocutor to properly understand the context of the message itself. A simple sentence such as “let’s go now” can be read with excitement or anger and deliver a completely different message. This need for enriching words with emotional features has led to the creation of additional textual communication cues in Computer-Mediated Communication (CMC) such as emoticons [5]. These solutions are now widely used in text communications such as SMS or in social medias. For this reason, it is essential for LIS patients to also be able to communicate such affective state to their interlocutors in the most natural way possible. Focusing on the most common emotional cues in communication, voice and facial expression, we may find a great number of work in recreating such concept for CMC. For instance, emotional speech synthesis has been widely studied in the past [6,7,8]. Additionally, facial expression was often associated with avatars and 3D characters as a way to express emotions online [9,10,11]. The usage of those two technologies together were also studied in the past for CMC [12].

However, to our knowledge, those advances in technology related to emotion expression haven’t been adapted for LIS patients yet. Augmentative and Alternative Communication (AAC) systems for persons with disabilities rarely provide tools for emotion expression [13]. Focusing on children with disabilities, [14] reviews the past studies on AAC and exposes the great need for emotional communication in such tools. Additionally, the effect of such affective capabilities on communication abilities for patients with LIS haven’t been studied so far. To fill this gap in the literature we propose a novel open-source system controlled with eye gaze, including emotional voice synthesis and an emotional personalized avatar to enhance affective communication.

2 The Proposed Solution

In order to allow LIS patients to communicate their emotions in addition to words, we proposed a system including a gaze-based keyboard, an emotional voice synthesizer and a personalized emotional avatar. We focused on the 3 most common basic emotions: Happy, Sad and Angry. An additional option allowed the patients to generate a laughing sound. The laugh consists in a recorded sound and it is not created through the voice synthesizer. The system was created using the Unity3D platformFootnote 1, which is most notably know as a game development engine. However, this versatile platform is also very useful for the development of assistive user interfaces. (the work presented in this paper adopts the GUI developed through Unity3D for TEEP-SLA projectFootnote 2.

Fig. 2.
figure 2

General aspect of the keyboard display (AG: Affective Group; CG: Control Group).

2.1 Gaze-Based Keyboard

The general aspect of the keyboard can be found in Fig. 2. It uses a standard dwell time system for key selection [15]. A menu button enables the customization of this dwell time. Autocompletion words are proposed using the Lib-face library [16] and displayed in the center of the keyboard to reduce gaze-movements that have been proven to induce fatigue [17]. Additionally, according to preliminary data, the users would most likely see the proposed words positioned in this way rather than above all the keys as their gaze would often pass over the words.

2.2 Emotional Voice Synthesis

The open-source voice modulation platform Emofilt [8] was used to modulate the voice according to emotions. To tune the emotional voice, we took as an hypothesis that a great voice differentiation between emotions was primordial to insure the emotion recognition by the interlocutor in the long-term. The selected Emofilt settings for the happy (H), sad (S) and angry (A) voice can be found in Fig. 3.

Fig. 3.
figure 3

Emofilt settings.

The settings containing an asterisk are additions to the original system. The pitch were capped to a maximum and a minimum to avoid unnatural voices. The user are able to select the desired emotion using 3 emoticons buttons positioned above the keyboard (Fig. 2). If no emotion is selected the voice is considered as neutral.

2.3 Emotional Avatar

Because LIS patients are not able to communicate their emotion through facial expression, we decided to simulate this ability using a 3D avatar. To do so, the AvatarSDK Unity asset [18] was used. It allows to create a 3D avatar using a simple picture of the user or of someone else. 3D animations such as blinking and yawning are provided. We created additional 3D animations of the 3 previously cited emotions. An example of such avatar expressions can be found in Fig. 4. The avatar facial expressions are triggered using the same emoticons buttons used for the emotional voice. The selected facial expression is displayed until the emotion is disactivated by the user.

Fig. 4.
figure 4

Example of the emotional avatar generated from a picture.

3 Methodology

Fig. 5.
figure 5

Experimental flow (AG: Affective Group; CG: Control Group; SP: Subject Patient).

In order to test the capability of this system in enhancing patients’ communicative abilities, we performed a between-subject study with 36 subjects (26 males, 10 females) separated into two gender-balanced group (5 females, 13 males, avg. age 29 years): a control group (CG) and an affective group (AG). The experimental flow may be found in Fig. 5).

For the control group, unlike the affective group, the affective features (emoticon buttons, emotional voice, emotional avatar, laugh button) were hidden and therefore inaccessible. During each session, a subject was assigned to represent either “the patient” (SP) or “the healthy interlocutor” (SI). After signing the inform consent, we first tested the validity of the emotional voice. 5 sentences were randomly picked among 10 sentences (Table 1) and were each played in the 3 different emotions.

Table 1. Emotional sample sentences.
Fig. 6.
figure 6

Conversation scenarios.

Therefore, in total, 15 sentences were played to the subjects in random order who had to decide if it was a Happy, Sad or Angry voice (Trial 1). Both SP and SI rated the emotional voices separately, in written form, without consulting each other. SP was then seated in front of a commercial eye-tracking monitor system (Tobii 4C [19]) and SI next to him. The eye-tracker was calibrated using the dedicated Tobii software. For the affective group, a picture of SP was taken using the camera from the computer. The 3D avatar was then built from this picture. A second screen displayed the emotional avatar positioned so that both subjects could see it. For both groups, the dwell time was originally fixed to 1 s but SP was able to adjust it at any time through the menu. They were then given a talk scenario designed to simulate an emotional conversation (Fig. 6).

The subjects were asked to have a conversation with each other. They were free to say whatever they desired while respecting the scenario. AG-SP was instructed to use the emotional buttons as much as possible. Once the conversation finished, both subjects were asked to answer a questionnaire on a 7-point Likert-type scale (Fig. 7). The first two questions were designed to compare the efficiency of the systems. The other questions were added to assess the user experience but will not be used to compare the systems. Such questions will be different between subjects and conditions to fit the context the subjects were experiencing. The first part of the study was then repeated with the remaining 5 sentences (Table 1) (Trial 2).

Fig. 7.
figure 7

Questionnaire.

4 Results

4.1 Speech Synthesis Emotion Recognition

The control group (both interlocutors and patients) were able to recognize 81% of the emotions from the emotional voice synthesis in the first trial and 87% in the second trial. The affective group had a 80% recognition in the first trial and 92% in the second trial (Fig. 8).

Fig. 8.
figure 8

Recognition rates of emotional sentences for each trial.

4.2 Questionnaire

In the questionnaire, only the first two questions aimed at comparing both systems and evaluating emotional communication efficiency. Results of these questions can be found in Fig. 9. The questionnaire results (ordinal scale measures, analyzed through Mann-Whitney U test) show the advantages of using the system proposed in this paper. The subjects considered the conversation significantly (p < 0.05) more similar to a normal dialog (in terms of communication abilities, not in terms of scenario) while they were using the expressive solution (Q1-AG-SP: 4.5 and Q1-AG-SI: 4.75 in average) compared to the control group (Q1-CG-SP: 3.375 and Q1-CG-SI: 3.25). This is true for both patient-like subjects (U = 16 with p = 0.027) and their interlocutors (U = 12.5 and p = 0.012).

The questionnaire data are analyzed through appropriate non-parametric tests because of the dependent variables are constituted by ordinal scale measures. The assumptions of the tests are checked.

The “patients” from the affective group found that they were more able to express their emotions (Q2-AG-SP: 5.875) compared to the control group (Q2-CG-SP: 3.25). A significant difference was found between the 2 conditions (AG and CG) (Mann-Whitney U(16) = 1.5 with p < 0.001).

The “healthy subjects” from the affective group found that they were more able to identify emotions from their interlocutors (Q2-AG-SI: 5.5) compared to the control group (Q2-CG-SI: 3.75). A significant difference was found between the 2 conditions (Mann-Whitney U(16) = 10.5 with p = 0.008).

Fig. 9.
figure 9

Questionnaire results for significative systems comparison.

Additional questions were asked aiming at collecting additional insights on both systems. The results can be found in Fig. 10.

The “patients” in the affective condition found that the ability to convey their emotions helped with the communication (Q3-AG-SP: 5.875) and the ones from the control group thought that it would have helped (Q3-CG-SP: 5.2). The “healthy subjects” in that affective condition found that the ability to identify their interlocutor’s emotion helped with the communication (Q3-AG-SI: 6.1) and the ones from the control group thought that it would have helped (Q3-CG-SI: 5.5).

5 Discussion

Firstly, we can see that the overall recognition of the emotional voice in the first task was sufficient for it to be used meaningfully in this experiment. Additionally, we can see that this recognition quickly increases with time since the recognition on task2 is much higher than the one on task1. This increase is higher for the affective group that had additional time to familiarize with the voice modulation during the scenario part of the experiment, reaching a score of 92%. This ability to successfully express emotion (Q4-AG-SP) and identify emotions (Q4-AG-SI) through the voice synthesizer were confirmed by the questionnaire. Furthermore, the affective group found this emotional voice helpful for the communication (Q7-AG) and the control group thought it would be a useful feature (Q7-CG). It confirms our hypothesis that strongly distinctive emotional voices are easily recognizable in the long term and improve communicative abilities.

Fig. 10.
figure 10

Questionnaire results for additional insights.

The emotional avatar was found to successfully represent the desired emotion (Q5-AG-SP), to provide easily identifiable emotions (Q5-AG-SI) and to help with the communication in the affective group (Q6-AG). It is interesting to notice the affective group found the avatar to be more helpful for the communication than the voice (Q6-AG and Q7-AG).

Overall, the communication was found more natural to the affective group than to the control group (Q1). SP subjects found that they were more able to express their emotion (Q2-SP). It highlights the positive impacts of both the emotional avatar and the emotional voice on the communication which is confirmed by Q3-AG-SP and Q3-AG-SI. Concurrently, the control group that did not have access to any emotional tools, also found that the ability to express emotion (Q3-CG-SP) and to identify emotion (Q3-CG-SI) would have helped with the communication.

It is interesting to notice that in the affective group the “healthy” subject ranked higher how much the avatar and the voice helped (Q6-AG-SI and Q7-AG-SI) compared to the “patient” (Q6-AG-SP and Q7-AG-SP). This highlights the fact that this system is particularly useful for the interlocutor who is the one looking for cues about the emotion felt by the patient. The “patients” subjects often stated that they did not really pay real attention to the avatar as they were focused on writing on the keyboard.

6 Conclusions and Future Work

People in LIS have limited methods to communicate. In the past decades, technology has greatly improve their quality of life by providing a range of communication tools. However AAC are still constrained in communicating words and rarely include ways of expressing emotions. This work focused on studying the impact of expressing emotion on the communicative abilities of LIS patients. To do so we created a platform that allows the user to select an emotion between happy, angry and sad. A 3D avatar was then animated according to the selected emotion along-side with an emotionally modulated voice synthesis. This system was tested by 36 subjects who were successfully able to recognize the emotions from the voice modulation and the avatar. They found that the two emotional tools helped with the communication as they were more able to convey and identify emotions. This system is available in open-source [20] and also includes a gaze-based computer game [21] and a gaze-based web-browsing system [22]. While today the avatar is only expressing fixed emotions it shows the need for extending AAC tools to include more non-verbal communication cues. This system could in the future include additional animations such as lip synchronization, visual reaction to detected skin temperature (sweating, shivering), additional gesture (wink, hand gesture, raised eyes...), additional type of sounds (“waaaw”, “uhm uhm”, “oooh”).

The avatar could therefore become an extensive communication tool as well as a quick visual aid for the interlocutor, family and caregiver to understand the internal state of the patient. Advanced avatar control could be used for instance to perform art [23].

While voice modulation and facial expression are the most common in non-verbal communication, other types of natural communication may be simulated such as physical contact. Indeed, systems such as heating wristbands placed on family and loved ones may be activated by the patient using gaze control to convey the idea of arm touching.

In the future, patients’ emotions could be automatically detected for instance from physiological signals [24]. However, this should be investigated carefully as it would raise concerns regarding the patients’ willingness to constantly display their emotion without a way to hide them from their interlocuter.