1 Introduction

In the last season of 2018, the amount of smart speakers sold in China surged. According to the quarterly research published on November 2018 from Strategy Analytics, global smart speaker shipments grew an astonishing 197% year-over-year to reach a record 22.7 million units in Q3 2018 putting the market on track to surpass 100 million units in use during the final quarter of the year. And in China’s Baidu was the biggest mover in the quarter increasing its share from just 1% in Q2 2018 to 8% in Q3 2018. Baidu has joined Alibaba and Xiaomi in a three-way battle for leadership of the fledging smart speaker market in China [1]. Designers now are facing a rapidly growing group of Chinese users who own smart speakers while noticing that many existing smart speakers in China are lack of good emotion interaction experience.

However, from the perspective of the user, is the emotional interaction experience of smart speakers necessary?

The Emotional Design [2] proposed by Dr. Norman has demonstrated that emotional experience provided by a product is essential and could be even more attractive than the offered functional experience to users. Pure reason doesn’t always suffice. Successful design now means going far beyond understanding the “cognitive load” or “steps of a task”–Karen Holtzblatt and Hugh Beyer have revealed in the Contextual Design [3]. It means that designers must understand a much wider life context than they ever had to before including various activities of users, and among all the activities, a lot of information is conveyed through non-verbal, especially emotional information which Albert Mehrabian’s studies suggested that we overwhelmingly deduce our feelings, attitudes, and beliefs about what someone says not by the actual words spoken, but by the speaker’s body language and tone of voice [4, 5].

Nonetheless, Rosalind Picard pointed out in the Affective Computing, emotion has a critical role in cognition and in human-computer interaction, it do not need to be put into every thing that computes. An affective computer still needs to have logical reasoning abilities. Designers should not go overboard trying to make computers and other smart devices such as a printer affective [6].

In this paper we will explore users’ need for emotion interaction of smart speakers, and attempt to leverage interaction design to mimic or enable emotional care and empathy-based feedback on smart speakers.

2 Do Smart Speakers Need to Mimic Emotional Care and Empathy-Based Feedback? User Research and Discoveries

2.1 Observation from Previous User Researches and Surveys

What emotions or emotional responses will smart devices like a robot or a smart speaker to have? Norman writes in Emotional Design, the answer depends upon the sort of robot we are thinking about, the tasks it is to perform, the nature of the environment, and what its social life is like. Does it interact with other robots or people? If so, it will need to expression its own emotional state as well as to assess the emotions of people it interacts with [2].

Smart speakers are mainly used in home scenario according to our previous user investigation, and people interact with smart speakers, it will need to expression its own emotional state as well as to assess the emotions of people it interacts with as well.

From our observation of previous user researches and surveys on smart speakers, AI technologies could enable smart devices various human-like capabilities such as “speak”, “see”, and “body language”, which naturally provide users psychological perception that the smart devices have human characteristic during the interaction process.

2.2 User Research on Emotional Experience User Needs

In order to find out whether there is a common phenomenon of that Chinese users are expecting an emotional interaction on smart speakers, and to find out how important to Chinese users to have emotional experience in their interacting with smart speakers, we have carried out a special research on the emotional interaction user needs of smart speakers.

Subject of Research, Method and Assessment Tool

Subject of Research:

Total 776 users. Male to female ratio 4:6 with balanced demographic variables such as age, occupation, and place of residence.

Method:

Questionnaire Assessment.

Assessment Tool:

The Likert Scale. From 1 to 5, the higher the score is, the more important the user evaluates.

The Research Process

Firstly, we collect the descriptions that users would apply to describe their interaction experience with the smart speakers through the user log and interviews. In the following brainstorming session, theses collected descriptions have been refined and then printed in cards for classification. Users use the classified cards to evaluate the importance of descriptions.

The classification of these words follows the three levels – Visceral, Behavioral and Reflective – of Norman’s emotional experience.

Descriptions corresponding to the visceral layer are “ Engaging ID Design”, “Charming Voice”, “Enjoyable Touch Feeling” and so on.

Descriptions corresponding to the behavior layer are “Easy to use”, “Enjoyable Surprise” and so on.

Descriptions corresponding to the reflective layer are “Sense of Humor”, “Ability of Empathy”, “Ability of Caring ” and so on.

Found of this Research

Emotional experience is as important as functional experience.

The more experienced with the smart speaker a user is, the more important he/she thinks the emotional experience provided by smart speakers is.

Result shows that, compared with users who interact with smart speakers less than 1 time everyday, users frequently interact with smart speakers (5 times or more everyday) have higher evaluation about the descriptions representing emotional experiences more, such as “ability of empathy”. See Table 1.

Table 1. The relationship between usage frequency of smart speakers and user expectation of emotional interaction experience

Result also shows that users who are have rich experience with smart products values the emotional experience more when compared with users have little experience with smart speakers. See Table 2.

Table 2. The relationship between smart product experience and user evaluation of the importance of emotional interaction experience

3 How to Leverage Design to Mimic Emotional Care and Empathy-Based Feedback on Smart Speakers

3.1 Studies on How Human Responds to Emotions

How Professional Counselors Respond to Emotions

We visited three professional counselors in Beijing to learn how they respond to visitors who come to the consultation with emotions. And professional counselors have their strategies when interact with visitors with emotions.

Listed below are typical strategies they will apply during the counseling.

Tone of Voice.

It’s the most timely and efficient way to empathy.

“When facing sad visitors, the speed of speech should be slow, the tone of the voice should be sounded low.”

“The speech speed should not be too fast and too slow when talk to a sad visitor, you should maintain a moderate rate of speech, and keep the tone similar to the visitor; in the face of different emotions, the tone of the voice should be frequency modulated, and visit On a frequency.”

Identify Emotions.

“I guess you are a little sad now.”“Are you sad now?” “Sounds like / I guess / there is a point that you are sad.”

Accept Emotions

“That happens.” “It’s very common to feel that way.”

Venting Emotions

Be a good listener.“I’m willing to hear about it.”

Behavior feedback. “I can hear you crying more heavily, are you thinking of things make you feel more sad?”

“If you want to talk about it, you can talk about it. If you don’t want to talk about it, it’s ok, I will be here.”

Offering Help

Solve problems or change perceptions.

Provide a variety of forms to divert attention, such as: watching videos, sports, sandbags, etc.

Classic Interpersonal Communication Theories

Mehrabian comes to two main conclusions in his studies of interpersonal communications [7]:

Firstly, there are basically three elements in any face-to-face communication: Words, Tone of voice, and Nonverbal behaviors. Words are what literally being said. The spoken word is part of the verbal communication in this and the intonation and body language are both part of the non-verbal communication. Tone of voice, also known as intonation, is how something is said (use of voice). Intonation is the vocal factor and body language the vocal factor. Nonverbal behaviors, also known as body language (Visual), which are posture, facial expressions and gestures someone uses. Secondly, the non-verbal elements are particularly important for communicating feelings and attitude, especially when they are inconsistent, i.e. if words disagree with the tone of voice and nonverbal behaviors, people tend to believe the tonality and nonverbal behaviors.

Secondly, the non-verbal elements are particularly important for communicating feelings and attitude, especially when they are inconsistent, i.e. if words disagree with the tone of voice and nonverbal behaviors, people tend to believe the tonality and nonverbal behaviors.

Classic Social Skills Theories

Definitions of social skill developed over time [8]:

Phillips (1978) noted that “knowing how to behavior in a variety of situation” is part of social skills.

Later, Ellis (1980) pointed out that “By social skills I refer to sequences of individual behavior which are integrated in some way with the behavior of one or more others and which measure up to some pre-determined criterion or criteria.” Other definitions, while focusing upon behavior, have included the concept of positive or negative reactions by other person as an element of skilled behavior.

Another definition by Becker et al. (1987) highlighted the fact that “to perform skillfully, the individual must be able to identify the emotions or intent expressed by the other person and make sophisticated judgments about the form and timing of the appropriate response”.

Michelson et al. (1983) identified six elements to constitute the core concept of social skills, namely that they (1) are learned; (2) are composed of specific verbal and non-verbal behaviors; (3) entail appropriate initiations and responses; (4) maximize available rewards from others; (5) require appropriate timing and control of specific behaviors; (6) are influenced by prevailing contextual factors.

And the definition adopted by Owen D.W. Hargie is that social skill is the process whereby the individual implement a set of goal-directed, interrelated, situationally appropriate social behaviors which are learned and controlled. This definition emphasizes six main features of social skills.

3.2 Leverage Interaction Design to Mimic Emotional Care

Now we know that the tone of voice played the first key role in interpersonal communications, it is more important than the meaning of the actual spoken words. Thus for a smart speaker, the tone of voice needs to be designed to mimic the empathy when interacts with users with emotions.

At present, the voice interaction between Smart Speaker and user in the Chinese market mainly focuses on the voice recognition then giving feedback process. No matter the user interact with the speakers with emotions or not, the smart speakers will only fulfill the instruction inputted by the user. Currently almost all TTS voice smart speakers applied is synthetic and sound happy.

The Tone of Voice Design for Smart Speakers

The tone of the voice of the smart speaker needs to be adjusted according to the user’s emotional state. For example, when the smart speaker detected that the user is in a sad mood, it is not very suitable to give the user feedbacks with a happy voice, the smart speaker should give feedbacks with a sad tone too for sad users.

The tone of voice of smart speakers can be designed at least 3 kinds:

One is opposite to the user’s tone of voice.

One is same to the user’s tone of voice.

And one is emotionless.

Mimic Emotional Care and Empathy-Based Feedback on Smart Speakers

An affective computer should not be built with only affective abilities, which would lead to infantile behavior at best. An affective computer still needs to have logical reasoning abilities [6].

Also, goal oriented is one of the most important interpersonal communication rules [8]. A smart speaker should not only respond to users’ emotion, it should provide proper functional feedback to mimic empathy: sometimes they respond to users’ emotion first, sometimes they fulfill what users’ instruct them to do first, and some times they offer recommendations first. However, respond to users’ emotion and give functional feedbacks, which step should the smart speaker do first? We need to mimic empathy-based feedback strategies for smart speakers.

User’s Instruction and Emotion Status - Criteria for Feedback Strategy Choosing

The mimic of empathy-based feedback strategies help the smart speaker to decide that it should respond to user’s emotion or fulfill user’s instruction first. Face emotion recognition technology can recognize user’s emotion, and NLP technology can distinguish user’s intent.

Although human is very expressive, our natural emotions can be distinguished up into 27 distinct categories of emotions [9], computers can currently discriminate about six different facial expressions and up to eight different vocal expressions under certain conditions. And we found that happy, neutral, sad and angry emotions are most common emotions occurred during the Chinese users’ interacting with smart speakers. Ability to respond to these common emotions can satisfy current user needs.

Summarized from our previous researches on smart speakers, there are 4 possible situations of user’s instruction and emotion status the smart speakers face, see Table 3:

Table 3. Scenes of emotional interaction needs of smart speakers

Situation 1

User conducts a clear instruction with emotion towards the smart speaker. E.g., “(Angrily) Set a clock for tomorrow 6:00 am.” This usually happened after multiple times of failure of recognition of user’s instruction due to reasons such as heavy accent or background noise.

Situation 2

User conducts a clear instruction with emotion not towards the smart speaker. E.g., “(Sad and anxious) Please find me a vet lives nearby.”

Situation 3

User conducts a fuzzy instruction with emotion. E.g., “(Sad and anxious) My puppy’s eyes look red and swollen.”

Situation 4

User conducts a fuzzy instruction with emotion. E.g., “(Sad) I’m tied!”

Feedback Strategies

Social skill is the process whereby the individual implement a set of goal-directed, interrelated, situationally appropriate social behaviors which are learned and controlled.

Smart speakers will need at least 6 different feedback strategies including the one (task fulfillment only) that many existing smart speakers in Chinese market have to mimic the social skills to interact with users with emotion in the first round of HCI, see Table 4:

Table 4. Smart speakers feedback strategy design

4 User Experiments of Emotional Interaction to Mimic Emotional Care on Smart Speakers

4.1 User Experiment of Voice of Tone Design

Smart speakers will need to determine its own feedback voice tone according to the emotion state of the user. Taking the sad users as an example, we have studied through experiments to find out two answers: Firstly, we want to find out whether the user wants the smart speaker to respond to them in a voice with tone. Secondly, what kind of tone of voice the sad users will prefer. The method is an inter-group experiment:

The independent variable is the three replies with exactly the same literally content, but with different tone of voice. The dependent variable is the preference of interest in tones by using the 7-point scale. The higher the score is, the more likeness there is. There are two control factors, one is the literally content of the speech, the three segments of audio use the same content; the second is the sequential effect, using a completely random method to balance the ordering effect of the audio.

Experiment Design

The experiment was carried out using the Wizard of Oz.

Pre-prepared experiment materials included are 3 listed below:

A piece of emotionally neutral music [10] to make sure the user begins the experimental interaction with the smart speaker verified through EEG data.

A piece of video that can induces sad emotion on more than 90% of the users supported by EEG data.

3 audios with exactly the same literally contents but with different tone of voice. Because the current TTS technology can not synthesize the emotional tone of voices, we asked a professional actor to record the audios to make sure there are 3 obviously perceptible different voice tone in them.

The experiment is executed in 4 steps. The first step is to play the neutral music to the user to make sure the user begin the experiment with a relatively neutral emotion. The second step is to induce the user’s sadness by showing him or her the sad story video. The third step is to instruct the user to interact with the smart speaker when the user is in sad emotion read from the EEG and face reading data, and let the user to experience one feedback strategy. In the fourth step, ask the user to fill in the assessment questionnaire about the user’s emotional state and the preference of the speaker’s tone.

Result

Core discovery:

When users are in sad mood, their favorite choice of voice tone on smart speakers is tone similar to theirs: a sad tone.

All users participated in the experiment can perceive the 3 given audios have 3 voice of tone, one is happy, one is sad, and one is neutral. And 62% of the users in the experiment, when they are sad, they expect the smart speaker can give them feedbacks with a sad tone.

4.2 User Experiment of Feedback Strategies Design

Core Goal of the Experiment

What kind of emotional feedback strategies should the smart speaker provide in the first round of HCI dialogue when facing sad users?

The smart speaker may encounter 3 scenarios with sad users in this experiment:

Scenario A, the user input a clear command with an emotion that does not toward the speaker: the user will feel sad after watching the sad video and will be instructed to talk to the smart speaker “I don’t feel well, can you play me a music?”

Scenario B, the user input an unclear command with an emotion that does not toward the speaker: the user will feel sad after watching the sad video and will be instructed to talk to the smart speaker “I just watched a video.”

Scenario C, the user with a sad emotion that does not toward the speaker and will be instructed to expresses emotions to the smart speaker: the user will feel sad after watching the sad video and instructed to talk to the smart speaker “I don’t feel well.”

The smart speaker will give the listed 6 kinds of feedbacks randomly:

Feedback 1, Only respond to emotions. “What makes you not feeling good?”

Feedback 2, Respond to emotions first, then recommend tasks. “What makes you not feeling good? Would you like me to play a song for you?”

Feedback 3, Respond to emotions first, then recommend tasks. “What makes you not feeling good? Let me play a song for you. (Then the smart speaker plays the song.)”

Feedback 4, Fulfill the task first, then respond to emotions. “Let me play a song for you. (Then the smart speaker plays the song.) Hope you will feel better.”

Feedback 5, Respond to emotions first, then fulfill tasks. “Hope you will feel better. Let me play a song for you. (Then the smart speaker plays the song.)”

Feedback 6, Only fulfill tasks. “Let me play a song for you. (Then the smart speaker plays the song.)”

Currently Feedback 6 is the most common way of feedback that smart speakers have in Chinese market.

Independent variables:

The 6 different feedbacks.

Dependent variables:

A subjective questionnaire on the degree of the user’s aroused sad emotion, and a subjective questionnaire on the user’s preference of given feedback.

Controlling factors:

Keep the content of the verbal tricks of feedbacks all the same.

Recommended tasks: four scenarios, using the same recommended task Sequential effect: the sequential effect of balancing coping styles with a completely random method.

Experiment method:

The Wizard of Oz.

The main tester will execute the feedback strategy accordingly to simulate the real HCI dialogue.

Experiment for Scenario A

The user will order a clear instruction with an emotion that does not toward the speaker: the user will feel sad after watching the sad video and will be instructed to talk to the smart speaker “I don’t feel well, can you play me a music?”

The 3 feedbacks given via the Wizard of OZ on the smart speaker in random order are:

A1. “Let me play a song for you. (Then the smart speaker plays the song.) Hope you will feel better.”

A2. “Hope you will feel better. Let me play a song for you. (Then the smart speaker plays the song.)”

A3. “Let me play a song for you. (Then the smart speaker plays the song.)”

The most favorite feedback strategy has been found upon scenario A is feedback strategy A2. See Table 5.

Table 5. Sad users’ preference for 3 different smart speaker voices tones - within group

However, considering the effect of alleviating sadness, all three strategies can significantly alleviate the user’s sadness. Strategy A2 has been proved the best one, and Strategy A2 is significantly better than Strategy A1. See Table 5.

In scenario A, Strategy A1 is more preferred than Strategy A2, but there is no significant difference between the two. Both A1 and A2 are significantly more preferred than Strategy A3. See Table 6.

Table 6. Comparison of 3 feedback strategies in scenario A- between groups

Reasons for preference of Strategy A1. Users have different preference of 3 Strategies in scenario A:

Table 7. Users’ preference of 3 strategies in scenario A- within groups

A strong sense of companionship. Listed below are some of the user comments:

“I feel that the smart speaker has been paying attention to my emotions and offering its accompany by suggesting me to listen to the song.”

“Timing is more close to real life situation. Before listening the music, I can’t really be happy just because the smart speaker told me so. It makes much more sense the smart speaker said so after I listening the music.”

Reasons for none preference of Strategy A1:

The verbal tricks are superfluous. Listed below are some of the user comments:

“After listening to the song, my sadness has been eased already, so there is no need to say something like the smart speaker wants me to feel better.”

Reasons for preference of Strategy A2:

A timely sense of caring. Listed below are some of the user comments:

“Your sadness can be quickly eased through the comfort.”

“In daily life, this is usually what people do, you comfort others first.”

Reasons for none preference of Strategy A2:

The verbal tricks sound in sincere. Listed below are some of the user comments:

“It feels more sincere to say ‘I hope you could feel better’ after I actually listen to the music.”

Reasons for preference of Strategy A3:

A smart speaker is still a machine, and a machine only needs to complete the instructed orders.

Reasons for none preference of Strategy A2:

Lack of emotion response.

Experiment for Scenario B

The user will order a fuzzy instruction with an emotion that does not toward the speaker: the user will feel sad after watching the sad video and will be instructed to talk to the smart speaker “I watched a video.”

The 2 feedbacks given via the Wizard of OZ on the smart speaker in random order are:

B1. “Are you feeling unwell? Would you like me to play a song for you?”

B2. “Are you feeling unwell?”

Result shows that responding to user emotions can significantly help users to alleviate sadness, but music recommendation has no additive effect. See Table 8.

Table 8. Comparison of 2 strategies in scenario B- between groups

In scenario B, Strategy B1 is more preferred than Strategy B2. See Table 9.

Table 9. Users’ preference of 2 strategies - within groups

Reasons for preference of Strategy B1:

A sense of being caring and considerate by recommending music. Listed below are some of the user comments:

“I feel that this smart speaker is being caring. It looks that it’s trying to help me to alleviate my sadness by diverting my attention to my sadness by recommending music.”

Reasons for none preference of Strategy B1:

The recommendation is so limited.

The experiment is executed by the Wizard of Oz with controlled variables, the recommendation of music is the only recommended task.

Reasons for preference of Strategy B2:

Being smart. Listed below are some of the user comments:

“It looks like that the smart speaker can recognize my emotions!”

Reasons for none preference of Strategy B2:

It’s not enough to just recognize the emotion. Listed below are some of the user comments:

“Only talk about the negative emotion may strengthen my sadness, and when it comes to negative emotions, only talk about it will make me feel more sad.”

“I don’t know how to keep the conversation going when the smart speaker tried to talk about my feelings.”

Experiment for Scenario C

The user will just express his or her emotion to the speaker: the user will feel sad after watching the sad video and will be instructed to talk to the smart speaker “I don’t feel well.” Or “I feel sad.” And so on.

The 2 feedbacks given via the Wizard of OZ on the smart speaker in random order are:

C1. “Are you feeling unwell?”

C2. “Are you feeling unwell? Would you like me to play a song for you? (Then the smart speaker plays the song.)”

Result shows that responding to user emotions can significantly help users to alleviate sadness, but music recommendation has no additive effect. See Table 10.

Table 10. Comparison of 2 strategies - between groups

In scenario C, Strategy C1 is more preferred than Strategy C2. See Table 11.

Table 11. User preference of 2 strategies - within groups

Reasons for preference of Strategy C1:

A sense of being caring and considerate by recommending music. Listed below are some of the user comments:

“I feel that this smart speaker is trying to alleviate my sadness by diverting my attention to my sadness by recommending music.”

Reasons for none preference of Strategy C1:

Being interrupted. The recommendation is so limited. Listed below are some of the user comments:

“I am immersed in my sadness mood, and I don’t want to be diverted to something else like a song so quickly. It was a bit awkward for me to be recommended by a smart speaker of a song.”

5 Discussion

5.1 Insufficiency of the Research, Design and Experiments

Insufficiency of the Design

The turn-on mechanism for the smart speaker to mimic emotional care is not been designed. The trigger to give emotion care could be time, the kind of instructions given by the user, the degree of the users’ emotion, and the user profiles.

The interaction design of this case to mimic emotion care only considered the emotional interaction feedback state design of the first round HCI between the smart speaker and the user. However, in real life, the number of HCI round may be more than 1 time each day; The time of the first round of HCI between the user and the smart speaker is unknown, some of them may occur in the early morning, and some may occur in the middle of the night. A smart speaker does not always respond to users’ emotion, it needs a rule of social time so it can respond to the user in a timely good manner.

In real life, instructions given by users various, it could be a very simple task, such as setting an alarm clock, or it could be a very complicated task, such as information search. Does smart speaker need to respond to user’s emotion every time no matter the given instruction is?

Factors such as user’s gender and personality may affect the user’s preference for the emotional feedback too.

Insufficiency of Experiments

Experiment Was Only Carried on Sad Users

In order to let the user accept the experiment in an effective emotion, and the sadness is the easiest emotion to be successfully induced than happiness and anger, the experiment only carries out research on the emotion of “sadness”.

Presentation of Experimental Strategy is Limited

Because most existing smart speakers in Chinese market do not have good emotional interaction abilities, the experiment adopts the Wizard of Oz method, and only pays attention to the initial exploration of the expression way of “speech”.

However, many existing smart speakers have screen to display an avatar to make facial expressions, and facial expression could enrich the feedback state of the smart speaker to mimic emotional care. Also, the hardware of the smart speaker can be considered to enrich the feedback state of the smart speakers such as light and light motion design.

The Recommended Task in the Experimental of Feedback Strategy is limited

The experiment was conducted with the Wizard of Oz, and by considering the control of variables, the strategy of recommending the task only has been played to users as a recommended music.

Insufficient findings

We observed during the experiments that it seems that people with different personalities may have different feedback strategy preferences for HCI’s first round of emotional interaction with smart speakers.

Each user was required to fill out the Big Five Personality Scale after the experiment, but the results shows no clear clue of the significant relationship between personality and strategy preference.

5.2 Future Design and Research Goals

We observed during the experiments that when watch of the same piece of video to arouse the sad emotion, some users think that the degree of sadness is deep, and some of them think the degree are very low. Different degree of sadness may affect the user’s preference for the feedback strategy of the first round HCI of emotional feedback son smart speakers.

Both in interpersonal communication theories and real life, the emotional state, degree and the timing are key factors that make people to decide whether to initiate interpersonal communication or not. In the emotional interaction between smart speakers and users, these factors need to be considered and designed too. There are lot of factors can be considered too, and we need to find out more through more studies and research on interpersonal communication and social skills.

Also, the multiple channels have different communication effects, for example, people with intense angry emotion may not listen to others very well while they can still get information trough look.

It is necessary to further our study about the different channels such as the sound channel and the screen channel from the perspective of human factors to make the mimic of emotion care and empathy-based feedback on smart speakers more natural and more effective.