Keywords

1 Introduction

Conversation involves using of words, prosody, facial expressions, gestures, and actions all seamlessly combined to convey meaning [1] and human manage to decode all these signals to perceive the messages. Conversation analysis has long been of interest to researchers across different domains of linguistics, sociology, anthropology, communication and computer sciences [2] as it holds great promise as arenas for understanding the essence of human interaction. Conversation informatics (CI) builds on conversation analysis and integrates scientific and engineering principles with utility considerations. CI analyzes conversational interactions and thought-sharing using social signals to design conversational artifacts that can smoothly interact with people [3].

We propose conversation envisioning (CE) as an approach to bridge the scientific and engineering aspect of conversation informatics. In line with the explicit aspects of conversation, analyzing tacit information is of paramount importance. In this view, CE aims to unveil tacit thoughts and mental states of people who interact during the conversation (Fig. 1). Conversation Envisioning tries to scratch the surface of the interaction and discover tacit information by focusing on how common ground (CG) is formed and updated during the conversation. We emphasize on the dynamic nature of the CG by considering conversation as a continuous process of updating the shared space during the discourse of the interaction.

Fig. 1.
figure 1

(a) The interpretation of own’s thoughts, mental states, intentions; both verbals and non-verbals (b) the perception of the listener, possible interpretations, mental states, and tacit thoughts (c) the update of the common ground.

This notion has been investigated as “grounding” in the studies by Traum [4] and Nakano et al. [5], who analyzed the process of building and repairing the CG. A practical example can be found in the study by Visser et al. [6] who proposed a computational model of grounding to provide overlapping verbal and non-verbal behavior by a virtual agent for efficient conversation.

In the case of conversation ‘in-the-wild’, however, the interaction goes beyond the limited incremental process of grounding and includes more active and dynamic aspects. For instance, in case of negotiations over price, participants combine strategic actions and improvisations to achieve multiple goals such as maintaining a friendly relationship (long-term connection) rather than merely following negotiation protocols to gain the best price. These goals are not always explicit but sometimes hidden or contradictory. Thus, for an agent to converse smoothly with humans, it is crucial to understand the underlying messages exchanged through (non-)verbal interactions. As there is no straightforward solution, a longitudinal effort involving (meta-)participants such as instructors, scientists, and engineers is needed to understand and augment conversations.

This paper proposes conversation envisioner as a computational platform that is designed to support a continuous collaboration and longitudinal effort by (meta-)participants, including data collection, analysis, modeling, building AI (meta-)participants, evaluating hypotheses and applications such as training and real-time assistance. We leverage VR technology and use it as a platform to facilitate uncovering tacit thoughts in situated conversation. Furthermore, we propose conversation description language (CDL) to systematically describe the conversation and realize envisioning by including the hypothesis brought by (meta-)participants. The goal of virtual reality conversation envisioning (VRCE) is to realize smoother communication and common ground building. We take bargaining as a situated interaction with rich socio-cultural aspects and show how VRCE can be effective by providing timely assistance in cross-cultural communications with limited or no shared background.

The present research has, therefore, three contributions: (1) proposing CE as a new computational framework for analyzing tacit dimensions of conversation, (2) introducing VRCE as a platform for augmenting conversation with CE, and (3) presenting CDL as a language for describing conversational components.

2 Situated Conversation Envisioning: Bargaining Scenario

A lot of communicative actions rely on what is present in a given situation. In this view, the situation gives meaning to the verbal and non-verbal interactions and influence shared knowledge and common ground building [7]. According to the situated cognition theory,“every human thought and action is adapted to the environment that is situated because what people perceive, how they conceive of their activity and what they physically do develop together” [8]. Therefore, this study puts particular emphasis on situated conversations, where conversational interactions involve frequent references to a specific situation comprising not just physical, but social entities and relations.

We select bargaining topic as it represents a cardinal illustration of a social interaction that provides useful information for analyzing broad and various forms of complex social relationships between people. Bargaining relationship is a microcosm within which many of the causes and consequences of social interaction and interdependence may be fruitfully examined [9]. It is an instance of negotiation, which is an essential part of everyday life, thus demands a thorough understanding of the situation for reaching an agreement.

Some studies have viewed bargaining as a dynamic decision-making process resembling the ultimatum bargaining game. They started from basic bidding problem and moved toward more complex situations [10, 11]. Others tried to generate agents that are believable negotiators [12]. However, bargaining can be viewed as a more complex social interaction, which is influenced by multiple factors such as culture or emotion and may involve multiple goals such as building trust and friendship rather than merely negotiating over price.

2.1 Cultural Aspects of Bargaining

Culture plays a significant role in bargaining. Hence, a unified bargaining practice or model may not be valid across different cultures. For instance, while in some cultures bargaining may be viewed as a simple trade-off, other cultures may consider bargaining as a chance to socialize and build relationships. Such interaction itself may be formed on the basis of gaining some benefits like making a long-term relationship with the customer in order to guarantee future purchases or building trust with the shopkeeper to facilitate future transactions by reducing the cost and saving the time. The influence of culture on negotiation is modeled by Hofstede et al. [12]. According to this model, masculine cultures are interested in fast profitable trades without considering past trustworthiness in subsequent deals, whereas feminine cultures value building trust and relationship as it might pay off in future negotiations. Similarly, in collectivist societies, negotiation builds on the established relationship. Individualist societies, however, focus on personal interest and explicitness, which sometimes offend collectivists.

Fig. 2.
figure 2

Analysis and interpretations rubric

2.2 Emotional Aspect of Bargaining

Almost every human interaction involves emotion, therefore we cannot leave this out when analyzing the conversation between shopkeeper and customer. In fact, negotiation is a complex emotional decision-making process with an aim to reach an agreement for goods or service exchange [13]. In an attempt to reach a deal, people exchange a lot of verbals and non-verbals, each having a specific emotional effect. These can shape the course of the conversation so that parties are directed toward the deal or distracted to a subsidiary goal. For instance, what may be started as bargaining over the price, can turn into a heated conversation to protect one’s pride when it is interpreted as an attack on self-image.

2.3 Envisioning of the Bargaining Scenario

The complicated interactions underlying the bargaining situation is essential to investigate for designing cognitive agents that are capable of interpreting such situations and interacting with people, given the bewildering complexity of the situation and in-depth knowledge involved.

To begin the analysis, the conversation is divided into several quanta as packages including the most relevant information to the meaning and expressions of a significant segment of the conversation [3]. Next, the observers annotate each quantum, based on (i) the interpretation of what is said by each side (for every single utterance accompanied by non-verbals, if any), (ii) the expectations of each side (i.e. what they expect from the other person as a reaction/response in the given situation) and (iii) the mental process and reasoning, which induce those expectations and interpretations. Figure 2 shows the analysis rubric.

The annotation of reasoning process behind the actions/reactions is based on a model inspired by the theory of mind’s belief-desire reasoning [14] as illustrated in Fig. 3. It is important to note that one’s action can have multiple interpretations, each may induce different reactions, which can lead the conversation into different branches. Moreover, cultural and emotional aspects directly or indirectly affect one’s action.

Fig. 3.
figure 3

Simplified scheme of belief-desire reasoning

The following is an extract of a successful scenario, in which the goal of both parties are realized and the deal is successfully formed. In this scenario, the customer and the shopkeeper, two strangers, but from the same cultural background, are interacting to successfully make a deal. The extract shows how other goals may be pursued (building trust and relationship, here) preceding to or in conjunction with the main goal i.e. negotiating to maximize the benefit.

figure a

The customer opens up using pre-expansion “You know what?” (line 1) as a preliminary signal to an announcement, which is followed by the shopkeeper’s nod showing that he is attentively listening. With this, they made a minimal joint project [15, p. 86] in which the customer seeks the shopkeeper’s consent to make her announcement. Upon receiving the signal, the customer proceeds and announces about her upcoming marriage. In case she did not receive any signal of interest, she would consider it inappropriate to make her announcement and would choose to close the project by saying “Never mind” or get to the point directly by saying “I’m looking for a table cloth”. However, the customer pre-supposes that the shopkeeper would be engaged into her story, thus she prefers the risk of being ignored to the benefit of being able to make a relationship while having a strategy against being ignored. The shopkeeper (whether real or out of politeness) shows that he is engaged in her story (nodding with wide eyes, line 4). Through sharing personal information with the shopkeeper, both partners of the conversation get closer as a result of the grounding. This initiates building trust and prospective relationship. This type of interaction can be seen in feminine and collectivist cultures, where building trust and making relationship comes before the trade. The customer explains her purpose of shopping and expects the shopkeeper to fortify the built-trust by providing good suggestions. She also expects the shopkeeper to expand the common ground by referring to the topics she has already added to the CG (marriage, brown table).

figure b

During the rest of the interaction, both participants tried to maintain the positive mood and friendly atmosphere, which resulted in a successful deal and a long-term relationship. As mentioned earlier, however, in case the shopkeeper chose to ignore the customer’s joint project, the conversation could have led to a different path. We refer to this notion as branching and we have collected and classified a number of such branches for analysis in the present study.

3 Virtual Reality Conversation Envisioning Framework

Many verbal and nonverbal clues are included in a small piece of conversation. Yet, it is not easy to trace all these signals. Speech is transient, likewise, gestures are fleeting and quickly disappear. In this view, we need an effective framework that allows for recording, making traces, performing investigations, and extracting tacit information of the conversation by involving (meta-)participants.

Fig. 4.
figure 4

VRCE framework

The proposed framework, VRCE, provides (meta-)participants with a deft tool that grants multiple features such as abilities to traverse through different time points in the conversation, to experience first and third person view, to add details, analysis and annotations on-the-fly and to trace the changes. For instance, it allows for creating branches in the conversation when necessary, indicating possibilities or alternative interpretations.

To this end, the conversation is reconstructed in the Unity3D environment. The procedure starts by recording participants’ speech and capturing their gestures using Perception Neuron motion capture system, followed by constructing 3D scene and avatars of the participants. Finally, the transformed speech is synchronized with the gestures so that the full scenario is regenerated in VR to enable (meta-)participants actively participate in CE (Fig. 4).

VRCE can serve as a tool for Annotation, Training and Assistance. The first mode aims to facilitate the interpretation of the conversation for (meta-) participants by providing necessary functions. It can take client as annotator and record annotations from the 1st person view using HMD or take researchers as annotators to capture structured annotation from the 3rd person view via a user-friendly interface. As a training tool, it can be used for educational purposes to teach discrete educational points such as the cultural differences. It can also serve as an assistive tool for trainees (especially from different cultures) to use the system for learning purposes by playing the role of the characters or benefiting from the 1st person view and learning from interpretations. Such system can be helpful for understanding other cultures or even one’s own culture, especially for children and elder people. It has a potential for game-playing situation, augmented by live interpretation to detect miscommunications. It also allows the learners to try different alternatives of conversation (branches) by which participants can experience how small modifications can lead the conversation to a totally different direction. In this view, it allows the participants to find why a certain situation happened by letting them go back and forth through the conversation and find the reasons. Furthermore, by providing the ability to switch and converse through different branches, VRCE exposes them to what if questions to explore different alternatives. It also stimulates the participants to think of what else could happen in that situation hence provide an option to add branches in the run-time. Finally, it allows the researchers to investigate learners’ behavior and the distribution of learners’ choices by recording their interactions. This allows investigators to make a root-cause analysis of an action and analyzing the probabilities of its occurrence.

The last mode is the assistance mode, in which the agent can act the role of the (meta-)participants and converse with people or serve as an assistant by providing interpretations and revealing important tacit information to smooth the conversation, to expand the common ground, and to advocate empathy (allowing for viewing things from partner’s perspective).

4 Conversation Description Language (CDL)

We aim to obtain and estimate the causal relationships of the events and the mental process of the participants either from observations or interpretations rather than statistical computations. This highlights the importance of a conversation manager as an automated annotation system that can convert and transfer the expert knowledge into the system. It is anticipated that using such annotation tool as a scientist workbench would allow us to understand and predict the behaviors that can be exploited to produce embodied conversational agents that are acceptable both perceptually and behaviorally.

The first step toward this goal is to have a method for capturing and encoding the logical structure of the conversation and transferring meta-participants’ interpretations in natural language into a simulator. This paper introduces CDL as a language for describing the structure of various components underlying the conversation. As a conceptual framework, it encompasses identities that point to objects in the environments and referred to in the conversation, actions/events identified in the verbal/non-verbal behaviors of participants, and the estimation of abstract mental processes of the participants. It also allows a large degree of ontological promiscuity, required by the analysts until reaching a conclusion.

CDL employs an entity-attribute-value representation consisting of statements in the theme-rheme form, where the theme shows a reference to a value represented by the rheme, e.g. object1.color=red. Alternatively, a functional notation can also be used as: color(object1)=red. When more than one entities are involved as referent, an (ordered) set representation is used, such as:

{John,Mary}.children={Judy, Peter} or children ({John, Mary})= {Judy, Peter} or {father:John, mother:Mary}.children={Judy, Peter}.

Unlike classic modal logic that requires a consistent set of propositions for a possible world, we define situation to be a series of compatible events that may result in a consistent representation of the local world. For example, the two statements object1.color=red and object1.weight=100g can be embedded into the same situation, while object1.price=100$ and object1.price=20$ cannot, as the latter may cause inconsistency in the commonsense world. As such, the analyst is requested to embed these statements into two different discourses.

A generic form of CDL expressions is: discourse-id [theme-rheme pair(s)]. For example, a CDL expression for a conversation segment:

‘The table cloth’s price is 500.’ might be:

[S.expresses=proposal-1] @discourse-1

[table-cloth-3.price=500] @proposal-1

if the utterance is interpreted as “in the given discourse [discourse-1], S expressed proposal-1 in which the price of a table-cloth [table-cloth-3] is 500.”

Fig. 5.
figure 5

CDL as an integral part of the conversation envisioner

Finally, in cases that the ontological analysis is difficult to resolve at current state (e.g. C: ‘How much is that?’), the analyzer should proceed, assuming that the ontological promiscuity will be resolved later (e.g., S:‘The red one?’).

The role of CDL in conversation envisioner is illustrated in Fig. 5. The main idea is to move from annotation to structural representation of CDL, in order to understand/predict the conversation. Given the annotation rubric (Fig. 2), meta-participants augment the raw transcription of the scenario into annotated transcripts. To this end, they consult the CDL knowledge-base to find relevant rules or worked examples that transform an abstract idea into a CDL expression by using existing entity-attribute pairs in CDL ontology or introducing new ones. The generated entity-attribute-value expressions are then stored in the CDL documents and indexed by the knowledge-base for future uses. The whole process is governed by CDL manager that consists of: (i) annotated transcripts including interpretations to abstract annotated events into CDL expressions, by unveiling latent information not appearing on the surface, (ii) ontology including basic vocabulary and the relations that can be extended on-demand, (iii) weakly structured documents that record sessions to be used as the background for new conversations or as a source to induce conversation knowledge, and (iv) knowledge-base for interpreting/predicting conversations and including hypotheses proposed by meta-participants, which can also be used to implement AI-participants and AI-interpreters.

5 Experimental Evaluation

A preliminary experiment was conducted to investigate the following questions:

  1. Q1.

    Does VRCE help selecting better choices during conversation to make a deal?

  2. Q2.

    Does VRCE raise the participants’ awareness of the situation in order to refine their choices for successful interaction?

  3. Q3.

    Do the participants find VRCE useful for perceiving the situation in cross-cultural interaction?

The participants of this experiment were 20 (under-)graduate students of our university (Japanese, French, Thai, Korean, Palestinian, Chinese, American, etc.), including 11 females and 9 males. We used the bargaining scenario with different branches and cultural points obtained from real interaction in a cultural context that was unfamiliar to the participants. The scenario was reconstructed in VR and augmented with CE provided by (meta-)participants.

5.1 Procedure

The participants were asked to play the role of the customer and try to make a deal with the agent shopkeeper. While most of the conversation was fixed and the participants could only hear and read their own sentences as well as the agent’s sentences, there were some branching points in the conversation where the participants were asked to choose their next utterance to the shopkeeper from the given options (16 branches at 2 levels, 4 choices at each level). There were an equal number of successful (closing the deal successfully) and failure branches. For instance, in the case of a failure, participants’ attitude or offer made the shopkeeper very angry and reluctant to sell anything to them. Even in such a situation, participants were still provided with a chance to repair the conversation and revive the deal. Each branch was given a specific score as all branches were distinct from each other in terms of mood, outcome and final price. The experiment consisted of three parts as follows:

Part I: The participants were randomly divided into two groups (CE and control). The CE group received the interpretations during the conversation especially before selecting the branches, whereas the control group did not receive anything. The CE was provided by the assistant agent and was taken from the meta-participants’ analysis, which included a summary of the interaction (verbal and non-verbal), mood analysis, and description of the situation from the shopkeeper point-of-view without giving any suggestions on choice selection (Q1).

Part II: The participants in the control group were given a second chance to redo the conversation, this time augmented by CE and select the choices again. This was done in an attempt to evaluate the effect of CE in building common ground and raising awareness of the participant to review their choices (Q2).

Part III: To address the third research questions (Q3), we conducted another experiment in which all participants received CE. The participants evaluated the usefulness of interpretations for understanding the situation, given the different cultural background. This part was followed by a questionnaire to elicit participants’ feedback on VRCE.

5.2 Analysis of the Results

Table 1 compares the results of CE versus the control group (Part I) and suggests that participants’ average score in CE group (\(M=64.40\)) were statistically higher than those in the control group (\(M=26.20\)). Therefore, it can be inferred that providing interpretation has significantly affected the participants’ scores as it helped them understand the situation better, hence choose the options that lead to a successful deal [\(t(18)=2.32, p=.03\)]. The result can provide a positive answer for the Q1 in the usefulness of CE for selecting better choices and realizing the goal of the task i.e., closing a deal.

Table 1. T-test analysis of CE vs. control groups

The results in Table 2 shows the effect of the VRCE in helping the participants gain a better understanding of the situation, and revising their choices in the second run (Part II). The results provide a positive response for Q2, indicating that participants’ average scores had a significant (41%) increase after they received CE [\(t(4)=2.83, p=.04\)]. While these results might be influenced by repetition effect, participants commented that the assistant agent substantially helped them to revise their choices.

Table 2. Paired-sample t-test
Table 3. Participants’ feedback on VRCE using a Liket-scale questionnaire

In Part III, the participants evaluated the interpretations as useful for 85.71% of the instances on average. This result was fortified by participants’ feedback on a Likert-scale questionnaire (1: strongly disagree \(\sim \) 5: strongly agree) as shown in Table 3. As the results suggest the majority of the participants believed that the interpretations were useful for perceiving the situation in cross-cultural interaction (Q3). Moreover, the idea of VRCE, the setting of the VR environment, and game-like nature of the experiment received positive feedback.

6 Conclusion

This paper introduces CE as a computational framework to highlight the tacit information of the conversation. Using VR platform, we tried to envision a bargaining scenario as a situated interaction with multiple goals, cultural implications, and emotional affect. VRCE is a platform that allows (meta-)participants to engage in a story and learn it from different perspectives. It aims to provide the (meta-)participants with the maximum degree of freedom with respect to space and time and to support multiple tasks ranging from the end-user service (learning and online assistance) to the meta-user service (analysis and synthesis). In this study, we also introduced CDL as a language to describe the structure of the conversation, and CDL manager as a means to encode (meta-)participants’ interpretations. While there is still room for lots of improvements, preliminary experiments showed that VRCE could facilitate common ground building in situated interactions especially when the cultural backgrounds are different. Future directions include using VRCE as an educational tool to facilitate cross-cultural interactions and as a scientist workbench for envisioning tacit dimensions of the conversation in order to design agents that are aware of these dimensions.