skip to main content
research-article
Open access

Travel Agency Task Dialogue Corpus: A Multimodal Dataset with Age-Diverse Speakers

Published: 16 August 2024 Publication History

Abstract

When individuals communicate, they use different vocabularies, speaking speeds, facial expressions, and gestural languages, depending on those with whom they are speaking. This study focuses on the age of the speaker as a factor that affects the style of communication. We collected a multimodal dialogue corpus with various speaker ages. We used travel as the topic, as it interests people of all ages, and we set up a task based on a tourism consultation between an operator and a customer at a travel agency. This article presents the details of the dialogue task, collection procedures and annotations, and analysis of the characteristics of the dialogues and facial expressions, focusing on the age of the speakers. The results of the analysis suggest that the adult speakers have more independent opinions, the older speakers express their opinions more frequently than other age groups, and those in the operator role smile more frequently at minors.

1 Introduction

Task-oriented dialogue systems have long been features in the study of dialogue [3, 4, 23, 36]. Recently, deep neural networks have come to be successfully applied to response generation [11, 29, 33] and dialogue state tracking [9, 19, 35]. These studies mainly focus on the generation of an appropriate response to user inputs.
When individuals communicate, they use different vocabularies, speaking speeds, facial expressions, and gestural languages, depending on those with whom they are communicating. For example, with children, speakers may use simpler vocabulary and speak with more emotion, whereas with the elderly, people may speak more slowly. By contrast, dialogue systems currently in use rarely change their speaking style or dialogue strategy according to the user. We believe that dialogue systems should alter their dialogue strategies in response to the identity of the user to accomplish their tasks more efficiently and increase user satisfaction.
A range of factors, such as gender, social relationships, and roles, significantly affect communication. However, here, we focused on the age of the speaker as a factor that dramatically affects communication, as speakers of various ages are relatively easy to recruit, and age is among the most important factors among them. We collected a multimodal dialogue corpus including various speaker ages from children to the advanced age (Figure 1).
Fig. 1.
Fig. 1. Multimodal dialogue corpus with a wide range of speaker ages (left: operators, right: customers). Customers are minors (upper right), adults (middle right), and older adults (lower right).
For the dialogue topics, we focused on travel, as it interests people of all ages. We created a task based on a tourism consultation between an operator and a customer at a travel agency. During these dialogues, the operator was enabled to use a tourist information retrieval system to obtain information on tourist spots. We collected information on timestamps, queries, and responses. We also manually transcribed the dialogues and annotated them, including noting tourist spots mentioned and the queries sent to the tourist system, along with their place in the dialogue. These data helped construct dialogue systems to access external resources during interactions.
The features of our corpus are as follows:
Wide range of age speakers, from 7 to 72 years old
More than 115 hours of large multimodal dialogue data in Japanese
Containing queries by the operator and outputs by the system, which are associated with the dialogue
Manually transcribed using four types of annotation.
It is generally unrealistic to have children to consulting travel agencies. However, children interacting with a dialogue system that provides tourism advice can be realistic. This corpus can be use as training data for constructing such a system. In addition, considering that the dialogue style can differ significantly between adults and minors, this corpus is valuable for training and evaluating dialogue models for adaptation to diverse dialogue strategies. In Section 6 of this study, the ways in which dialogue strategies varied among the minor, adult, and elderly participants can be illuminating. In Section 8, we conducted dialogue act prediction experiments to demonstrate the usefulness of this corpus.
This article describes the details of the dialogue task, the methods and results of the corpus collection, and the results of the analysis of the collected corpus with respect to the dialogue phase transition and the customers’ facial expressions to identify an effective interaction strategy with respect to the user’s age for constructing dialogue systems.
This article forms an updated and rectified version of a paper presented at a conference [13]. The conference proceedings included only one type of annotation (general dialogue act) and analysis based on the annotation. The revisions of the original manuscript include the addition of three novel types of annotation (task-specific dialogue act, query, and tourist spot references), along with new analyses and experiments enabled by the new annotations.

2 Related Work

Several multimodal dialogue corpora including dialogues between two speakers have been collected for the analysis of human interactions, facial expressions, emotions, and gestures. The Cardiff Conversation Database (CCDb) [1] contains audio-visual natural conversation with no role assigned (whether listener or speaker) and no scenario. Some of the data were annotated with dialogue acts such as Backchannel and Agree, emotions such as Surprise and Happy, and head movements such as Head Nodding and Head Tilt. Each conversation lasted 5 minutes, and 300 minutes of dialogue were collected, with participants ranging in age from 25 to 57 years. The Emotional Dyadic Motion CAPture (IEMOCAP) dataset [6] is used for communication and gesture analysis. The actors were wearing markers on their faces, heads, and hands, and two types of dialogue were conducted: improvisations and scripted scenarios. The utterances are annotated with emotional labels. The total recording time was approximately 12 hours. The NoXi corpus [7] contains dialogues mainly in English, French, and German, annotated with head movements, smiles, gazes, engagement, and so on. The total recording time was approximately 25 hours, and the ages of the participants ranged from 21 to 50 years. The CANDOR corpus [24] is an English conversation corpus that consists of 1,656 conversations totaling 850 hours. This corpus is larger than that collected for this study. However, unlike our study, it was transcribed and annotated automatically, not by hand. In addition, the participants range in age from 19 to 66 years, and the corpus excludes conversations with minors. The CABB dataset [12] is a dialogue corpus of collaborative referential communication games conducted in Dutch. It includes video, audio, and body-motion tracking data. The multimodal dialogue corpus Hazumi [16] includes dialogues that represent conversations in Japanese between a participant and a system that is operated using the Wizard of Oz (WoZ) method. It aggregates approximately 65 hours of dialogue1 with 214 participants who ranged in age from their 20s to their 70s, excluding minors. All exchanges between pairs, made up of a system utterance and the subsequent user utterance, were assigned sentiments by multiple third-party annotators. Physiological sensor outputs were also simultaneously recorded as a version within the corpus. In addition, a multimodal corpus of persuasive dialogues between participants and an Android operated with WoZ has been constructed [15]. In this study, we collected more than 115 hours of data, or more than the two-party multimodal dialogue corpus noted previously. The age range of the speakers in our data was also wider than that in previous studies.
Multimodal corpora that contain conversations between multiple people have also been collected. RoomReader [25] is a multi-party, multimodal dialogue corpus that consists of approximately 8 hours of dialogue that were collected via Zoom. The dialogues were transcribed automatically and manually corrected, with annotations made regarding engagement. The Belfast storytelling dataset [17], the AMI meeting corpus [8],the ICSI meeting corpus [14], Computers in Human Interaction Loop (CHIL) [27], and Video Analysis and Content Extraction (VACE) [10] are well-known examples of this. These corpora contain both two-person and multi-person dialogues, primarily encompassing speakers in their 20s to 70s and generally excluding minor speakers.
Several multimodal dialogue datasets were constructed through the extraction of scenes from TV series. The Understanding and Response Prediction dataset [28] is a multimodal dialogue corpus consisting of approximately 42,000 scenes from TV dramas. This corpus was constructed for the boundary prediction of dialogue scenes and response generation from scene recognition. The Multimodal EmotionLines Dataset (MELD) [21] is a dialogue emotion recognition dataset that was created through the annotation of emotional information in dialogue scenes from the television show Friends. Another dialogue emotion recognition dataset, the Multimodal Multi-scene Multi-label Emotional Dialogue (M3ED) dataset [34], was constructed from Chinese TV series.
Monologue corpora include speakers of a broader age range. CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) [32] is one of the largest, containing 23,500 YouTube videos of 1,000 people, each utterance of which is annotated with an emotion label. In addition, several monologue corpora have been shared in the research community, such as the Multimodal Corpus of Sentiment Intensity (CMU-MOSI) [31], the ICT Multi-Modal Movie Opinion (ICT-MMMO) [30], and the Multimodal Opinion Utterances Dataset (MOUD) [20]. These datasets are relatively large, and they include minor and older speakers; however, they are not dialogue corpora.

3 Travel Agency Task

We collected dialogues between two speakers, where one was playing the role of an operator and the other was playing the role of a customer, simulating a travel agency consultation. The two speakers made a video call using Zoom,2 and each conversation lasted 20 minutes. The customer prepared a travel concept before the beginning of the dialogue (described in Section 3.2) and then consulted with the operator to identify tourist spots that met the conception. The operator elicited the customer’s request and recommended tourist spots using information obtained from the tourist information retrieval system (described in Section 3.1).

3.1 Tourist Information Retrieval System

We have developed a tourist information retrieval system using the Rurubu data3 provided by JTB Publishing Corporation. Rurubu is among the best-known famous travel guidebooks in Japan. It contains information on approximately 45,000 Japanese sightseeing spots.
A screenshot of the system is given in Figure 2. The operator can provide search queries to the left of the system screen, such as region and area, free keyword search, genre-based search (e.g., “See – Buildings – Historical Sites – Historical Buildings,” “Eat – Foreign Cuisine – French Cuisine”), and budget. The right side of the screen provides the search results, including descriptions, maps, images, addresses, and access information.
Fig. 2.
Fig. 2. Tourist information retrieval system. The operator retrieves tourist spot information from the system and provides it to the customer.
We instruct the operators to use the information from the information retrieval system as much as possible and not to rely upon their memory.
We collected timestamped operation logs and displayed data from the system during dialogues. The operation logs are useful for training dialogue models that interact with users while accessing external knowledge resources, and the timestamps can be used to estimate the timing of query issuance in the models. The operation logs include “dialogue start,” “dialogue end,” “search query,” and “read more.” The “dialogue start” and “dialogue end” logs are used to correspond the dialogue data to the timestamp of the log. They are saved when the operator clicks the “start” and “end” buttons on the system at the beginning and end of the dialogue, respectively. The “search query” denotes the search condition that is specified by the operator when searching for tourist information. Conditions could be specified in three ways: pull-down menus, check boxes, and free descriptions in text boxes. The system saves a log when an item is selected on a pull-down menu, when a box is clicked on a check box, and when input is completed for a text box. One query is projected for each condition if multiple search conditions were specified, and a log is saved for each. For example, if an operator searches for a tourist spot using a combination of three search conditions “prefecture\(=\)Hokkaido,” “genre\(=\)sightseeing,” and “subgenre\(=\)museum,” the queries are cast three times as “prefecture\(=\)Hokkaido,” “prefecture\(=\)Hokkaido, genre\(=\)sightseeing,” and “prefecture\(=\)Hokkaido, genre\(=\)sightseeing, subgenre\(=\)museum.”

3.2 Dialogue Scenario

Before the dialogue began, the customers specified a concept of features of the trip they are planning, keeping in mind their own status and the destination that they actually wanted to visit.
We adopted two types of dialogue scenarios. In Dialogue Scenario 1, the customer presents a specific concept and then assigns the destination (prefecture or region in Japan), the season (spring, summer, fall, winter), the number of people, and their relationships (friends, family, etc.). During the dialogue, the customer decides on three specific destinations. In Scenario 2, the customer briefly describes the concept of the trip they want to take, providing a short description of the activities or outings they are hoping for. Examples of these situation descriptions include “I want to relax in a hot spring resort,” or “I want to visit shrines and temples during the day and eat local specialties at night.” During the dialogue, the customer decides on at least one destination.
The dialogues for scenario 1 were expected to follow a certain communication pattern, in which the operator listens to the customer’s travel concept and recommends certain tourist spots. This resembles related to the content of task-oriented dialogue corpora such as MultiWoZ [4] and the Schema-Guided Dialogue Dataset (SGD) [22], and studies that incorporate these corpora were actively being conducted. We prioritized contributions to this field and used scenario 1 twice to collect data. Scenario 2, in contrast, was a more challenging setting, in which the goal of the dialogue was not specifically determined. We consider it is important to study task-oriented dialogues using unclear goals, such as those in scenario 2, so we also collected data for scenario 2.

4 Data Collection

4.1 Recording

We recorded dialogues from November 10, 2020, to February 25, 2021, using Zoom’s local recording function. For each interaction, we collected from each dialogue an mp4 video file, an m4a audio file, and separate m4a audio files for the operator and customer. The data collection procedure was approved by the ethics committee for experiments on human subjects at the University of Electro-Communications (No. 19061(2)).
The speakers interacted in two ways: looking at each other’s faces and using screen sharing. In the former setting, the speaker’s face was displayed on the left and right sides of the screen (gallery view), and the speakers interacted by looking at the other person’s face (see Figure 1). In this condition, only the operator speaker could see the screen of the tourist information retrieval system. In the latter dialogues, which used screen sharing, both the customer and operator speaker could see the system screen, using Zoom’s screen-sharing function. There are two reasons for having a shared screen: (1) screen sharing has become more common in video calls recently, and (2) speakers can communicate more smoothly when they view the same screen.
Each customer interacted three times through gallery view and three times through screen sharing over the course of six dialogues. Table 1 presents the order of these dialogues. For each screen setting, Dialogue Scenario 1 was used twice and Dialogue Scenario 2 was used once.
Table 1.
No.Screen modeScenario type
1Gallery view1
2Gallery view1
3Gallery view2
4Screen sharing1
5Screen sharing1
6Screen sharing2
Table 1. Dialogue Order, Screen Mode, and Scenario in the Recording

4.2 Speakers

We had 55 people play the customers: 20 minors, 25 adults, and 10 older adults. Due to the difficulty in recruiting older people who were able to use Zoom,4 the number of older people who participated was relatively small.
A breakdown of customers per age in given in Figure 3. Minors participated in data collection with the consent of their parents. As noted in the previous section, each speaker took part in six dialogues giving a total of 330 (\(55 \times 6\)) dialogues collected.
Fig. 3.
Fig. 3. Age and gender distribution of customers.
We used five speakers in the operator role, three of whom had experience working at a travel agency (36 years old/male, 41/female, and 57/female), and the remaining two had experience in customer service (35/male and 27/male). The three speakers with travel agency experience participated in 78.2% of the total set of dialogues (258 out of 330).

5 Transcription and Annotation

The collected dialogues were manually transcribed, and we annotated the transcriptions with the following four types of annotation: (1) the subset of the ISO 24617-2 dialogue act tag (general dialogue act), (2) task-specific dialogue act tag for our undertaking (task-specific dialogue act), (3) mapping tourist spots mentioned by speakers against tourist spot IDs registered in the tourist information retrieval system, and (4) plotting search queries with functional segments in the dialogues.
The functional segments are defined in the ISO 24617-2 annotation scheme as units that are smaller than that of the utterances [5]. We performed all four types of annotation on functional segments rather than on utterances.
The sections that follow describe the procedure that we adopted to classify utterances into functional segments and subsequently elucidate all of the annotations performed. For each type of annotation, one annotator was assigned to perform annotations for a particular dialogue. We verified the consistency of annotations through experiments, which are also elucidated in the ensuing sections.

5.1 Functional Segments

We manually transcribed the dialogues collected and divided their utterances into functional segments according to the stated definitions. To confirm the consistency of the segmentation, three people individually performed functional segmentation for 2,109 utterances in 10 dialogues and calculated perfect and partial matching rates. Partial matching rates were defined as rates at which two segmentation results were perfectly matched if either is divided into smaller segments. For example, we have two segmentation results, “AA BB/CC DD” and “AA/BB/CC DD” (where “/” is the division separator). If we split the first “AA BB” into “AA/BB,” it would match the second split and thus denote a partial match. However, “AA/BB CC/DD” and “AA BB/CC DD” do not form partial matches, as one would not match the other even if only one of them was split into smaller segments. The results of the segmentation experiment showed a mean perfect matching rate of 0.70 and a mean partial matching rate of 0.97, indicating a high degree of agreement.

5.2 General Dialogue Act

We used a subset of the ISO 24617-2 annotation scheme [5] as a general tag set for dialogue acts. In the ISO 24617-2 annotation scheme, tags are grouped into nine dimensions, and tags of different dimensions are annotated in duplicate for a single segment. To restrict each segment to having only one tag, as is done in many other dialogue act tag sets, we excluded tags in the following dimensions: Turn Management (six tags), Own Communication Management (three tags), Partner Communication Management (two tags), and Discourse Structuring (one tag). These tags can be annotated in duplicate for single segments if the dimensions are different. Table 2 presents the 44 tags used. Note that the Other tag in the table is not an ISO 24617-2 tag, and it is largely assigned to sections that were difficult to hear and transcribe. Our data are compliant with ISO 24617-2 if we were to additionally annotate tags in the dimensions excluded in this study.
Table 2.
InformAddressSuggestAcceptSuggest
DisagreementDeclineSuggestInstruct
AnswerAutoPositiveAgreement
ConfirmAutoNegativeCorrection
DisconfirmAlloPositiveQuestion
AlloNegativeSetQuestionFeedbackElicitation
PropositionalQuestionStallingPausing
ChoiceQuestionCheckQuestionInitGreeting
OfferReturnGreetingInitSelfIntroduction
AddressOfferAcceptOfferReturnSelfIntroduction
DeclineOfferApologyAcceptApology
PromiseRequestThanking
AddressRequestAcceptThankingInitGoodbye
AcceptRequestDeclineRequestReturnGoodbye
SuggestOther 
Table 2. Dialogue Act Tags
To confirm the tagging consistency, we conduct experiments in which three annotators individually annotated randomly selected dialogues. Three workers annotated the dialogue act tags to 9,220 functional segments from 10 dialogues. The experimental results showed that the overall Fleiss’ \(\kappa\) was 0.75, indicating good agreement. Proposers of the ISO 24617-2 annotation scheme [5] reported Cohen’s \(\kappa\) value of 0.21 to 0.58 for dimensions corresponding to the tags used in this study, a little lower than the results in our study. They only calculated Cohen’s \(\kappa\) for each dimension, making apples-to-apples comparison impossible, although our experimental results suggest that our annotations were adequate. For reference, the lowest value (0.21) was for the Auto-Feedback dimension, consisting of AutoPositive and AutoNegative, and the highest (0.58) was for the Time Management dimension, consisting of Staling and Pausing.

5.3 Task-Specific Dialogue Act

We originally designed and annotated task-specific dialogue act tags for this dataset. Task-specific tags were assigned to functional segments in the same manner as general dialogue act tags. The task-specific tags were designed based on the following policy: (1) the targets of the annotation were utterances that were related to tourism consultation, (2) the tags were associated with tourist spot information included in the search results for the tourist information retrieval system (business hours, address, etc.), and (3) any utterances unrelated to the tourism consultation were assigned the tag “none.”
We used the following procedure for tag design:
(1)
Utterances were grouped by content and their associations with the information contained in the search queries and search results of the retrieval system.
(2)
Each group was assigned a dialogue act tag based on the grouping outcomes.
(3)
The authors annotated the dialogues to confirm that the tag set that was created in stage 2 was adequate for the annotation.
(4)
The tag set was modified on the basis of the annotation results.
(5)
Stages 3 and 4 were repeated multiple times until no further tag set modification was required.
We used the preceding procedure to design the 37 tags shown in Table 3. The details of every tag and example of annotation are presented in the appendix.
Table 3.
DirectionQuestionSeasonQuestionPeopleQuestion
AgeQuestionExperienceQuestionRequestQuestion
SearchAdviceRequestConfirmChecklistConfirm
AddChecklistTravelSummarySearchInform
PhotoInformSearchConditionInformNameInform
IntroductionInformOfficeHoursInformPriceInfor
FeatureInformAccessInformPhoneNumberInform
ParkInformEmptyInformMistakeInform
OperatorSpotImpressionSearchResultInformSpotRequirement
CustomerExperienceSpotRelatedQuestionRequestRecommendation
SpotDetailsQuestionSpotImpressionOnScreenSuggest
OnScreenQuestionOnScreenChoiceSpecifyQuery
None  
Table 3. Task-Specific Dialogue Act Tags
Three appraisers annotated 10 dialogues to confirm the annotation consistency. The experimental results showed that Fleiss’ \(\kappa\) for the results was 0.61, which is a high level of agreement.

5.4 Query Annotation

Operators had the option to maneuver the tourist information retrieval system during the dialogue, and we collected the logs of these operations along with the displayed contents. We focused on search queries for the system in this annotation, noting the segments that were used by the operators for determining the queries. The queries for the system are compliant with the Rurubu Data API.5
Annotation was performed on segments. The annotators mapped queries to all referenced segments to establish their demands. For example, when an operator selected the “prefecture: Hokkaido” query and the customer said “I want to go to Hokkaido,” the query annotation “prefecture=Hokkaido” was mapped to the “I want to go to Hokkaido” segment. This annotation was sued to mark segments with updated differences in operator-issued queries. For example, the annotation target was “subgenre=museum” when the operator issued the query “prefecture=Hokkaido, genre=sightseeing, subgenre=museum” and the previous query was “prefecture=Hokkaido, genre=sightseeing.”
The assigned appraiser annotated all segments of a queries for which an operator created a single inquiry that was based on several segments. Both customer and operator segments were annotated, as queries were sometimes proposed to the customer through the operator. In some cases, a query may have been issued as the result of an erroneous operation, where the evaluator did not annotate any segment of that query.
An example of the annotation is shown in Table 4.
Table 4.
SpeakerUtterance/Functional segmentQuery
OperatorWhat kind of trip are you planning? 
CustomerUh, 
 to Kyoto.region=Kinki, prefecture=Kyoto
OperatorYes. 
CustomerAh, 
 I’d like to go there during the fall foliage.condition=Autumn
OperatorYes. 
Table 4. Query Annotation Example
Three annotators each individually annotated 10 dialogues and 180 queries to confirm the consistency of this annotation type. The mean complete match rate was 0.66, and the partial match rate was 0.84, indicating medium to high correspondence.

5.5 Annotation of Tourist Spot References

The operators selected and recommended suitable tourist spots for customers from the sites displayed to them in the tourist information retrieval system. The operators did not indicate to the tourists which spots they selected. This type of annotation marked the tourist spots that were mentioned in the operator’s segment. Each tourist spot in the system was assigned a unique ID, and we annotated each segment alluding to a tourist spot by assigning a tourist spot ID. A range of tourist spot IDs were assigned when multiple tourist spots were cited in a given segment.
We also designated a reference level for segments to indicate the degree of detail of the allusion to a given tourist spot. This information indicated whether the operator referred to the spot merely as an option or as a recommendation. Table 5 presents the reference levels: reference level 1 signified that the segment did not incorporate details on the tourist spots displayed on the system; reference level 2 indicates that the segment includes the detailed information displayed on the tourist information retrieval system. The system presented the names and photographs of 10 tourist spots for each query. Details about each spot could be viewed by clicking the “Read more” button. The system operations log contained information regarding when the “Read more” button was clicked. The annotator assessed the reference level of each segment based on the operation log and segment content within it. Reference level 0 was assigned primarily where the operator noted tourist spots not displayed on the system at the time the segment occurred, such as in a review of the dialogue by the operator, as was frequently performed at the end of the dialogue.
Table 5.
LevelDescriptionSample segment
2Cite tourist spots based on detailed information displayed in the retrieval systemIt opens at 10:00 a.m.
1Mention tourist spots based on the spot search results displayed in the systemThere is the Hokkaido Botanical Garden.
0Allude to tourist spots without using the systemLooking back, you were planning to visit the Tokyo Tower first.
Table 5. Reference Level
Three annotators individually annotated 10 dialogues to confirm the consistency of this annotation type. Of the 10 dialogues, 1,563 segments were annotated by at least one annotator, revealing a mean exact matching rate of 0.72 and a mean partial matching rate of 0.74 for tourist spot IDs. Fleiss’ kappa was calculated at 0.61 for the reference levels assigned to the 919 segments that exhibited an exact match in tourist spot IDs, for a high degree of agreement.

5.6 Data Statistics

Table 6 presents the statistics for the corpus. We defined the duration of each dialogue as 20 minutes, although dialogues were not automatically terminated after a given time. The total dialogue time was thus approximately 5% longer than 6,600 \((= 330 \times 20)\) minutes.
Table 6.
Dialogues330
Length (minutes)6,948
Transcribed utterances120,140
– Operator speaker utterances66,224
– Customer speaker utterances53,916
Functional segments245,543
– Operator speaker segments152,500
– Customer speaker segments93,043
Spot ID annotated segments43,768
Total number of annotated spot IDs46,330
– Mention level 02,948
– Mention level 124,124
– Mention level 219,258
Query annotated segments5,164
Table 6. Corpus Statistics
The dialogue and annotations of the dialogue acts for a minor customer speaker are shown in Table 7, and those for an elderly customer in Table 8. From these data, we can confirm that the operator speaker’s speech style and content differ greatly in relation to the customer’s age.
Table 7.
SpeakerSegmentsGeneral dialogue act/Specific dialogue act
OperatorWell,Stalling/None
 there are a lot of places around the market where you can eat sushi.Inform/IntroductionInform
CustomerYes.AutoPositive/None
OperatorWhat kind of sushi do you want to eat?PropositionalQuestion/RequestQuestion
CustomerWell,Stalling/None
 a sushi restaurant has a lots of salmon roe, like a bowl of salmon roe.Answer/SpotRequirement
OperatorOh,Stalling/None
 a sushi restaurant has a rice bowl with lots of salmon roe on it, don’t you?CheckQuestion/RequestConfirm
CustomerYes.Answer/None
OperatorOkay,AutoPositive/None
 I’ll try to find one.Inform/SearchInform
CustomerYes.AutoPositive/None
OperatorOkay,AutoPositive/None
 wait a minute.Pausing/None
Table 7. Dialogue Examples (Minor Speaker)
Table 8.
SpeakerSegmentGeneral dialogue act/Specific dialogue act
OperatorIs it hard for you to walk?PropositionalQuestion/None
CustomerYeah,AutoPositive/None
 I used to love climbing mountains when I was younger, but my knees are getting worse with age.Answer/None
OperatorUh,Stalling/None
 there are variousCheckQuestion/SearchResultInform
 uh,Stalling/None
 types of houses with kabuki roofs, what do you think?CheckQuestion/SearchResultInform
 AndStalling/None
 there is also a place like a museum,Inform/SearchResultInform
 uh,Stalling/None
 it seems to exhibit these old houses.Inform/SearchResultInform
Table 8. Dialogue Example (Elder Speaker)

6 Analysis of Task-Specific Dialogue Act Sequence

In the following sections, we analyze the verbal and non-verbal information to determine how the speakers’ behavior differs between age groups. One of the most commonly used methods for visualizing the time structure of sequential data is the hidden Markov model (HMM) [18]. In this section, we describe training HMM using sequences of dialogue act and discuss differences in dialogue transition among age groups to analyze verbal information.

6.1 Experimental Conditions

We used 330 complete dialogues with dialogue act tags for the analysis. We compared the models of dialogue act tag sequences between age groups and between screen modes. In the comparison between age groups, the experimental data were separated into 120, 150, and 60 dialogue exchanges for minor customers, general adult customers, and older customers, respectively. The number of data items for screen modes was 165 for both gallery view and screen-sharing conditions. We used the hmmlearn package6 in Python to train the HMM. Each speaker’s dialogue act tags were treated as different labels. Dialogue act tags that had little impact on the dialogue content were excluded. These excluded tags were ‘Stalling,’ ‘AutoPositive,’ ‘AlloPositive,’ ‘AutoNegative,’ ‘AlloNegative,’ ‘Pausing,’ and ‘Other.’ In all, 52 dialogue act tags were used for the analysis. We designed the state of the HMM to roughly correspond to the speaker’s intention. Table 9 presents the correspondence between the state and dialogue act tags. For training the HMM, we based our definition of the output probability for each state on this correspondence. We set the output probability to 0 for the dialogue act tags that were excluded from the list for each state.
Table 9.
StateIntentionDialogue act tags
1.Operator questionDirectionQuestion, SeasonQuestion, PeopleQuestion, AgeQuestion, ExperienceQuestion, RequestQuestion, SearchAdvice, OnScreenQuestion, and OnScreenSuggest
2.Operator confirmationRequestConfirm, ChecklistConfirm, AddChecklist
3.Operator informationTravelSummary, SearchInform, PhotoInform, SearchConditionInform, NameInform, IntroductionInform, OfficeHoursInform, PriceInform, FeatureInform, AccessInform, PhoneNumberInform, ParkInform, EmptyInform, MistakeInform, OperatorSpotImpression, SearchResultInform
4.Customer questionSpotRequirement, SpotRelatedQuestion, RequestRecommendation, SpotDetailsQuestion
5.Customer informSpotImpression, OnScreenChoice, SpecifyQuery, CustomerExperience
Table 9. Correspondence between State and Dialogue Acts for HMM Analysis
In this article, the structure of HMM is designed to capture the transition of these states. Therefore, we constructed the ergodic HMM that has transitions from any one state to any other for the analyses. To compare age groups, we trained HMMs for three customer groups: minors, general adults, and older customers. By contrast, to compare the screen modes, we trained the HMMs for two conditions: gallery view and screen-sharing modes. The models were discrete HMMs with an output probability distribution of the multinomial distribution.

6.2 Experimental Results

6.2.1 Comparison between Age Groups.

Figures 4, 5, and 6 show the HMMs trained for each age group. First, the structure of each model reveals evident similarities. This result specifically suggests that the operators evinced similar strategies based on their expertise in their conversations with all customers. A characteristic of our corpus is its inclusion of dialogues with customers of varying ages, ranging from minors to the elderly. We compared the models of minor customers (see Figure 4) and elderly customers (see Figure 6) against the model of adult customers (see Figure 5) to investigate the age-group-related variations in the dialogue strategies. Figures 4 and 5 illustrate the small transition probability in the model of minor customers apropos the self-loops of customer states compared to the adult model. In addition, the transition probabilities elucidated in the minor customer model from the “Customer question” to the “Operator confirmation” and from the “Customer information” to the “Operator question” were larger than the probabilities noted in the adult customer model. These results suggest that the operators actively elicited opinions from minor customers who could not clearly express their opinions.
Fig. 4.
Fig. 4. HMM for dialogues with minor customers. The alphabet before the tag represents speakers: “O” is the operator, and “C” is the customer. Operator states are represented by red circles, and customer states are designated by blue circles.
Fig. 5.
Fig. 5. HMM for dialogue with adult customers. The alphabet before the tag represents speakers: “O” is the operator, and “C” is the customer. Operator states are represented by red circles, and customer states are designated by blue circles.
Fig. 6.
Fig. 6. HMM for dialogue with elderly customers. The alphabet before the tag represents speakers: “O” is the operator, and “C” is the customer. Operator states are represented by red circles, and customer states are designated by blue circles.
Figures 5 and 6 elucidate that the self-loops of the customer states of the elderly customer model display a large transition probability compared to the adult customer model. Additionally, the transition probabilities from the operator states to the customer’s states are greater in the elderly customer model than in models of other age groups. These results indicate that elderly customers took more initiative to talk about their motivations and expectations for the trip compared to the other age groups.
The preceding analysis indicates the existence of discrete dialogue characteristics depending on the ages of the customers. Therefore, a travel guidance dialogue system should modify dialogue strategies according to user age groups to obtain high user satisfaction.

6.2.2 Comparisons between Screen Modes.

Figures 7 and 8 show the HMMs trained for respective screen modes. The results showed large transition probabilities in the screen-sharing condition from the “Operator information” to the “Customer information” and from the “Operator request” to the “Customer information.” These results imply that screen sharing encourages interactions between operators and customers, making it easier for the speakers to mutually communicate trip-related images. Interestingly, the utterances related to impressions about locations (“O_OperatorSpotImpression”) appeared frequently in Figure 8. Operators and customers appeared to exchange opinions and impressions more specifically through screen sharing.
Fig. 7.
Fig. 7. HMM for dialogue with galley view. The alphabet before the tag represents speakers: “O” is the operator, and “C” is the customer. Operator states are represented by red circles, and customer states are designated by blue circles.
Fig. 8.
Fig. 8. HMM for dialogue with screen sharing. The alphabet before the tag represents speakers: “O” is the operator, and “C” is the customer. Operator states are represented by red circles, and customer states are designated by blue circles.
Tables 10 and 11 further exhibit the 20 most frequent bi-grams of the dialogue act tags. Here, we excluded dialogue act pairs with the same speakers for the preceding and succeeding utterances. In both tables, the bi-gram from the “O_RequestQuestion” (operator question) to the “C_SpotRequirement” (customer’s requirement apropos the spot) and from the “C_SpotRequirement” to “O_RequestConfirm” (operator confirmation) appears frequently. These bi-grams represent the basic interactions undertaken during the travel agency task dialogues. Additionally, Table 11 displays the specific dialogue acts for the screen-sharing condition, such as “O_OnScreenSuggest” or “C_OnScreenChoice.” This finding evinced that operators and customers interacted actively through screen mediation. These tables also evidence the frequent exchanges of impressions and feelings between operators and customers in the screen-sharing condition (e.g., “C_SpotImpression” \(\rightarrow\) “O_OperatorSpotImpression” and “O_OperatorSpotImpression” \(\rightarrow\) “C_SpotImpression”). In particular, the “O_OperatorSpotImpression” (operator impression) does not appear among 20 most frequent bi-grams of the galley view condition. Therefore, screen sharing especially appeared to exercise the effect of encouraging operators to convey their impressions.
Table 10.
OrderBi-gramFrequency
1.O_RequestQuestion \(\rightarrow\) C_SpotRequirement947
2.C_SpotRequirement \(\rightarrow\) O_RequestConfirm661
3.O_IntroductionInform \(\rightarrow\) C_SpotImpression521
4.C_SpotImpression \(\rightarrow\) O_IntroductionInform338
5.C_SpotDetailsQuestion \(\rightarrow\) O_IntroductionInform300
6.C_SpotRequirement \(\rightarrow\) O_SearchInform292
7.O_OperatorSpotImpression \(\rightarrow\) C_SpotImpression291
8.O_SearchAdvice \(\rightarrow\) C_SpotRequirement275
9.C_SpotRequirement \(\rightarrow\) O_RequestQuestion270
10.C_SpotImpression \(\rightarrow\) O_OperatorSpotImpression253
11.O_IntroductionInform \(\rightarrow\) C_SpotDetailsQuestion242
12.O_RequestQuestion \(\rightarrow\) C_SpotImpression208
13.O_RequestConfirm \(\rightarrow\) C_SpotRequirement193
14.O_SearchAdvice \(\rightarrow\) C_SpotImpression173
15.C_SpotRequirement \(\rightarrow\) O_SearchAdvice172
16.C_SpotImpression \(\rightarrow\) O_ChecklistConfirm145
17.C_SpotRequirement \(\rightarrow\) O_NameInform145
18.C_SpotImpression \(\rightarrow\) O_AddChecklist139
19.O_DirectionQuestion \(\rightarrow\) C_SpotRequirement136
20.C_SpotImpression \(\rightarrow\) O_NameInform134
Table 10. 20 Most Frequent Bi-grams of Dialogue Acts for the Galley View Condition
Table 11.
OrderBi-gramFrequency
1.O_RequestQuestion \(\rightarrow\) C_SpotRequirement721
2.C_SpotImpression \(\rightarrow\) O_OperatorSpotImpression598
3.C_SpotRequirement \(\rightarrow\) O_RequestConfirm582
4.O_OperatorSpotImpression \(\rightarrow\) C_SpotImpression550
5.O_IntroductionInform \(\rightarrow\) C_SpotImpression497
6.O_OnScreenSuggest \(\rightarrow\) C_OnScreenChoice381
7.C_SpotImpression \(\rightarrow\) O_IntroductionInform361
8.O_OnScreenQuestion \(\rightarrow\) C_OnScreenChoice351
9.C_OnScreenChoice \(\rightarrow\) O_IntroductionInform235
10.O_SearchAdvice \(\rightarrow\) C_SpotRequirement230
11.C_OnScreenChoice \(\rightarrow\) O_OnScreenQuestion223
12.C_SpotRequirement \(\rightarrow\) O_SearchInform204
13.C_OnScreenChoice \(\rightarrow\) O_OperatorSpotImpression204
14.O_OnScreenSuggest \(\rightarrow\) C_SpotImpression196
15.O_RequestConfirm \(\rightarrow\) C_SpotRequirement184
16.O_OperatorSpotImpression \(\rightarrow\) C_OnScreenChoice179
17.C_SpotDetailsQuestion \(\rightarrow\) O_IntroductionInform177
18.C_SpotRequirement \(\rightarrow\) O_SearchAdvice171
19.C_OnScreenChoice \(\rightarrow\) O_RequestConfirm167
20.O_IntroductionInform \(\rightarrow\) C_OnScreenChoice165
Table 11. 20 Most Frequent Bi-grams of Dialogue Acts for the Screen-Sharing Condition
The preceding analyses indicate that the characteristics of the dialogues differed depending on the screen mode. Therefore, a travel guidance dialogue system should also amend its dialogue strategies according to available modalities.

7 Analysis of Facial Expression

For non-verbal information, we investigated the difference in the operator’s facial expression between customers’ age groups; we cropped regions of the operator’s face from the video images and extracted facial action units (AUs) using OpenFace [2]. The facial regions were too small to extract AUs in half of the data because the dialogues were conducted while sharing the screen, so we used 165 dialogues recorded with gallery view for the analysis.
The AUs were obtained frame by frame, and they were averaged over the dialogue to compare age groups. We conducted a one-way layout ANOVA factoring the age group and performed multiple comparison tests for AUs that showed significant differences. In the analysis, we focused on the AUs relating to the smile. Table 12 shows the results of multiple comparison tests for AU06 (Cheek raiser) and AU12 (Lip corner puller), which exhibited a significant difference from ANOVA. These AUs become high when the speaker expressed a smile.
Table 12.
AUComparisonDiff.\(p\)-Value
AU06Minor\(-\)Adult0.382\(\lt 0.001\)\(^{***}\)
 Minor\(-\)Older0.3500.011\(^{*}\)
 Adult\(-\)Older\(-0.031\)1.000
AU12Minor\(-\)Adult0.507\(\lt 0.001\)\(^{***}\)
 Minor\(-\)Older0.4230.023\(^{*}\)
 Adult\(-\)Older\(-0.084\)1.000
Table 12. Results of Multiple Comparison Tests for the Operator’s Facial Expression
\(^{*}\) \(p\lt 0.05\), \(^{**}p\lt 0.01\), \(^{***}\)\(p\lt 0.001\).
As shown in the table, significant differences were observed between the minor group and the adult group for AU06 and AU12. For both AUs, the values were larger in the minor’s group. These results reflect that the operator expressed a smile more frequently to the minor customer. The system should therefore adapt not only verbal but also non-verbal dialogue behavior to the user’s age to provide natural travel guidance.

8 Task-Specific Dialogue Act Estimation Experiment

8.1 Experimental Setting

As demonstrated in the previous section, operators can flexibly change their dialogue strategies according to the characteristics of the customer. One way to use this dataset is to apply it for the training and evaluation of dialogue models that are intended to adapt to users and employ different dialogue strategies. In this section, we focused on dialogue data with minors, where the adaptation of dialogue strategies is particularly required, and conducted experiments for the prediction of the operator’s dialogue acts.
We conducted experiments at three settings: zero-shot, low-resource, and medium-resource. In the zero-shot setting, all data with adult and elderly customers (210 dialogues) were used for training data. In the low-resource setting, we added 18 dialogues (with three minors) to the training data of the zero-shot setting. In the medium-resource setting, we further increased the data with minors by 42 cases, making 60 dialogues (with 10 minors) for training data. The evaluation data for the zero-shot, low-resource, and moderate-resource settings were all the same, consisting of 60 dialogues (10 minors). The low-resource and medium-resource settings showed no overlap of customer speakers between the training and evaluation data.
We used the following pre-trained Japanese language models: Japanese T5\(_{large}\),7 Japanese GPT-NeoX,8 and Japanese Dialog Transformers [26]. We input six turns of customer and operator utterances for context and task-specific dialogue act tags into each model, and the model predicted the tags for the next operator’s utterance. Given that one utterance could include multiple segments, there could be more than two correct tags. The evaluation was conducted on the basis of the exact match rate and the partial match rate, with the correct tags.
We used four NVIDIA A100 40-GB GPUs for training the model. We trained the model for 8 epochs, and the batch size was set to the maximum that the model could handle. We only provided hyperparameter tuning on the learning rate. To find the optimal learning rate, a grid search from {\(1e^{-4}\), \(5e^{-5}\), \(1e^{-5}\), \(5e^{-6}\), \(1e^{-6}\)} was performed.

8.2 Results

The experimental results are presented in Table 13. The training was performed five times for different seeds, and the average values are shown in Table 13. From the table, it can be seen that the medium-resource setting achieved the highest performance in both exact and partial matches, followed by the low-resource setting and the zero-shot setting for all models. The performance difference between zero-shot and low-resource settings and between low-resource and medium-resource settings shows that the latter exhibited a greater difference in performance. This suggests that small amounts of data did not allow for sufficient adaptation to changes in operators’ conversational strategies toward minors. The addition of further training data could improve performance. However, to collect data from speakers with unique conversational strategies is generally costly. Therefore, it is crucial to develop models that can conduct appropriate dialogue in settings with limited or no data, and the corpus collected in this study is important for this purpose. We confirmed that our corpus is suitable evaluation data for developing dialogue models that can be adapted to users having different dialogue strategies.
Table 13.
 Zero-shotLow-resourceMedium-resource
ExactPartialExactPartialExactPartial
Japanese T5\(_{large}\).239.399.247.404.271.439
Japanese GPT-NeoX.209.358.223.382.245.402
JDT.208.338.209.350.228.358
Table 13. Experimental Results of Task-Specific Dialogue Act Estimation

9 Conclusion

This work represents our multimodal dialogue corpus with various speakers’ ages, from children to those of advanced age. This corpus includes 330 dialogues of 20 minutes each. The dialogue task given was based on the consultation of tourist spots at a travel agency between two speakers, one of whom played the role of a tourism services operator and the other a customer. The dialogues were manually transcribed and annotated using four types of annotations: general dialogue act, task-specific dialogue act, query, and tourist spot mention.
The number of individuals in each age group included in this corpus is uneven, with data from older speakers being relatively scarce. In future work, we plan to expand our corpus to include age groups that are currently underrepresented. We also plan to analyze dialogues taking into account cultural backgrounds and personalities, and to develop a dialogue model that adapts its strategy based on user characteristics.
Our corpus has been distributed by the Informatics Research Data Repository at the National Institute of Informatics (NII-IDR).9

Acknowledgments

The authors would like to thank Koki Washio for useful discussions.

Footnotes

4
The reason for using Zoom for the collection was the COVID-19 pandemic.

A Details of Task-Specific Dialogue Act Tags

The task-specific dialogue act tags can be categorized into those assigned to operator segments and those assigned to customer segments. Tables 14 and 15 display and define every tag along with sample segments.
Table 14.
Dialogue actDescriptionExample
DirectionQuestionQuestion on areas for the desired travelTo which destination are you planning to travel?
SeasonQuestionQuestion on the desired seasonWhen will you go?
PeopleQuestionQuestion about the number of people traveling and their relationships with the customerHow many people are traveling with you?
AgeQuestionQuestion on the age of customers or their companionsHow old are your children?
ExperienceQuestionQuestion about the customer’s experienceHave you ever been to Osaka?
RequestQuestionQuestion about the tourist spot requestWhat would you like to do there?
SearchAdviceQuestions or suggestions related to the tourist spot information retrieval systemShould I look for a restaurant there?
RequestConfirmConfirmation of requests for tourist spotsYou want to go to a spa, don’t you?
DestinationConfirmConfirmation of destinationAm I correct in assuming that you are going to Yashi Park?
AddDestinationListAddition to destination list by operatorI’ll add this location to the list.
TravelSummarySummary of trip planningLooking back, you plan to visit the Toshogu Shrine first.
SearchInformOperator’s declaration of intent to search tourist spots in the systemI will now search.
PhotoInformProvide information on photos displayed on the systemHere is a picture of a meal containing a lot of salmon roe.
SearchConditionInformProvide information on search conditionsI can also filter by the time required.
NameInformProvide information on the names of tourist spotsThere is a commercial complex called the Sapporo Factory.
IntroductionInformProvide information on tourist spots based on the system search resultsIt was established in 1876.
OfficeHoursInformProvide information on hours of operation and closing datesOur business hours span 10:00 a.m. to 10:00 p.m.
PriceInformProvide information on fees and price rangeThe admission fee is 360 yen.
FeatureInformProviding information about the characteristics of tourist spotsIt is recommended for women even when it rains.
AccessInformProvide information on accessThis location is a five-minute walk from the railway station.
PhoneNumberInformProvide information on telephone numbersThe phone number is 095 824.
ParkInformProvide information on parkingThere are three parking lots.
EmptyInformStatement that there are no search results or specific descriptionI do not see anything in the search results.
MistakeInformCorrecting errors in tourist spot informationSorry, this store is open on all days of the week.
OperatorSpotImpressionSubjective evaluations and assumptions about a tourist spot by operatorsThis restaurant looks nice and inexpensive.
SearchResultInformReport overall search resultsIt appears there are numerous stores in this location.
OnScreenSuggestSuggestions for tourist spots on the shared screenHow about this site?
OnScreenQuestionQuestions about tourist spots on the shared screenWhich one looks the best, number 1, 2, or 3?
Table 14. Task-Specific Dialogue Act Tags for Operator Segments
Table 15.
Dialogue actDescriptionExample
SpotRequirementRequests and conditions by tourists spotsI want to go to a hot spring.
CustomerExperienceStatement of past experience I went there once around five years ago. 
SpotRelatedQuestionQuestions about information related to tourist spotsCan I can buy souvenirs somewhere in that location?
RequestRecommendationQuestions about the operator’s recommendationsDo you have any recommendations for a place in Hokkaido?
SpotDetailsQuestionQuestions seeking more information about a tourist spotWhat are the hours of operation for that store?
SpotImpressionEvaluation of a tourist spot and opinion statementThe scenery looks beautiful.
OnScreenChoiceExpression of interest in tourist spots on the shared screenNumber 3 looks good.
SpecifyQuerySpecification of search conditions on the shared screenPlease launch a query for Ishikawa and Toyama prefectures.
Table 15. Task-Specific Dialogue Act Tags for Customer Segments

References

[1]
Andrew J. Aubrey, David Marshall, Paul L. Rosin, Jason Vendeventer, Douglas W. Cunningham, and Christian Wallraven. 2013. Cardiff Conversation Database (CCDb): A database of natural dyadic conversations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 277–282.
[2]
Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In Proceedings of IEEE Winter Conference on Applications of Computer Vision. 1–10.
[3]
Daniel G. Bobrow, Ronald M. Kaplan, Martin Kay, Donald A. Norman, Henry Thompson, and Terry Winograd. 1977. GUS, a frame-driven dialog system. Artificial Intelligence 8, 2 (1977), 155–173.
[4]
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. MultiWOZ—A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 5016–5026.
[5]
Harry Bunt, Volha Petukhova, David Traum, and Jan Alexandersson. 2017. Dialogue act annotation with the ISO 24617-2 standard. In Multimodal Interaction with W3C Standards. Springer, 109–135.
[6]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42, 4 (2008), 335–359.
[7]
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar. 2017. The NoXi database: Multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 350–359.
[8]
Jean Carletta. 2007. Unleashing the killer corpus: Experiences in creating the multi-everything AMI Meeting Corpus. Language Resources and Evaluation 41, 2 (2007), 181–190.
[9]
Lu Chen, Boer Lv, Chi Wang, Su Zhu, Bowen Tan, and Kai Yu. 2020. Schema-guided multi-domain dialogue state tracking with graph attention neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7521–7528.
[10]
Lei Chen, R. Travis Rose, Ying Qiao, Irene Kimbara, Fey Parrill, Haleema Welji, Tony Xu Han, Jilin Tu, Zhongqiang Huang, Mary Harper, Francis Quek, Yingen Xiong, David McNeill, Ronald Tuttle, and Thomas Huang. 2005. VACE multimodal meeting corpus. In Proceedings of the 2005 International Workshop on Machine Learning for Multimodal Interaction. 40–51.
[11]
Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan, and William Yang Wang. 2019. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3696–3709.
[12]
Lotte Eijk, Marlou Rasenberg, Flavia Arnese, Mark Blokpoel, Mark Dingemanse, Christian F. Doeller, Mirjam Ernestus, Judith Holler, Branka Milivojevic, Asli Özyürek, Wim Pouw, Iris van Rooij, Herbert Schriefers, Ivan Toni, James Trujillo, and Sara Bogels. 2022. The CABB dataset: A multimodal corpus of communicative interactions for behavioural and neural analyses. NeuroImage 264 (2022), 119734.
[13]
Michimasa Inaba, Yuya Chiba, Ryuichiro Higashinaka, Kazunori Komatani, Yusuke Miyao, and Takayuki Nagai. 2022. Collection and analysis of travel agency task dialogues with age-diverse speakers. In Proceedings of the 13th Language Resources and Evaluation Conference. 5759–5767.
[14]
Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, and. C. Wooters. 2003. The ICSI meeting corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. I-364–I-367.
[15]
Seiya Kawano, Muteki Arioka, Akishige Yuguchi, Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara, Satoshi Nakamura, and Koichiro Yoshino. 2022. Multimodal persuasive dialogue corpus using a teleoperated Android. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’22). 2308–2312.
[16]
Kazunori Komatani and Shogo Okada. 2021. Multimodal human-agent dialogue corpus with annotations at utterance and dialogue levels. In Proceedings of the 9th International Conference on Affective Computing and Intelligent Interaction (ACII ’21). 1–8. DOI:
[17]
Gary McKeown, William Curran, Johannes Wagner, Florian Lingenfelser, and Elisabeth André. 2015. The Belfast storytelling database: A spontaneous social interaction database with laughter focused annotation. In Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction. 166–172.
[18]
Toyomi Meguro, Ryuichiro Higashinaka, Kohji Dohsaka, Yasuhiro Minami, and Hideki Isozaki. 2009. Analysis of listening-oriented dialogue for building listening agents. In Proceedings of the SIGDIAL 2009 Conference. 124–127.
[19]
Nikola Mrkšić, Diarmuid Ó. Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2017. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1777–1788.
[20]
Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 973–982.
[21]
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
[22]
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8689–8696.
[23]
Antoine Raux, Brian Langner, Dan Bohus, Alan W. Black, and Maxine Eskenazi. 2005. Let’s go public! Taking a spoken dialog system to the real world. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’05).
[24]
Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023. The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation. Science Advances 9, 13 (2023), eadf3197.
[25]
Justine Reverdy, Sam O’Connor Russell, Louise Duquenne, Diego Garaialde, Benjamin R. Cowan, and Naomi Harte. 2022. RoomReader: A multimodal corpus of online multiparty conversational interactions. In Proceedings of the 13th Language Resources and Evaluation Conference. 2517–2527.
[26]
Hiroaki Sugiyama, Masahiro Mizukami, Tsunehiro Arimoto, Hiromi Narimatsu, Yuya Chiba, Hideharu Nakajima, and Toyomi Meguro. 2021. Empirical analysis of training strategies of Transformer-based Japanese chit-chat systems. arxiv:2109.05217[cs.CL] (2021).
[27]
Alex Waibel, Hartwig Steusloff, and Rainer Stiefelhagen. 2005. CHIL: Computers in the Human Interaction Loop. In Proceedings of the 5th International Workshop on Image Analysis for Multimedia Interactive Services.
[28]
Yueqian Wang, Yuxuan Wang, and Dongyan Zhao. 2023. Overview of the NLPCC 2023 shared task 10: Learn to watch TV: Multimodal dialogue understanding and response generation. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. 412–419.
[29]
T. H. Wen, D. Vandyke, N. Mrkšíc, M. Gašíc, L. M. Rojas-Barahona, P. H. Su, S. Ultes, and S. Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1. 438–449.
[30]
Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. YouTube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems 28, 3 (2013), 46–53.
[31]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016).
[32]
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2236–2246.
[33]
Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020. Task-oriented dialog systems that consider multiple appropriate responses under the same context. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9604–9611.
[34]
Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, and Haizhou Li. 2022. M3ED: Multi-modal multi-scene multi-label emotional dialogue database. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5699–5710.
[35]
Victor Zhong, Caiming Xiong, and Richard Socher. 2018. Global-locally self-attentive encoder for dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1458–1467.
[36]
Victor Zue, James Glass, David Goodine, Hong Leung, Michael Phillips, Joseph Polifroni, and Stephanie Seneff. 1991. Integration of speech recognition and natural language processing in the MIT VOYAGER system. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 713–716.

Cited By

View all
  • (2024)Effects of Multiple Japanese Datasets for Training Voice Activity Projection Models2024 27th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)10.1109/O-COCOSDA64382.2024.10800197(1-6)Online publication date: 17-Oct-2024
  • (2024)Investigating the Language Independence of Voice Activity Projection Models through Standardization of Speech Segmentation Labels2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC63619.2025.10849022(1-6)Online publication date: 3-Dec-2024

Index Terms

  1. Travel Agency Task Dialogue Corpus: A Multimodal Dataset with Age-Diverse Speakers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 9
    September 2024
    186 pages
    EISSN:2375-4702
    DOI:10.1145/3613646
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 August 2024
    Online AM: 26 June 2024
    Accepted: 22 June 2024
    Revised: 01 March 2024
    Received: 02 August 2023
    Published in TALLIP Volume 23, Issue 9

    Check for updates

    Author Tags

    1. Dialogue corpus
    2. multimodal
    3. data collection
    4. dialogue act

    Qualifiers

    • Research-article

    Funding Sources

    • Japan Society for the Promotion of Science

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,534
    • Downloads (Last 6 weeks)413
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Effects of Multiple Japanese Datasets for Training Voice Activity Projection Models2024 27th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)10.1109/O-COCOSDA64382.2024.10800197(1-6)Online publication date: 17-Oct-2024
    • (2024)Investigating the Language Independence of Voice Activity Projection Models through Standardization of Speech Segmentation Labels2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC63619.2025.10849022(1-6)Online publication date: 3-Dec-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media