Several multimodal dialogue corpora including dialogues between two speakers have been collected for the analysis of human interactions, facial expressions, emotions, and gestures. The Cardiff Conversation Database (CCDb) [
1] contains audio-visual natural conversation with no role assigned (whether listener or speaker) and no scenario. Some of the data were annotated with dialogue acts such as Backchannel and Agree, emotions such as Surprise and Happy, and head movements such as Head Nodding and Head Tilt. Each conversation lasted 5 minutes, and 300 minutes of dialogue were collected, with participants ranging in age from 25 to 57 years. The Emotional Dyadic Motion CAPture (IEMOCAP) dataset [
6] is used for communication and gesture analysis. The actors were wearing markers on their faces, heads, and hands, and two types of dialogue were conducted: improvisations and scripted scenarios. The utterances are annotated with emotional labels. The total recording time was approximately 12 hours. The NoXi corpus [
7] contains dialogues mainly in English, French, and German, annotated with head movements, smiles, gazes, engagement, and so on. The total recording time was approximately 25 hours, and the ages of the participants ranged from 21 to 50 years. The CANDOR corpus [
24] is an English conversation corpus that consists of 1,656 conversations totaling 850 hours. This corpus is larger than that collected for this study. However, unlike our study, it was transcribed and annotated automatically, not by hand. In addition, the participants range in age from 19 to 66 years, and the corpus excludes conversations with minors. The CABB dataset [
12] is a dialogue corpus of collaborative referential communication games conducted in Dutch. It includes video, audio, and body-motion tracking data. The multimodal dialogue corpus Hazumi [
16] includes dialogues that represent conversations in Japanese between a participant and a system that is operated using the
Wizard of Oz (WoZ) method. It aggregates approximately 65 hours of dialogue
1 with 214 participants who ranged in age from their 20s to their 70s, excluding minors. All exchanges between pairs, made up of a system utterance and the subsequent user utterance, were assigned sentiments by multiple third-party annotators. Physiological sensor outputs were also simultaneously recorded as a version within the corpus. In addition, a multimodal corpus of persuasive dialogues between participants and an Android operated with WoZ has been constructed [
15]. In this study, we collected more than 115 hours of data, or more than the two-party multimodal dialogue corpus noted previously. The age range of the speakers in our data was also wider than that in previous studies.
Multimodal corpora that contain conversations between multiple people have also been collected. RoomReader [
25] is a multi-party, multimodal dialogue corpus that consists of approximately 8 hours of dialogue that were collected via Zoom. The dialogues were transcribed automatically and manually corrected, with annotations made regarding engagement. The Belfast storytelling dataset [
17], the AMI meeting corpus [
8],the ICSI meeting corpus [
14], Computers in Human Interaction Loop (CHIL) [
27], and Video Analysis and Content Extraction (VACE) [
10] are well-known examples of this. These corpora contain both two-person and multi-person dialogues, primarily encompassing speakers in their 20s to 70s and generally excluding minor speakers.
Several multimodal dialogue datasets were constructed through the extraction of scenes from TV series. The Understanding and Response Prediction dataset [
28] is a multimodal dialogue corpus consisting of approximately 42,000 scenes from TV dramas. This corpus was constructed for the boundary prediction of dialogue scenes and response generation from scene recognition. The Multimodal EmotionLines Dataset (MELD) [
21] is a dialogue emotion recognition dataset that was created through the annotation of emotional information in dialogue scenes from the television show
Friends. Another dialogue emotion recognition dataset, the Multimodal Multi-scene Multi-label Emotional Dialogue (M
3ED) dataset [
34], was constructed from Chinese TV series.
Monologue corpora include speakers of a broader age range. CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) [
32] is one of the largest, containing 23,500 YouTube videos of 1,000 people, each utterance of which is annotated with an emotion label. In addition, several monologue corpora have been shared in the research community, such as the Multimodal Corpus of Sentiment Intensity (CMU-MOSI) [
31], the ICT Multi-Modal Movie Opinion (ICT-MMMO) [
30], and the Multimodal Opinion Utterances Dataset (MOUD) [
20]. These datasets are relatively large, and they include minor and older speakers; however, they are not dialogue corpora.