Abstract
In human-to-human communication, gesture and speech co-exist in time with a tight synchrony, and gestures are often utilized to complement or to emphasize speech. In human–computer interaction systems, natural, affective and believable use of gestures would be a valuable key component in adopting and emphasizing human-centered aspects. However, natural and affective multimodal data, for studying computational models of gesture and speech, is limited. In this study, we introduce the JESTKOD database, which consists of speech and full-body motion capture data recordings in dyadic interaction setting under agreement and disagreement scenarios. Participants of the dyadic interactions are native Turkish speakers and recordings of each participant are rated in dimensional affect space. We present our multimodal data collection and annotation process, as well as our preliminary experimental studies on agreement/disagreement classification of dyadic interactions using body gesture and speech data. The JESTKOD database provides a valuable asset to investigate gesture and speech towards designing more natural and affective human–computer interaction systems.
Similar content being viewed by others
Notes
Flex 13 system—http://www.optitrack.com/products/flex-13/.
Motive—optical motion capture software http://www.optitrack.com/products/motive/.
The JESTKOD project is supported by TÜBİTAK under Grant Number 113E102.
The JESTKOD database—http://mvgl.ku.edu.tr/databases/.
References
Bavelas, J. B., Chovil, N., Coates, L., & Roe, L. (1995). Gestures specialized for dialogue. Personality and Social Psychology Bulletin, 21(4), 394–405.
Bousmalis, K., Mehu, M., & Pantic, M. (2009) Spotting agreement and disagreement: A survey of nonverbal audiovisual cues and tools. In 3rd International conference on affective computing and intelligent interaction and workshops (pp. 1–9).
Bousmalis, K., Morency, L., & Pantic, M. (2011). Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition. In IEEE international conference on automatic face gesture recognition and workshops (FG 2011) (pp. 746–752).
Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359.
Busso, C., Parthasarathy, S., Burmania, A., Abdel-Wahab, M., Sadoughi, N., & Provost, E. M. (2016). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 10(99), 1–1.
Carletta, J. (2007). Unleashing the killer corpus: Experiences in creating the multi-everything ami meeting corpus. Language Resources and Evaluation, 41(2), 181–190.
Cowie, R., Cox, C., Martin, J. C., Batliner, A., Heylen, D., & Karpouzis, K. (2011). Issues in data labelling. In P. Petta, C. Pelachaud, & R. Cowie (Eds.), Emotion-oriented systems: The Humaine handbook (pp. 215–244). Berlin: Springer.
Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., McRorie, M., et al. (2007). The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data. In A. Paiva, R. Prada, & R. Picard (Eds.), Affective computing and intelligent interaction. Lecture notes in computer science (Vol. 4738, pp. 488–500). Berlin: Springer.
Ekman, P., & Friesen, W. (1975). Unmasking the face: A guide to recognizing emotions from facial clues. Spectrum books. Englewood Cliffs: Prentice-Hall.
Galley, M., McKeown, K., Hirschberg, J., & Shriberg, E. (2004) Identifying agreement and disagreement in conversational speech: Use of Bayesian networks to model pragmatic dependencies. In Proceedings of the 42nd annual meeting on association for computational linguistics (p. 669). Association for Computational Linguistics, Stroudsburg, PA, USA.
Grandjean, D., Sander, D., & Scherer, K. R. (2008). Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization. Consciousness and Cognition, 17(2), 484–495.
Grimm, M., Kroschel, K., & Narayanan, S. (2008). The vera am mittag German audio–visual emotional speech database. In IEEE international conference on multimedia and expo (pp. 865–868).
Gunes, H., Schuller, B., Pantic, M., & Cowie. R. (2011). Emotion representation, analysis and synthesis in continuous space: A survey. In 2011 IEEE international conference on automatic face & gesture recognition and workshops (FG 2011) (pp. 827–834). IEEE.
Heloir, A., Neff, M., & Kipp, M. (2010). Exploiting motion capture for virtual human animation: Data collection and annotation visualization. In Proceedings of the workshop on multimodal corpora: Advances in capturing, coding and analyzing multimodality (pp. 59–62).
Hillard, D., Ostendorf, M., & Shriberg, E. (2003). Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology: Companion volume of the proceedings of HLT-NAACL 2003: Short Papers (Vol. 2, pp. 34–36). Association for Computational Linguistics, Stroudsburg, PA, USA.
Khaki, H., & Erzin, E. (2016). Use of agreement/disagreement classification in dyadic interactions for continuous emotion recognition. In Proceedings of Interspeech, San Francisco, USA.
Kim, S., Valente, F., & Vinciarelli, A. (2012). Automatic detection of conflicts in spoken conversations: Ratings and analysis of broadcast political debates. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5089–5092).
McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., & Zhang, D. (2005). Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3), 305–317.
McKeown, G., Valstar, M., Cowie, R., Pantic, M., & Schroder, M. (2012). The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 3(1), 5–17.
Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4), 261–292.
Mehu, M., & Maaten, L. (2014). Multimodal integration of dynamic audio–visual cues in the communication of agreement and disagreement. Journal of Nonverbal Behavior, 38(4), 569–597.
Metallinou, A., Lee, C. C., Busso, C., Carnicke, S., & Narayanan, S. S. (2010). The USC CreativeIT database : A multimodal database of theatrical improvisation. In Multimodal corpora: Advances in capturing, coding and analyzing multimodality (MMC).
Metallinou, A., Yang, Z., Lee, C. C., Busso, C., Carnicke, S., & Narayanan, S. S. (2015). The USC CreativeIT database of multimodal dyadic interactions: From speech and full body motion capture to continuous emotional annotations. Language Resources and Evaluation, 50(3), 497–521.
Metallinou, A., Katsamanis, A., & Narayanan, S. (2013). Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image and Vision Computing, 31(2), 137–152.
Poggi, I., D’Errico, F., & Vincze, L. (2010). Agreement and its multimodal communication in debates: A qualitative analysis. Cognitive Computation, 3(3), 466–479.
Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.
Scherer, K. R., Schorr, A., & Johnstone, T. (2001). Appraisal processes in emotion: Theory, methods, research. Oxford: Oxford University Press.
Vinciarelli, A., Dielmann, A., Favre, S., & Salamin, H. (2009). Canal9: A database of political debates for analysis of social interactions. In Proceedings of the international conference on affective computing and intelligent interaction, ACII ’09 (pp. 1–4).
Vinciarelli, A., Pantic, M., Heylen, D., Pelachaud, C., Poggi, I., D’Errico, F., et al. (2012). Bridging the gap between social animal and unsocial machine: A survey of social signal processing. IEEE Transactions on Affective Computing, 3(1), 69–87.
Wang, W., Yaman, S., Precoda, K., & Richey, C. (2011). Automatic identification of speaker role and agreement/disagreement in broadcast conversation. In IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 5556–5559).
Yang, Z., & Narayanan, S. S. (2014). Analysis of emotional effect on speech-body gesture interplay. In Proceedings of Interspeech (pp. 1934–1938). Singapore.
Yang, Z., & Narayanan, S. S. (2016). Modeling dynamics of expressive body gestures in dyadic interactions. IEEE Transactions on Affective Computing. doi:10.1109/TAFFC.2016.2542812.
Yang, Z., Metallinou, A., Erzin, E., & Narayanan, S. S. (2014). Analysis of interaction attitudes using data-driven hand gesture phrases. In Proceedings of IEEE international conference on audio, speech and signal processing (ICASSP) (pp. 699–703). Florence, Italy.
Acknowledgements
This work is supported by TÜBİTAK under Grant Number 113E102.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bozkurt, E., Khaki, H., Keçeci, S. et al. The JESTKOD database: an affective multimodal database of dyadic interactions. Lang Resources & Evaluation 51, 857–872 (2017). https://doi.org/10.1007/s10579-016-9377-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-016-9377-0