Skip to main content
Log in

Emotion Aided Dialogue Act Classification for Task-Independent Conversations in a Multi-modal Framework

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Dialogue act classification (DAC) gives a significant insight into understanding the communicative intention of the user. Numerous machine learning (ML) and deep learning (DL) approaches have been proposed over the years in these regards for task-oriented/independent conversations in the form of texts. However, the affect of emotional state in determining the dialogue acts (DAs) has not been studied in depth in a multi-modal framework involving text, audio, and visual features. Conversations are intrinsically determined and regulated by direct, exquisite, and subtle emotions. The emotional state of a speaker has a considerable affect on its intentional or its pragmatic content. This paper thoroughly investigates the role of emotions in automatic identification of the DAs in task-independent conversations in a multi-modal framework (specifically audio and texts). A DL-based multi-tasking network for DAC and emotion recognition (ER) has been developed incorporating attention to facilitate the fusion of different modalities. An open source, benchmarked ER multi-modal dataset IEMOCAP has been manually annotated for its corresponding DAs to make it suitable for multi-task learning and further advance the research in multi-modal DAC. The proposed multi-task framework attains an improvement of 2.5% against its single-task DAC counterpart for manually annotated IEMOCAP dataset. Results as compared with several baselines establish the efficacy of the proposed approach and the importance of incorporating emotion while identifying the DAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The DA annotated dataset will be made publicly available for the research community.

  2. https://keras.io/

References

  1. Jurafsky D, Bates R, Coccaro N, Martin R, Meteer M, Ries K, Shriberg E, Stolcke A, Taylor P, Van Ess-Dykema C. 1997. Automatic detection of discourse structure for speech recognition and understanding. In: 1997 IEEE workshop on automatic speech recognition and understanding proceedings, IEEE, pp 88–95.

  2. Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R, Ess-Dykema C V, Meteer M. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics 2000;26(3):339–373.

    Article  Google Scholar 

  3. Verbree D, Rienks R, Heylen D. 2006. Dialogue-act tagging using smart feature selection; results on multiple corpora. In: Spoken Language Technology Workshop, 2006. IEEE, IEEE, pp 70–73.

  4. Kalchbrenner N, Blunsom P. 2013. Recurrent convolutional neural networks for discourse compositionality. arXiv:13063584.

  5. Papalampidi P, Iosif E, Potamianos A. 2017. Dialogue act semantic representation and classification using recurrent neural networks. SEMDIAL 2017 SaarDial, pp 104.

  6. Liu Y, Han K, Tan Z, Lei Y. 2017. Using context information for dialog act classification in dnn framework. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 2170–2178.

  7. Ribeiro E, Ribeiro R, de Matos D M. A multilingual and multidomain study on dialog act recognition using character-level tokenization. Information 2019;10(3):94.

    Article  Google Scholar 

  8. DeLamater JD, Ward A. Handbook of social psychology. Berlin: Springer; 2006.

    Book  Google Scholar 

  9. Fleckenstein K S. Defining affect in relation to cognition: A response to susan mcleod. J Adv Comp 1991;11: 447–453.

    Google Scholar 

  10. Barrett L F, Lewis M, Haviland-Jones JM. Handbook of emotions. New York: The Guilford Press; 1993.

    Google Scholar 

  11. Zadeh AB, Liang PP, Poria S, Cambria E, Morency LP. 2018. Multimodal language analysis in the wild: Cmu-mosei data-set and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 2236–2246.

  12. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor J G. Emotion recognition in human-computer interaction. IEEE Signal Proc Mag 2001;18(1):32–80.

    Article  Google Scholar 

  13. Jain N, Kumar S, Kumar A, Shamsolmoali P, Zareapoor M. Hybrid deep neural networks for face emotion recognition. Pattern Recogn Lett 2018;115:101–106.

    Article  Google Scholar 

  14. Zhang S, Zhang S, Huang T, Gao W, Tian Q. Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans Circuits Syst Video Technol 2018;28(10):3030–3043.

    Article  Google Scholar 

  15. Huang C, Zaiane O, Trabelsi A, Dziri N. 2018. Automatic dialogue generation with expressed emotions. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, vol 2 (Short Papers), pp 49–54.

  16. Zhou H, Huang M, Zhang T, Zhu X, Liu B. 2018. Emotional chatting machine: Emotional conversation generation with internal and external memory. In: 32nd AAAI conference on artificial intelligence.

  17. Fung P, Bertero D, Xu P, Park J H, Wu C S, Madotto A. 2018. Empathetic dialog systems. In: The international conference on language resources and evaluation. European Language Resources Association.

  18. Novielli N, Strapparava C. The role of affect analysis in dialogue act identification. IEEE Trans Affect Comput 2013;4(4):439– 451.

    Article  Google Scholar 

  19. Bosma W, André E. 2004. Exploiting emotions to disambiguate dialogue acts. In: Proceedings of the 9th international conference on Intelligent user interfaces, ACM, pp 85–92.

  20. Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency LP. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In: 2017 IEEE international conference on data mining (ICDM), IEEE, pp 1033–1038.

  21. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency LP. 2017. Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 873–883.

  22. Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation 2008;42(4):335.

    Article  Google Scholar 

  23. Reithinger N, Klesen M. 1997. Dialogue act classification using language models. In: 5th European conference on speech communication and technology.

  24. Stolcke A, Shriberg E, Bates R, Coccaro N, Jurafsky D, Martin R, Meteer M, Ries K, Taylor P, Van Ess-Dykema C, et al. 1998. Dialog act modeling for conversational speech. In: AAAI spring symposium on applying machine learning to discourse processing, pp 98–105.

  25. Grau S, Sanchis E, Castro MJ, Vilar D. 2004. Dialogue act classification using a bayesian approach. In: 9th Conference Speech and Computer.

  26. Godfrey J J, Holliman E C, McDaniel J. 1992. Switchboard: Telephone speech corpus for research and development. In: 1992 IEEE international conference on acoustics, speech, and signal processing, 1992. ICASSP-92, IEEE, vol 1, pp 517-520.

  27. Khanpour H, Guntakandla N, Nielsen R. 2016. Dialogue act classification in domain-independent conversations using a deep recurrent neural network. In: Proceedings of COLING 2016, The 26th international conference on computational linguistics: Technical Papers, pp 2012–2021.

  28. Lee JY, Dernoncourt F. 2016. Sequential short-text classification with recurrent and convolutional neural networks. In: Proceedings of the 2016 Conference of the North American chapter of the association for computational linguistics: Human language technologies, association for computational linguistics, pp 515–520. http://aclweb.org/anthology/N16-1062.

  29. Kumar H, Agarwal A, Dasgupta R, Joshi S. 2018. Dialogue act sequence labeling using hierarchical encoder with crf. In: 32nd AAAI conference on artificial intelligence.

  30. Raheja V, Tetreault J. 2019. Dialogue act classification with context-aware self-attention. arXiv:190402594.

  31. Yu Y, Peng S, Yang GH. 2019. Modeling long-range context for concurrent dialogue acts recognition. arXiv:190900521.

  32. Sitter S, Stein A. Modeling the illocutionary aspects of information-seeking dialogues. Inf Process Manag 1992;28(2):165–180.

    Article  Google Scholar 

  33. Ortega D, Li C Y, Vallejo G, Denisov P, Vu NT. 2019. Context-aware neural-based dialog act classification on automatically generated transcriptions. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7265–7269.

  34. Saha T, Srivastava S, Firdaus M, Saha S, Ekbal A, Bhattacharyya P. 2019. Exploring machine learning and deep learning frameworks for task-oriented dialogue act classification. In: International joint conference on neural networks, IJCNN 2019 Budapest, Hungary, July 14-19, 2019, pp 1–8. https://doi.org/10.1109/IJCNN.2019.8851943.

  35. Boyer KE, Grafsgaard JF, Ha EY, Phillips R, Lester JC. 2011. An affect-enriched dialogue act classification model for task-oriented dialogue. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies vol 1, Association for Computational Linguistics, pp 1190–1199.

  36. Ihasz P L, Kryssanov V. 2018. Emotions and intentions mediated with dialogue acts. In: 2018 5th international conference on business and industrial research (ICBIR), IEEE, pp 125–130.

  37. Cerisara C, Jafaritazehjani S, Oluokun A, Le H. 2018. Multi-task dialog act and sentiment recognition on mastodon. arXiv:180705013.

  38. Vosoughi S, Roy D. 2016. Tweet acts: A speech act classifier for twitter. In: 10th international AAAI conference on web and social media.

  39. Lauren P, Qu G, Yang J, Watta P, Huang G B, Lendasse A. Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks. Cogn Comput 2018;10(4):625–638.

    Article  Google Scholar 

  40. Wang Z, Lin Z. 2019. Optimal feature selection for learning-based algorithms for sentiment classification. Cognitive Computation pp 1–11.

  41. Sun X, Peng X, Ding S. Emotional human-machine conversation generation based on long short-term memory. Cogn Comput 2018;10(3):389–397. https://doi.org/10.1007/s12559-017-9539-4.

    Article  Google Scholar 

  42. Griol D, Callejas Z. Mobile conversational agents for context-aware care applications. Cogn Comput 2016;8 (2):336–356. https://doi.org/10.1007/s12559-015-9352-x.

    Article  Google Scholar 

  43. Rodríguez LF, Ramos F. Development of computational models of emotions for autonomous agents: A review. Cogn Comput 2014;6(3):351?-375. https://doi.org/10.1007/s12559-013-9244-x.

    Article  Google Scholar 

  44. Shriberg E, Dhillon R, Bhagat S, Ang J, Carvey H. The icsi meeting recorder dialog act (mrda) corpus. Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL; 2004.

  45. Heeman P A, Allen J F. 1995. The trains 93 dialogues. Tech. rep., Rochester Univ NY Dept of Computer Science.

  46. Anderson A H, Bader M, Bard E G, Boyle E, Doherty G, Garrod S, Isard S, Kowtko J, McAllister J, Miller J, et al. The hcrc map task corpus. Language and speech 1991;34(4):351–366.

    Article  Google Scholar 

  47. Jurafsky D. 1997. Switchboard swbd-damsl shallow-discourse-function annotation coders manual. Institute of Cognitive Science Technical Report.

  48. LeCun Y, Bengio Y, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 1995;3361(10):1995.

    Google Scholar 

  49. Pennington J, Socher R, Manning C. 2014. Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543.

  50. Eyben F, Wöllmer M, Schuller B. 2010. Opensmile: The munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on Multimedia, ACM, pp 1459–1462.

  51. Drugman T, Thomas M, Gudnason J, Naylor P, Dutoit T. Detection of glottal closure instants from speech signals: A quantitative review. IEEE Trans Audio, Speech, Language Process 2011;20(3):994–1006.

    Article  Google Scholar 

  52. Kane J, Gobl C. Wavelet maxima dispersion for breathy to tense voice discrimination. IEEE Trans Audio, Speech, Language Process 2013;21(6):1170–1179.

    Article  Google Scholar 

  53. Drugman T, Alwan A. 2011. Joint robust voicing detection and pitch estimation based on residual harmonics. In: 12th annual conference of the international speech communication association.

  54. Hermansky H. Perceptual linear predictive (plp) analysis of speech. The Journal of the Acoustical Society of America 1990;87(4):1738–1752.

    Article  Google Scholar 

  55. Fastl H. 2005. Psycho-acoustics and sound quality. In: Communication acoustics, Springer, pp 139–162.

  56. Thomson D J. Spectrum estimation and harmonic analysis. Proc IEEE 1982;70(9):1055–1096.

    Article  Google Scholar 

  57. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–1780.

    Article  Google Scholar 

  58. Welch B L. The generalization ofstudent’s’ problem when several different population variances are involved. Biometrika 1947;34(1/2):28–35.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) for carrying out this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tulika Saha.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saha, T., Gupta, D., Saha, S. et al. Emotion Aided Dialogue Act Classification for Task-Independent Conversations in a Multi-modal Framework. Cogn Comput 13, 277–289 (2021). https://doi.org/10.1007/s12559-019-09704-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-019-09704-5

Keywords

Navigation