Skip to main content

Advertisement

Log in

Linguistic analysis for emotion recognition: a case of Chinese speakers

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Acoustic features, in the examination of emotions occurring in pronouncing English and Chinese Mandarin words, are investigated in this study, then different emotion recognition experiments are presented. To this end, the sound recordings for 91 speakers were analyzed. In the test experiment, a linguistic data set was used to examine which acoustic features are most important for the emotional representation in signal acquisition, segmentation, construction, and encoding. In doing so, words, syllables, phonemes (which contain vowels and consonants), stress and frequency tones were taken into consideration. The types of emotions considered in the experiment included neutral, happy, and sad. Time duration differences, F0 frequency, and dB intensity levels variables were used in conjunction with unsupervised and supervised machine learning approaches for emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Adams, C., & Munro, R. R. (1978). In search of the acoustic correlates of stress: Fundamental frequency, amplitude, and duration in the connected utterance of some native and non-native speakers of English. Phonetica, 35(3), 125–156.

  • Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communications, 116, 56–76.

  • Assmann, P. F., & Katz, W. F. (2005). Synthesis fidelity and time-varying spectral change in vowels. Journal of the Acoustical Society of America, 117(2), 886–895.

    Google Scholar 

  • Beckman, M. E. (2012). Stress and non-stress accent. De Gruyter Mouton. https://doi.org/10.1515/9783110874020

  • Biele, C., & Grabowska, A. (2006). Sex differences in perception of emotion intensity in dynamic and static facial expressions. Experimental Brain Research, 171(1), 1–6.

    Google Scholar 

  • Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9), 341–345.

    Google Scholar 

  • Bolinger, D. L. (1958). A theory of pitch accent in English. Word, 14(2–3), 109–149.

    Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European conference on speech communication and technology.

  • Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 582–596.

    Google Scholar 

  • Campbell, N. (2008). Individual traits of speaking style and speech rhythm in a spoken discourse. In Verbal and nonverbal features of human-human and human-machine interaction (pp. 107–120). Springer.

  • Casale, S., Russo, A., Scebba, G., & Serrano, S. (2008). Speech emotion classification using machine learning algorithms. In 2008 IEEE international conference on semantic computing (pp. 158–165).

  • Chao, Y. (1968). A grammar of spoken Chinese. University of California Press.

    Google Scholar 

  • Chen, M. (1970). Vowel length variation as a function of the voicing of the consonant environment. Phonetica, 22(3), 129–159.

    Google Scholar 

  • Chen, Y., Robb, M. P., Gilbert, H. R., & Lerman, J. W. (2001). A study of sentence stress production in mandarin speakers of American English. The Journal of the Acoustical Society of America, 109(4), 1681–1690.

    Google Scholar 

  • Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. G. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.

    Google Scholar 

  • Cutler, A., & Darwin, C. J. (1981). Phoneme-monitoring reaction time and preceding prosody: Effects of stop closure duration and of fundamental frequency. Perception & Psychophysics, 29(3), 217–224.

    Google Scholar 

  • Davletcharova, A., Sugathan, S., Abraham, B., & James, A. P. (2015). Detection and analysis of emotion from speech signals. Procedia Computer Science, 58, 91–96.

    Google Scholar 

  • Demircan, S., & Kahramanli, H. (2018). Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech. Neural Computing and Applications, 29(8), 59–66.

    Google Scholar 

  • Demšar, J., Curk, T., Erjavec, A., Gorup, Črt, Hočevar, T., Milutinovič, M., & Zupan, B. (2013). Orange: Data mining toolbox in python. Journal of Machine Learning Research, 14, 2349–2353.

    MATH  Google Scholar 

  • Deng, J. J., Leung, C. H. C., Mengoni, P., & Li, Y. (2018). Emotion recognition from human behaviors using attention model. In First IEEE international conference on artificial intelligence and knowledge engineering, AIKE 2018, Laguna Hills, CA, USA, September 26–28, 2018 (pp. 249–253). IEEE Computer Society. https://doi.org/10.1109/AIKE.2018.00056

  • Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1–2), 33–60.

    MATH  Google Scholar 

  • Duanmu, S. (2007). The phonology of standard Chinese. OUP Oxford.

    Google Scholar 

  • Eibe, F., Hall, M. A., Witten, I. H. (2016). The weka workbench. Online appendix for data mining: Practical machine learning tools and techniques. Morgan Kaufmann.

  • Ekman, P. (1999). Basic emotions. Handbook of Cognition and Emotion, 98(45–60), 16.

    Google Scholar 

  • Ekman, P., & Cole, J. (1972). Universals and cultural differences in facial expressions of emotions. In Nebraska symposium on motivation (Vol. 19, pp. 207–283).

  • Fan, C. (1982). Sounds of English and Chinese. Primary and Middle School English Teaching and Research, 1, 3–4.

    Google Scholar 

  • Franzoni, V., Li, Y., Mengoni, P., & Milani, A. (2017). Clustering Facebook for biased context extraction. In Computational science and its applications (ICCSA 2017), 17th international conference, Trieste, Italy, July 3–6, 2017, proceedings, part I (Vol. 10404, pp. 717–729). Springer.

  • Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. The Journal of the Acoustical Society of America, 27(4), 765–768.

    Google Scholar 

  • Fry, D. B. (1958). Experiments in the perception of stress. Language and Speech, 1(2), 126–152.

    Google Scholar 

  • Fu, Q.-J., Zeng, F.-G., Shannon, R. V., & Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. The Journal of the Acoustical Society of America, 104(1), 505–510.

    Google Scholar 

  • Fukada, T., Tokuda, K., Kobayashi, T., & Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. In 1992 IEEE international conference on acoustics, speech, and signal processing (ICASSP-92), (Vol. 1, pp. 137–140).

  • Gandour, J. T. (1978). The perception of tone. Tone (pp. 41–76). Elsevier.

  • Garcia-Garcia, J. M., Penichet, V. M., Lozano, M. D. (2017). Emotion detection: A technology review. In Proceedings of the xviii International conference on human computer interaction (pp. 1–8).

  • Gui, C. (1978). A comparison of Chinese and English sound system. Modern Foreign Languages, 1, 44–50.

    Google Scholar 

  • He, S. (2002). Contrastive analysis of English and Chinese. Shanghai Foreign Language Education Press.

    Google Scholar 

  • Hodari, Z., Watts, O., & King, S. (2019, Sep). Using generative modelling to produce varied intonation for speech synthesis. In 10th ISCA speech synthesis workshop. https://doi.org/10.21437/ssw.2019-43

  • House, A. S. (1961). On vowel duration in English. The Journal of the Acoustical Society of America, 33(9), 1174–1178.

    Google Scholar 

  • Hozjan, V., & Kačič, Z. (2006). A rule-based emotion-dependent feature extraction method for emotion analysis from speech. The Journal of the Acoustical Society of America, 119(5), 3109–3120.

    Google Scholar 

  • Isenberg, D., & Gay, T. (1978). Acoustic correlates of perceived stress in an isolated synthetic disyllable. The Journal of the Acoustical Society of America, 64(S1), S21–S21.

    Google Scholar 

  • Juffs, A. (1990). Tone, syllable structure and interlanguage phonology: Chinese learners’ stress errors. International Review of Applied Linguistics in Language Teaching, 28(2), 99–118. https://doi.org/10.1515/iral.1990.28.2.99

    Article  Google Scholar 

  • Kelly, G. (2000). How to teach pronunciation. Longman.

    Google Scholar 

  • Kostoulas, T., Ganchev, T., & Fakotakis, N. (2008). Study on speakerindependent emotion recognition from speech on real-world data. In Verbal and nonverbal features of human-human and human-machine interaction (pp. 235–242). Springer.

  • Larsen, R.J., & Prizmic-Larsen, Z. (2006). Measuring emotions: Implications of a multimethod perspective. In Handbook of multimethod measurement in psychology (pp. 337–351). American Psychological Association.

  • Lehiste, I. (1970). Suprasegmentals. MIT Press.

    Google Scholar 

  • Li, Y., & Zhao, Y. (1998). Recognizing emotions in speech using short-term and long-term features. In Fifth international conference on spoken language processing.

  • Lieberman, P. (1960). Some acoustic correlates of word stress in American English. The Journal of the Acoustical Society of America, 32(4), 451–454.

    Google Scholar 

  • Mao, S., Ching, P. C., & Lee, T. (2019). Deep learning of segment-level feature representation with multiple instance learning for utterance-level speech emotion recognition. In Interspeech 2019, 20th Annual conference of the international speech communication association, Graz, Austria, 15–19 September 2019 (pp. 1686–1690). ISCA. https://doi.org/10.21437/Interspeech.2019-1968

  • Mase, K. (1991). Recognition of facial expression from optical flow. IEICE Transactions on Information and Systems, 74(10), 3474–3483.

    Google Scholar 

  • Matsumoto, D., & Ekman, P. (1989). American-Japanese cultural differences in intensity ratings of facial expressions of emotion. Motivation and Emotion, 13(2), 143–157.

    Google Scholar 

  • Mattys, S. L. (2000). The perception of primary and secondary stress in English. Perception & Psychophysics, 62(2), 253–265.

    Google Scholar 

  • Morton, J., & Jassem, W. (1965). Acoustic correlates of stress. Language and Speech, 8(3), 159–181. https://doi.org/10.1177/002383096500800303

    Article  Google Scholar 

  • Navas, E., Hernáez, I., & Luengo, I. (2006). An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional tts. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1117–1127.

    Google Scholar 

  • Nittrouer, S. (2005). Age-related differences in weighting and masking of two cues to word-final stop voicing in noise. The Journal of the Acoustical Society of America, 118(2), 1072–1088.

    Google Scholar 

  • Okobi, A.O. (2006). Acoustic correlates of word stress in American English (Unpublished doctoral dissertation). Massachusetts Institute of Technology.

  • Pantic, M., & Rothkrantz, L. J. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, 91(9), 1370–1390.

    Google Scholar 

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  • Pervaiz, M., & Ahmed, T. (2016). Emotion recognition from speech using prosodic and linguistic features. International Journal of Advanced Computer Science & Applications, 7(8).

  • Peterson, G. E., & Lehiste, I. (1960). Duration of syllable nuclei in English. The Journal of the Acoustical Society of America, 32(6), 693–703.

    Google Scholar 

  • Picard, R. W. (1997). Affective computing. MIT Press.

    Google Scholar 

  • Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125.

    Google Scholar 

  • Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16, 143–160.

    Google Scholar 

  • Raphael, L. J. (1972). Preceding vowel duration as a cue to the perception of the voicing characteristic of word-final consonants in American English. The Journal of the Acoustical Society of America, 51(4B), 1296–1303.

  • Rietveld, A. C., & Koopmans-van Beinum, F. J. (1987). Vowel reduction and stress. Speech Communication, 6(3), 217–229.

    Google Scholar 

  • Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2010). Cluster analysis of differential spectral envelopes on emotional speech. In Eleventh annual conference of the international speech communication association.

  • Schirru, C. (1992). Premiers éléments d’analyse prosodique contrastive entre le sarde, l’italien et le francais: Résultats statistiques. Travaux de l’Institut de Phonétique d’Aix, 14, 77–108.

  • Schirru, C. (1995). Fenomeni sovrasegmentali: confronto fra comportamento femminile e maschile. In G. Marcato (a cura di): Donna e linguaggio, atti del convegno internazionale di studi su “dialettologia al femminile” (pp. 437–446).

  • Schirru, C. (1998). German, Italian and French in a Swiss variety: A first global approach. In Proceedings of the third international symposium on the acquisition of second-language speech (pp. 276–285).

  • Schirru, C. (2000). Sulla durata sillabica nell’italiano della Sardegna. Revista de Filología Románica, 17, 283–291.

    Google Scholar 

  • Schirru, C. (2009). Per un’analisi interlinguistica d’epoca: Grazia Deledda e contemporanei (pp. 9–32). Anno XI: Rivista Italiana di Linguistica e di Dialettologia.

    Google Scholar 

  • Schirru, C. (2013). Un autore “in parola”: preliminari fisico-acustici sulla voce di Marinetti: I linguaggi del futurismo. Atti del convegno internazionale di studi (Macerata, 15-17 dicembre 2010). Edizioni Università di Macerata.

  • Seferoǧlu, G. (2005). Improving students’ pronunciation through accent reduction software. British Journal of Educational Technology, 36(2), 303–316. https://doi.org/10.1111/j.1467-8535.2005.00459.x

    Article  Google Scholar 

  • Shen, X. S. (1993). Relative duration as a perceptual cue to stress in mandarin. Language and Speech, 36(4), 415–433.

    Google Scholar 

  • Shih, C. E. A. (1988). Tone and intonation in Mandarin. Working Papers, Cornell Phonetics Laboratory, 3 , 83–109.

  • Sluijter, A. M., & Van Heuven, V. J. (1996). Spectral balance as an acoustic correlate of linguistic stress. The Journal of the Acoustical Society of America, 100(4), 2471–2485.

    Google Scholar 

  • Tseng, C.-Y. (1981). An acoustic phonetic study on tones in Mandarin Chinese. Brown University.

  • Van der Hulst, H. (1999). Word accent. In Word prosodic systems in the languages of Europe, (pp. 3–115). De Gruyter Mouton.

  • Van Heuven, V. J., & Menert, L. (1996). Why stress position bias? The Journal of the Acoustical Society of America, 100(4), 2439–2451.

    Google Scholar 

  • Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.

    Google Scholar 

  • Vicsi, K., & Sztahó, D. (2011). Problems of the automatic emotion recognitions in spontaneous speech; an example for the recognition in a dispatcher center. In Toward autonomous, adaptive, and context-aware multimodal interfaces. Theoretical and practical issues (pp. 331–339). Springer.

  • Wei, Z. (2003). An introduction to comparative studies of Chinese and English. Shanghai Foreign Language Education Press.

    Google Scholar 

  • Whalen, D. H., & Xu, Y. (1992). Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica, 49(1), 25–47.

  • Whitman, R. L., & Jackson, K. L. (1972). The unpredictability of constrastive analysis. Language Learning, 22(1), 29–41. https://doi.org/10.1111/j.1467-1770.1972.tb00071.x

    Article  Google Scholar 

  • Williams, C. E., & Stevens, K. N. (1972). Emotions and speech: Some acoustical correlates. The Journal of the Acoustical Society of America, 52(4B), 1238–1250.

    Google Scholar 

  • Wu, C. -H., Lin, J. -C., & Wei, W. -L. (2014). Survey on audiovisual emotion recognition: Databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing, 3.

  • Yuan, J., Shen, L., & Chen, F. (2002). The acoustic realization of anger, fear, joy and sadness in Chinese. In Seventh international conference on spoken language processing.

  • Zhang, Y., & Francis, A. (2010). The weighting of vowel quality in native and non-native listeners’ perception of English lexical stress. Journal of Phonetics, 38(2), 260–271.

  • Zhang, Y., Nissen, S. L., & Francis, A. L. (2008). Acoustic characteristics of English lexical stress produced by native Mandarin speakers. The Journal of the Acoustical Society of America, 123(6), 4498–4513.

  • Zhao, Z. (2006). Phonology. Shanghai Foreign Language Education Press.

    Google Scholar 

  • Zhou, J., & Mengoni, P. (2020). Spot gold price prediction using financial news sentiment analysis. In IEEE/WIC/ACM International Joint Conference on Web intelligence and intelligent agent technology, WI/IAT 2020, Melbourne, Australia, December 14–17, 2020 (pp. 758–763). IEEE. https://doi.org/10.1109/WIIAT50758.2020.00117

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally to this work.

Corresponding author

Correspondence to Paolo Mengoni.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Consent for publication

All authors of the manuscript have read and agreed the final manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schirru, C., Simin, S., Mengoni, P. et al. Linguistic analysis for emotion recognition: a case of Chinese speakers. Int J Speech Technol 26, 417–432 (2023). https://doi.org/10.1007/s10772-023-10028-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10028-x

Keywords

Navigation