Abstract
Acoustic features, in the examination of emotions occurring in pronouncing English and Chinese Mandarin words, are investigated in this study, then different emotion recognition experiments are presented. To this end, the sound recordings for 91 speakers were analyzed. In the test experiment, a linguistic data set was used to examine which acoustic features are most important for the emotional representation in signal acquisition, segmentation, construction, and encoding. In doing so, words, syllables, phonemes (which contain vowels and consonants), stress and frequency tones were taken into consideration. The types of emotions considered in the experiment included neutral, happy, and sad. Time duration differences, F0 frequency, and dB intensity levels variables were used in conjunction with unsupervised and supervised machine learning approaches for emotion recognition.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Adams, C., & Munro, R. R. (1978). In search of the acoustic correlates of stress: Fundamental frequency, amplitude, and duration in the connected utterance of some native and non-native speakers of English. Phonetica, 35(3), 125–156.
Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communications, 116, 56–76.
Assmann, P. F., & Katz, W. F. (2005). Synthesis fidelity and time-varying spectral change in vowels. Journal of the Acoustical Society of America, 117(2), 886–895.
Beckman, M. E. (2012). Stress and non-stress accent. De Gruyter Mouton. https://doi.org/10.1515/9783110874020
Biele, C., & Grabowska, A. (2006). Sex differences in perception of emotion intensity in dynamic and static facial expressions. Experimental Brain Research, 171(1), 1–6.
Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9), 341–345.
Bolinger, D. L. (1958). A theory of pitch accent in English. Word, 14(2–3), 109–149.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., & Weiss, B. (2005). A database of German emotional speech. In Ninth European conference on speech communication and technology.
Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 582–596.
Campbell, N. (2008). Individual traits of speaking style and speech rhythm in a spoken discourse. In Verbal and nonverbal features of human-human and human-machine interaction (pp. 107–120). Springer.
Casale, S., Russo, A., Scebba, G., & Serrano, S. (2008). Speech emotion classification using machine learning algorithms. In 2008 IEEE international conference on semantic computing (pp. 158–165).
Chao, Y. (1968). A grammar of spoken Chinese. University of California Press.
Chen, M. (1970). Vowel length variation as a function of the voicing of the consonant environment. Phonetica, 22(3), 129–159.
Chen, Y., Robb, M. P., Gilbert, H. R., & Lerman, J. W. (2001). A study of sentence stress production in mandarin speakers of American English. The Journal of the Acoustical Society of America, 109(4), 1681–1690.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. G. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.
Cutler, A., & Darwin, C. J. (1981). Phoneme-monitoring reaction time and preceding prosody: Effects of stop closure duration and of fundamental frequency. Perception & Psychophysics, 29(3), 217–224.
Davletcharova, A., Sugathan, S., Abraham, B., & James, A. P. (2015). Detection and analysis of emotion from speech signals. Procedia Computer Science, 58, 91–96.
Demircan, S., & Kahramanli, H. (2018). Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech. Neural Computing and Applications, 29(8), 59–66.
Demšar, J., Curk, T., Erjavec, A., Gorup, Črt, Hočevar, T., Milutinovič, M., & Zupan, B. (2013). Orange: Data mining toolbox in python. Journal of Machine Learning Research, 14, 2349–2353.
Deng, J. J., Leung, C. H. C., Mengoni, P., & Li, Y. (2018). Emotion recognition from human behaviors using attention model. In First IEEE international conference on artificial intelligence and knowledge engineering, AIKE 2018, Laguna Hills, CA, USA, September 26–28, 2018 (pp. 249–253). IEEE Computer Society. https://doi.org/10.1109/AIKE.2018.00056
Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003). Emotional speech: Towards a new generation of databases. Speech Communication, 40(1–2), 33–60.
Duanmu, S. (2007). The phonology of standard Chinese. OUP Oxford.
Eibe, F., Hall, M. A., Witten, I. H. (2016). The weka workbench. Online appendix for data mining: Practical machine learning tools and techniques. Morgan Kaufmann.
Ekman, P. (1999). Basic emotions. Handbook of Cognition and Emotion, 98(45–60), 16.
Ekman, P., & Cole, J. (1972). Universals and cultural differences in facial expressions of emotions. In Nebraska symposium on motivation (Vol. 19, pp. 207–283).
Fan, C. (1982). Sounds of English and Chinese. Primary and Middle School English Teaching and Research, 1, 3–4.
Franzoni, V., Li, Y., Mengoni, P., & Milani, A. (2017). Clustering Facebook for biased context extraction. In Computational science and its applications (ICCSA 2017), 17th international conference, Trieste, Italy, July 3–6, 2017, proceedings, part I (Vol. 10404, pp. 717–729). Springer.
Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. The Journal of the Acoustical Society of America, 27(4), 765–768.
Fry, D. B. (1958). Experiments in the perception of stress. Language and Speech, 1(2), 126–152.
Fu, Q.-J., Zeng, F.-G., Shannon, R. V., & Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. The Journal of the Acoustical Society of America, 104(1), 505–510.
Fukada, T., Tokuda, K., Kobayashi, T., & Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. In 1992 IEEE international conference on acoustics, speech, and signal processing (ICASSP-92), (Vol. 1, pp. 137–140).
Gandour, J. T. (1978). The perception of tone. Tone (pp. 41–76). Elsevier.
Garcia-Garcia, J. M., Penichet, V. M., Lozano, M. D. (2017). Emotion detection: A technology review. In Proceedings of the xviii International conference on human computer interaction (pp. 1–8).
Gui, C. (1978). A comparison of Chinese and English sound system. Modern Foreign Languages, 1, 44–50.
He, S. (2002). Contrastive analysis of English and Chinese. Shanghai Foreign Language Education Press.
Hodari, Z., Watts, O., & King, S. (2019, Sep). Using generative modelling to produce varied intonation for speech synthesis. In 10th ISCA speech synthesis workshop. https://doi.org/10.21437/ssw.2019-43
House, A. S. (1961). On vowel duration in English. The Journal of the Acoustical Society of America, 33(9), 1174–1178.
Hozjan, V., & Kačič, Z. (2006). A rule-based emotion-dependent feature extraction method for emotion analysis from speech. The Journal of the Acoustical Society of America, 119(5), 3109–3120.
Isenberg, D., & Gay, T. (1978). Acoustic correlates of perceived stress in an isolated synthetic disyllable. The Journal of the Acoustical Society of America, 64(S1), S21–S21.
Juffs, A. (1990). Tone, syllable structure and interlanguage phonology: Chinese learners’ stress errors. International Review of Applied Linguistics in Language Teaching, 28(2), 99–118. https://doi.org/10.1515/iral.1990.28.2.99
Kelly, G. (2000). How to teach pronunciation. Longman.
Kostoulas, T., Ganchev, T., & Fakotakis, N. (2008). Study on speakerindependent emotion recognition from speech on real-world data. In Verbal and nonverbal features of human-human and human-machine interaction (pp. 235–242). Springer.
Larsen, R.J., & Prizmic-Larsen, Z. (2006). Measuring emotions: Implications of a multimethod perspective. In Handbook of multimethod measurement in psychology (pp. 337–351). American Psychological Association.
Lehiste, I. (1970). Suprasegmentals. MIT Press.
Li, Y., & Zhao, Y. (1998). Recognizing emotions in speech using short-term and long-term features. In Fifth international conference on spoken language processing.
Lieberman, P. (1960). Some acoustic correlates of word stress in American English. The Journal of the Acoustical Society of America, 32(4), 451–454.
Mao, S., Ching, P. C., & Lee, T. (2019). Deep learning of segment-level feature representation with multiple instance learning for utterance-level speech emotion recognition. In Interspeech 2019, 20th Annual conference of the international speech communication association, Graz, Austria, 15–19 September 2019 (pp. 1686–1690). ISCA. https://doi.org/10.21437/Interspeech.2019-1968
Mase, K. (1991). Recognition of facial expression from optical flow. IEICE Transactions on Information and Systems, 74(10), 3474–3483.
Matsumoto, D., & Ekman, P. (1989). American-Japanese cultural differences in intensity ratings of facial expressions of emotion. Motivation and Emotion, 13(2), 143–157.
Mattys, S. L. (2000). The perception of primary and secondary stress in English. Perception & Psychophysics, 62(2), 253–265.
Morton, J., & Jassem, W. (1965). Acoustic correlates of stress. Language and Speech, 8(3), 159–181. https://doi.org/10.1177/002383096500800303
Navas, E., Hernáez, I., & Luengo, I. (2006). An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional tts. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1117–1127.
Nittrouer, S. (2005). Age-related differences in weighting and masking of two cues to word-final stop voicing in noise. The Journal of the Acoustical Society of America, 118(2), 1072–1088.
Okobi, A.O. (2006). Acoustic correlates of word stress in American English (Unpublished doctoral dissertation). Massachusetts Institute of Technology.
Pantic, M., & Rothkrantz, L. J. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, 91(9), 1370–1390.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Pervaiz, M., & Ahmed, T. (2016). Emotion recognition from speech using prosodic and linguistic features. International Journal of Advanced Computer Science & Applications, 7(8).
Peterson, G. E., & Lehiste, I. (1960). Duration of syllable nuclei in English. The Journal of the Acoustical Society of America, 32(6), 693–703.
Picard, R. W. (1997). Affective computing. MIT Press.
Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125.
Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16, 143–160.
Raphael, L. J. (1972). Preceding vowel duration as a cue to the perception of the voicing characteristic of word-final consonants in American English. The Journal of the Acoustical Society of America, 51(4B), 1296–1303.
Rietveld, A. C., & Koopmans-van Beinum, F. J. (1987). Vowel reduction and stress. Speech Communication, 6(3), 217–229.
Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2010). Cluster analysis of differential spectral envelopes on emotional speech. In Eleventh annual conference of the international speech communication association.
Schirru, C. (1992). Premiers éléments d’analyse prosodique contrastive entre le sarde, l’italien et le francais: Résultats statistiques. Travaux de l’Institut de Phonétique d’Aix, 14, 77–108.
Schirru, C. (1995). Fenomeni sovrasegmentali: confronto fra comportamento femminile e maschile. In G. Marcato (a cura di): Donna e linguaggio, atti del convegno internazionale di studi su “dialettologia al femminile” (pp. 437–446).
Schirru, C. (1998). German, Italian and French in a Swiss variety: A first global approach. In Proceedings of the third international symposium on the acquisition of second-language speech (pp. 276–285).
Schirru, C. (2000). Sulla durata sillabica nell’italiano della Sardegna. Revista de Filología Románica, 17, 283–291.
Schirru, C. (2009). Per un’analisi interlinguistica d’epoca: Grazia Deledda e contemporanei (pp. 9–32). Anno XI: Rivista Italiana di Linguistica e di Dialettologia.
Schirru, C. (2013). Un autore “in parola”: preliminari fisico-acustici sulla voce di Marinetti: I linguaggi del futurismo. Atti del convegno internazionale di studi (Macerata, 15-17 dicembre 2010). Edizioni Università di Macerata.
Seferoǧlu, G. (2005). Improving students’ pronunciation through accent reduction software. British Journal of Educational Technology, 36(2), 303–316. https://doi.org/10.1111/j.1467-8535.2005.00459.x
Shen, X. S. (1993). Relative duration as a perceptual cue to stress in mandarin. Language and Speech, 36(4), 415–433.
Shih, C. E. A. (1988). Tone and intonation in Mandarin. Working Papers, Cornell Phonetics Laboratory, 3 , 83–109.
Sluijter, A. M., & Van Heuven, V. J. (1996). Spectral balance as an acoustic correlate of linguistic stress. The Journal of the Acoustical Society of America, 100(4), 2471–2485.
Tseng, C.-Y. (1981). An acoustic phonetic study on tones in Mandarin Chinese. Brown University.
Van der Hulst, H. (1999). Word accent. In Word prosodic systems in the languages of Europe, (pp. 3–115). De Gruyter Mouton.
Van Heuven, V. J., & Menert, L. (1996). Why stress position bias? The Journal of the Acoustical Society of America, 100(4), 2439–2451.
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.
Vicsi, K., & Sztahó, D. (2011). Problems of the automatic emotion recognitions in spontaneous speech; an example for the recognition in a dispatcher center. In Toward autonomous, adaptive, and context-aware multimodal interfaces. Theoretical and practical issues (pp. 331–339). Springer.
Wei, Z. (2003). An introduction to comparative studies of Chinese and English. Shanghai Foreign Language Education Press.
Whalen, D. H., & Xu, Y. (1992). Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica, 49(1), 25–47.
Whitman, R. L., & Jackson, K. L. (1972). The unpredictability of constrastive analysis. Language Learning, 22(1), 29–41. https://doi.org/10.1111/j.1467-1770.1972.tb00071.x
Williams, C. E., & Stevens, K. N. (1972). Emotions and speech: Some acoustical correlates. The Journal of the Acoustical Society of America, 52(4B), 1238–1250.
Wu, C. -H., Lin, J. -C., & Wei, W. -L. (2014). Survey on audiovisual emotion recognition: Databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing, 3.
Yuan, J., Shen, L., & Chen, F. (2002). The acoustic realization of anger, fear, joy and sadness in Chinese. In Seventh international conference on spoken language processing.
Zhang, Y., & Francis, A. (2010). The weighting of vowel quality in native and non-native listeners’ perception of English lexical stress. Journal of Phonetics, 38(2), 260–271.
Zhang, Y., Nissen, S. L., & Francis, A. L. (2008). Acoustic characteristics of English lexical stress produced by native Mandarin speakers. The Journal of the Acoustical Society of America, 123(6), 4498–4513.
Zhao, Z. (2006). Phonology. Shanghai Foreign Language Education Press.
Zhou, J., & Mengoni, P. (2020). Spot gold price prediction using financial news sentiment analysis. In IEEE/WIC/ACM International Joint Conference on Web intelligence and intelligent agent technology, WI/IAT 2020, Melbourne, Australia, December 14–17, 2020 (pp. 758–763). IEEE. https://doi.org/10.1109/WIIAT50758.2020.00117
Author information
Authors and Affiliations
Contributions
All authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Consent for publication
All authors of the manuscript have read and agreed the final manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Schirru, C., Simin, S., Mengoni, P. et al. Linguistic analysis for emotion recognition: a case of Chinese speakers. Int J Speech Technol 26, 417–432 (2023). https://doi.org/10.1007/s10772-023-10028-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-023-10028-x