Abstract
Employing stress in speech can transfer more information to a listener but makes more problems in speech recognition. The first step toward stressed speech recognition is the recognition of boundaries in stressed speech. In this research, the boundaries of prosodic stress were extracted in Farsi stressed sentences. The acoustic and prosodic features were used to train hidden Markov models for stress boundaries recognition. Using fast correlation-based filter (FCBF) method, the efficient features were selected for stress recognition. The influence of different feature sets on stress boundaries recognition performance was evaluated in this study. Based on this evaluation, a combined classifier scheme was proposed. Experimental results showed that the proposed combined model improved the stress boundaries detection performance by 12% as compared to the baseline model. So, the final recognition rate of the proposed classifier was 85% for prosodic stress boundaries recognition.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ananthakrishnan, A., & Narayanan, S. (2005). An automatic prosody recognizer using a coupled multi-stream acoustic model and syntactic-prosodic language model. Proceedings of the International Conference on Acoustic, Speech and Signal Processing in Montreal, Canada (pp. 269–272).
Ananthakrishnan, S., & Narayanan, S. (2008). Automatic prosodic even detection using acoustic, lexical and syntactic evidence. IEEE Transactions on Audio, Speech, and Language Processing, 16, 216–228.
Arslan, L. M., & Hansen, J. H. L. (1997). Frequency characteristics of foreign accented speech. Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP’97), 2, in Munich (pp. 1123–1126).
Bartels, C. D., & Bilmes, J. A. (2010). Graphical models for integrating syllabic information. Computer Speech and Language, 24, 685–697.
Bartkova, K., & Jouvet, D. (2007). On using units trained on foreign data for improved multiple accent speech recognition. Speech Communication, 49, 836–846.
Bijankhan, M., Sheikhzadegan, J., Roohani, M. R., Samareh, Y., Lucas, C., & Tebiani, M. (1994). The speech database of Farsi spoken language. Proceedings of the Australian International Speech Science and Technology Conference in Sydney, Australia (pp. 826–831).
Bitouk, D., RaginiVerma, R., & AniNenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52, 613–625.
Bortfeld, H., & Morgan, J. L. (2010). Is early word-form processing stress-full? How natural variability supports recognition. Cognitive Psychology, 60, 241–266.
Casale, S., Russo, A., & Serrano, S. (2007). Multistyle classification of speech under stress using feature subset selection based on genetic algorithms. Speech Communication, 49, 801–810.
Chen, K., Hasegawa-Johnson, M., & Cohen, A. (2004). An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic prosodic model. Proceedings of the International Conference on Acoustic, Speech and Signal Processing in Montreal, Canada (pp. 509–512).
Cvejic, E., Kim, J., & Davis, C. (2012). Recognizing prosody across modalities, face areas and speakers: Examining perceivers’ sensitivity to variable realizations of visual prosody. Cognition, 122, 442–453.
Domahs, U., Klein, E., Huber, W., & Domahs, F. (2013). Good, bad and ugly word stress—fMRI evidence for foot structure driven processing of prosodic violations. Brain & Language, 125, 272–282.
Dumouchel, P., & O’Shaughnessy, D. D. (1993). Prosody and continuous speech recognition. Proceedings of the European Conference on Speech Communication and Technology in Berlin, Germany.
Fleuret, F. (2004). Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5, 1531–1555.
Gallwitz, F., Niemann, H., No¨, thE., and Warnke., V. (2002). Integrated recognition of words and prosodic phrase boundaries. Speech Communication, 36, 81–95.
Gharavian, D. (2004). Prosody in Farsi language and its use in recognition of intonation and speech, Ph.D. Thesis, Elec. Eng. Dept., Amirkabir University, Tehran (In Farsi).
Gharavian, D., & Ahadi, S. M. (2003). Statistical evaluation of the influence of stress on pitch frequency and phoneme durations in Farsi language. 8th European Conference on Speech Communication and Technology in Geneva.
Gharavian, D., & Ahadi, S. M. (2004a). Evaluation of the effect of stress on formants in Farsi vowels. International Conference on Acoustics, Speech, and Signal Processing in Montreal.
Gharavian, D., & Ahadi, S. M. (2004b). Use of formants in stressed and unstressed continuous speech recognition. 8th International Conference on Spoken Language Processing in Jeju Island.
Gharavian, D., & Ahadi, S. M. (2008). Stressed speech recognition using a warped frequency scale. IEICE Electronic Express, 5, 187–191.
Gharavian, D., Sheikhan, M., & Ashoftedel, F. (2013). Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model. Neural Computing and Applications, 22, 1181–1191.
Gharavian, D., Sheikhan, M., Nazerieh, A. R., & Garoucy, S. (2012). Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Computing and Applications, 21, 2115–2126.
He, L., Lech, M., Maddage, N. C., & Allen, N. B. (2011). Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomedical Signal Processing and Control, 6, 139–146.
Kat, L. W., & Fung, P. (1999). Fast accented identification and accented speech recognition. Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP’99), 1, in Phoenix, AZ (pp. 221–224).
Kirchhoff, K., Fink, G. A., & Sagerer, G. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37, 303 – 39.
Kompe, R., Kiessling, A., Niemann, H., No¨th, E., Schukat-Talamazzini, E. G., Zottman, A., & Batliner, A. (1995). Prosodic scoring of word hypothesis graphs. Proceedings of the European Conference on Speech Communication and Technology in Madrid, (pp. 1333–1336).
Kuijk, D. V., Heuvel, H. V. D., & Boves L. (1996). Using lexical stress in continuous speech recognition for Dutch. Proceeding of the International Conference on Spoken Language Processing (ICSLP’96), 3, in Philadelphia, PA (1736–1739).
McCandless, S. S. (1974). An algorithm for formant extraction using linear prediction spectra. IEEE Transactions on Acoustics, Speech and Signal Processing, 2, 135–141.
Medan, Y., Yair, E., & Chazan, D. (1991). Super resolution pitch determination of speech signals. IEEE Trans. Signal Processing, 39(1), 40–48.
Narayana, L., & Kopparapu, S. K. (2009). On the use of stress information in speech for speaker recognition. Proceedings of the IEEE Region 10 Conference (TENCON’09) in Singapore (pp. 1–4).
Ni, C., Liu, W., & Bo, X. B. (2012). From English pitch accent detection to Mandarin stress detection, where is the difference? Computer Speech and Language, 26, 127–148.
Patil, S. A., & Hansen, J. H. L. (2010). The physiological microphone (PMIC): A competitive alternative for speaker assessment in stress detection and speaker verification. Speech Communication, 52, 327–340.
Santen, J. P. H., Prud’hommeaux, E. T., & Black, L. M. (2009). Automated assessment of prosody production. Speech Communication, 51, 1082–1097.
ShiroOjima, A., & Hagiwara, H. (2011). An event-related potential investigation of lexical pitch-accent processing in auditory Japanese. Brain Research, 1385, 217–228.
Shue, Y.-L., Shattuck-Hufnagel, S. S., Iseli, M., Jun, S.-A., Veilleux, N., & Alwan, A. (2010). On the acoustic correlates of high and low nuclear pitch accents in American English. Speech Communication, 52, 106–122.
Theera-Umpon, N., Chansareewittaya, S., & Auephanwiriyakul, S. (2011). Phoneme and tonal accent recognition for Thai speech. Expert Systems with Applications, 38, 13254–13259.
Tomas, B., Maletic, M., & Raguz, Z. (2007). Determination and evaluation pitch harmonics parameters with emotions classifications. Proceedings of the International Conference on Telecommunications and Computer Networks (SOFTCOM 2007) in Split-Dubrovnik (pp. 1–5).
Vazirnezhad, B., Almasganj, F., & Ahadi, S. M. (2009). Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Computer Speech and Language, 23, 1–24.
Vicsi, K., & Szasza´k, G. (2010). Using prosody to improve automatic speech recognition. Speech Communication, 52, 413–426.
Wightman, C. W., & Ostendorf, M. (1994). Automatic labeling of prosodic patterns. IEEE Transactions on Audio and Speech Processing, 2, 469–481.
Wu, T., Duchateau, J., Wu, T., Martens, J.-P., & Compernolle, D. V. (2010). Feature subset selection for improved native accent identification. Speech Communication, 52, 83–98.
Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2002). The HTK Book. Revised for HTK Version 3.2. Retrieved from http://htk.eng.cam.ac.uk/.
Zhang, A. Y., You, H., & Ni, C. J. (2010). Mandarin stress detection using syllable-based acoustic and syntactic features. Proceedings of the International Conference on Audio Language and Image Processing (ICALIP’10) in Shanghai (pp. 494–498).
Zhou, G., Hansen, J. H. L., & Kaiser, J. F. (1998). Classification of speech under stress based on feature derived from the nonlinear Teager energy operator. Proceedings of the International Conference on Acoustic, Speech and Signal Processing, 1, in Seattle, WA (pp. 549–552)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gharavian, D., Sheikhan, M. & Ghasemi, S.S. Combined classification method for prosodic stress recognition in Farsi language. Int J Speech Technol 21, 333–341 (2018). https://doi.org/10.1007/s10772-018-9508-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9508-7