Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech

Lazaridis, Alexandros; Ganchev, Todor; Kostoulas, Theodoros; Mporas, Iosif; Fakotakis, Nikos

doi:10.1007/s10772-010-9077-x

Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech

Published: 30 July 2010

Volume 13, pages 175–188, (2010)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Alexandros Lazaridis¹,
Todor Ganchev¹,
Theodoros Kostoulas¹,
Iosif Mporas¹ &
…
Nikos Fakotakis¹

184 Accesses
1 Citation
Explore all metrics

Abstract

Accurate modeling of prosody is prerequisite for the production of synthetic speech of high quality. Phone duration, as one of the key prosodic parameters, plays an important role for the generation of emotional synthetic speech with natural sounding. In the present work we offer an overview of various phone duration modeling techniques, and consequently evaluate ten models, based on decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms, which over the past decades have been successfully used in various modeling tasks. Furthermore, we study the opportunity for performance optimization by applying two feature selection techniques, the RReliefF and the Correlation-based Feature Selection, on a large set of numerical and nominal linguistic features extracted from text, such as: phonetic, phonologic and morphosyntactic ones, which have been reported successful on the phone and syllable duration modeling task. We investigate the practical usefulness of these phone duration modeling techniques on a Modern Greek emotional speech database, which consists of five categories of emotional speech: anger, fear, joy, neutral, sadness. The experimental results demonstrated that feature selection significantly improves the accuracy of phone duration prediction regardless of the type of machine learning algorithm used for phone duration modeling. Specifically, in four out of the five categories of emotional speech, feature selection contributed to the improvement of the phone duration modeling, when compared to the case without feature selection. The M5p trees based phone duration model was observed to achieve the best phone duration prediction accuracy in terms of RMSE and MAE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

Music mood and human emotion recognition based on physiological signals: a systematic review

Article 22 April 2021

Speech Emotion Recognition: A Comprehensive Survey

Article 08 March 2023

References

Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning algorithms. Journal of Machine Learning, 6, 37–66.
Google Scholar
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Article MATH MathSciNet Google Scholar
Allen, J., Hunnicutt, S., & Klatt, D. H. (1987). From text to speech: the MITalk system. Cambridge: Cambridge University Press.
Google Scholar
Arvaniti, A., & Baltazani, M. (2000). Greek ToBI: a system for the annotation of Greek speech corpora. In Proceedings of the 2nd international conference on language resources and evaluation (pp. 555–562). Athens, Greece.
Atkeson, C. G., Moorey, A. W., & Schaal, S. (1996). Locally weighted learning. Artificial Intelligence Review, 11, 11–73.
Article Google Scholar
Barbosa, P. A., & Bailly, G. (1994). Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication, 15, 127–137.
Article Google Scholar
Bartkova, K., & Sorin, C. (1987). A model of segmental duration for speech synthesis in French. Speech Communication, 6, 245–260.
Article Google Scholar
Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., & Gildea, D. (2003). Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. Journal of the Acoustical Society of America, 113(2), 1001–1024.
Article Google Scholar
Black, A. (2003). Unit selection and emotional speech. In Proceedings of EUROSPEECH’03 (pp. 1649–1652). Geneva, Switzerland.
Breiman, L. (1996). Bagging predictors. Journal of Machine Learning, 24(2), 123–140.
MATH MathSciNet Google Scholar
Burkhardt, F., & Sendlmeier, W. F. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. In Proceedings of the ISCA workshop on speech & emotion (pp. 151–156). Northern Ireland.
Campbell, W. N. (1992). Syllable based segment duration. In G. Bailly, C. Benoit, & T. R. Sawallis (Eds.), Talking machines: theories, models and designs (pp. 211–224). Amsterdam: Elsevier.
Google Scholar
Carlson, R., & Granstrom, B. (1986). A search for durational rules in real speech database. Phonetica, 43, 140–154.
Article Google Scholar
Chien, J. T., & Huang, C. H. (2003). Bayesian learning of speech duration models. IEEE Transactions on Speech and Audio Processing, 11(6), 558–567.
Article Google Scholar
Chung, H. (2002). Duration models and the perceptual evaluation of spoken Korean. In Proceedings of speech prosody (pp. 219–222). France.
Cordoba, R., Montero, J. M., Gutierrez-Ariola, J., & Pardo, J. M. (2001). Duration modeling in a restricted-domain female-voice synthesis in Spanish using neural networks. In Proceedings of ICASSP’01 (pp. 793–796). Utah, USA.
Crystal, T. H., & House, A. S. (1988). Segmental durations in connected-speech signals: current results. Journal of the Acoustical Society of America, 83(4), 1553–1573.
Article Google Scholar
Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht: Kluwer Academic.
Google Scholar
Epitropakis, G., Tambakas, D., Fakotakis, N., & Kokkinakis, G. (1993). Duration modelling for the Greek language. In Proceedings of EUROSPEECH’93 (pp. 1995–1998). Berlin, Germany.
Febrer, A., Padrell, J., & Bonafonte, A. (1998). Modeling phone duration: application to Catalan TTS. In Workshop of speech synthesis (pp. 43–46). Australia.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.
Article MATH MathSciNet Google Scholar
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38(4), 367–378.
Article MATH MathSciNet Google Scholar
Gilad-Bachrach, R., Navot, A., & Tishby, N. (2004). Margin based feature selection—theory and algorithms. In P. Tadepalli, R. Givan, & K. Driessens (Eds.), Proceedings of the 21st international conference on machine learning (pp. 43–50). Banff: Morgan Kaufmann.
Google Scholar
Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning. Boston: Addison–Wesley/Longman.
MATH Google Scholar
Goubanova, O., & King, S. (2008). Bayesian network for phone duration prediction. Speech Communication, 50, 301–311.
Article Google Scholar
Goubanova, O., & Taylor, P. (2000). Using Bayesian belief networks for modeling duration in text-to-speech systems. In Proceedings of the ICSLP’00 (pp. 427–431). Beijing, China.
Gregory, M., Bell, A., Jurafsky, D., & Raymond, W. (2001). Frequency and predictability effects on the duration of content words in conversation. Journal of the Acoustical Society of America, 110(5), 27–38.
Google Scholar
Hall, M. A. (1999). Correlation-based feature subset selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Waikato, New Zealand.
Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In P. Langley (Ed.), Proceedings of the 17th international conference on machine learning (pp. 359–366). San Francisco: Morgan Kaufmann.
Google Scholar
Heuft, B., Portele, T., & Rauth, M. (1996). Emotions in time domain synthesis. In Proceedings of ICSLP’96 (pp. 1974–1977). Philadelphia, USA.
Inanoglu, Z., & Young, S. (2009). Data-driven emotion conversion in spoken English. Speech Communication, 51, 268–283.
Article Google Scholar
Iida, A., Campbell, N., Iga, S., Higuchi, F., & Yasumura, M. (2000). A speech synthesis system for assisting communication. In Proceedings of the ISCA workshop on speech & emotion (pp. 167–172). Northern Ireland.
Iwahashi, N., & Sagisaka, Y. (2000). Statistical modeling of speech segment duration by constrained tree regression. IEICE Transactions on Information and Systems, E83-D(7), 1550–1559.
Google Scholar
Jiang, D. N., Zhang, W., Shen, L., & Cai, L. H. (2005). Prosody analysis and modeling for emotional speech synthesis. In Proceedings of ICASSP’05 (pp. 281–284). Philadelphia, USA.
Kääriäinen, M., & Malinen, T. (2004). Selective rademacher penalization and reduced error pruning of decision trees. Journal of Machine Learning Research, 5, 1107–1126.
Google Scholar
Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Sleeman, & P. Edwards (Eds.), Proceedings of the 9th international conference on machine learning (pp. 249–256). Aberdeen, Scotland. San Francisco: Morgan Kaufmann.
Google Scholar
Klatt, D. H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America, 59, 1209–1221.
Article Google Scholar
Klatt, D. H. (1979). Synthesis by rule of segmental durations in English sentences. In B. Lindlom & S. Ohman (Eds.), Frontiers of speech communication research (pp. 287–300). New York: Academic Press.
Google Scholar
Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82(3), 737–793.
Article Google Scholar
Kohler, K. J. (1988). Zeistrukturierung in der Sprachsynthese. ITG-Tagung Digitalc Sprachverarbeitung, 6, 165–170.
MathSciNet Google Scholar
Kominek, J., & Black, A. W. (2003). CMU ARCTIC databases for speech synthesis, CMU-LTI-03-177, Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
Kononenko, I. (1994). Estimating attributes: analysis and extensions of relief. In F. Bergadano & L. De Raedt (Eds.), Proceedings of the European conference machine learning (pp. 171–182). New York: Springer.
Google Scholar
Krishna, N. S., & Murthy, H. A. (2004). Duration modeling of Indian languages Hindi and Telugu. In Proceedings of the 5th ISCA speech synthesis workshop (pp. 197–202). Pittsburgh, USA.
Krishna, N. S., Talukdar, P. P., Bali, K., & Ramakrishnan, A. G. (2004). Duration modeling for Hindi text-to-speech synthesis system. In Proceedings of ICSLP’04 (pp. 789–792). Jeju Island, Korea.
Lazaridis, A., Zervas, P., & Kokkinakis, G. (2007). Segmental duration modeling for Greek speech synthesis. In Proceedings of ICTAI’07 (pp. 518–521). Patras, Greece.
Lee, S., & Oh, Y. H. (1999a). Tree-based modeling of prosodic phrasing and segmental duration for Korean TTS systems. Speech Communication, 28, 283–300.
Article Google Scholar
Lee, S., & Oh, Y. H. (1999b). CART-based modelling of Korean segmental duration. In Proceedings of the oriental COCOSDA’99 (pp. 109–112). Taipei, Taiwan.
Möbius, B., & Santen, P. H. J. (1996). Modeling segmental duration in German text-to-speech synthesis. In Proceedings of ICSLP’96 (pp. 2395–2398). Philadelphia, USA.
Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390.
Article Google Scholar
Oatley, K., & Johnson-Laird, P. (1998). The communicative theory of emotions. In J. Jenkins, K. Oatley, & N. Stein (Eds.), Human emotions: a reader (pp. 84–87). Oxford: Blackwell.
Google Scholar
Olive, J. P., & Liberman, M. Y. (1985). Text to speech—an overview. Journal of the Acoustical Society of America, 78(1), S6.
Article Google Scholar
Quinlan, R. J. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence (pp. 343–348). Hobart, Tasmania.
Rank, E., & Pirker, H. (1998). Generating Emotional Speech with a Concatenative Synthesizer. In Proceedings of ICSLP’98 (pp. 671–674). Sydney, Australia.
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21(2), 282–295.
Article Google Scholar
Riley, M. (1992). Tree-based modelling for speech synthesis. In G. Bailly, C. Benoit, & T. R. Sawallis (Eds.), Talking machines: theories, models and designs (pp. 265–273). Amsterdam: Elsevier.
Google Scholar
Robnik-Sikonja, M., & Kononenko, I. (1997). An adaptation of relief for attribute estimation in regression. In D. H. Fisher (Ed.), Proceedings of the 14th international conference on machine learning (pp. 296–304). San Francisco: Morgan Kaufmann.
Google Scholar
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: a standard for labeling English prosody. In Proceedings of ICSLP’92 (pp. 867–870). Banff, Alberta, Canada.
Simoes, A. R. M. (1990). Predicting sound segment duration in connected speech: an acoustical study of Brazilian Portuguese. In Proceedings of the workshop on speech synthesis (pp. 173–176). Autrans, France.
Takeda, K., Sagisaka, Y., & Kuwabara, H. (1989). On sentence-level factors governing segmental duration in Japanese. Journal of Acoustic Society of America, 86(6), 2081–2087.
Article Google Scholar
Tesser, F., Cosi, P., Drioli, C., & Tisato, G. (2005). Emotional festival-mbrola TTS synthesis. In Proceedings of INTERSPEECH’05 (pp. 505–508). Lisboa, Portugal.
Teixeira, J. P., & Freitas, D. (2003). Segmental durations predicted with a neural network. In Proceedings of EUROSPEECH’03 (pp. 169–172). Geneva, Switzerland, September.
van Santen, J. P. H. (1992). Contextual effects on vowel durations. Speech Communication, 11, 513–546.
Article Google Scholar
van Santen, J. P. H. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech & Language, 8(2), 95–128.
Article Google Scholar
Wang, Y., & Witten, I. H. (1997). Induction of model trees for predicting continuous classes. In Proceedings of the 9th European conference on machine learning (pp. 128–137). University of Economics, Faculty of Informatics and Statistics, Prague, Czech.
Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of ICASSP’04 (pp. 641–644). Montreal, Canada.
Witten, H. I., & Frank, E. (2005). Data mining: practical machine learning tools and techniques (2nd ed.) San Francisco: Morgan Kaufmann.
MATH Google Scholar
Yamagishi, J., Kawai, H., & Kobayashi, T. (2008). Phone duration modeling using gradient tree boosting. Speech Communication, 50(5), 405–415.
Article Google Scholar
Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice Hall.
Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rion-Patras, 26500, Greece
Alexandros Lazaridis, Todor Ganchev, Theodoros Kostoulas, Iosif Mporas & Nikos Fakotakis

Authors

Alexandros Lazaridis
View author publications
You can also search for this author in PubMed Google Scholar
Todor Ganchev
View author publications
You can also search for this author in PubMed Google Scholar
Theodoros Kostoulas
View author publications
You can also search for this author in PubMed Google Scholar
Iosif Mporas
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Fakotakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandros Lazaridis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lazaridis, A., Ganchev, T., Kostoulas, T. et al. Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech. Int J Speech Technol 13, 175–188 (2010). https://doi.org/10.1007/s10772-010-9077-x

Download citation

Received: 20 May 2010
Accepted: 11 July 2010
Published: 30 July 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s10772-010-9077-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

Music mood and human emotion recognition based on physiological signals: a systematic review

Speech Emotion Recognition: A Comprehensive Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

Music mood and human emotion recognition based on physiological signals: a systematic review

Speech Emotion Recognition: A Comprehensive Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation