Abstract
Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the required prosody modification is achieved by the interpolation of epoch intervals plot. Alternatively, this work proposes a method for prosody modification by the resampling of ZFFS. Also the existing epoch based prosody modification method is further refined for modifying the prosodic parameters at every epoch level. Thus providing more flexibility for prosody modification. The general framework for deriving the modified epoch locations can also be used for obtaining the dynamic prosody modification from existing PSOLA and epoch based prosody modification methods. The quality of the prosody modified speech is evaluated using waveforms, spectrograms and subjective studies. The usefulness of the proposed dynamic prosody modification is demonstrated for neutral to emotional conversion task. The subjective evaluations performed for the emotion conversion indicate the effectiveness of the dynamic prosody modification over the fixed prosody modification for emotion conversion. The dynamic prosody modified speech files synthesized using the proposed, epoch based and TD-PSOLA methods are available at http://www.iitg.ac.in/eee/emstlab/demos/demo5.php.
Similar content being viewed by others
References
Cabral, J. P. (2006). Transforming prosody and voice quality to generate emotions in speech. Master’s thesis, L2F-Spoken Language Systems Lab, Lisboa, Portugal.
Cabral, J. P., & Oliveira, L. C. (2006). Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proc. INTERSPEECH.
Cahn, J. E. (1989). Generation of affect in synthesized speech. In Proc. American Voice I/O Society.
Campell, N., Hamza, W., Hog, H., & Tao, J. (2006). Editorial special section on expressive speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1097–1098.
Childers, D. G., Wu, K., & Yegnanarayana, B. (1989). Voice conversion. Speech Communication, 8, 147–158.
Dhananjaya, N., & Yegananarayana, B. (2010). Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3), 273–276.
Govind, D., Prasanna, S. R. M., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In Proc. INTERSPEECH 2011.
Gu, H. -Y. (1998). Notes for the Syllable-signal synthesis method: Tipw. In Proc. ISCSLP.
Gu, H.-Y., & Shiu, W.-L. (1998). A mandarin-syllable signal synthesis method with increased flexibility in duration, tone and timbre control. Proceedings of the National Science Council, Republic of China. Part A, 22(3), 385–395.
Hofer, G., Richmond, K., & Clark, B. (2005). Informed blending of databases for emotional speech synthesis. In Proc. INTERSPEECH.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9, 452–467.
Mourlines, E., & Laroche, J. (1995). Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication, 16, 175–205.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1614.
Murty, K. S. R., & Yegnanarayana, B. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.
Pollard, M. P., et al. (1996). Enhanced shape-invarient pitch and time-scale modification for concatenative speech synthesis. In Proc. ICSLP.
Portnoff, M. R. (1981). Time-scale modification of speech based on short-time Fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-29, 374–390.
Prasanna, S. R. M., & Govind, D. (2010). Analysis of excitation source information in emotional speech. In Proc. INTERSPEECH (pp. 781–784).
Prasanna, S. R. M., Govind, D., Rao, K. S., & Yenanarayana, B. (2010). Fast prosody modification using instants of significant excitation. In Proc. speech prosody.
Quatieri, T. F., & McAulay, R. J. (1992). Shape invariant time scale and pitch modification of speech. IEEE Transactions on Signal Processing, 40(3), 497–510.
Rao, K. S., & Yegananarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51(12), 1263–1269.
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14, 972–980.
Rao, K. S., Prasanna, S. R. M., & Yegnanarayana, B. (2007). Determination of instants of significant excitation in speech using Hilbert envelope and group delay function. IEEE Signal Processing Letters, 14, 762–765.
Schroeder, M. R., Flanagan, J. L., & Lundry, E. A. (1967). Bandwidth compression of speech by analytic-signal rooting. Proceedings of the IEEE, 55(3), 396–401.
Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Acoustics, Speech, and Signal Processing, 4, 325–333.
Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1145–1154.
Taylor, P. (2009). Text to speech synthesis. Cambridge: Cambridge University Press.
Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for story telling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108.
Thomas, M. R. P., Gudnason, J., & Naylor, P. A. (2008). Application of the dypsa algorithm to segmented time scale modification of speech. In Proc. European signal processing conference.
Acknowledgements
The work done in this paper is funded by the on going UK-India Education Research Initiative (UKIERI) project titled “study of source features for speech synthesis and speaker recognition” between IIT Guwahati, IIIT Hyderabad and University of Edinburgh.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Govind, D., Mahadeva Prasanna, S.R. Dynamic prosody modification using zero frequency filtered signal. Int J Speech Technol 16, 41–54 (2013). https://doi.org/10.1007/s10772-012-9155-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9155-3