Skip to main content
Log in

Improving the Flexibility of Dynamic Prosody Modification Using Instants of Significant Excitation

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Modification of suprasegmental features such as pitch and duration of original speech by fixed scaling factors is referred to as static prosody modification. In dynamic prosody modification, the prosodic scaling factors (time-varying modification factors) are defined for all the pitch cycles present in the original speech. The present work is focused on improving the naturalness of the prosody modified speech by reducing the generation of piecewise constant segments in the modified pitch contour. The prosody modification is performed by anchoring around the accurate instants of significant excitation estimated from the original speech. The division of longer pitch intervals into many equal intervals over long speech segments introduces step-like discontinuities in the form of piecewise constant segments in the modified pitch contours. The effectiveness of proposed dynamic modification method is initially confirmed from the smooth modified pitch contour plot obtained for finer static prosody scaling factors, waveforms, spectrogram plots and comparison subjective evaluations. Also, the average \(F_0\) jitter computed from the pitch segments of each glottal activity region in the modified speech is proposed as an objective measure for the prosody modification. The naturalness of the prosody modified speech using the proposed method is objectively and subjectively compared with that of the existing zero frequency filtered signal-based dynamic prosody modification. Also, the proposed algorithm effectively preserves the dynamics of the prosodic patterns in singing voices where in the \(F_0\) parameters rapidly and continuously fluctuate within a higher \(F_0\) range.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. The terms epochs and ISE are interchangeably used throughout this article.

  2. The epoch intervals and instantaneous pitch periods are considered as the same parameter in the context of prosody modification.

  3. Since pitch cycles are either repeated or dropped in case of duration modification, no overlap in successive pitch intervals occurs and hence samples in the pitch intervals interval are not copied in overlap-add manner. For pitch modification, to reduce the effect of truncation and expansion of pitch cycles in the waveform reconstruction, the samples in each pitch intervals of the original speech signal are copied in an overlap-add manner.

  4. Methods for subjective determination of transmission quality: ITU-T Recommendation P.800 is available from the ITU Web site: http://www.itu.int/rec/T-REC-P.800-199608-I/en.

References

  1. N. Adiga, D. Govind, S.R.M. Prasanna, Significance of epoch identification accuracy for prosody modification, in Proceedings of the SPCOM (2014)

  2. J.P. Cabral, L.C. Oliveira, Emo voice: a system to generate emotions in speech, in Proceedings of the INTERSPEECH (2006a), pp. 1798–1801

  3. J.P. Cabral, L.C. Oliveira, Pitch-synchronous time-scaling for prosodic and voice quality transformations, in Proceedings of the INTERSPEECH (2006b)

  4. K.T. Deepak, S.R.M. Prasanna, Epoch extraction using zero band filtering from speech signal. Circuits Syst. Signal Process. (2014). doi:10.1007/s00034-014-9957-4

  5. J.R. Deller, J.G. Proakis, J.H.L. Hanson, Discrete-Time Processing of Speech Signals (Macmillan, New York, 1993)

    Google Scholar 

  6. M. Farrus, J. Hernando, Using jitter and shimmer in speaker verification. IET Signal Process. 3(4), 247–257 (2009)

    Article  Google Scholar 

  7. D. Govind, A.S. Biju, A. Smily, Automatic speech polarity detection using phase information from complex analytic signal representations, in SPCOM (2014)

  8. D. Govind, S.R.M. Prasanna, Epoch extraction from emotional speech, in Proceedings of the Signal Processing & Communications (SPCOM) (2012), pp. 1–5

  9. D. Govind, S.R.M. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of the INTERSPEECH (2011)

  10. D. Govind, S.R.M. Prasanna, Dynamic prosody modification using zero frequency filtered signal. Int. J. Speech Technol. 16(1), 41–54 (2013)

    Article  Google Scholar 

  11. H.-Y. Gu, Notes for the syllable-signal synthesis method: Tipw, in Proceedings of the ISCSLP (1998)

  12. H.-Y. Gu, W.-L. Shiu, A mandarin-syllable signal synthesis method with increased flexibility in duration, tone and timbre control. Proc. Natl. Sci. Counc. ROC(A) 22(3), 385–395 (1998)

  13. G. Hu, D.L. Wang, monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)

    Article  Google Scholar 

  14. S. King, An introduction to statistical parametric speech synthesis. Sadhana 36(5), 837–852 (2011)

    Article  Google Scholar 

  15. J. Kominek, A. Black, CMU-Arctic speech databases, in 5th ISCA Speech Synthesis Workshop (Pittsburgh, PA, 2004), pp. 223–224

  16. Y. Li, D. Wang, Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans. Audio Speech Lang. Process. 15, 1475–1487 (2007)

    Article  Google Scholar 

  17. E. Moulines, F. Charpentier, Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9, 452–467 (1990)

    Google Scholar 

  18. E. Moulines, J. Laroche, Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Commun. 16, 175–205 (1995)

  19. P.S. Murthy, B. Yegnanarayana, Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals. IEEE Trans. Speech Audio Process. 7(6), 609–619 (1999)

    Article  Google Scholar 

  20. K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1614 (2008)

    Article  Google Scholar 

  21. K.S.R. Murty, B. Yegnanarayana, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)

    Article  Google Scholar 

  22. P.A. Naylor, A. Kounoudes, J. Gudnason, M. Brookes, Estimation of glottal closure instants in voiced speech using DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process. 15(1), 34–43 (2007)

    Article  Google Scholar 

  23. M.P. Pollard, B.M.G. Cheetham, C.C. Goodyear, M.D. Edgington, A. Lowry, Enhanced shape-invariant pitch and time-scale modification for concatenative speech synthesis, in Proceedings of the ICSLP (1996)

  24. M.R. Portnoff, Time-scale modification of speech based on short-time fourier analysis. IEEE Trans. Acoust. Speech Signal Process. ASSP 29, 374–390 (1981)

    Article  MathSciNet  Google Scholar 

  25. S.R.M. Prasanna, D. Govind, Unified pitch markers generation method for pitch and duration modification, in Proceedings of the National Conference on Communications (NCC) (2013)

  26. S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yenanarayana, Fast prosody modification using instants of significant excitation, in Proceedings of the Speech Prosody (2010)

  27. A. Prathosh, T. Ananthapadmanabha, A. Ramakrishnan, Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Trans Audio Speech Lang. Process. 21(12), 2471–2480 (2013)

    Article  Google Scholar 

  28. T.F. Quatieri, R.J. McAulay, Shape invariant time scale and pitch modification of speech. IEEE Trans. Signal Process. 40(3), 497–510 (1992)

    Article  Google Scholar 

  29. K.S. Rao, S.R.M. Prasanna, B. Yegnanarayana, Determination of instants of significant excitation in speech using Hilbert envelope and group delay function. IEEE Signal Process. Lett. 14, 762–765 (2007)

    Article  Google Scholar 

  30. K.S. Rao, B. Yegnanarayana, Prosody modification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14, 972–980 (2006)

    Article  Google Scholar 

  31. P. Taylor, Text to Speech Synthesis (Cambridge University Press, Cambridge, MA, 2009)

    Book  Google Scholar 

  32. M.R.P. Thomas, J. Gudnason, P.A. Naylor, Application of DYPSA algorithm to segmented time scale modification of speech, in Proceedings of the EUSIPCO (2008)

  33. S.P. Whiteside, Simulated emotions: an acoustic study of voice and perturbation measures, in Proceedings of the ICSLP (Sydney, 1998), pp. 699–703

  34. H. Zen, K. Tokuda, A. Black, Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)

    Article  Google Scholar 

Download references

Acknowledgments

The present work is supported by Department of Science and Technology sponsored project entitled “Analysis, processing and synthesis of emotions in speech.” The project Reference No. SB/FTP/ETA-370/2012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Govind.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Govind, D., Joy, T.T. Improving the Flexibility of Dynamic Prosody Modification Using Instants of Significant Excitation. Circuits Syst Signal Process 35, 2518–2543 (2016). https://doi.org/10.1007/s00034-015-0159-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-015-0159-5

Keywords

Navigation