Abstract
Modification of suprasegmental features such as pitch and duration of original speech by fixed scaling factors is referred to as static prosody modification. In dynamic prosody modification, the prosodic scaling factors (time-varying modification factors) are defined for all the pitch cycles present in the original speech. The present work is focused on improving the naturalness of the prosody modified speech by reducing the generation of piecewise constant segments in the modified pitch contour. The prosody modification is performed by anchoring around the accurate instants of significant excitation estimated from the original speech. The division of longer pitch intervals into many equal intervals over long speech segments introduces step-like discontinuities in the form of piecewise constant segments in the modified pitch contours. The effectiveness of proposed dynamic modification method is initially confirmed from the smooth modified pitch contour plot obtained for finer static prosody scaling factors, waveforms, spectrogram plots and comparison subjective evaluations. Also, the average \(F_0\) jitter computed from the pitch segments of each glottal activity region in the modified speech is proposed as an objective measure for the prosody modification. The naturalness of the prosody modified speech using the proposed method is objectively and subjectively compared with that of the existing zero frequency filtered signal-based dynamic prosody modification. Also, the proposed algorithm effectively preserves the dynamics of the prosodic patterns in singing voices where in the \(F_0\) parameters rapidly and continuously fluctuate within a higher \(F_0\) range.
Similar content being viewed by others
Notes
The terms epochs and ISE are interchangeably used throughout this article.
The epoch intervals and instantaneous pitch periods are considered as the same parameter in the context of prosody modification.
Since pitch cycles are either repeated or dropped in case of duration modification, no overlap in successive pitch intervals occurs and hence samples in the pitch intervals interval are not copied in overlap-add manner. For pitch modification, to reduce the effect of truncation and expansion of pitch cycles in the waveform reconstruction, the samples in each pitch intervals of the original speech signal are copied in an overlap-add manner.
Methods for subjective determination of transmission quality: ITU-T Recommendation P.800 is available from the ITU Web site: http://www.itu.int/rec/T-REC-P.800-199608-I/en.
References
N. Adiga, D. Govind, S.R.M. Prasanna, Significance of epoch identification accuracy for prosody modification, in Proceedings of the SPCOM (2014)
J.P. Cabral, L.C. Oliveira, Emo voice: a system to generate emotions in speech, in Proceedings of the INTERSPEECH (2006a), pp. 1798–1801
J.P. Cabral, L.C. Oliveira, Pitch-synchronous time-scaling for prosodic and voice quality transformations, in Proceedings of the INTERSPEECH (2006b)
K.T. Deepak, S.R.M. Prasanna, Epoch extraction using zero band filtering from speech signal. Circuits Syst. Signal Process. (2014). doi:10.1007/s00034-014-9957-4
J.R. Deller, J.G. Proakis, J.H.L. Hanson, Discrete-Time Processing of Speech Signals (Macmillan, New York, 1993)
M. Farrus, J. Hernando, Using jitter and shimmer in speaker verification. IET Signal Process. 3(4), 247–257 (2009)
D. Govind, A.S. Biju, A. Smily, Automatic speech polarity detection using phase information from complex analytic signal representations, in SPCOM (2014)
D. Govind, S.R.M. Prasanna, Epoch extraction from emotional speech, in Proceedings of the Signal Processing & Communications (SPCOM) (2012), pp. 1–5
D. Govind, S.R.M. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of the INTERSPEECH (2011)
D. Govind, S.R.M. Prasanna, Dynamic prosody modification using zero frequency filtered signal. Int. J. Speech Technol. 16(1), 41–54 (2013)
H.-Y. Gu, Notes for the syllable-signal synthesis method: Tipw, in Proceedings of the ISCSLP (1998)
H.-Y. Gu, W.-L. Shiu, A mandarin-syllable signal synthesis method with increased flexibility in duration, tone and timbre control. Proc. Natl. Sci. Counc. ROC(A) 22(3), 385–395 (1998)
G. Hu, D.L. Wang, monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)
S. King, An introduction to statistical parametric speech synthesis. Sadhana 36(5), 837–852 (2011)
J. Kominek, A. Black, CMU-Arctic speech databases, in 5th ISCA Speech Synthesis Workshop (Pittsburgh, PA, 2004), pp. 223–224
Y. Li, D. Wang, Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans. Audio Speech Lang. Process. 15, 1475–1487 (2007)
E. Moulines, F. Charpentier, Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9, 452–467 (1990)
E. Moulines, J. Laroche, Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Commun. 16, 175–205 (1995)
P.S. Murthy, B. Yegnanarayana, Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals. IEEE Trans. Speech Audio Process. 7(6), 609–619 (1999)
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1614 (2008)
K.S.R. Murty, B. Yegnanarayana, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)
P.A. Naylor, A. Kounoudes, J. Gudnason, M. Brookes, Estimation of glottal closure instants in voiced speech using DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process. 15(1), 34–43 (2007)
M.P. Pollard, B.M.G. Cheetham, C.C. Goodyear, M.D. Edgington, A. Lowry, Enhanced shape-invariant pitch and time-scale modification for concatenative speech synthesis, in Proceedings of the ICSLP (1996)
M.R. Portnoff, Time-scale modification of speech based on short-time fourier analysis. IEEE Trans. Acoust. Speech Signal Process. ASSP 29, 374–390 (1981)
S.R.M. Prasanna, D. Govind, Unified pitch markers generation method for pitch and duration modification, in Proceedings of the National Conference on Communications (NCC) (2013)
S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yenanarayana, Fast prosody modification using instants of significant excitation, in Proceedings of the Speech Prosody (2010)
A. Prathosh, T. Ananthapadmanabha, A. Ramakrishnan, Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Trans Audio Speech Lang. Process. 21(12), 2471–2480 (2013)
T.F. Quatieri, R.J. McAulay, Shape invariant time scale and pitch modification of speech. IEEE Trans. Signal Process. 40(3), 497–510 (1992)
K.S. Rao, S.R.M. Prasanna, B. Yegnanarayana, Determination of instants of significant excitation in speech using Hilbert envelope and group delay function. IEEE Signal Process. Lett. 14, 762–765 (2007)
K.S. Rao, B. Yegnanarayana, Prosody modification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14, 972–980 (2006)
P. Taylor, Text to Speech Synthesis (Cambridge University Press, Cambridge, MA, 2009)
M.R.P. Thomas, J. Gudnason, P.A. Naylor, Application of DYPSA algorithm to segmented time scale modification of speech, in Proceedings of the EUSIPCO (2008)
S.P. Whiteside, Simulated emotions: an acoustic study of voice and perturbation measures, in Proceedings of the ICSLP (Sydney, 1998), pp. 699–703
H. Zen, K. Tokuda, A. Black, Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)
Acknowledgments
The present work is supported by Department of Science and Technology sponsored project entitled “Analysis, processing and synthesis of emotions in speech.” The project Reference No. SB/FTP/ETA-370/2012.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Govind, D., Joy, T.T. Improving the Flexibility of Dynamic Prosody Modification Using Instants of Significant Excitation. Circuits Syst Signal Process 35, 2518–2543 (2016). https://doi.org/10.1007/s00034-015-0159-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-015-0159-5