Abstract
The phonetic alignment of the spoken utterances for speech research are commonly performed by HMM-based speech recognizers, in forced alignment mode, but the training of the phonetic segment models requires considerable amounts of annotated data. When no such material is available, a possible solution is to synthesize the same phonetic sequence and align the resulting speech signal with the spoken utterances. However, without a careful choice of acoustic features used in this procedure, it can perform poorly when applied to continuous speech utterances. In this paper we propose a new method to select the best features to use in the alignment procedure for each pair of phonetic segment classes. The results show that this selection considerably reduces the segment boundary location errors.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou and A. Syrdal, The AT&T Next-Gen TTS System, 137th Acoustical Society of America meeting, Berlin, Germany, 1999.
A. Black, CHATR, Version 0.8, a generic speech synthesizer, System documentation, ATR-Interpreting Telecomunications Laboratories, Kyoto, Japan, 1996.
Sakoe H. and Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. on ASSP, 26(1):43–49, 1978.
S. Paulo and L. Oliveira, Multilevel Annotation of Speech Signals Using Weighted Finite State Transducers. In Proceedings of IEEE 2002 Workshop on Speech Synthesis, Santa Monica, California, 2002.
D. Caseiro, H. Meinedo, A. Serralheiro, I. Trancoso and J. Neto, Spoken Book alignment using WFST HLT 2002 Human Language Technology Conference, San Diego, California, 2002.
F. Malfrère and T. Dutoit, High-Quality Speech Synthesis for Phonetic Speech Segmentation. In Proceedings of Eurospeech’97, Rhodes, Greece, 1997.
N. Campbell, Autolabelling Japanese TOBI. In Proceedings of ICSLP’96, Philadelphia, USA, 1996.
A. Black, P. Taylor and R. Caley, The Festival Speech Synthesis System. System documentation Edition 1.4, for Festival Version 1.4.0, 17th June 1999.
P. Taylor R. Caley, A. Black, S. King, Edinburgh Speech Tools Library System Documentation Edition 1.2, 15th June 1999.
ESPS Programs Version 5.3 Entropic Research Laboratories Inc., 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Paulo, S., Oliveira, L.C. (2003). Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment Using Multiple Acoustic Features. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds) Computational Processing of the Portuguese Language. PROPOR 2003. Lecture Notes in Computer Science(), vol 2721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45011-4_5
Download citation
DOI: https://doi.org/10.1007/3-540-45011-4_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40436-1
Online ISBN: 978-3-540-45011-5
eBook Packages: Springer Book Archive