Abstract
This paper addresses the problem of forced alignment in news and songs in order to get the times where every word of the transcriptions begins and ends. For this purpose two methods are used. The first one is basically a forced alignment process of the audio and text based on pre-existent models. The second one is a model-free method in which new models are trained on the audio to align producing as a result the aligned text and audio. For analysis of the songs, we have considered two versions of the same song: one is an a capella song (only voice with no music) and the other, the full song (with instrumental music included). Three songs have been selected from different singers and different styles. Regarding news, we have analyzed four speakers (2 females and 2 males). Analyzing all the results, we observe that news is better aligned than songs, as expected. The two methods work similarly in both a capella songs and news, but in the case of songs that include the instrumental part, the model-free method is much better.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Mesaros, A., Virtanen, T.: Automatic Alignment of Music Audio and Lyrics. In: Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx 2008), Espoo, Finland, September 1-4 (2008)
Lee, K., Cremer, M.: Segmentation-Based Lyrics-Audio Alignment Using Dynamic Programming. In: Proc. ISMIR, pp. 395–400 (2008)
Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T., Okuno, H.G.: Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals. In: Proceedings of the Eighth IEEE International Symposium on Multimedia, ISM 2006 (2006)
Meinedo, H., Abad, A., Pellegrini, T., Neto, J., Trancoso, I.: The L2F Broadcast News Speech Recognition System. In: Proc. FALA 2010: VI Jornadas en Tecnología del Habla and II Iberian SLTech Workshop, pp. 93–96 (2010)
Ortega, A., Garcia, J., Miguel, A., Lleida, E.: Real-time live broadcast news subtitling system for spanish. In: Proc. Interspeech 2009, Brighton (September 2009)
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Povey, D., Valtchev, V., Woodland, P.: The HTK Book, Version 3.4 (March 2009)
TIMIT Acoustic-Phonetic Continuous Speech Corpus, LDC Catalog Number LDC93S1, Available through the Linguistic Data Consortium, http://www.ldc.upenn.edu
CMU Pronouncing Dictionary, ftp://ftp.cs.cmu.edu/project/speech/dict/ (accessed June 25, 2012)
Toledano, D.T., Hernández, L.A., Villarubia Grande, L.: Automatic Phonetic Segmentation. IEEE Transactions on Speech and Audio Processing 11(6) (November 2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Córdova Lucero, D.P., Toledano, D.T. (2012). Preliminary Results of Alignment of Text and Audio in News and Songs. In: Torre Toledano, D., et al. Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science, vol 328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35292-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-35292-8_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35291-1
Online ISBN: 978-3-642-35292-8
eBook Packages: Computer ScienceComputer Science (R0)