Abstract
The phoneme-to-audio alignment task aims to align every phoneme to a corresponding speech or singing audio segment. It has many applications in the research and commercial field. Here we propose an easy-to-train and compact phoneme-to-audio alignment model that is especially effective for singing audio alignment tasks. Specifically, we design a compact model with simple encoder-decoder architecture without a popular but redundant attention component. The model can be well trained in relatively few epochs for different datasets with a combination of CTC loss and mel-spectrogram reconstruction loss. We apply a dedicated dynamic programming algorithm to the output likelihood matrix from the model to acquire alignment results. We conduct extensive experiments to verify the effectiveness of our method. Experiments show that our method outperforms the baseline models on different datasets. Our codes are available on github.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Segmental contrastive predictive coding for unsupervised word segmentation. arXiv preprint arXiv:2106.02170 (2021)
Duan, Z., Fang, H., Li, B., Sim, K.C., Wang, Y.: The nus sung and spoken lyrics corpus: a quantitative comparison of singing and speech. In: 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9. IEEE (2013)
Franke, J., Mueller, M., Hamlaoui, F., Stueker, S., Waibel, A.: Phoneme boundary detection using deep bidirectional LSTMs. In: Speech Communication; 12. ITG Symposium, pp. 1–5. VDE (2016)
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017)
Fujihara, H., Goto, M., Ogata, J., Okuno, H.G.: LyricSynchronizer: automatic synchronization system between musical audio signals and lyrics. IEEE J. Sel. Topics Signal Process. 5(6), 1252–1261 (2011)
Gorman, K., Howell, J., Wagner, M.: Prosodylab-aligner: a tool for forced alignment of laboratory speech. Can. Acoust. 39(3), 192–193 (2011)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Gulati, et al.: Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kamper, H., van Niekerk, B.: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. arXiv preprint arXiv:2012.07551 (2020)
Kisler, T., Schiel, F., Sloetjes, H.: Signal processing via web services: the use case webMAUS. In: Digital Humanities Conference (2012)
Kreuk, F., Keshet, J., Adi, Y.: Self-supervised contrastive learning for unsupervised phoneme segmentation. arXiv preprint arXiv:2007.13465 (2020)
Kreuk, F., Sheena, Y., Keshet, J., Adi, Y.: Phoneme boundary detection using learnable segmental features. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8089–8093. IEEE (2020)
Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z.: DiffSinger: Singing voice synthesis via shallow diffusion mechanism. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11020–11028 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017)
Michel, P., Räsänen, O., Thiollière, R., Dupoux, E.: Blind phoneme segmentation with temporal prediction errors. In: Proceedings of ACL 2017, Student Research Workshop, pp. 62–68 (2017)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)
Ren, Y., et al.: FastSpeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)
Ren, Y., Tan, X., Qin, T., Luan, J., Zhao, Z., Liu, T.Y.: DeepSinger: Singing voice synthesis with data mined from the web. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1979–1989 (2020)
Rosenfelder, I., Fruehwald, J., Evanini, K., Yuan, J.: FAVE (forced alignment and vowel extraction) program suite. URL http://fave.ling.upenn.edu (2011)
Schulze-Forster, K., Doire, C.S., Richard, G., Badeau, R.: Joint phoneme alignment and text-informed speech separation on highly corrupted speech. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7274–7278. IEEE (2020)
Schulze-Forster, K., Doire, C.S., Richard, G., Badeau, R.: Phoneme level lyrics alignment and text-informed singing voice separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2382–2395 (2021)
Stoller, D., Durand, S., Ewert, S.: End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181–185. IEEE (2019)
Teytaut, Y., Roebel, A.: Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice. In: Proceedings of Interspeech 2021, pp. 61–65. International Speech Communication Association; ISCA (2021)
Vaglio, A., Hennequin, R., Moussallam, M., Richard, G., d’Alché Buc, F.: Multilingual lyrics-to-audio alignment. In: International Society for Music Information Retrieval Conference (ISMIR) (2020)
Wang, Y., et al.: Opencpop: a high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429 (2022)
Weide, R., et al.: The carnegie mellon pronouncing dictionary. release 0.6, https://www.cs.cmu.edu/ (1998)
Zhu, J., Zhang, C., Jurgens, D.: Phone-to-audio alignment without text: a semi-supervised approach. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8167–8171. IEEE (2022)
Zue, V., Seneff, S., Glass, J.: Speech database development at MIT: timit and beyond. Speech Commun. 9(4), 351–356 (1990)
Canon, [NamineRitsu] Blue (YOASOBI) [ENUNU model Ver. 2, Singing DBVer.2 release], https://www.youtube.com/watch?v=pKeo9IE_L1I, Accessed: 2022.10.06
Acknowledgment
This work is supported by the National key R &D Program of China (Grant No.2020AAA0107904), the Key Support Project of NSFC-Liaoning Joint Foundation (Grant No. U1908216), and the Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan (Grant No. WT135-38).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zheng, M., Bai, P., Shi, X. (2023). A Compact Phoneme-To-Audio Aligner for Singing Voice. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-46664-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46663-2
Online ISBN: 978-3-031-46664-9
eBook Packages: Computer ScienceComputer Science (R0)