A Compact Phoneme-To-Audio Aligner for Singing Voice

Zheng, Meizhen; Bai, Peng; Shi, Xiaodong

doi:10.1007/978-3-031-46664-9_13

Meizhen Zheng¹⁵,
Peng Bai¹⁵ &
Xiaodong Shi¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14177))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1021 Accesses

Abstract

The phoneme-to-audio alignment task aims to align every phoneme to a corresponding speech or singing audio segment. It has many applications in the research and commercial field. Here we propose an easy-to-train and compact phoneme-to-audio alignment model that is especially effective for singing audio alignment tasks. Specifically, we design a compact model with simple encoder-decoder architecture without a popular but redundant attention component. The model can be well trained in relatively few epochs for different datasets with a combination of CTC loss and mel-spectrogram reconstruction loss. We apply a dedicated dynamic programming algorithm to the output likelihood matrix from the model to acquire alignment results. We conduct extensive experiments to verify the effectiveness of our method. Experiments show that our method outperforms the baseline models on different datasets. Our codes are available on github.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Singing to speech conversion with generative flow

Article Open access 10 March 2025

Multi-Voice Singing Synthesis From Lyrics

Article 08 August 2022

A Comprehensive Exploration of Network-Based Approaches for Singing Voice Separation

Notes

1.
https://github.com/zhengmidon/singaligner.

References

Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Segmental contrastive predictive coding for unsupervised word segmentation. arXiv preprint arXiv:2106.02170 (2021)
Duan, Z., Fang, H., Li, B., Sim, K.C., Wang, Y.: The nus sung and spoken lyrics corpus: a quantitative comparison of singing and speech. In: 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9. IEEE (2013)
Google Scholar
Franke, J., Mueller, M., Hamlaoui, F., Stueker, S., Waibel, A.: Phoneme boundary detection using deep bidirectional LSTMs. In: Speech Communication; 12. ITG Symposium, pp. 1–5. VDE (2016)
Google Scholar
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017)
Fujihara, H., Goto, M., Ogata, J., Okuno, H.G.: LyricSynchronizer: automatic synchronization system between musical audio signals and lyrics. IEEE J. Sel. Topics Signal Process. 5(6), 1252–1261 (2011)
Article Google Scholar
Gorman, K., Howell, J., Wagner, M.: Prosodylab-aligner: a tool for forced alignment of laboratory speech. Can. Acoust. 39(3), 192–193 (2011)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Gulati, et al.: Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kamper, H., van Niekerk, B.: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. arXiv preprint arXiv:2012.07551 (2020)
Kisler, T., Schiel, F., Sloetjes, H.: Signal processing via web services: the use case webMAUS. In: Digital Humanities Conference (2012)
Google Scholar
Kreuk, F., Keshet, J., Adi, Y.: Self-supervised contrastive learning for unsupervised phoneme segmentation. arXiv preprint arXiv:2007.13465 (2020)
Kreuk, F., Sheena, Y., Keshet, J., Adi, Y.: Phoneme boundary detection using learnable segmental features. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8089–8093. IEEE (2020)
Google Scholar
Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z.: DiffSinger: Singing voice synthesis via shallow diffusion mechanism. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11020–11028 (2022)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017)
Google Scholar
Michel, P., Räsänen, O., Thiollière, R., Dupoux, E.: Blind phoneme segmentation with temporal prediction errors. In: Proceedings of ACL 2017, Student Research Workshop, pp. 62–68 (2017)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)
Google Scholar
Ren, Y., et al.: FastSpeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)
Ren, Y., Tan, X., Qin, T., Luan, J., Zhao, Z., Liu, T.Y.: DeepSinger: Singing voice synthesis with data mined from the web. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1979–1989 (2020)
Google Scholar
Rosenfelder, I., Fruehwald, J., Evanini, K., Yuan, J.: FAVE (forced alignment and vowel extraction) program suite. URL http://fave.ling.upenn.edu (2011)
Schulze-Forster, K., Doire, C.S., Richard, G., Badeau, R.: Joint phoneme alignment and text-informed speech separation on highly corrupted speech. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7274–7278. IEEE (2020)
Google Scholar
Schulze-Forster, K., Doire, C.S., Richard, G., Badeau, R.: Phoneme level lyrics alignment and text-informed singing voice separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2382–2395 (2021)
Article Google Scholar
Stoller, D., Durand, S., Ewert, S.: End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181–185. IEEE (2019)
Google Scholar
Teytaut, Y., Roebel, A.: Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice. In: Proceedings of Interspeech 2021, pp. 61–65. International Speech Communication Association; ISCA (2021)
Google Scholar
Vaglio, A., Hennequin, R., Moussallam, M., Richard, G., d’Alché Buc, F.: Multilingual lyrics-to-audio alignment. In: International Society for Music Information Retrieval Conference (ISMIR) (2020)
Google Scholar
Wang, Y., et al.: Opencpop: a high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429 (2022)
Weide, R., et al.: The carnegie mellon pronouncing dictionary. release 0.6, https://www.cs.cmu.edu/ (1998)
Zhu, J., Zhang, C., Jurgens, D.: Phone-to-audio alignment without text: a semi-supervised approach. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8167–8171. IEEE (2022)
Google Scholar
Zue, V., Seneff, S., Glass, J.: Speech database development at MIT: timit and beyond. Speech Commun. 9(4), 351–356 (1990)
Article Google Scholar
Canon, [NamineRitsu] Blue (YOASOBI) [ENUNU model Ver. 2, Singing DBVer.2 release], https://www.youtube.com/watch?v=pKeo9IE_L1I, Accessed: 2022.10.06

Download references

Acknowledgment

This work is supported by the National key R &D Program of China (Grant No.2020AAA0107904), the Key Support Project of NSFC-Liaoning Joint Foundation (Grant No. U1908216), and the Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan (Grant No. WT135-38).

Author information

Authors and Affiliations

Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, 361005, China
Meizhen Zheng, Peng Bai & Xiaodong Shi

Authors

Meizhen Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Peng Bai
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Shi .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
The University of Indonesia, Depok, Indonesia
Heru Suhartanto
Beijing Institute of Technology, Beijing, China
Guoren Wang
Northeastern University, Shenyang, China
Bin Wang
University of Technology Sydney, Sydney, NSW, Australia
Jing Jiang
Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Bing Li
Sun Yat-sen University, Guangzhou, China
Huaijie Zhu
Anhui University, Hefei, China
Ningning Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, M., Bai, P., Shi, X. (2023). A Compact Phoneme-To-Audio Aligner for Singing Voice. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-46664-9_13
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46663-2
Online ISBN: 978-3-031-46664-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Compact Phoneme-To-Audio Aligner for Singing Voice