Skip to main content

A Compact Phoneme-To-Audio Aligner for Singing Voice

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14177))

Included in the following conference series:

  • 1021 Accesses

Abstract

The phoneme-to-audio alignment task aims to align every phoneme to a corresponding speech or singing audio segment. It has many applications in the research and commercial field. Here we propose an easy-to-train and compact phoneme-to-audio alignment model that is especially effective for singing audio alignment tasks. Specifically, we design a compact model with simple encoder-decoder architecture without a popular but redundant attention component. The model can be well trained in relatively few epochs for different datasets with a combination of CTC loss and mel-spectrogram reconstruction loss. We apply a dedicated dynamic programming algorithm to the output likelihood matrix from the model to acquire alignment results. We conduct extensive experiments to verify the effectiveness of our method. Experiments show that our method outperforms the baseline models on different datasets. Our codes are available on github.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/zhengmidon/singaligner.

References

  1. Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Segmental contrastive predictive coding for unsupervised word segmentation. arXiv preprint arXiv:2106.02170 (2021)

  2. Duan, Z., Fang, H., Li, B., Sim, K.C., Wang, Y.: The nus sung and spoken lyrics corpus: a quantitative comparison of singing and speech. In: 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9. IEEE (2013)

    Google Scholar 

  3. Franke, J., Mueller, M., Hamlaoui, F., Stueker, S., Waibel, A.: Phoneme boundary detection using deep bidirectional LSTMs. In: Speech Communication; 12. ITG Symposium, pp. 1–5. VDE (2016)

    Google Scholar 

  4. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017)

  5. Fujihara, H., Goto, M., Ogata, J., Okuno, H.G.: LyricSynchronizer: automatic synchronization system between musical audio signals and lyrics. IEEE J. Sel. Topics Signal Process. 5(6), 1252–1261 (2011)

    Article  Google Scholar 

  6. Gorman, K., Howell, J., Wagner, M.: Prosodylab-aligner: a tool for forced alignment of laboratory speech. Can. Acoust. 39(3), 192–193 (2011)

    Google Scholar 

  7. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  8. Gulati, et al.: Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  10. Kamper, H., van Niekerk, B.: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. arXiv preprint arXiv:2012.07551 (2020)

  11. Kisler, T., Schiel, F., Sloetjes, H.: Signal processing via web services: the use case webMAUS. In: Digital Humanities Conference (2012)

    Google Scholar 

  12. Kreuk, F., Keshet, J., Adi, Y.: Self-supervised contrastive learning for unsupervised phoneme segmentation. arXiv preprint arXiv:2007.13465 (2020)

  13. Kreuk, F., Sheena, Y., Keshet, J., Adi, Y.: Phoneme boundary detection using learnable segmental features. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8089–8093. IEEE (2020)

    Google Scholar 

  14. Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z.: DiffSinger: Singing voice synthesis via shallow diffusion mechanism. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11020–11028 (2022)

    Google Scholar 

  15. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  16. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017)

    Google Scholar 

  17. Michel, P., Räsänen, O., Thiollière, R., Dupoux, E.: Blind phoneme segmentation with temporal prediction errors. In: Proceedings of ACL 2017, Student Research Workshop, pp. 62–68 (2017)

    Google Scholar 

  18. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  19. Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)

    Google Scholar 

  20. Ren, Y., et al.: FastSpeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)

  21. Ren, Y., Tan, X., Qin, T., Luan, J., Zhao, Z., Liu, T.Y.: DeepSinger: Singing voice synthesis with data mined from the web. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1979–1989 (2020)

    Google Scholar 

  22. Rosenfelder, I., Fruehwald, J., Evanini, K., Yuan, J.: FAVE (forced alignment and vowel extraction) program suite. URL http://fave.ling.upenn.edu (2011)

  23. Schulze-Forster, K., Doire, C.S., Richard, G., Badeau, R.: Joint phoneme alignment and text-informed speech separation on highly corrupted speech. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7274–7278. IEEE (2020)

    Google Scholar 

  24. Schulze-Forster, K., Doire, C.S., Richard, G., Badeau, R.: Phoneme level lyrics alignment and text-informed singing voice separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2382–2395 (2021)

    Article  Google Scholar 

  25. Stoller, D., Durand, S., Ewert, S.: End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181–185. IEEE (2019)

    Google Scholar 

  26. Teytaut, Y., Roebel, A.: Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice. In: Proceedings of Interspeech 2021, pp. 61–65. International Speech Communication Association; ISCA (2021)

    Google Scholar 

  27. Vaglio, A., Hennequin, R., Moussallam, M., Richard, G., d’Alché Buc, F.: Multilingual lyrics-to-audio alignment. In: International Society for Music Information Retrieval Conference (ISMIR) (2020)

    Google Scholar 

  28. Wang, Y., et al.: Opencpop: a high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429 (2022)

  29. Weide, R., et al.: The carnegie mellon pronouncing dictionary. release 0.6, https://www.cs.cmu.edu/ (1998)

  30. Zhu, J., Zhang, C., Jurgens, D.: Phone-to-audio alignment without text: a semi-supervised approach. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8167–8171. IEEE (2022)

    Google Scholar 

  31. Zue, V., Seneff, S., Glass, J.: Speech database development at MIT: timit and beyond. Speech Commun. 9(4), 351–356 (1990)

    Article  Google Scholar 

  32. Canon, [NamineRitsu] Blue (YOASOBI) [ENUNU model Ver. 2, Singing DBVer.2 release], https://www.youtube.com/watch?v=pKeo9IE_L1I, Accessed: 2022.10.06

Download references

Acknowledgment

This work is supported by the National key R &D Program of China (Grant No.2020AAA0107904), the Key Support Project of NSFC-Liaoning Joint Foundation (Grant No. U1908216), and the Major Scientific Research Project of the State Language Commission in the 13th Five-Year Plan (Grant No. WT135-38).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodong Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, M., Bai, P., Shi, X. (2023). A Compact Phoneme-To-Audio Aligner for Singing Voice. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46664-9_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46663-2

  • Online ISBN: 978-3-031-46664-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics