skip to main content
10.1145/3625135.3625154acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdlfmConference Proceedingsconference-collections
short-paper

Aligning Incomplete Lyrics of Korean Folk Song Dataset using Whisper

Published:10 November 2023Publication History

ABSTRACT

In this study, we introduce a method for time-alignment of lyrics in Korean folk song audio using a transformer encoder-decoder model specifically designed to utilize incomplete lyric data. We analyzed the characteristics of Korean folk song lyrics and found some discrepancies between the lyrics and the corresponding audio recordings. To address these challenges and maximize the use of existing transcriptions, we introduce RefWhisper. This is a variant of OpenAI’s Whisper and includes an extra encoder module and cross-attention layer, enabling the model to consult incomplete lyrics during the transcription process. The added cross-attention layer facilitates not only the alignment of the reference text with the predicted transcription but also with the audio. We make public the transcribed outcomes and timestamp data, which are aligned at both the sentence and word levels, for a corpus of 13,801 Korean folk songs.

References

  1. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.Google ScholarGoogle Scholar
  2. Yu-Ren Chien, Hsin-Min Wang, and Shyh-Kang Jeng. 2016. Alignment of lyrics with accompanied singing audio based on acoustic-phonetic vowel likelihood modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 11 (2016), 1998–2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yejin Cho. 2017. Korean Grapheme-to-Phoneme Analyzer (KoG2P). https://github.com/scarletcho/KoG2P.Google ScholarGoogle Scholar
  4. Sang Il Choi. 2000. Articles on Recordings : ‘Anthology of Korean Traditional Folksongs’ About the Project and the Records Published(창간 10 주년 기념호: 음반;’한국민요대전’사업과 음반 발간). Korean Recording Studies(한국음반학) 10 (2000), 459–480.Google ScholarGoogle Scholar
  5. Simon Durand, Daniel Stoller, and Sebastian Ewert. 2023. Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.Google ScholarGoogle ScholarCross RefCross Ref
  6. Georgi Bogomilov Dzhambazov, Ajay Srinivasamurthy, Sertan Sentürk, and Xavier Serra. 2016. On the use of note onsets for improved lyrics-to-audio alignment in turkish makam music. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR). International Society for Music Information Retrieval (ISMIR), 716–722.Google ScholarGoogle Scholar
  7. Toni Giorgino. 2009. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package. Journal of Statistical Software 31, 7 (2009). https://doi.org/10.18637/jss.v031.i07Google ScholarGoogle ScholarCross RefCross Ref
  8. Danbinaerin Han, Rafael Caro Repetto, and Dasaem Jeong. 2023. Finding Tori: Self-supervised Learning for Analyzing Korean Folk Song. In Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR).Google ScholarGoogle Scholar
  9. iMBC. 1991-1996. Comentary of ‘Anthology of Korean Traditional Folksongs’. Copyright 1994 by Munhwa Broadcasting Corporation, Yoido-dong 31, Yongdeungpo-gu, Seoul, Korea. http://www.urisori.co.kr/doku.php?id=%ED%95%9C%EA%B5%AD%EB%AF%BC%EC%9A%94%EB%8C%80%EC%A0%84_%EC%9E%90%EB%A3%8CcdGoogle ScholarGoogle Scholar
  10. Sang Won Lee and Jeffrey Scott. 2017. Word level lyrics-audio synchronization using separated vocals. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 646–650.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of 7th International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  12. Jérôme Louradour. 2023. whisper-timestamped. https://github.com/linto-ai/whisper-timestamped.Google ScholarGoogle Scholar
  13. Eric Nichols, Dan Morris, Sumit Basu, and Christopher Raphael. 2009. Relationships between lyrics and melody in popular music. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR). 471–476.Google ScholarGoogle Scholar
  14. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of International Conference on Machine Learning (ICML). PMLR, 28492–28518.Google ScholarGoogle Scholar
  15. Rafael Caro Repetto, Shuo Zhang, and Xavier Serra. 2017. Quantitative analysis of the relationship between linguistic tones and melody in jingju using music scores. In Proceedings of the 4th International Workshop on Digital Libraries for Musicology. 41–44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel Stoller, Simon Durand, and Sebastian Ewert. 2019. End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 181–185.Google ScholarGoogle ScholarCross RefCross Ref
  17. Shuo Zhang, Rafael Caro Repetto, and Xavier Serra. 2014. Study of the similarity between linguistic tones and melodic pitch contours in Beijing opera singing. In Proceedings of the 15th Conference of the International Society for Music Information Retrieval (ISMIR 2014); 2014 Oct 27-31; Taipei, Taiwan. Taipei: International Society for Music Information Retrieval; 2014. International Society for Music Information Retrieval (ISMIR), 343–348.Google ScholarGoogle Scholar

Index Terms

  1. Aligning Incomplete Lyrics of Korean Folk Song Dataset using Whisper

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        DLfM '23: Proceedings of the 10th International Conference on Digital Libraries for Musicology
        November 2023
        139 pages
        ISBN:9798400708336
        DOI:10.1145/3625135

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 November 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate27of48submissions,56%
      • Article Metrics

        • Downloads (Last 12 months)69
        • Downloads (Last 6 weeks)6

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format