ABSTRACT
In this study, we introduce a method for time-alignment of lyrics in Korean folk song audio using a transformer encoder-decoder model specifically designed to utilize incomplete lyric data. We analyzed the characteristics of Korean folk song lyrics and found some discrepancies between the lyrics and the corresponding audio recordings. To address these challenges and maximize the use of existing transcriptions, we introduce RefWhisper. This is a variant of OpenAI’s Whisper and includes an extra encoder module and cross-attention layer, enabling the model to consult incomplete lyrics during the transcription process. The added cross-attention layer facilitates not only the alignment of the reference text with the predicted transcription but also with the audio. We make public the transcribed outcomes and timestamp data, which are aligned at both the sentence and word levels, for a corpus of 13,801 Korean folk songs.
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.Google Scholar
- Yu-Ren Chien, Hsin-Min Wang, and Shyh-Kang Jeng. 2016. Alignment of lyrics with accompanied singing audio based on acoustic-phonetic vowel likelihood modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 11 (2016), 1998–2008.Google ScholarDigital Library
- Yejin Cho. 2017. Korean Grapheme-to-Phoneme Analyzer (KoG2P). https://github.com/scarletcho/KoG2P.Google Scholar
- Sang Il Choi. 2000. Articles on Recordings : ‘Anthology of Korean Traditional Folksongs’ About the Project and the Records Published(창간 10 주년 기념호: 음반;’한국민요대전’사업과 음반 발간). Korean Recording Studies(한국음반학) 10 (2000), 459–480.Google Scholar
- Simon Durand, Daniel Stoller, and Sebastian Ewert. 2023. Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.Google ScholarCross Ref
- Georgi Bogomilov Dzhambazov, Ajay Srinivasamurthy, Sertan Sentürk, and Xavier Serra. 2016. On the use of note onsets for improved lyrics-to-audio alignment in turkish makam music. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR). International Society for Music Information Retrieval (ISMIR), 716–722.Google Scholar
- Toni Giorgino. 2009. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package. Journal of Statistical Software 31, 7 (2009). https://doi.org/10.18637/jss.v031.i07Google ScholarCross Ref
- Danbinaerin Han, Rafael Caro Repetto, and Dasaem Jeong. 2023. Finding Tori: Self-supervised Learning for Analyzing Korean Folk Song. In Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR).Google Scholar
- iMBC. 1991-1996. Comentary of ‘Anthology of Korean Traditional Folksongs’. Copyright 1994 by Munhwa Broadcasting Corporation, Yoido-dong 31, Yongdeungpo-gu, Seoul, Korea. http://www.urisori.co.kr/doku.php?id=%ED%95%9C%EA%B5%AD%EB%AF%BC%EC%9A%94%EB%8C%80%EC%A0%84_%EC%9E%90%EB%A3%8CcdGoogle Scholar
- Sang Won Lee and Jeffrey Scott. 2017. Word level lyrics-audio synchronization using separated vocals. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 646–650.Google ScholarDigital Library
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of 7th International Conference on Learning Representations (ICLR).Google Scholar
- Jérôme Louradour. 2023. whisper-timestamped. https://github.com/linto-ai/whisper-timestamped.Google Scholar
- Eric Nichols, Dan Morris, Sumit Basu, and Christopher Raphael. 2009. Relationships between lyrics and melody in popular music. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR). 471–476.Google Scholar
- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of International Conference on Machine Learning (ICML). PMLR, 28492–28518.Google Scholar
- Rafael Caro Repetto, Shuo Zhang, and Xavier Serra. 2017. Quantitative analysis of the relationship between linguistic tones and melody in jingju using music scores. In Proceedings of the 4th International Workshop on Digital Libraries for Musicology. 41–44.Google ScholarDigital Library
- Daniel Stoller, Simon Durand, and Sebastian Ewert. 2019. End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 181–185.Google ScholarCross Ref
- Shuo Zhang, Rafael Caro Repetto, and Xavier Serra. 2014. Study of the similarity between linguistic tones and melodic pitch contours in Beijing opera singing. In Proceedings of the 15th Conference of the International Society for Music Information Retrieval (ISMIR 2014); 2014 Oct 27-31; Taipei, Taiwan. Taipei: International Society for Music Information Retrieval; 2014. International Society for Music Information Retrieval (ISMIR), 343–348.Google Scholar
Index Terms
- Aligning Incomplete Lyrics of Korean Folk Song Dataset using Whisper
Recommendations
Automated analysis of performance variations in folk song recordings
MIR '10: Proceedings of the international conference on Multimedia information retrievalPerformance analysis of recorded music material has become increasingly important in musicological research and music psychology. In this paper, we present various techniques for extracting performance aspects from field recordings of folk songs. Main ...
A Trend Analysis on Concreteness of Popular Song Lyrics
DLfM '19: Proceedings of the 6th International Conference on Digital Libraries for MusicologyRecently, music complexity has drawn attention from researchers in Music Digital Libraries area. In particular, computational methods to measure music complexity have been studied to provide better music services in large-scale music digital libraries. ...
LyricAlly: automatic synchronization of acoustic musical signals and textual lyrics
MULTIMEDIA '04: Proceedings of the 12th annual ACM international conference on MultimediaWe present a prototype that automatically aligns acoustic musical signals with their corresponding textual lyrics, in a manner similar to manually-aligned karaoke. We tackle this problem using a multimodal approach, where the appropriate pairing of ...
Comments