skip to main content
10.1145/3591106.3592237acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

More Than Simply Masking: Exploring Pre-training Strategies for Symbolic Music Understanding

Published: 12 June 2023 Publication History

Abstract

Pre-trained language models have become the prevailing approach for handling natural language processing tasks in recent years. Given the similarities in sequential features between symbolic music and natural language text, it is fairly logical to adopt pre-training methods to symbolic music data. However, the disparity between music and natural language text makes it difficult to comprehensively model the unique features of music through traditional text-based pre-training strategies alone. To address this challenge, in this paper, we design the quad-attribute masking (QM) strategy and propose the key prediction (KP) task to improve the extraction of generic knowledge from symbolic music. We evaluate the impact of various pre-training strategies on several public symbolic music datasets, and the results of our experiments reveal that the proposed multi-task pre-training model can effectively capture music domain knowledge from symbolic music data and significantly improve performance on downstream tasks.

References

[1]
Yi-Hui Chou, I-Chun Chen, Chin-Jui Chang, Joann Ching, and Yi-Hsuan Yang. 2021. MidiBERT-Piano: Large-scale pre-training for symbolic music understanding. arXiv preprint arXiv:2107.05223 (2021).
[2]
Michael Scott Cuthbert and Christopher Ariza. 2010. music21: A toolkit for computer-aided musicology and symbolic music data. In Proceedings of the 11th International Society for Music Information Retrieval Conference, ISMIR. 637–642.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
[4]
Lucas N. Ferreira and Jim Whitehead. 2019. Learning to generate music with sentiment. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR. 384–390.
[5]
Francesco Foscarin, Andrew Mcleod, Philippe Rigaux, Florent Jacquemard, and Masahiko Sakai. 2020. ASAP: A dataset of aligned scores and performances for piano transcription. In Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR. 534–541.
[6]
Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. 2021. Compound Word Transformer: Learning to compose full-song music over dynamic directed hypergraphs. Proceedings of the AAAI Conference on Artificial Intelligence 35, 1 (2021), 178–186.
[7]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music Transformer: Generating music with long-term structure. In International Conference on Learning Representations.
[8]
Yu-Siang Huang and Yi-Hsuan Yang. 2020. Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM International Conference on Multimedia. 1180–1188.
[9]
Hsiao-Tzu Hung, Joann Ching, Seungheon Doh, Nabin Kim, Juhan Nam, and Yi-Hsuan Yang. 2021. EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR. 318–325.
[10]
Carol L Krumhansl and Edward J Kessler. 1982. Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys.Psychological Review 89, 4 (1982), 334.
[11]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
[12]
Wei-Tsung Lu and Li Su. 2018. Deep Learning Models for Melody Perception: An Investigation on Symbolic Music Data. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC. IEEE, Honolulu, 1620–1625.
[13]
Jibao Qiu, C. L. Philip Chen, and Tong Zhang. 2022. A novel multi-task learning method for symbolic music emotion recognition. arXiv preprint arXiv:2201.05782 (2022).
[14]
James A Russell. 1980. A circumplex model of affect.Journal of Personality and Social Psychology 39, 6 (1980), 1161.
[15]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems. 5998–6008.
[16]
Ziyu Wang, Ke Chen, Junyan Jiang, Yiyi Zhang, Maoran Xu, Shuqi Dai, Xianbin Gu, and Gus Xia. 2020. POP909: A pop-song dataset for music arrangement generation. In Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR. 38–45.
[17]
Ziyu Wang and Gus Xia. 2021. MuseBERT: Pre-training music representation for music understanding and controllable generation. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR. 722–729.
[18]
Shih-Lun Wu and Yi-Hsuan Yang. 2021. MuseMorphose: Full-song and fine-grained music style transfer with One Transformer VAE. arXiv preprint arXiv:2105.04090 (2021).
[19]
Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu. 2021. MusicBERT: Symbolic music understanding with large-scale pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP. 791–800.
[20]
Hongyuan Zhu, Ye Niu, Di Fu, and Hao Wang. 2021. MusicBERT: A self-supervised learning of music representation. In Proceedings of the 29th ACM International Conference on Multimedia. 3955–3963.

Cited By

View all
  • (2025)Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: A SurveyACM Computing Surveys10.1145/3714457Online publication date: 28-Jan-2025
  • (2024)Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-TrainingSensors10.3390/s2415501724:15(5017)Online publication date: 2-Aug-2024
  • (2024)PianoBART: Symbolic Piano Music Generation and Understanding with Large-Scale Pre-Training2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688332(1-6)Online publication date: 15-Jul-2024
  • Show More Cited By

Index Terms

  1. More Than Simply Masking: Exploring Pre-training Strategies for Symbolic Music Understanding

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
      June 2023
      694 pages
      ISBN:9798400701788
      DOI:10.1145/3591106
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 June 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. deep learning
      2. masking strategies
      3. multi-task pre-training
      4. pre-trained models
      5. symbolic music

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Funding Sources

      Conference

      ICMR '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 254 of 830 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)76
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: A SurveyACM Computing Surveys10.1145/3714457Online publication date: 28-Jan-2025
      • (2024)Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-TrainingSensors10.3390/s2415501724:15(5017)Online publication date: 2-Aug-2024
      • (2024)PianoBART: Symbolic Piano Music Generation and Understanding with Large-Scale Pre-Training2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688332(1-6)Online publication date: 15-Jul-2024
      • (2023)Multimodal Multifaceted Music Emotion Recognition Based on Self-Attentive Fusion of Psychology-Inspired Symbolic and Acoustic Features2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317539(1641-1645)Online publication date: 31-Oct-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media