Abstract
Recently many deep learning-based automatic music generation models have been proposed. How to generate long pieces of pop music with distinctive musical characteristics remains a challenging problem, as it relies heavily on musical structures. Some transformer-based models take advantage of self-attention for generating long-sequence music; however, most pay little attention to well-organized musical structures. In this article, we propose a novel note-to-bar hierarchical model named the Bar Transformer to address long-term dependency issues and generate impressive and structurally meaningful music. In particular, we propose a novel note-to-bar approach that pre-processes the notes within each individual bar to provide a strong structural constraint to increase our model’s awareness of the note-to-bar structure in music. The Bar Transformer is constructed using an encoder-decoder framework, including a two-layer encoder and an arrangement decoder. In the two-layer encoder, the bottom is a note-level encoder, which outputs embeddings by learning the relation between notes within an individual bar, and the top is a bar-level encoder, which uses these embeddings to encode each bar from the melody and chord. The decoder is an arrangement decoder used to generalize the interrelationships among the bars and simultaneously generate melodies and chords. The experimental results of the structural analysis and the aural evaluations demonstrate that our approach outperforms the Music Transformer model and other regressive models used for music generation.
Similar content being viewed by others
References
Briot JP (2021) From artificial neural networks to deep learning for music generation: history, concepts and trends. Neural Comput Applic 33(1):39–65. https://doi.org/10.1007/s00521-020-05399-0
Briot JP, Pachet F (2020) Deep learning for music generation: challenges and directions. Neural Comput Applic 32(4):981–993. https://doi.org/10.1007/s00521-018-3813-6
Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. In: Advances in neural information processing systems (NeurIPS), pp 1877–1901
Brunner G, Wang Y, Wattenhofer R et al (2017) Jambot: Music theory aware chord based generation of polyphonic music with lstms. In: 2017 IEEE 29th international conference on tools with artificial intelligence (ICTAI), IEEE, pp 519–526. https://doi.org/10.1109/ICTAI.2017.00085
Brunner G, Konrad A, Wang Y et al (2018) Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer. In: Proceedings of the 19th international society for music information retrieval conference(ISMIR), pp 747–754
Choi K, Hawthorne C, Simon I et al (2020) Encoding musical style with transformer autoencoders. In: International conference on machine learning(ICML), pp 1899–1908
Chu H, Urtasun R, Fidler S (2017) Song from pi: a musically plausible network for pop music generation. In: 5th International conference on learning representations(ICLR)
Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: Proceedings of the AAAI conference on artificial intelligence(AAAI), pp 2159–2166
Chung J, Ahn S, Bengio Y (2017) Hierarchical multiscale recurrent neural networks. In: 5th International conference on learning representations(ICLR)
Devlin J, Chang MW, Lee K et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the north american chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dong HW, Yang YH (2018) Convolutional generative adversarial networks with binary neurons for polyphonic music generation. In: Proceedings of the 19th international society for music information retrieval conference (ISMIR), pp 190–196
Dong HW, Hsiao WY, Yang LC et al (2018) Musegan: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the AAAI conference on artificial intelligence(AAAI), pp 34–41
Furner M, Islam MZ, Li CT (2021) Knowledge discovery and visualisation framework using machine learning for music information retrieval from broadcast radio data. Expert Syst Appl 182:115,236. https://doi.org/10.1016/j.eswa.2021.115236
Gao T, Cui Y, Ding F (2021) Seqvae: sequence variational autoencoder with policy gradient. Appl Intell, pp 1–8
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International conference on acoustics, speech and signal processing, IEEE, pp 6645–6649 https://doi.org/10.1109/ICASSP.2013.6638947
Guo Z, Dimos M, Dorien H (2021) Hierarchical recurrent neural networks for conditional melody generation with long-term structure. In: International joint conference on neural networks(IJCNN), pp 1–8. https://doi.org/10.1109/IJCNN52387.2021.9533493https://doi.org/10.1109/IJCNN52387.2021.9533493
Hadjeres G, Pachet F, Nielsen F (2017) Deepbach: a steerable model for bach chorales generation. In: International conference on machine learning(ICML), pp 1362–1371
Huang CZA, Vaswani A, Uszkoreit J et al (2019) Music transformer: generating music with long-term structure. In: 7th International conference on learning representations(ICLR)
Huang YS, Yang YH (2020) Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM international conference on multimedia, pp 1180–1188
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations(ICLR)
Liang FT, Gotham M, Johnson M et al (2017) Automatic stylistic composition of bach chorales with deep lstm. In: Proceedings of the 18th international society for music information retrieval conference(ISMIR), pp 449–456
Ockelford A (2017) Repetition in music: Theoretical and metatheoretical perspectives. Routledge
Oord AVD, Dieleman S, Zen H et al (2016) Wavenet: a generative model for raw audio. In: The 9th ISCA speech synthesis workshop, pp 125
Pappagari R, Zelasko P, Villalba J et al (2019) Hierarchical transformers for long document classification. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp 838-844. https://doi.org/10.1109/ASRU46091.2019.9003958https://doi.org/10.1109/ASRU46091.2019.9003958
Paszke A, Gross S, Massa F et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems(NeurIPS), pp 8024–8035
Pauwels J, O’Hanlon K, Gómez E et al (2019) 20 years of automatic chord recognition from audio. In: Proceedings of the 20th International society for music information retrieval conference (ISMIR), pp 54–63
Payne C (2019) Musenet. https://openaicom/blog/musenet
Roberts A, Engel J, Raffel C et al (2018) A hierarchical latent vector model for learning long-term structure in music. In: International conference on machine learning(ICML), pp 4364–4373
Roig C, Tardón LJ, Barbancho I et al (2018) A non-homogeneous beat-based harmony markov model. Knowl-Based Syst 142:85–94. https://doi.org/10.1016/j.knosys.2017.11.027
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 464–468
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems(NeurIPS), pp 5998–6008
Villegas R, Yang J, Zou Y et al (2017) Learning to generate long-term future via hierarchical prediction. In: International conference on machine learning(ICML), pp 3560–3569
Waite E (2016) Project magenta: generating long-term structure in songs and stories. https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn/
Wang Z, Zhang Y, Zhang Y et al (2020) Pianotree vae: Structured representation learning for polyphonic music. In: Proceedings of the 21th international society for music information retrieval conference(ISMIR), pp 368–375
Wu J, Hu C, Wang Y et al (2020) A hierarchical recurrent neural network for symbolic melody generation. IEEE Trans Cybern 50(6):2749–2757. https://doi.org/10.1109/TCYB.2019.2953194
Wu J, Liu X, Hu X et al (2020) Popmnet: Generating structured pop music melodies using neural networks. Artif Intell 286:103,303. https://doi.org/10.1016/j.artint.2020.103303
Yang LC, Chou SY, Yang YH (2017) Midinet: A convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th International society for music information retrieval conference(ISMIR), pp 324–331
Ycart A, Benetos E (2020) Learning and evaluation methodologies for polyphonic music sequence prediction with lstms. EEE/ACM Trans Audio, Speech, Language Process 28:1328–1341
Zhang N (2020) Learning adversarial transformer for symbolic music generation. IEEE Trans Neural Netw Learn Syst, pp 1–10. https://doi.org/10.1109/TNNLS.2020.2990746
Zhu H, Liu Q, Yuan NJ et al (2018) Xiaoice band: a melody and arrangement generation framework for pop music. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2837–2846
Zhu H, Liu Q, Yuan NJ et al (2020) Pop music generation: from melody to multi-style arrangement. ACM Trans Knowl Discov 14(5):1–31. https://doi.org/10.1145/3374915
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant No.62076077 and No.61903090, the Guangxi Science and Technology Major Project under Grant No.AA22068057, the Guangxi Scientific Research Basic Ability Enhancement Program for Young and Middle-aged Teachers under Grant No.2022KY0183, and the School Foundation of Guilin University of Aerospace Technology under Grant No.XJ21KT32.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Additional Figures
Appendix A: Additional Figures
To facilitate reading and comparison, we added additional three-dimensional structure histograms for Figs. 5 and 6. Three-dimensional structure histograms from the real melodies in the dataset and the sample melodies generated by different models. Each histogram represents the structure of the first 32 bars of the melody. The elements on the i-th X-axis and j-th Y-axis denote that the i-th bar repeats the j-th bar, where i > j. The element on the z-axis denotes the repetition distance. The different repetition distances are represented by the heights of the colored cylinders. Best viewed in color and zoomed in.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qin, Y., Xie, H., Ding, S. et al. Bar transformer: a hierarchical model for learning long-term structure and generating impressive pop music. Appl Intell 53, 10130–10148 (2023). https://doi.org/10.1007/s10489-022-04049-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04049-3