Skip to main content

Advertisement

Log in

Music performance style transfer for learning expressive musical performance

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Generating expressive musical performance (EMP) is a hot issue in the field of music generation. Music played by humans is always more expressive than music produced by machines. To figure this out, it is crucial to explore the role of human performance in the production of music. This paper proposes a performance style transfer model to learn human performance style and implement EMP system. Our model is implemented using generative adversarial networks (GANs), with a multi-channel image composed of four elaborated spectrograms serving as the input to decompose and reconstruct music audio. To ensure training stability, we have designed a multi-channel consistency loss for GANs. Furthermore, given the lack of objective evaluation criteria for music generation, we propose a hybrid evaluation method that combines qualitative and quantitative methods to evaluate human-needs satisfaction. Three quantitative criteria are proposed at the feature and audio levels, respectively. The effectiveness of our method is verified on a public dataset through objective evaluation, which demonstrates its comparability to state-of-the-art algorithms. Additionally, subjective evaluations are conducted through visual analyses of both audio content and style. Finally, we conduct a musical Turing test in which subjects score the performance of the generated music. A series of experimental results show that our method is very competitive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

  1. https://github.com/XiaoZhecug/MST.

  2. https://www.piano-e-competition.com/

References

  1. Zhang, Z., Zhou, X., Qin, M., Chen, X.: Chinese character style transfer based on multi-scale gan. Signal Image Video Process. 16(2), 559–567 (2022)

    Article  Google Scholar 

  2. Xue, A.: End-to-end Chinese landscape painting creation using generative adversarial networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3863–3871 (2021)

  3. Muhamed, A., Li, L., Shi, X., Yaddanapudi, S., Chi, W., Jackson, D., Suresh, R., Lipton, Z.C., Smola, A.J.: Symbolic music generation with transformer-gans. Proc. AAAI Conf. Artif. Intell. 35(1), 408–417 (2021)

    Google Scholar 

  4. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

  5. Dai, S., Zhang, Z., Xia, G.G.: Music style transfer: a position paper. (2018) arXiv:1803.06841

  6. Oore, S., Simon, I., Dieleman, S., Eck, D., Simonyan, K.: This time with feeling: learning expressive musical performance. Neural Comput. Appl. 32(4), 955–967 (2020)

    Article  Google Scholar 

  7. Anantrasirichai, N., Bull, D.: Artificial intelligence in the creative industries: a review. Artif. Intell. Rev., pp. 1–68 (2022)

  8. Wiegreffe, S., Pinter, Y.: Attention is not not explanation. (2019) arXiv:1908.04626

  9. Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A. M., Hoffman, M. D., Dinculescu, M., Eck, D.: Music transformer (2018). arXiv:1809.04281

  10. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. (2019). arXiv:1904.10509

  11. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., Salakhutdinov, R.: Transformer-xl: attentive language models beyond a fixed-length context (2019). arXiv:1901.02860

  12. Donahue, C., Mao, H.H., Li, Y.E., Cottrell, G.W., McAuley, J.: Lakhnes: improving multi-instrumental music generation with cross-domain pre-training. (2019). arXiv:1907.04868

  13. Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., Simonyan, K.: Neural audio synthesis of musical notes with wavenet autoencoders. In: International Conference on Machine Learning. PMLR, pp. 1068–1077 (2017)

  14. Verma, P., Smith, J.O.: Neural style transfer for audio spectograms. (2018). arXiv:1801.01589

  15. Friberg, A., Bresin, R., Sundberg, J.: Overview of the kth rule system for musical performance. Adv. Cognit. Psychol. 2(2), 145 (2006)

    Article  Google Scholar 

  16. Teramura, K., Okuma, H., Taniguchi, Y., Makimoto, S., Maeda, S.-i.: Gaussian process regression for rendering music performance. In: Proceedings of the 10th International Conference on Music Perception and Cognition, ICMPC, pp. 167–172 (2008)

  17. Chacón, C.E.C., Grachten, M.: The basis mixer: a computational romantic pianist. In: Late-Breaking Demos of the 17th International Society for Music Information Retrieval Conf. (ISMIR) (2016)

  18. Kirke, A., Miranda, E.R.: An overview of computer systems for expressive music performance. Guide Comput. Expressive Music Perform., pp. 1–47 (2012)

  19. Grachten, M., Widmer, G.: Linear basis models for prediction and analysis of musical expression. J. New Music Res. 41(4), 311–322 (2012)

    Article  Google Scholar 

  20. Cancino-Chacón, C.E.: Computational Modeling of Expressive Music Performance with Linear and Non-linear Basis Function Models. Johannes Kepler University, Linz (2018)

    Google Scholar 

  21. Cancino-Chacón, C.E., Gadermaier, T., Widmer, G., Grachten, M.: An evaluation of linear and non-linear models of expressive dynamics in classical piano and symphonic music. Mach. Learn. 106(6), 887–909 (2017)

    Article  MathSciNet  Google Scholar 

  22. Grachten, M., Krebs, F.: An assessment of learned score features for modeling expressive dynamics in music. IEEE Trans. Multimed. 16(5), 1211–1218 (2014)

    Article  Google Scholar 

  23. Peperkamp, J., Hildebrandt, K., Liem, C.C.S.: A formalization of relative local tempo variations in collections of performances. In: Proceedings of the 18th International Society for Music Information Retrieval Conference, pp. 158–164 (2017)

  24. Moulieras, S., Pachet, F.: Maximum entropy models for generation of expressive music. (2016). arXiv:1610.03606

  25. Van Herwaarden, S., Grachten, M., De Haas, W.B., Wang, H.-M., Yang, Y.-H., Lee, J.H., et al. Predicting expressive dynamics in piano performances using neural networks. In: Proceedings of the 15th Conference of the International Society for Music Information Retrieval (ISMIR 2014). International Society for Music Information Retrieval, pp. 45–52 (2014)

  26. Malik, I., Ek, C.H.: Neural translation of musical style (2017). arXiv:1708.03535

  27. Jeong, D., Kwon, T., Kim, Y., Lee, K., Nam, J.: Virtuosonet: a hierarchical rnn-based system for modeling expressive piano performance. In: ISMIR, pp. 908–915 (2019)

  28. Maezawa, A., Yamamoto, K., Fujishima, T.: Rendering music performance with interpretation variations using conditional variational rnn. In: ISMIR, pp. 855–861 (2019)

  29. Shi, Z.: Computational analysis and modeling of expressive timing in Chopin’s mazurkas. In: Proceedings of the 22nd ISMIR Conference, pp. 650–656 (2021)

  30. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

  31. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  32. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)

  33. Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from standard gan (2018). arXiv:1807.00734

Download references

Funding

This work is supported by the Technical Innovation Major Project of Hubei Province, China, under Grant 2020AEA010 and the Natural Science Foundation of Hubei Province, China, under Grant 2020CFA031 and Grant 2019CFB581.

Author information

Authors and Affiliations

Authors

Contributions

ZX and XC wrote the main manuscript text and LZ polished the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Xin Chen.

Ethics declarations

Conflict of interest

I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethical approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, Z., Chen, X. & Zhou, L. Music performance style transfer for learning expressive musical performance. SIViP 18, 889–898 (2024). https://doi.org/10.1007/s11760-023-02788-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02788-5

Keywords

Navigation