Skip to main content

Advertisement

Log in

CST: a melody generation method based on ChatGPT and Structure Transformer

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Lyric-to-melody (L2M) generation has garnered significant attention in recent years. However, existing L2M systems face challenges in integrating user-specified and automatically generated lyrics, ensuring structural coherence in melodies, and achieving precise lyric-melody alignment. To alleviate these issues, this paper proposes Chat Structure Transformer (CST), an L2M system that combines ChatGPT with a Structure Transformer. Specifically, CST leverages ChatGPT’s advanced text generation capabilities to ensure thematic consistency and generate corresponding lyrics while also accommodating user-specified lyrics, thereby enhancing lyric generation flexibility. Meanwhile, CST incorporates the Structure Transformer, which introduces the StruAttention module for automatic recognition of musical structures and employs a customized loss function based on reinforcement learning principles. These components collectively enhance the structural coherence and lyric-melody alignment in the generated melodies. Both subjective and objective experimental evaluations demonstrate that CST produces higher-quality melodies compared to previous systems. Our code is available at https://github.com/liuasdeu/cst. Our music is available at https://github.com/liuasdeu/cst/tree/main/evaluation/cst.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Yu, W., Zhu, C., Li, Z., Hu, Z., Wang, Q., Ji, H., Jiang, M.: A survey of knowledge-enhanced text generation. ACM Comput Surv 54, 1–38 (2022)

    Google Scholar 

  2. Iqbal, T., Qureshi, S.: The survey: text generation models in deep learning. J King Saud Univ Comput Inf Sci 34, 2515–2528 (2022)

    Google Scholar 

  3. Xia, W., Yang, Y., Xue, J.-H., Wu, B.: TediGAN: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2256–2265 (2021)

  4. Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: European Conference on Computer Vision, pp. 89–106. Springer (2022)

  5. Ju, Z., Lu, P., Tan, X., Wang, R., Zhang, C., Wu, S., Zhang, K., Li, X.-Y., Qin, T., Liu, T.-Y.: TeleMelody: lyric-to-melody generation with a template-based two-stage method. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5426–5437 (2022)

  6. Lv, A., Tan, X., Qin, T., Liu, T.-Y., Yan, R.: Re-creation of creations: a new paradigm for lyric-to-melody generation (2022). arXiv preprint arXiv:2208.05697

  7. Sheng, Z., Song, K., Tan, X., Ren, Y., Ye, W., Zhang, S., Qin, T.: SongMASS: automatic song writing with pre-training and alignment constraint. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13798–13805 (2021)

  8. Wang, J.-C., Kosta, K., Smith, J.B., Zhou, S., et al.: Modeling the rhythm from lyrics for melody generation of pop songs. In: ISMIR 2022 Hybrid Conference (2022)

  9. Teubner, T., Flath, C.M., Weinhardt, C., Aalst, W., Hinz, O.: Welcome to the era of ChatGPT et al.: the prospects of large language models. Bus. Inf. Syst. Eng. 65, 95–101 (2023)

    Google Scholar 

  10. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971

  11. Fischer, S., Rossetto, F., Gemmell, C., Ramsay, A., Mackie, I., Zubel, P., Tecklenburg, N., Dalton, J.: Open assistant toolkit–version 2 (2024). arXiv preprint arXiv:2403.00586

  12. Jin, C., Zhu, R., Zhu, Z., Yang, L., Yang, M., Luo, J.: MtArtGPT: a multi-task art generation system with pre-trained transformer. IEEE Trans. Circuits Syst. Video Technol. 34(8), 6901–6912 (2024)

    Google Scholar 

  13. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., Auli, M.: fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North. Association for Computational Linguistics (2019)

  14. Pinkerton, R.C.: Information theory and melody. Sci. Am. 194, 77–87 (1956)

    Google Scholar 

  15. Brooks, F.P., Hopkins, A., Neumann, P.G., Wright, W.V.: An experiment in musical composition. IRE Trans. Electron. Comput. EC–6, 175–182 (1957)

    Google Scholar 

  16. Monteith, K., Martinez, T.R., Ventura, D.: Automatic generation of melodic accompaniments for lyrics. In: ICCC, pp. 87–94 (2012)

  17. Fukayama, S., Nakatsuma, K., Sako, S., Nishimoto, T., Sagayama, S.: Automatic song composition from the lyrics exploiting prosody of the Japanese language. In: Proc. 7th Sound and Music Computing Conference (SMC), pp. 299–302 (2010)

  18. Long, C., Wong, R.C.-W., Sze, R.K.W.: T-Music: a melody composer based on frequent pattern mining. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1332–1335. IEEE (2013)

  19. Dong, S., Wang, P., Abbas, K.: A survey on deep learning and its applications. Comput. Sci. Rev. 40, 100379 (2021)

    MathSciNet  Google Scholar 

  20. Janiesch, C., Zschech, P., Heinrich, K.: Machine learning and deep learning. Electron. Markets 31(3), 685–695 (2021)

    Google Scholar 

  21. Bretan, M., Weinberg, G., Heck, L.: A unit selection methodology for music generation using deep neural networks (2016). arXiv preprint arXiv:1612.03789

  22. Zou, Y., Zou, P., Zhao, Y., Zhang, K., Zhang, R., Wang, X.: Melons: generating melody with long-term structure using transformers and structure graph. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 191–195. IEEE (2022)

  23. Su, Y., Han, R., Wu, X., Zhang, Y., Li, Y.: Folk melody generation based on CNN-BIGRU and self-attention. In: 2022 4th International Conference on Communications, Information System and Computer Engineering (CISCE), pp. 363–368 (2022). IEEE

  24. Lu, P., Tan, X., Yu, B., Qin, T., Zhao, S., Liu, T.-Y.: MeloForm: generating melody with musical form based on expert systems and neural networks. In: ISMIR 2022 Hybrid Conference (2022)

  25. Duan, W., Zhang, Z., Yu, Y., Oyama, K.: Interpretable melody generation from lyrics with discrete-valued adversarial training. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6973–6975 (2022)

  26. Yu, Y., Zhang, Z., Duan, W., Srivastava, A., Shah, R., Ren, Y.: Conditional hybrid GAN for melody generation from lyrics. Neural Comput. Appl. 35, 3191–3202 (2023)

    Google Scholar 

  27. Bao, H., Huang, S., Wei, F., Cui, L., Wu, Y., Tan, C., Piao, S., Zhou, M.: Neural melody composition from lyrics. In: Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part I 8, pp. 499–511. Springer (2019)

  28. Yu, Y., Srivastava, A., Canales, S.: Conditional LSTM-GAN for melody generation from lyrics. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 17, 1–20 (2021)

    Google Scholar 

  29. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

  30. Raffel, C.: Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. PhD thesis, Columbia University (2016)

  31. Guo, R., Simpson, I., Magnusson, T., Kiefer, C., Herremans, D.: A variational autoencoder for music generation controlled by tonal tension (2020). arXiv preprint arXiv:2010.06230

  32. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980

  33. Haji, S.H., Abdulazeez, A.M.: Comparison of optimization techniques based on gradient descent algorithm: a review. PalArch’s J. Archaeol. Egypt Egyptol. 18(4), 2715–2743 (2021)

    Google Scholar 

  34. Gold, B.P., Pearce, M.T., Mas-Herrero, E., Dagher, A., Zatorre, R.J.: Predictability and uncertainty in the pleasure of music: a reward for learning? J. Neurosci. 39(47), 9397–9409 (2019)

    Google Scholar 

  35. Jin, C., Luo, C., Yan, M., Zhao, G., Zhang, G., Zhang, S.: Weakening the dominant role of text: CMOSI dataset and multimodal semantic enhancement network. IEEE Trans. Neural Netw. Learn. Syst. 36, 222–236 (2023)

    Google Scholar 

  36. Jin, C., Liu, X., Zhao, Y., Zhu, Y., Wang, J., Wang, H.: ViolinBot: a framework for imitation learning of violin bowing using fuzzy logic and PCA. IEEE Trans. Fuzzy Syst. 32, 5005–5017 (2024)

    Google Scholar 

  37. Wang, H., Zhang, X., Iida, F.: Human–robot cooperative piano playing with learning-based real-time music accompaniment. IEEE Trans. Robot. 40, 4650–4669 (2024)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

R.H. and R.L. proposed the main ideas and conducted the related experiments, while T.P. and X.H. supervised and guided the research project. R.H. and R.L. wrote the manuscript, and T.P. and X.H. revised the manuscript and organized the figures and tables. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ruixue Liu.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, R., Liu, R., Peng, T. et al. CST: a melody generation method based on ChatGPT and Structure Transformer. Multimedia Systems 31, 244 (2025). https://doi.org/10.1007/s00530-025-01802-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-025-01802-9

Keywords