Abstract
Lyric-to-melody (L2M) generation has garnered significant attention in recent years. However, existing L2M systems face challenges in integrating user-specified and automatically generated lyrics, ensuring structural coherence in melodies, and achieving precise lyric-melody alignment. To alleviate these issues, this paper proposes Chat Structure Transformer (CST), an L2M system that combines ChatGPT with a Structure Transformer. Specifically, CST leverages ChatGPT’s advanced text generation capabilities to ensure thematic consistency and generate corresponding lyrics while also accommodating user-specified lyrics, thereby enhancing lyric generation flexibility. Meanwhile, CST incorporates the Structure Transformer, which introduces the StruAttention module for automatic recognition of musical structures and employs a customized loss function based on reinforcement learning principles. These components collectively enhance the structural coherence and lyric-melody alignment in the generated melodies. Both subjective and objective experimental evaluations demonstrate that CST produces higher-quality melodies compared to previous systems. Our code is available at https://github.com/liuasdeu/cst. Our music is available at https://github.com/liuasdeu/cst/tree/main/evaluation/cst.





Similar content being viewed by others
Data availability
The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.
References
Yu, W., Zhu, C., Li, Z., Hu, Z., Wang, Q., Ji, H., Jiang, M.: A survey of knowledge-enhanced text generation. ACM Comput Surv 54, 1–38 (2022)
Iqbal, T., Qureshi, S.: The survey: text generation models in deep learning. J King Saud Univ Comput Inf Sci 34, 2515–2528 (2022)
Xia, W., Yang, Y., Xue, J.-H., Wu, B.: TediGAN: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2256–2265 (2021)
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: European Conference on Computer Vision, pp. 89–106. Springer (2022)
Ju, Z., Lu, P., Tan, X., Wang, R., Zhang, C., Wu, S., Zhang, K., Li, X.-Y., Qin, T., Liu, T.-Y.: TeleMelody: lyric-to-melody generation with a template-based two-stage method. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5426–5437 (2022)
Lv, A., Tan, X., Qin, T., Liu, T.-Y., Yan, R.: Re-creation of creations: a new paradigm for lyric-to-melody generation (2022). arXiv preprint arXiv:2208.05697
Sheng, Z., Song, K., Tan, X., Ren, Y., Ye, W., Zhang, S., Qin, T.: SongMASS: automatic song writing with pre-training and alignment constraint. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13798–13805 (2021)
Wang, J.-C., Kosta, K., Smith, J.B., Zhou, S., et al.: Modeling the rhythm from lyrics for melody generation of pop songs. In: ISMIR 2022 Hybrid Conference (2022)
Teubner, T., Flath, C.M., Weinhardt, C., Aalst, W., Hinz, O.: Welcome to the era of ChatGPT et al.: the prospects of large language models. Bus. Inf. Syst. Eng. 65, 95–101 (2023)
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971
Fischer, S., Rossetto, F., Gemmell, C., Ramsay, A., Mackie, I., Zubel, P., Tecklenburg, N., Dalton, J.: Open assistant toolkit–version 2 (2024). arXiv preprint arXiv:2403.00586
Jin, C., Zhu, R., Zhu, Z., Yang, L., Yang, M., Luo, J.: MtArtGPT: a multi-task art generation system with pre-trained transformer. IEEE Trans. Circuits Syst. Video Technol. 34(8), 6901–6912 (2024)
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., Auli, M.: fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North. Association for Computational Linguistics (2019)
Pinkerton, R.C.: Information theory and melody. Sci. Am. 194, 77–87 (1956)
Brooks, F.P., Hopkins, A., Neumann, P.G., Wright, W.V.: An experiment in musical composition. IRE Trans. Electron. Comput. EC–6, 175–182 (1957)
Monteith, K., Martinez, T.R., Ventura, D.: Automatic generation of melodic accompaniments for lyrics. In: ICCC, pp. 87–94 (2012)
Fukayama, S., Nakatsuma, K., Sako, S., Nishimoto, T., Sagayama, S.: Automatic song composition from the lyrics exploiting prosody of the Japanese language. In: Proc. 7th Sound and Music Computing Conference (SMC), pp. 299–302 (2010)
Long, C., Wong, R.C.-W., Sze, R.K.W.: T-Music: a melody composer based on frequent pattern mining. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1332–1335. IEEE (2013)
Dong, S., Wang, P., Abbas, K.: A survey on deep learning and its applications. Comput. Sci. Rev. 40, 100379 (2021)
Janiesch, C., Zschech, P., Heinrich, K.: Machine learning and deep learning. Electron. Markets 31(3), 685–695 (2021)
Bretan, M., Weinberg, G., Heck, L.: A unit selection methodology for music generation using deep neural networks (2016). arXiv preprint arXiv:1612.03789
Zou, Y., Zou, P., Zhao, Y., Zhang, K., Zhang, R., Wang, X.: Melons: generating melody with long-term structure using transformers and structure graph. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 191–195. IEEE (2022)
Su, Y., Han, R., Wu, X., Zhang, Y., Li, Y.: Folk melody generation based on CNN-BIGRU and self-attention. In: 2022 4th International Conference on Communications, Information System and Computer Engineering (CISCE), pp. 363–368 (2022). IEEE
Lu, P., Tan, X., Yu, B., Qin, T., Zhao, S., Liu, T.-Y.: MeloForm: generating melody with musical form based on expert systems and neural networks. In: ISMIR 2022 Hybrid Conference (2022)
Duan, W., Zhang, Z., Yu, Y., Oyama, K.: Interpretable melody generation from lyrics with discrete-valued adversarial training. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6973–6975 (2022)
Yu, Y., Zhang, Z., Duan, W., Srivastava, A., Shah, R., Ren, Y.: Conditional hybrid GAN for melody generation from lyrics. Neural Comput. Appl. 35, 3191–3202 (2023)
Bao, H., Huang, S., Wei, F., Cui, L., Wu, Y., Tan, C., Piao, S., Zhou, M.: Neural melody composition from lyrics. In: Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part I 8, pp. 499–511. Springer (2019)
Yu, Y., Srivastava, A., Canales, S.: Conditional LSTM-GAN for melody generation from lyrics. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 17, 1–20 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Raffel, C.: Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching. PhD thesis, Columbia University (2016)
Guo, R., Simpson, I., Magnusson, T., Kiefer, C., Herremans, D.: A variational autoencoder for music generation controlled by tonal tension (2020). arXiv preprint arXiv:2010.06230
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Haji, S.H., Abdulazeez, A.M.: Comparison of optimization techniques based on gradient descent algorithm: a review. PalArch’s J. Archaeol. Egypt Egyptol. 18(4), 2715–2743 (2021)
Gold, B.P., Pearce, M.T., Mas-Herrero, E., Dagher, A., Zatorre, R.J.: Predictability and uncertainty in the pleasure of music: a reward for learning? J. Neurosci. 39(47), 9397–9409 (2019)
Jin, C., Luo, C., Yan, M., Zhao, G., Zhang, G., Zhang, S.: Weakening the dominant role of text: CMOSI dataset and multimodal semantic enhancement network. IEEE Trans. Neural Netw. Learn. Syst. 36, 222–236 (2023)
Jin, C., Liu, X., Zhao, Y., Zhu, Y., Wang, J., Wang, H.: ViolinBot: a framework for imitation learning of violin bowing using fuzzy logic and PCA. IEEE Trans. Fuzzy Syst. 32, 5005–5017 (2024)
Wang, H., Zhang, X., Iida, F.: Human–robot cooperative piano playing with learning-based real-time music accompaniment. IEEE Trans. Robot. 40, 4650–4669 (2024)
Author information
Authors and Affiliations
Contributions
R.H. and R.L. proposed the main ideas and conducted the related experiments, while T.P. and X.H. supervised and guided the research project. R.H. and R.L. wrote the manuscript, and T.P. and X.H. revised the manuscript and organized the figures and tables. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
He, R., Liu, R., Peng, T. et al. CST: a melody generation method based on ChatGPT and Structure Transformer. Multimedia Systems 31, 244 (2025). https://doi.org/10.1007/s00530-025-01802-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-025-01802-9