Abstract
Writing a video from text script (i.e., video editing) is an important but challenging multimedia-related task. Although a number of recent works have started to develop deep learning models for video editing, they mainly focus on writing a video from generic text script, not suitable for some specific domains (e.g., song lyrics). In this paper, we thus introduce a novel video editing task called song-to-video translation (S2VT), which aims to write a video from song lyrics based on multimodal pre-training. Similar to generic video editing, this S2VT task also has three main steps: lyric-to-shot retrieval, shot selection, and shot stitching. However, it has a large difference from generic video editing in that: the song lyrics are often more abstract to understand than the common text script, and thus a large-scale multimodal pre-training model is needed for lyric-to-shot retrieval. To facilitate the research on S2VT, we construct a benchmark dataset with human annotations according to three evaluation metrics (i.e., semantic-consistence, content-coherence, and rhythm-matching). Further, a baseline method for S2VT is proposed by training three classifiers (each for a metric) and developing a beam shot-selection algorithm based on the trained classifiers. Extensive experiments are conducted to show the effectiveness of the proposed baseline method in the S2VT task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems 29 (2016)
Bellini, R., Kleiman, Y., Cohen-Or, D.: Dance to the beat: Synchronizing motion to audio. Comput. Visual Media 4(3), 197–208 (2018)
Chen, Q., Wu, Q., Chen, J., Wu, Q., van den Hengel, A., Tan, M.: Scripted video generation with a bottom-up generative adversarial network. IEEE Trans. Image Process. 29, 7454–7467 (2020)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Davis, A., Agrawala, M.: Visual rhythm and beat. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2532–2535 (2018)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Fei, N., et al.: Wenlan 2.0: Make ai imagine via a multimodal foundation model. arXiv preprint arXiv:2110.14378 (2021)
Fu, T.J., Wang, X.E., Grafton, S.T., Eckstein, M.P., Wang, W.Y.: Language-based video editing via multi-modal multi-level transformer. arXiv preprint arXiv:2104.01122 (2021)
Girgensohn, A., et al.: A semi-automatic approach to home video editing. In: Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology, pp. 81–89 (2000)
Hua, X.S., Lu, L., Zhang, H.J.: Automatic music video generation based on temporal pattern analysis. In: Proceedings of the 12th Annual ACM International Conference on Multimedia, pp. 472–475 (2004)
Kim, D., Joo, D., Kim, J.: Tivgan: text to image to video generation with step-by-step evolutionary generator. IEEE Access 8, 153113–153122 (2020)
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2880–2894 (2020)
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36(4), 130–1 (2017)
Liao, C., Wang, P.P., Zhang, Y.: Mining association patterns between music and video clips in professional MTV. In: Huet, B., Smeaton, A., Mayer-Patel, K., Avrithis, Y. (eds.) MMM 2009. LNCS, vol. 5371, pp. 401–412. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-540-92892-8_41
Liao, Z., Yu, Y., Gong, B., Cheng, L.: Audeosynth: music-driven video montage. ACM Trans. Graph. (TOG) 34(4), 1–10 (2015)
Lin, J.C., Wei, W.L., Wang, H.M.: Automatic music video generation based on emotion-oriented pseudo song prediction and matching. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 372–376 (2016)
Parekh, S., Essid, S., Ozerov, A., Duong, N.Q., Pérez, P., Richard, G.: Weakly supervised representation learning for audio-visual scene analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 416–428 (2019)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Schindler, A.: Multi-modal music information retrieval: augmenting audio-analysis with visual computing for improved music video analysis. arXiv preprint arXiv:2002.00251 (2020)
Schindler, A., Rauber, A.: Harnessing music-related visual stereotypes for music information retrieval. ACM Trans. Intell. Syst. Technol. (TIST) 8(2), 1–21 (2016)
Wang, M., Yang, G.W., Hu, S.M., Yau, S.T., Shamir, A.: Write-a-video: computational video montage from themed text. ACM Trans. Graph. 38(6), 177–1 (2019)
Xiong, Y., Heilbron, F.C., Lin, D.: Transcript to video: efficient clip sequencing from texts. arXiv preprint arXiv:2107.11851 (2021)
Acknowledgements
This work was supported by National Natural Science Foundation of China (61976220).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fu, F., Sun, Z., Yang, G., He, X., Lu, Z. (2023). Song-to-Video Translation: Writing a Video from Song Lyrics Based on Multimodal Pre-training. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-46664-9_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46663-2
Online ISBN: 978-3-031-46664-9
eBook Packages: Computer ScienceComputer Science (R0)