Abstract
This paper presents Scene2Wav, a novel deep convolutional model proposed to handle the task of music generation from emotionally annotated video. This is important because when paired with the appropriate audio, the resulting music video is able to enhance the emotional effect it has on viewers. The challenge lies in transforming the video to audio domain and generating music. Our proposed encoder Scene2Wav uses a convolutional sequence encoder to embed dynamic emotional visual features from low-level features in the colour space, namely Hue, Saturation and Value. The decoder Scene2Wav is a proposed conditional SampleRNN which uses that emotional visual feature embedding as condition to generate novel emotional music. The entire model is fine-tuned in an end-to-end training fashion to generate a music signal evoking the intended emotional response from the listener. By taking into consideration the emotional and generative aspect of it, this work is a significant contribution to the field of Human-Computer Interaction. It is also a stepping stone towards the creation of an AI movie and/or drama director, which is able to automatically generate appropriate music for trailers and movies. Based on experimental results, this model can effectively generate music that is preferred to the user when compared to the baseline model and able to evoke correct emotions.











Similar content being viewed by others
Notes
References
Bravo F (2012) The influence of music on the emotional interpretation of visual contexts. In: International symposium on computer music modeling and retrieval. Springer, pp 366–377
Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6(1):3–5
Çevikalp H, Dordinejad GG, Elmas M (2017) Feature extraction with convolutional neural networks for aerial image retrieval. In: Signal processing and communications applications conference (SIU), 2017 25th. IEEE, pp 1–4
Chang JD, Yu SS, Chen HH, Tsai CS (2010) Hsv-based color texture image classification using wavelet transform and motif patterns. J Comput 20 (4):63–69
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). http://www.aclweb.org/anthology/D14-1179. Association for Computational Linguistics, Doha, Qatar, pp 1724 – 1734
Cunha Sergio G, Lee M (2020) Emotional video to audio transformation using deep recurrent neural networks and a neuro-fuzzy system. Math Probl Eng 2020
Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154
Heinichen JD (1728) Der general-bass in der composition. Ripol Classic Publishing House
Ishiguro MA (2010) The affective properties of keys in instrumental music from the late nineteenth and early twentieth centuries. Master’s thesis University of Massachusett Amherst
Jaimovich J, Coghlan N, Knapp RB (2012) Emotion in motion: a study of music and affective response. In: International symposium on computer music modeling and retrieval. Springer, pp 19–43
Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2011) Deap: a database for emotion analysis; using physiological signals. IEEE Trans Affective Comput 3(1):18–31
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
LeCun Y, Haffner P, Bottou L, Bengio Y (1999) Object recognition with gradient-based learning. In: Shape, contour and grouping in computer vision. Springer, pp 319–345
Manocha P, Finkelstein A, Jin Z, Bryan NJ, Zhang R, Mysore GJ (2020) A differentiable perceptual audio metric learned from just noticeable differences. arXiv:2001.04460
Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville A, Bengio Y (2016) Samplernn: an unconditional end-to-end neural audio generation model. arXiv:1612.07837
Mikels JA, Fredrickson BL, Larkin GR, Lindberg CM, Maglio SJ, Reuter-Lorenz PA (2005) Emotional category data on images from the international affective picture system. Behav Res Methods 37(4):626–630
Morriss-Kay GM (2010) The evolution of human artistic creativity. J Anat 216(2):158–176
Nanni L, Ghidoni S, Brahnam S (2017) Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn 71:158–172
Neubig G (2017) Neural machine translation and sequence-to-sequence models: a tutorial. arXiv:1703.01619
Oatley K, Keltner D, Jenkins JM (2006) Understanding emotions. Blackwell Publishing
Rouzic M (2008) Photosounder. http://photosounder.com/
Savage TM, Vogel KE (2013) An introduction to digital multimedia. Jones & Bartlett Publishers
Schubart CFD (1806) Christ. Fried. Dan. Schubart’s Ideen zu einer Ästhetik der Tonkunst. Degen
Sergio GC, Lee M (2016) Audio generation from scene considering its emotion aspect. In: International conference on neural information processing. Springer, Kyoto, pp 74–81
Sergio GC, Moirangthem DS, Lee M (2018) Temporal hierarchies in sequence to sequence for sentence correction. In: 2018 international joint conference on neural networks (IJCNN). IEEE, pp 1–7
Shan MK, Kuo FF, Chiang MF, Lee SY (2009) Emotion-based music recommendation by affinity discovery from film music. Expert Systems with Applications 36(4):7666–7674
Shan MK, Kuo FF, Chiang MF, Lee SY (2009) Emotion-based music recommendation by affinity discovery from film music. Expert Systems with Applications 36(4):7666–7674
Singh JF (2012) Paint2sound. http://flexibeatz.weebly.com/paint2sound.html
Soleymani M, Pantic M, Pun T (2012) Multimodal emotion recognition in response to videos. IEEE Trans Affective Comput 3(2):211–223
Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2017) Char2wav: end-to-end speech synthesis
Steblin R (2005) A history of key characteristics in the eighteenth and early nineteenth centuries. University of Rochester Press
Ullrich K, van der Wel E (2017) Music transcription with convolutional sequence-to-sequence models. In: Proceedings of the 18th international society for music information retrieval conference
Van Den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. In: SSW, p 125
van der Zwaag MD, Westerink JH, van den Broek EL (2009) Deploying music characteristics for an affective music player. In: 3rd international conference on affective computing and intelligent interaction and workshops, 2009. ACII 2009. IEEE, pp 1–7
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circ Syst Video Technol 16(6):689–704
van der Wel E, Ullrich K (2017) Optical music recognition with convolutional sequence-to-sequence models. arXiv:1707.04877
White D (2011) Sonicphoto. http://www.skytopia.com/software/sonicphoto/
Yang X, Fan Y (2018) Feature extraction using convolutional neural networks for multi-atlas based image segmentation. In: Medical imaging 2018: image processing. International Society for Optics and Photonics, vol 10574, p 1057439
Yanulevskaya V, van Gemert JC, Roth K, Herbold AK, Sebe N, Geusebroek JM (2008) Emotional valence categorization using holistic image features. In: 15th IEEE international conference on image processing, 2008. ICIP 2008. IEEE, pp 101–104
Zhang Q, Jeong S, Lee M (2012) Autonomous emotion development using incremental modified adaptive neuro-fuzzy inference system. Neurocomputing 86:33–44
Zhang Q, Lee M (2012) Emotion development system by interacting with human eeg and natural scene understanding. Cogn Syst Res 14(1):37–49
Zhao S, Yao H, Wang F, Jiang X, Zhang W (2014) Emotion based image musicalization. In: 2014 IEEE international conference on multimedia and expo workshops (ICMEW). IEEE, pp 1–6
Zhiqiang W, Jun L (2017) A review of object detection based on convolutional neural network. In: Control conference (CCC), 2017 36th chinese. IEEE, pp 11104–11109
Zhou C, Horgan M, Kumar V, Vasco C, Darcy D (2018) Voice conversion with conditional samplernn. arXiv:1808.08311
Zlatintsi A, Koutras P, Evangelopoulos G, Malandrakis N, Efthymiou N, Pastra K, Potamianos A, Maragos P (2017) Cognimuse: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP Journal on Image and Video Processing 2017 (1):54
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sergio, G.C., Lee, M. Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization. Multimed Tools Appl 80, 1793–1812 (2021). https://doi.org/10.1007/s11042-020-09636-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09636-5