Skip to main content
Log in

Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents Scene2Wav, a novel deep convolutional model proposed to handle the task of music generation from emotionally annotated video. This is important because when paired with the appropriate audio, the resulting music video is able to enhance the emotional effect it has on viewers. The challenge lies in transforming the video to audio domain and generating music. Our proposed encoder Scene2Wav uses a convolutional sequence encoder to embed dynamic emotional visual features from low-level features in the colour space, namely Hue, Saturation and Value. The decoder Scene2Wav is a proposed conditional SampleRNN which uses that emotional visual feature embedding as condition to generate novel emotional music. The entire model is fine-tuned in an end-to-end training fashion to generate a music signal evoking the intended emotional response from the listener. By taking into consideration the emotional and generative aspect of it, this work is a significant contribution to the field of Human-Computer Interaction. It is also a stepping stone towards the creation of an AI movie and/or drama director, which is able to automatically generate appropriate music for trailers and movies. Based on experimental results, this model can effectively generate music that is preferred to the user when compared to the baseline model and able to evoke correct emotions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Code available at https://github.com/gcunhase/Scene2Wav

  2. https://github.com/gcunhase/AnnotatedMV-PreProcessing

  3. Audio available online at https://github.com/gcunhase/Scene2Wav/tree/master/results_generated_samples

  4. https://tinyurl.com/y7rn8jqj

References

  1. Bravo F (2012) The influence of music on the emotional interpretation of visual contexts. In: International symposium on computer music modeling and retrieval. Springer, pp 366–377

  2. Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6(1):3–5

    Article  Google Scholar 

  3. Çevikalp H, Dordinejad GG, Elmas M (2017) Feature extraction with convolutional neural networks for aerial image retrieval. In: Signal processing and communications applications conference (SIU), 2017 25th. IEEE, pp 1–4

  4. Chang JD, Yu SS, Chen HH, Tsai CS (2010) Hsv-based color texture image classification using wavelet transform and motif patterns. J Comput 20 (4):63–69

    Google Scholar 

  5. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). http://www.aclweb.org/anthology/D14-1179. Association for Computational Linguistics, Doha, Qatar, pp 1724 – 1734

  6. Cunha Sergio G, Lee M (2020) Emotional video to audio transformation using deep recurrent neural networks and a neuro-fuzzy system. Math Probl Eng 2020

  7. Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154

    Article  Google Scholar 

  8. Heinichen JD (1728) Der general-bass in der composition. Ripol Classic Publishing House

  9. Ishiguro MA (2010) The affective properties of keys in instrumental music from the late nineteenth and early twentieth centuries. Master’s thesis University of Massachusett Amherst

  10. Jaimovich J, Coghlan N, Knapp RB (2012) Emotion in motion: a study of music and affective response. In: International symposium on computer music modeling and retrieval. Springer, pp 19–43

  11. Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2011) Deap: a database for emotion analysis; using physiological signals. IEEE Trans Affective Comput 3(1):18–31

    Article  Google Scholar 

  12. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  13. LeCun Y, Haffner P, Bottou L, Bengio Y (1999) Object recognition with gradient-based learning. In: Shape, contour and grouping in computer vision. Springer, pp 319–345

  14. Manocha P, Finkelstein A, Jin Z, Bryan NJ, Zhang R, Mysore GJ (2020) A differentiable perceptual audio metric learned from just noticeable differences. arXiv:2001.04460

  15. Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville A, Bengio Y (2016) Samplernn: an unconditional end-to-end neural audio generation model. arXiv:1612.07837

  16. Mikels JA, Fredrickson BL, Larkin GR, Lindberg CM, Maglio SJ, Reuter-Lorenz PA (2005) Emotional category data on images from the international affective picture system. Behav Res Methods 37(4):626–630

    Article  Google Scholar 

  17. Morriss-Kay GM (2010) The evolution of human artistic creativity. J Anat 216(2):158–176

    Article  Google Scholar 

  18. Nanni L, Ghidoni S, Brahnam S (2017) Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn 71:158–172

    Article  Google Scholar 

  19. Neubig G (2017) Neural machine translation and sequence-to-sequence models: a tutorial. arXiv:1703.01619

  20. Oatley K, Keltner D, Jenkins JM (2006) Understanding emotions. Blackwell Publishing

  21. Rouzic M (2008) Photosounder. http://photosounder.com/

  22. Savage TM, Vogel KE (2013) An introduction to digital multimedia. Jones & Bartlett Publishers

  23. Schubart CFD (1806) Christ. Fried. Dan. Schubart’s Ideen zu einer Ästhetik der Tonkunst. Degen

  24. Sergio GC, Lee M (2016) Audio generation from scene considering its emotion aspect. In: International conference on neural information processing. Springer, Kyoto, pp 74–81

  25. Sergio GC, Moirangthem DS, Lee M (2018) Temporal hierarchies in sequence to sequence for sentence correction. In: 2018 international joint conference on neural networks (IJCNN). IEEE, pp 1–7

  26. Shan MK, Kuo FF, Chiang MF, Lee SY (2009) Emotion-based music recommendation by affinity discovery from film music. Expert Systems with Applications 36(4):7666–7674

    Article  Google Scholar 

  27. Shan MK, Kuo FF, Chiang MF, Lee SY (2009) Emotion-based music recommendation by affinity discovery from film music. Expert Systems with Applications 36(4):7666–7674

    Article  Google Scholar 

  28. Singh JF (2012) Paint2sound. http://flexibeatz.weebly.com/paint2sound.html

  29. Soleymani M, Pantic M, Pun T (2012) Multimodal emotion recognition in response to videos. IEEE Trans Affective Comput 3(2):211–223

    Article  Google Scholar 

  30. Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2017) Char2wav: end-to-end speech synthesis

  31. Steblin R (2005) A history of key characteristics in the eighteenth and early nineteenth centuries. University of Rochester Press

  32. Ullrich K, van der Wel E (2017) Music transcription with convolutional sequence-to-sequence models. In: Proceedings of the 18th international society for music information retrieval conference

  33. Van Den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. In: SSW, p 125

  34. van der Zwaag MD, Westerink JH, van den Broek EL (2009) Deploying music characteristics for an affective music player. In: 3rd international conference on affective computing and intelligent interaction and workshops, 2009. ACII 2009. IEEE, pp 1–7

  35. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

  36. Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circ Syst Video Technol 16(6):689–704

    Article  Google Scholar 

  37. van der Wel E, Ullrich K (2017) Optical music recognition with convolutional sequence-to-sequence models. arXiv:1707.04877

  38. White D (2011) Sonicphoto. http://www.skytopia.com/software/sonicphoto/

  39. Yang X, Fan Y (2018) Feature extraction using convolutional neural networks for multi-atlas based image segmentation. In: Medical imaging 2018: image processing. International Society for Optics and Photonics, vol 10574, p 1057439

  40. Yanulevskaya V, van Gemert JC, Roth K, Herbold AK, Sebe N, Geusebroek JM (2008) Emotional valence categorization using holistic image features. In: 15th IEEE international conference on image processing, 2008. ICIP 2008. IEEE, pp 101–104

  41. Zhang Q, Jeong S, Lee M (2012) Autonomous emotion development using incremental modified adaptive neuro-fuzzy inference system. Neurocomputing 86:33–44

    Article  Google Scholar 

  42. Zhang Q, Lee M (2012) Emotion development system by interacting with human eeg and natural scene understanding. Cogn Syst Res 14(1):37–49

    Article  Google Scholar 

  43. Zhao S, Yao H, Wang F, Jiang X, Zhang W (2014) Emotion based image musicalization. In: 2014 IEEE international conference on multimedia and expo workshops (ICMEW). IEEE, pp 1–6

  44. Zhiqiang W, Jun L (2017) A review of object detection based on convolutional neural network. In: Control conference (CCC), 2017 36th chinese. IEEE, pp 11104–11109

  45. Zhou C, Horgan M, Kumar V, Vasco C, Darcy D (2018) Voice conversion with conditional samplernn. arXiv:1808.08311

  46. Zlatintsi A, Koutras P, Evangelopoulos G, Malandrakis N, Efthymiou N, Pastra K, Potamianos A, Maragos P (2017) Cognimuse: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP Journal on Image and Video Processing 2017 (1):54

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minho Lee.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sergio, G.C., Lee, M. Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization. Multimed Tools Appl 80, 1793–1812 (2021). https://doi.org/10.1007/s11042-020-09636-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09636-5

Keywords

Navigation