Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization

Sergio, Gwenaelle Cunha; Lee, Minho

doi:10.1007/s11042-020-09636-5

Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization

Published: 10 September 2020

Volume 80, pages 1793–1812, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

455 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents Scene2Wav, a novel deep convolutional model proposed to handle the task of music generation from emotionally annotated video. This is important because when paired with the appropriate audio, the resulting music video is able to enhance the emotional effect it has on viewers. The challenge lies in transforming the video to audio domain and generating music. Our proposed encoder Scene2Wav uses a convolutional sequence encoder to embed dynamic emotional visual features from low-level features in the colour space, namely Hue, Saturation and Value. The decoder Scene2Wav is a proposed conditional SampleRNN which uses that emotional visual feature embedding as condition to generate novel emotional music. The entire model is fine-tuned in an end-to-end training fashion to generate a music signal evoking the intended emotional response from the listener. By taking into consideration the emotional and generative aspect of it, this work is a significant contribution to the field of Human-Computer Interaction. It is also a stepping stone towards the creation of an AI movie and/or drama director, which is able to automatically generate appropriate music for trailers and movies. Based on experimental results, this model can effectively generate music that is preferred to the user when compared to the baseline model and able to evoke correct emotions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Art Generation Using Speech Emotions

FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

Predicting Music Emotion by Using Convolutional Neural Network

Notes

Code available at https://github.com/gcunhase/Scene2Wav
https://github.com/gcunhase/AnnotatedMV-PreProcessing
Audio available online at https://github.com/gcunhase/Scene2Wav/tree/master/results_generated_samples
https://tinyurl.com/y7rn8jqj

References

Bravo F (2012) The influence of music on the emotional interpretation of visual contexts. In: International symposium on computer music modeling and retrieval. Springer, pp 366–377
Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6(1):3–5
Article Google Scholar
Çevikalp H, Dordinejad GG, Elmas M (2017) Feature extraction with convolutional neural networks for aerial image retrieval. In: Signal processing and communications applications conference (SIU), 2017 25th. IEEE, pp 1–4
Chang JD, Yu SS, Chen HH, Tsai CS (2010) Hsv-based color texture image classification using wavelet transform and motif patterns. J Comput 20 (4):63–69
Google Scholar
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). http://www.aclweb.org/anthology/D14-1179. Association for Computational Linguistics, Doha, Qatar, pp 1724 – 1734
Cunha Sergio G, Lee M (2020) Emotional video to audio transformation using deep recurrent neural networks and a neuro-fuzzy system. Math Probl Eng 2020
Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154
Article Google Scholar
Heinichen JD (1728) Der general-bass in der composition. Ripol Classic Publishing House
Ishiguro MA (2010) The affective properties of keys in instrumental music from the late nineteenth and early twentieth centuries. Master’s thesis University of Massachusett Amherst
Jaimovich J, Coghlan N, Knapp RB (2012) Emotion in motion: a study of music and affective response. In: International symposium on computer music modeling and retrieval. Springer, pp 19–43
Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2011) Deap: a database for emotion analysis; using physiological signals. IEEE Trans Affective Comput 3(1):18–31
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
LeCun Y, Haffner P, Bottou L, Bengio Y (1999) Object recognition with gradient-based learning. In: Shape, contour and grouping in computer vision. Springer, pp 319–345
Manocha P, Finkelstein A, Jin Z, Bryan NJ, Zhang R, Mysore GJ (2020) A differentiable perceptual audio metric learned from just noticeable differences. arXiv:2001.04460
Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville A, Bengio Y (2016) Samplernn: an unconditional end-to-end neural audio generation model. arXiv:1612.07837
Mikels JA, Fredrickson BL, Larkin GR, Lindberg CM, Maglio SJ, Reuter-Lorenz PA (2005) Emotional category data on images from the international affective picture system. Behav Res Methods 37(4):626–630
Article Google Scholar
Morriss-Kay GM (2010) The evolution of human artistic creativity. J Anat 216(2):158–176
Article Google Scholar
Nanni L, Ghidoni S, Brahnam S (2017) Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn 71:158–172
Article Google Scholar
Neubig G (2017) Neural machine translation and sequence-to-sequence models: a tutorial. arXiv:1703.01619
Oatley K, Keltner D, Jenkins JM (2006) Understanding emotions. Blackwell Publishing
Rouzic M (2008) Photosounder. http://photosounder.com/
Savage TM, Vogel KE (2013) An introduction to digital multimedia. Jones & Bartlett Publishers
Schubart CFD (1806) Christ. Fried. Dan. Schubart’s Ideen zu einer Ästhetik der Tonkunst. Degen
Sergio GC, Lee M (2016) Audio generation from scene considering its emotion aspect. In: International conference on neural information processing. Springer, Kyoto, pp 74–81
Sergio GC, Moirangthem DS, Lee M (2018) Temporal hierarchies in sequence to sequence for sentence correction. In: 2018 international joint conference on neural networks (IJCNN). IEEE, pp 1–7
Shan MK, Kuo FF, Chiang MF, Lee SY (2009) Emotion-based music recommendation by affinity discovery from film music. Expert Systems with Applications 36(4):7666–7674
Article Google Scholar
Shan MK, Kuo FF, Chiang MF, Lee SY (2009) Emotion-based music recommendation by affinity discovery from film music. Expert Systems with Applications 36(4):7666–7674
Article Google Scholar
Singh JF (2012) Paint2sound. http://flexibeatz.weebly.com/paint2sound.html
Soleymani M, Pantic M, Pun T (2012) Multimodal emotion recognition in response to videos. IEEE Trans Affective Comput 3(2):211–223
Article Google Scholar
Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2017) Char2wav: end-to-end speech synthesis
Steblin R (2005) A history of key characteristics in the eighteenth and early nineteenth centuries. University of Rochester Press
Ullrich K, van der Wel E (2017) Music transcription with convolutional sequence-to-sequence models. In: Proceedings of the 18th international society for music information retrieval conference
Van Den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. In: SSW, p 125
van der Zwaag MD, Westerink JH, van den Broek EL (2009) Deploying music characteristics for an affective music player. In: 3rd international conference on affective computing and intelligent interaction and workshops, 2009. ACII 2009. IEEE, pp 1–7
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circ Syst Video Technol 16(6):689–704
Article Google Scholar
van der Wel E, Ullrich K (2017) Optical music recognition with convolutional sequence-to-sequence models. arXiv:1707.04877
White D (2011) Sonicphoto. http://www.skytopia.com/software/sonicphoto/
Yang X, Fan Y (2018) Feature extraction using convolutional neural networks for multi-atlas based image segmentation. In: Medical imaging 2018: image processing. International Society for Optics and Photonics, vol 10574, p 1057439
Yanulevskaya V, van Gemert JC, Roth K, Herbold AK, Sebe N, Geusebroek JM (2008) Emotional valence categorization using holistic image features. In: 15th IEEE international conference on image processing, 2008. ICIP 2008. IEEE, pp 101–104
Zhang Q, Jeong S, Lee M (2012) Autonomous emotion development using incremental modified adaptive neuro-fuzzy inference system. Neurocomputing 86:33–44
Article Google Scholar
Zhang Q, Lee M (2012) Emotion development system by interacting with human eeg and natural scene understanding. Cogn Syst Res 14(1):37–49
Article Google Scholar
Zhao S, Yao H, Wang F, Jiang X, Zhang W (2014) Emotion based image musicalization. In: 2014 IEEE international conference on multimedia and expo workshops (ICMEW). IEEE, pp 1–6
Zhiqiang W, Jun L (2017) A review of object detection based on convolutional neural network. In: Control conference (CCC), 2017 36th chinese. IEEE, pp 11104–11109
Zhou C, Horgan M, Kumar V, Vasco C, Darcy D (2018) Voice conversion with conditional samplernn. arXiv:1808.08311
Zlatintsi A, Koutras P, Evangelopoulos G, Malandrakis N, Efthymiou N, Pastra K, Potamianos A, Maragos P (2017) Cognimuse: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP Journal on Image and Video Processing 2017 (1):54
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics Engineering, Kyungpook National University, 80 Daehakro, Bukgu, Daegu, 41566, South Korea
Gwenaelle Cunha Sergio & Minho Lee

Authors

Gwenaelle Cunha Sergio
View author publications
You can also search for this author in PubMed Google Scholar
Minho Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minho Lee.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sergio, G.C., Lee, M. Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization. Multimed Tools Appl 80, 1793–1812 (2021). https://doi.org/10.1007/s11042-020-09636-5

Download citation

Received: 30 September 2019
Revised: 04 July 2020
Accepted: 13 August 2020
Published: 10 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11042-020-09636-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Art Generation Using Speech Emotions

FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

Predicting Music Emotion by Using Convolutional Neural Network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Art Generation Using Speech Emotions

FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

Predicting Music Emotion by Using Convolutional Neural Network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation