ABSTRACT
It is a consensus that auditory and visual information can be quite similar in terms of the expression of emotions and knowledge. To explore this relationship with machine learning, this paper proposes a feasible system to generate drum beats from images. Specifically, the model converts the input image to an embedding vector, calculates a corresponding music embedding of a 4-bar drum set performance for this image embedding, and converts it to a playable MIDI file. The training process of the model is implemented by categorising the source dataset into the same set of genres and training with different combinations of images and drum beat for each genre. This paper also includes an evaluation of the performance of the system under different configurations.
- Huriye Atilgan, Stephen M. Town, Katherine C. Wood, Gareth P. Jones, Ross K. Maddox, Adrian K.C. Lee, and Jennifer K. Bizley. 2018. Integration of Visual Information in Auditory Cortex Promotes Auditory Scene Analysis through Multisensory Binding. Neuron 97, 3 (Feb. 2018), 640–655.e4. https://doi.org/10.1016/j.neuron.2017.12.034Google ScholarCross Ref
- Jiansong Chao, Haofen Wang, Wenlei Zhou, Weinan Zhang, and Yong Yu. 2011. Tunesensor: A Semantic-Driven Music Recommendation Service for Digital Photo Albums. In Proceedings of the 10th International Semantic Web Conference. ISWC2011 (October 2011).Google Scholar
- Tristan A. Chaplin, Marcello G. P. Rosa, and Leo L. Lui. 2018. Auditory and Visual Motion Processing and Integration in the Primate Cerebral Cortex. Frontiers in Neural Circuits 12 (2018), 93. https://doi.org/10.3389/fncir.2018.00093Google ScholarCross Ref
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Miami, FL, 248–255. https://doi.org/10.1109/cvpr.2009.5206848Google Scholar
- Jon Gillick, Adam Roberts, Jesse Engel, Douglas Eck, and David Bamman. 2019. Learning to Groove with Inverse Sequence Transformations. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2269–2279.Google Scholar
- Patrik N. Juslin and Daniel Västfjäll. 2008. Emotional Responses to Music: The Need to Consider Underlying Mechanisms. Behavioral and Brain Sciences 31, 5 (Oct. 2008), 559–575. https://doi.org/10.1017/s0140525x08005293Google ScholarCross Ref
- Janis Libeks and Douglas Turnbull. 2011. You Can Judge an Artist by an Album Cover: Using Images for Music Annotation. IEEE Multimedia 18, 4 (April 2011), 30–37. https://doi.org/10.1109/mmul.2011.1Google ScholarDigital Library
- David Massard. 2010. Pony Pony Run Run Playing @ Francofolies de Spa.Google Scholar
- Kevin F. McCarthy (Ed.). 2001. The Performing Arts in a New Era. Rand, Santa Monica, CA.Google Scholar
- Leonard B. Meyer. 1990. Emotion and Meaning in Music(17. impr ed.). Univ. of Chicago Pr, Chicago.Google Scholar
- Sergio Oramas, Oriol Nieto, Francesco Barbieri, and Xavier Serra. 2017. Multi-Label Music Genre Classification from Audio, Text, and Images Using Deep Features. arXiv:1707.04916 [cs] (July 2017). arxiv:1707.04916 [cs]Google Scholar
- John W. Osborne. 1981. The Mapping of Thoughts, Emotions, Sensations, and Images as Responses to Music.Journal of Mental Imagery 5, 1 (1981), 133–136.Google Scholar
- Alexandra Quittner and Robert Glueckauf. 1983. The Facilitative Effects of Music on Visual Imagery: A Multiple Measures Approach.Journal of Mental Imagery 7, 1 (1983), 105–119.Google Scholar
- Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 4364–4373.Google Scholar
- Melissa Saenz and Christof Koch. 2008. The Sound of Change: Visually-Induced Auditory Synesthesia. Current biology: CB 18, 15 (Aug. 2008), R650–R651. https://doi.org/10.1016/j.cub.2008.06.014Google Scholar
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, 2818–2826. https://doi.org/10.1109/cvpr.2016.308Google Scholar
- Haohan Wang and Bhiksha Raj. 2017. On the Origin of Deep Learning. arXiv:1702.07800 [cs, stat] (March 2017). arxiv:1702.07800 [cs, stat]Google Scholar
- Ju-Chiang Wang, Yi-Hsuan Yang, I-Hong Jhuo, Yen-Yu Lin, and Hsin-Min Wang. 2012. The Acousticvisual Emotion Guassians Model for Automatic Generation of Music Video. In Proceedings of the 20th ACM International Conference on Multimedia - MM ’12. ACM Press, Nara, Japan, 1379. https://doi.org/10.1145/2393347.2396494Google ScholarDigital Library
- Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2012. Cross Matching of Music and Image. In Proceedings of the 20th ACM International Conference on Multimedia - MM ’12. ACM Press, Nara, Japan, 837. https://doi.org/10.1145/2393347.2396325Google ScholarDigital Library
- Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2016. Bridging Music and Image via Cross-Modal Ranking Analysis. IEEE Transactions on Multimedia 18, 7 (July 2016), 1305–1318. https://doi.org/10.1109/TMM.2016.2557722Google ScholarDigital Library
- Yi Yu, Zhijie Shen, and Roger Zimmermann. 2012. Automatic Music Soundtrack Generation for Outdoor Videos from Contextual Sensor Information. In Proceedings of the 20th ACM International Conference on Multimedia - MM ’12. ACM Press, Nara, Japan, 1377. https://doi.org/10.1145/2393347.2396493Google ScholarDigital Library
Index Terms
- Groovy Pixels: Generating Drum Set Rhythms from Images
Recommendations
Comparing Three Data Representations for Music with a Sequence-to-Sequence Model
AI 2020: Advances in Artificial IntelligenceAbstractThe choices of neural network model and data representation, a mapping between musical notation and input signals for a neural network, have emerged as a major challenge in creating convincing models for melody generation. Music generation can ...
A novel Xi’an drum music generation method based on Bi-LSTM deep reinforcement learning
AbstractChinese Folk Drum music is an excellent traditional cultural resource, it has brilliant historical and cultural heritage and excellent traditional cultural connotation. However, the survey found that the social and cultural values, tourism ...
XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWith the development of knowledge of music composition and the recent increase in demand, an increasing number of companies and research institutes have begun to study the automatic generation of music. However, previous models have limitations when ...
Comments