ABSTRACT
Searching through a large audio database for a specific sound can be a slow and tedious task with detrimental effects on creative workflow. Listening to each sample is time consuming, while textual descriptions or tags may be insufficient, unavailable or simply unable to meaningfully capturing certain sonic qualities. This paper explores the use of visual sketches that express the mental model associated with a sound to accelerate the search process. To achieve this, a study was conducted to collect data on how 30 people visually represent sound, by providing hand-sketched visual representations for a range of 30 different sounds. After augmenting the data to a sparse set of 855 samples, two different autoencoder were trained. The one finds similar sketches in latent space and delivers the associated audio files. The other one is a multimodal autoencoder combining both visual and sonic cues in a common feature space but lacks on having no audio input for the search task. These both were then used to implement and discuss a visual query-by-sketch search interface for sounds.
- Mohammad Adeli, Jean Rouat, and Stéphane Molotchnikoff. 2014. Audiovisual correspondence between musical timbre and visual shapes. Frontiers in Human Neuroscience 8 (May 2014). https://doi.org/10.3389/fnhum.2014.00352Google Scholar
- Kristina Andersen and Peter Knees. 2016. Conversations with Expert Users in Music Retrieval and Research Challenges for Creative MIR.. In ISMIR. 122–128. https://research.tue.nl/en/publications/conversations-with-expert-users-in-music-retrieval-and-research-cGoogle Scholar
- Cătălina Cangea, Petar Veličković, and Pietro Liò. 2017. XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification. (2017).Google Scholar
- Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu. 2016. Correlation Autoencoder Hashing for Supervised Cross-Modal Search. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval - ICMR '16. ACM Press. https://doi.org/10.1145/2911996.2912000Google ScholarDigital Library
- Ya-Xi Chen and René Klüber. 2010. ThumbnailDJ: Visual Thumbnails of Music Content.. In ISMIR. 565–570.Google Scholar
- Charles P. Davis, Hannah M. Morrow, and Gary Lupyan. 2019. What Does a Horgous Look Like? Nonsense Words Elicit Meaningful Drawings. Cognitive Science 43, 10 (oct 2019). https://doi.org/10.1111/cogs.12791Google ScholarCross Ref
- Dhiraj, Rohit Biswas, and Nischay Ghattamaraju. 2018. An effective analysis of deep learning based approaches for audio based feature extraction and its visualization. Multimedia Tools and Applications 78, 17 (oct 2018), 23949–23972. https://doi.org/10.1007/s11042-018-6706-xGoogle Scholar
- Lars Engeln and Rainer Groh. 2020. CoHEARence of audible shapes—a qualitative user study for coherent visual audio design with resynthesized shapes. Personal and Ubiquitous Computing(2020), 1–11. https://doi.org/10.1007/s00779-020-01392-5Google Scholar
- K. K. Evans and A. Treisman. 2010. Natural cross-modal mappings between visual and auditory features. Journal of Vision 10, 1 (jan 2010), 6–6. https://doi.org/10.1167/10.1.6Google Scholar
- Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. In Proceedings of the ACM International Conference on Multimedia - MM '14. ACM Press. https://doi.org/10.1145/2647868.2654902Google ScholarDigital Library
- Masataka Goto and Takayuki Goto. 2009. Musicream: Integrated Music-Listening Interface for Active, Flexible, and Unexpected Encounters with Musical Pieces. Journal of Information Processing 17 (2009), 292–305. https://doi.org/10.2197/ipsjjip.17.292Google ScholarCross Ref
- Thomas Grill and Arthur Flexer. 2012. Visualization of perceptual qualities in textural sounds. In ICMC. Citeseer.Google Scholar
- Xifeng Guo, Xinwang Liu, En Zhu, and Jianping Yin. 2017. Deep Clustering with Convolutional Autoencoders. In Neural Information Processing. Springer International Publishing, 373–382. https://doi.org/10.1007/978-3-319-70096-0_39Google Scholar
- David Ha and Douglas Eck. 2017. A Neural Representation of Sketch Drawings. (2017). arXiv:1704.03477v4 [cs.NE]Google Scholar
- Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://doi.org/10.1109/icassp.2017.7952132Google ScholarDigital Library
- Jyh-Shing Roger Jang, Hong-Ru Lee, and Chia-Hui Yeh. 2001. Query by Tapping: A New Paradigm for Content-Based Music Retrieval from Acoustic Input. In Advances in Multimedia Information Processing — PCM 2001. Springer Berlin Heidelberg, 590–597. https://doi.org/10.1007/3-540-45453-5_76Google ScholarCross Ref
- Ajay Kapur, Manj Benning, and George Tzanetakis. 2004. Query-by-beat-boxing: Music retrieval for the DJ. In Proceedings of the International Conference on Music Information Retrieval. 170–177.Google Scholar
- Toshikazu Kato. 1992. Database architecture for content-based image retrieval. In Image Storage and Retrieval Systems, Albert A. Jamberdino and Carlton W. Niblack (Eds.). SPIE. https://doi.org/10.1117/12.58497Google Scholar
- Peter Knees and Kristina Andersen. 2016. Searching for Audio by Sketching Mental Images of Sound. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval - ICMR '16. https://doi.org/10.1145/2911996.2912021Google ScholarDigital Library
- Wolfgang Köhler. 1967. Gestalt psychology. Psychologische Forschung 31, 1 (1967), XVIII–XXX. https://doi.org/10.1007/bf00422382Google Scholar
- Philipp Kolhoff, Jacqueline Preub, and Jorn Loviscach. 2006. Music Icons: Procedural Glyphs for Audio Files. In 2006 19th Brazilian Symposium on Computer Graphics and Image Processing. IEEE. https://doi.org/10.1109/sibgrapi.2006.30Google Scholar
- Naoko Kosugi, Yuichi Nishihara, Tetsuo Sakata, Masashi Yamamuro, and Kazuhiko Kushima. 2000. A practical query-by-humming system for a large music database. In Proceedings of the eighth ACM international conference on Multimedia - MULTIMEDIA '00. ACM Press. https://doi.org/10.1145/354384.354520Google ScholarDigital Library
- Xuelong Li, Dacheng Tao, Stephen J. Maybank, and Yuan Yuan. 2008. Visual music and musical vision. Neurocomputing 71, 10-12 (jun 2008), 2023–2028. https://doi.org/10.1016/j.neucom.2008.01.025Google ScholarDigital Library
- Sebastian Löbbers, Mathieu Barthet, and György Fazekas. 2021. Sketching sounds: an exploratory study on sound-shape associations. arXiv preprint arXiv:2107.07360(2021).Google Scholar
- Meinard Mueller, Andreas Arzt, Stefan Balke, Matthias Dorfer, and Gerhard Widmer. 2019. Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies. IEEE Signal Processing Magazine 36, 1 (jan 2019), 52–62. https://doi.org/10.1109/msp.2018.2868887Google ScholarCross Ref
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML. 689–696. https://icml.cc/2011/papers/399_icmlpaper.pdfGoogle Scholar
- Ozge Ozturk, Madelaine Krehm, and Athena Vouloumanos. 2013. Sound symbolism in infancy: Evidence for sound–shape cross-modal correspondences in 4-month-olds. Journal of Experimental Child Psychology 114, 2 (feb 2013), 173–186. https://doi.org/10.1016/j.jecp.2012.05.004Google ScholarCross Ref
- Amir Hossein Poorjam. 2018. Why we take only 12-13 MFCC coefficients in feature extraction?Google Scholar
- Kazuko Shinohara and Shigeto Kawahara. 2010. A Cross-linguistic Study of Sound Symbolism: The Images of Size. Annual Meeting of the Berkeley Linguistics Society 36, 1 (aug 2010), 396. https://doi.org/10.3765/bls.v36i1.3926Google ScholarCross Ref
- Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giro-i Nieto. 2018. Cross-modal Embeddings for Video and Audio Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.Google Scholar
- Yuxin Wang, Miao Yu, Qi Jia, and He Guo. 2011. Query by sketch: An asymmetric sketch-vs-image retrieval system. In 2011 4th International Congress on Image and Signal Processing. IEEE. https://doi.org/10.1109/cisp.2011.6100457Google ScholarCross Ref
- Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2016. Bridging Music and Image via Cross-Modal Ranking Analysis. IEEE Transactions on Multimedia 18, 7 (jul 2016), 1305–1318. https://doi.org/10.1109/tmm.2016.2557722Google ScholarDigital Library
- Yiling Wu, Shuhui Wang, and Qingming Huang. 2019. Multi-modal semantic autoencoder for cross-modal retrieval. Neurocomputing 331 (feb 2019), 165–175. https://doi.org/10.1016/j.neucom.2018.11.042Google Scholar
- Baixi Xing, Kejun Zhang, Lekai Zhang, Xinda Wu, Jian Dou, and Shouqian Sun. 2019. Image–Music Synesthesia-Aware Learning Based on Emotional Similarity Recognition. IEEE Access 7(2019), 136378–136390. https://doi.org/10.1109/access.2019.2942073Google ScholarCross Ref
- Peng Xu, Yongye Huang, Tongtong Yuan, Kaiyue Pang, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales, Zhanyu Ma, and Jun Guo. 2018. SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval. (2018). arXiv:1804.01401v1 [cs.CV]Google Scholar
- Peng Xu, Zeyu Song, Qiyue Yin, Yi-Zhe Song, and Liang Wang. 2020. Deep Self-Supervised Representation Learning for Free-Hand Sketch. (2020). arXiv:2002.00867v1Google Scholar
- Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2018. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (apr 2018), 657–672. https://doi.org/10.1007/s11280-018-0541-xGoogle Scholar
- Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, and Mark D. Plumbley. 2017. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 6 (jun 2017), 1230–1241. https://doi.org/10.1109/taslp.2017.2690563Google ScholarDigital Library
- Hung-Ming Yu, Wei-Ho Tsai, and Hsin-Min Wang. 2008. A Query-by-Singing System for Retrieving Karaoke Music. IEEE Transactions on Multimedia 10, 8 (dec 2008), 1626–1637. https://doi.org/10.1109/tmm.2008.2007345Google ScholarDigital Library
- Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (feb 2019), 1–16. https://doi.org/10.1145/3281746Google ScholarDigital Library
- Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (jul 2020), 1–23. https://doi.org/10.1145/3387164Google ScholarDigital Library
- Hong Zhang and Fei Wu. 2006. Bridging the Gap Between Visual and Auditory Feature Spaces for Cross-Media Retrieval. In Lecture Notes in Computer Science. Springer Berlin Heidelberg, 596–605. https://doi.org/10.1007/978-3-540-69423-6_58Google ScholarCross Ref
- Hong Zhang, Yueting Zhuang, and Fei Wu. 2007. Cross-modal correlation learning for clustering on image-audio dataset. In Proceedings of the 15th international conference on Multimedia - MULTIMEDIA '07. ACM Press. https://doi.org/10.1145/1291233.1291290Google ScholarDigital Library
- Hui Zou, Ji-Xiang Du, Chuan-Min Zhai, and Jing Wang. 2016. Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval. In Intelligent Computing Theories and Application. Springer International Publishing, 322–331. https://doi.org/10.1007/978-3-319-42294-7_28Google Scholar
- Similarity Analysis of Visual Sketch-based Search for Sounds
Recommendations
Using nonspeech sounds to provide navigation cues
This article describes 3 experiments that investigate the possibiity of using structured nonspeech audio messages called earcons to provide navigational cues in a menu hierarchy. A hierarchy of 27 nodes and 4 levels was created with an earcon for each ...
Enhancing scanning input with non-speech sounds
Assets '96: Proceedings of the second annual ACM conference on Assistive technologiesThis paper proposes the addition of non-speech sounds to aid people who use scanning as their method of input. Scanning input is a temporal task; users have to press a switch when a cursor is over the required ...
Synthesis of Explosion Sounds from Utterance Voice of Onomatopoeia using Transformer
IUI '23 Companion: Companion Proceedings of the 28th International Conference on Intelligent User InterfacesSound creators use knowledge, techniques, and experience to create sound effects for media works, ensuring that these sound effects are suitable for different situations and dramatic presentations. This is a challenging task for inexperienced creators ...
Comments