skip to main content
10.1145/3478384.3478423acmotherconferencesArticle/Chapter ViewAbstractPublication PagesamConference Proceedingsconference-collections
research-article

Similarity Analysis of Visual Sketch-based Search for Sounds

Published:15 October 2021Publication History

ABSTRACT

Searching through a large audio database for a specific sound can be a slow and tedious task with detrimental effects on creative workflow. Listening to each sample is time consuming, while textual descriptions or tags may be insufficient, unavailable or simply unable to meaningfully capturing certain sonic qualities. This paper explores the use of visual sketches that express the mental model associated with a sound to accelerate the search process. To achieve this, a study was conducted to collect data on how 30 people visually represent sound, by providing hand-sketched visual representations for a range of 30 different sounds. After augmenting the data to a sparse set of 855 samples, two different autoencoder were trained. The one finds similar sketches in latent space and delivers the associated audio files. The other one is a multimodal autoencoder combining both visual and sonic cues in a common feature space but lacks on having no audio input for the search task. These both were then used to implement and discuss a visual query-by-sketch search interface for sounds.

References

  1. Mohammad Adeli, Jean Rouat, and Stéphane Molotchnikoff. 2014. Audiovisual correspondence between musical timbre and visual shapes. Frontiers in Human Neuroscience 8 (May 2014). https://doi.org/10.3389/fnhum.2014.00352Google ScholarGoogle Scholar
  2. Kristina Andersen and Peter Knees. 2016. Conversations with Expert Users in Music Retrieval and Research Challenges for Creative MIR.. In ISMIR. 122–128. https://research.tue.nl/en/publications/conversations-with-expert-users-in-music-retrieval-and-research-cGoogle ScholarGoogle Scholar
  3. Cătălina Cangea, Petar Veličković, and Pietro Liò. 2017. XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification. (2017).Google ScholarGoogle Scholar
  4. Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu. 2016. Correlation Autoencoder Hashing for Supervised Cross-Modal Search. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval - ICMR '16. ACM Press. https://doi.org/10.1145/2911996.2912000Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ya-Xi Chen and René Klüber. 2010. ThumbnailDJ: Visual Thumbnails of Music Content.. In ISMIR. 565–570.Google ScholarGoogle Scholar
  6. Charles P. Davis, Hannah M. Morrow, and Gary Lupyan. 2019. What Does a Horgous Look Like? Nonsense Words Elicit Meaningful Drawings. Cognitive Science 43, 10 (oct 2019). https://doi.org/10.1111/cogs.12791Google ScholarGoogle ScholarCross RefCross Ref
  7. Dhiraj, Rohit Biswas, and Nischay Ghattamaraju. 2018. An effective analysis of deep learning based approaches for audio based feature extraction and its visualization. Multimedia Tools and Applications 78, 17 (oct 2018), 23949–23972. https://doi.org/10.1007/s11042-018-6706-xGoogle ScholarGoogle Scholar
  8. Lars Engeln and Rainer Groh. 2020. CoHEARence of audible shapes—a qualitative user study for coherent visual audio design with resynthesized shapes. Personal and Ubiquitous Computing(2020), 1–11. https://doi.org/10.1007/s00779-020-01392-5Google ScholarGoogle Scholar
  9. K. K. Evans and A. Treisman. 2010. Natural cross-modal mappings between visual and auditory features. Journal of Vision 10, 1 (jan 2010), 6–6. https://doi.org/10.1167/10.1.6Google ScholarGoogle Scholar
  10. Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. In Proceedings of the ACM International Conference on Multimedia - MM '14. ACM Press. https://doi.org/10.1145/2647868.2654902Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Masataka Goto and Takayuki Goto. 2009. Musicream: Integrated Music-Listening Interface for Active, Flexible, and Unexpected Encounters with Musical Pieces. Journal of Information Processing 17 (2009), 292–305. https://doi.org/10.2197/ipsjjip.17.292Google ScholarGoogle ScholarCross RefCross Ref
  12. Thomas Grill and Arthur Flexer. 2012. Visualization of perceptual qualities in textural sounds. In ICMC. Citeseer.Google ScholarGoogle Scholar
  13. Xifeng Guo, Xinwang Liu, En Zhu, and Jianping Yin. 2017. Deep Clustering with Convolutional Autoencoders. In Neural Information Processing. Springer International Publishing, 373–382. https://doi.org/10.1007/978-3-319-70096-0_39Google ScholarGoogle Scholar
  14. David Ha and Douglas Eck. 2017. A Neural Representation of Sketch Drawings. (2017). arXiv:1704.03477v4 [cs.NE]Google ScholarGoogle Scholar
  15. Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://doi.org/10.1109/icassp.2017.7952132Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jyh-Shing Roger Jang, Hong-Ru Lee, and Chia-Hui Yeh. 2001. Query by Tapping: A New Paradigm for Content-Based Music Retrieval from Acoustic Input. In Advances in Multimedia Information Processing — PCM 2001. Springer Berlin Heidelberg, 590–597. https://doi.org/10.1007/3-540-45453-5_76Google ScholarGoogle ScholarCross RefCross Ref
  17. Ajay Kapur, Manj Benning, and George Tzanetakis. 2004. Query-by-beat-boxing: Music retrieval for the DJ. In Proceedings of the International Conference on Music Information Retrieval. 170–177.Google ScholarGoogle Scholar
  18. Toshikazu Kato. 1992. Database architecture for content-based image retrieval. In Image Storage and Retrieval Systems, Albert A. Jamberdino and Carlton W. Niblack (Eds.). SPIE. https://doi.org/10.1117/12.58497Google ScholarGoogle Scholar
  19. Peter Knees and Kristina Andersen. 2016. Searching for Audio by Sketching Mental Images of Sound. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval - ICMR '16. https://doi.org/10.1145/2911996.2912021Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wolfgang Köhler. 1967. Gestalt psychology. Psychologische Forschung 31, 1 (1967), XVIII–XXX. https://doi.org/10.1007/bf00422382Google ScholarGoogle Scholar
  21. Philipp Kolhoff, Jacqueline Preub, and Jorn Loviscach. 2006. Music Icons: Procedural Glyphs for Audio Files. In 2006 19th Brazilian Symposium on Computer Graphics and Image Processing. IEEE. https://doi.org/10.1109/sibgrapi.2006.30Google ScholarGoogle Scholar
  22. Naoko Kosugi, Yuichi Nishihara, Tetsuo Sakata, Masashi Yamamuro, and Kazuhiko Kushima. 2000. A practical query-by-humming system for a large music database. In Proceedings of the eighth ACM international conference on Multimedia - MULTIMEDIA '00. ACM Press. https://doi.org/10.1145/354384.354520Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xuelong Li, Dacheng Tao, Stephen J. Maybank, and Yuan Yuan. 2008. Visual music and musical vision. Neurocomputing 71, 10-12 (jun 2008), 2023–2028. https://doi.org/10.1016/j.neucom.2008.01.025Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sebastian Löbbers, Mathieu Barthet, and György Fazekas. 2021. Sketching sounds: an exploratory study on sound-shape associations. arXiv preprint arXiv:2107.07360(2021).Google ScholarGoogle Scholar
  25. Meinard Mueller, Andreas Arzt, Stefan Balke, Matthias Dorfer, and Gerhard Widmer. 2019. Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies. IEEE Signal Processing Magazine 36, 1 (jan 2019), 52–62. https://doi.org/10.1109/msp.2018.2868887Google ScholarGoogle ScholarCross RefCross Ref
  26. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML. 689–696. https://icml.cc/2011/papers/399_icmlpaper.pdfGoogle ScholarGoogle Scholar
  27. Ozge Ozturk, Madelaine Krehm, and Athena Vouloumanos. 2013. Sound symbolism in infancy: Evidence for sound–shape cross-modal correspondences in 4-month-olds. Journal of Experimental Child Psychology 114, 2 (feb 2013), 173–186. https://doi.org/10.1016/j.jecp.2012.05.004Google ScholarGoogle ScholarCross RefCross Ref
  28. Amir Hossein Poorjam. 2018. Why we take only 12-13 MFCC coefficients in feature extraction?Google ScholarGoogle Scholar
  29. Kazuko Shinohara and Shigeto Kawahara. 2010. A Cross-linguistic Study of Sound Symbolism: The Images of Size. Annual Meeting of the Berkeley Linguistics Society 36, 1 (aug 2010), 396. https://doi.org/10.3765/bls.v36i1.3926Google ScholarGoogle ScholarCross RefCross Ref
  30. Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giro-i Nieto. 2018. Cross-modal Embeddings for Video and Audio Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.Google ScholarGoogle Scholar
  31. Yuxin Wang, Miao Yu, Qi Jia, and He Guo. 2011. Query by sketch: An asymmetric sketch-vs-image retrieval system. In 2011 4th International Congress on Image and Signal Processing. IEEE. https://doi.org/10.1109/cisp.2011.6100457Google ScholarGoogle ScholarCross RefCross Ref
  32. Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2016. Bridging Music and Image via Cross-Modal Ranking Analysis. IEEE Transactions on Multimedia 18, 7 (jul 2016), 1305–1318. https://doi.org/10.1109/tmm.2016.2557722Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yiling Wu, Shuhui Wang, and Qingming Huang. 2019. Multi-modal semantic autoencoder for cross-modal retrieval. Neurocomputing 331 (feb 2019), 165–175. https://doi.org/10.1016/j.neucom.2018.11.042Google ScholarGoogle Scholar
  34. Baixi Xing, Kejun Zhang, Lekai Zhang, Xinda Wu, Jian Dou, and Shouqian Sun. 2019. Image–Music Synesthesia-Aware Learning Based on Emotional Similarity Recognition. IEEE Access 7(2019), 136378–136390. https://doi.org/10.1109/access.2019.2942073Google ScholarGoogle ScholarCross RefCross Ref
  35. Peng Xu, Yongye Huang, Tongtong Yuan, Kaiyue Pang, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales, Zhanyu Ma, and Jun Guo. 2018. SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval. (2018). arXiv:1804.01401v1 [cs.CV]Google ScholarGoogle Scholar
  36. Peng Xu, Zeyu Song, Qiyue Yin, Yi-Zhe Song, and Liang Wang. 2020. Deep Self-Supervised Representation Learning for Free-Hand Sketch. (2020). arXiv:2002.00867v1Google ScholarGoogle Scholar
  37. Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2018. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (apr 2018), 657–672. https://doi.org/10.1007/s11280-018-0541-xGoogle ScholarGoogle Scholar
  38. Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, and Mark D. Plumbley. 2017. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 6 (jun 2017), 1230–1241. https://doi.org/10.1109/taslp.2017.2690563Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Hung-Ming Yu, Wei-Ho Tsai, and Hsin-Min Wang. 2008. A Query-by-Singing System for Retrieving Karaoke Music. IEEE Transactions on Multimedia 10, 8 (dec 2008), 1626–1637. https://doi.org/10.1109/tmm.2008.2007345Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (feb 2019), 1–16. https://doi.org/10.1145/3281746Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (jul 2020), 1–23. https://doi.org/10.1145/3387164Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hong Zhang and Fei Wu. 2006. Bridging the Gap Between Visual and Auditory Feature Spaces for Cross-Media Retrieval. In Lecture Notes in Computer Science. Springer Berlin Heidelberg, 596–605. https://doi.org/10.1007/978-3-540-69423-6_58Google ScholarGoogle ScholarCross RefCross Ref
  43. Hong Zhang, Yueting Zhuang, and Fei Wu. 2007. Cross-modal correlation learning for clustering on image-audio dataset. In Proceedings of the 15th international conference on Multimedia - MULTIMEDIA '07. ACM Press. https://doi.org/10.1145/1291233.1291290Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Hui Zou, Ji-Xiang Du, Chuan-Min Zhai, and Jing Wang. 2016. Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval. In Intelligent Computing Theories and Application. Springer International Publishing, 322–331. https://doi.org/10.1007/978-3-319-42294-7_28Google ScholarGoogle Scholar
  1. Similarity Analysis of Visual Sketch-based Search for Sounds

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      AM '21: Proceedings of the 16th International Audio Mostly Conference
      September 2021
      283 pages
      ISBN:9781450385695
      DOI:10.1145/3478384

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 October 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate177of275submissions,64%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format