research-article

Similarity Analysis of Visual Sketch-based Search for Sounds

Authors:
Lars Engeln

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

,
Nhat Long Le

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

,
Matthew McGinity

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

,
Rainer Groh

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

AM '21: Proceedings of the 16th International Audio Mostly ConferenceSeptember 2021Pages 101–108https://doi.org/10.1145/3478384.3478423

Published:15 October 2021Publication History

AM '21: Proceedings of the 16th International Audio Mostly Conference

Pages 101–108

ABSTRACT

Searching through a large audio database for a specific sound can be a slow and tedious task with detrimental effects on creative workflow. Listening to each sample is time consuming, while textual descriptions or tags may be insufficient, unavailable or simply unable to meaningfully capturing certain sonic qualities. This paper explores the use of visual sketches that express the mental model associated with a sound to accelerate the search process. To achieve this, a study was conducted to collect data on how 30 people visually represent sound, by providing hand-sketched visual representations for a range of 30 different sounds. After augmenting the data to a sparse set of 855 samples, two different autoencoder were trained. The one finds similar sketches in latent space and delivers the associated audio files. The other one is a multimodal autoencoder combining both visual and sonic cues in a common feature space but lacks on having no audio input for the search task. These both were then used to implement and discuss a visual query-by-sketch search interface for sounds.

References

Mohammad Adeli, Jean Rouat, and Stéphane Molotchnikoff. 2014. Audiovisual correspondence between musical timbre and visual shapes. Frontiers in Human Neuroscience 8 (May 2014). https://doi.org/10.3389/fnhum.2014.00352Google Scholar
Kristina Andersen and Peter Knees. 2016. Conversations with Expert Users in Music Retrieval and Research Challenges for Creative MIR.. In ISMIR. 122–128. https://research.tue.nl/en/publications/conversations-with-expert-users-in-music-retrieval-and-research-cGoogle Scholar
Cătălina Cangea, Petar Veličković, and Pietro Liò. 2017. XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification. (2017).Google Scholar
Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu. 2016. Correlation Autoencoder Hashing for Supervised Cross-Modal Search. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval - ICMR '16. ACM Press. https://doi.org/10.1145/2911996.2912000Google ScholarDigital Library
Ya-Xi Chen and René Klüber. 2010. ThumbnailDJ: Visual Thumbnails of Music Content.. In ISMIR. 565–570.Google Scholar
Charles P. Davis, Hannah M. Morrow, and Gary Lupyan. 2019. What Does a Horgous Look Like? Nonsense Words Elicit Meaningful Drawings. Cognitive Science 43, 10 (oct 2019). https://doi.org/10.1111/cogs.12791Google ScholarCross Ref
Dhiraj, Rohit Biswas, and Nischay Ghattamaraju. 2018. An effective analysis of deep learning based approaches for audio based feature extraction and its visualization. Multimedia Tools and Applications 78, 17 (oct 2018), 23949–23972. https://doi.org/10.1007/s11042-018-6706-xGoogle Scholar
Lars Engeln and Rainer Groh. 2020. CoHEARence of audible shapes—a qualitative user study for coherent visual audio design with resynthesized shapes. Personal and Ubiquitous Computing(2020), 1–11. https://doi.org/10.1007/s00779-020-01392-5Google Scholar
K. K. Evans and A. Treisman. 2010. Natural cross-modal mappings between visual and auditory features. Journal of Vision 10, 1 (jan 2010), 6–6. https://doi.org/10.1167/10.1.6Google Scholar
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. In Proceedings of the ACM International Conference on Multimedia - MM '14. ACM Press. https://doi.org/10.1145/2647868.2654902Google ScholarDigital Library
Masataka Goto and Takayuki Goto. 2009. Musicream: Integrated Music-Listening Interface for Active, Flexible, and Unexpected Encounters with Musical Pieces. Journal of Information Processing 17 (2009), 292–305. https://doi.org/10.2197/ipsjjip.17.292Google ScholarCross Ref
Thomas Grill and Arthur Flexer. 2012. Visualization of perceptual qualities in textural sounds. In ICMC. Citeseer.Google Scholar
Xifeng Guo, Xinwang Liu, En Zhu, and Jianping Yin. 2017. Deep Clustering with Convolutional Autoencoders. In Neural Information Processing. Springer International Publishing, 373–382. https://doi.org/10.1007/978-3-319-70096-0_39Google Scholar
David Ha and Douglas Eck. 2017. A Neural Representation of Sketch Drawings. (2017). arXiv:1704.03477v4 [cs.NE]Google Scholar
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://doi.org/10.1109/icassp.2017.7952132Google ScholarDigital Library
Jyh-Shing Roger Jang, Hong-Ru Lee, and Chia-Hui Yeh. 2001. Query by Tapping: A New Paradigm for Content-Based Music Retrieval from Acoustic Input. In Advances in Multimedia Information Processing — PCM 2001. Springer Berlin Heidelberg, 590–597. https://doi.org/10.1007/3-540-45453-5_76Google ScholarCross Ref
Ajay Kapur, Manj Benning, and George Tzanetakis. 2004. Query-by-beat-boxing: Music retrieval for the DJ. In Proceedings of the International Conference on Music Information Retrieval. 170–177.Google Scholar
Toshikazu Kato. 1992. Database architecture for content-based image retrieval. In Image Storage and Retrieval Systems, Albert A. Jamberdino and Carlton W. Niblack (Eds.). SPIE. https://doi.org/10.1117/12.58497Google Scholar
Peter Knees and Kristina Andersen. 2016. Searching for Audio by Sketching Mental Images of Sound. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval - ICMR '16. https://doi.org/10.1145/2911996.2912021Google ScholarDigital Library
Wolfgang Köhler. 1967. Gestalt psychology. Psychologische Forschung 31, 1 (1967), XVIII–XXX. https://doi.org/10.1007/bf00422382Google Scholar
Philipp Kolhoff, Jacqueline Preub, and Jorn Loviscach. 2006. Music Icons: Procedural Glyphs for Audio Files. In 2006 19th Brazilian Symposium on Computer Graphics and Image Processing. IEEE. https://doi.org/10.1109/sibgrapi.2006.30Google Scholar
Naoko Kosugi, Yuichi Nishihara, Tetsuo Sakata, Masashi Yamamuro, and Kazuhiko Kushima. 2000. A practical query-by-humming system for a large music database. In Proceedings of the eighth ACM international conference on Multimedia - MULTIMEDIA '00. ACM Press. https://doi.org/10.1145/354384.354520Google ScholarDigital Library
Xuelong Li, Dacheng Tao, Stephen J. Maybank, and Yuan Yuan. 2008. Visual music and musical vision. Neurocomputing 71, 10-12 (jun 2008), 2023–2028. https://doi.org/10.1016/j.neucom.2008.01.025Google ScholarDigital Library
Sebastian Löbbers, Mathieu Barthet, and György Fazekas. 2021. Sketching sounds: an exploratory study on sound-shape associations. arXiv preprint arXiv:2107.07360(2021).Google Scholar
Meinard Mueller, Andreas Arzt, Stefan Balke, Matthias Dorfer, and Gerhard Widmer. 2019. Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies. IEEE Signal Processing Magazine 36, 1 (jan 2019), 52–62. https://doi.org/10.1109/msp.2018.2868887Google ScholarCross Ref
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML. 689–696. https://icml.cc/2011/papers/399_icmlpaper.pdfGoogle Scholar
Ozge Ozturk, Madelaine Krehm, and Athena Vouloumanos. 2013. Sound symbolism in infancy: Evidence for sound–shape cross-modal correspondences in 4-month-olds. Journal of Experimental Child Psychology 114, 2 (feb 2013), 173–186. https://doi.org/10.1016/j.jecp.2012.05.004Google ScholarCross Ref
Amir Hossein Poorjam. 2018. Why we take only 12-13 MFCC coefficients in feature extraction?Google Scholar
Kazuko Shinohara and Shigeto Kawahara. 2010. A Cross-linguistic Study of Sound Symbolism: The Images of Size. Annual Meeting of the Berkeley Linguistics Society 36, 1 (aug 2010), 396. https://doi.org/10.3765/bls.v36i1.3926Google ScholarCross Ref
Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giro-i Nieto. 2018. Cross-modal Embeddings for Video and Audio Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops.Google Scholar
Yuxin Wang, Miao Yu, Qi Jia, and He Guo. 2011. Query by sketch: An asymmetric sketch-vs-image retrieval system. In 2011 4th International Congress on Image and Signal Processing. IEEE. https://doi.org/10.1109/cisp.2011.6100457Google ScholarCross Ref
Xixuan Wu, Yu Qiao, Xiaogang Wang, and Xiaoou Tang. 2016. Bridging Music and Image via Cross-Modal Ranking Analysis. IEEE Transactions on Multimedia 18, 7 (jul 2016), 1305–1318. https://doi.org/10.1109/tmm.2016.2557722Google ScholarDigital Library
Yiling Wu, Shuhui Wang, and Qingming Huang. 2019. Multi-modal semantic autoencoder for cross-modal retrieval. Neurocomputing 331 (feb 2019), 165–175. https://doi.org/10.1016/j.neucom.2018.11.042Google Scholar
Baixi Xing, Kejun Zhang, Lekai Zhang, Xinda Wu, Jian Dou, and Shouqian Sun. 2019. Image–Music Synesthesia-Aware Learning Based on Emotional Similarity Recognition. IEEE Access 7(2019), 136378–136390. https://doi.org/10.1109/access.2019.2942073Google ScholarCross Ref
Peng Xu, Yongye Huang, Tongtong Yuan, Kaiyue Pang, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales, Zhanyu Ma, and Jun Guo. 2018. SketchMate: Deep Hashing for Million-Scale Human Sketch Retrieval. (2018). arXiv:1804.01401v1 [cs.CV]Google Scholar
Peng Xu, Zeyu Song, Qiyue Yin, Yi-Zhe Song, and Liang Wang. 2020. Deep Self-Supervised Representation Learning for Free-Hand Sketch. (2020). arXiv:2002.00867v1Google Scholar
Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2018. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (apr 2018), 657–672. https://doi.org/10.1007/s11280-018-0541-xGoogle Scholar
Yong Xu, Qiang Huang, Wenwu Wang, Peter Foster, Siddharth Sigtia, Philip J. B. Jackson, and Mark D. Plumbley. 2017. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 6 (jun 2017), 1230–1241. https://doi.org/10.1109/taslp.2017.2690563Google ScholarDigital Library
Hung-Ming Yu, Wei-Ho Tsai, and Hsin-Min Wang. 2008. A Query-by-Singing System for Retrieving Karaoke Music. IEEE Transactions on Multimedia 10, 8 (dec 2008), 1626–1637. https://doi.org/10.1109/tmm.2008.2007345Google ScholarDigital Library
Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (feb 2019), 1–16. https://doi.org/10.1145/3281746Google ScholarDigital Library
Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (jul 2020), 1–23. https://doi.org/10.1145/3387164Google ScholarDigital Library
Hong Zhang and Fei Wu. 2006. Bridging the Gap Between Visual and Auditory Feature Spaces for Cross-Media Retrieval. In Lecture Notes in Computer Science. Springer Berlin Heidelberg, 596–605. https://doi.org/10.1007/978-3-540-69423-6_58Google ScholarCross Ref
Hong Zhang, Yueting Zhuang, and Fei Wu. 2007. Cross-modal correlation learning for clustering on image-audio dataset. In Proceedings of the 15th international conference on Multimedia - MULTIMEDIA '07. ACM Press. https://doi.org/10.1145/1291233.1291290Google ScholarDigital Library
Hui Zou, Ji-Xiang Du, Chuan-Min Zhai, and Jing Wang. 2016. Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval. In Intelligent Computing Theories and Application. Springer International Publishing, 322–331. https://doi.org/10.1007/978-3-319-42294-7_28Google Scholar

Similarity Analysis of Visual Sketch-based Search for Sounds
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Using nonspeech sounds to provide navigation cues

This article describes 3 experiments that investigate the possibiity of using structured nonspeech audio messages called earcons to provide navigational cues in a menu hierarchy. A hierarchy of 27 nodes and 4 levels was created with an earcon for each ...
Read More
Enhancing scanning input with non-speech sounds
Assets '96: Proceedings of the second annual ACM conference on Assistive technologies

This paper proposes the addition of non-speech sounds to aid people who use scanning as their method of input. Scanning input is a temporal task; users have to press a switch when a cursor is over the required ...
Read More
Synthesis of Explosion Sounds from Utterance Voice of Onomatopoeia using Transformer
IUI '23 Companion: Companion Proceedings of the 28th International Conference on Intelligent User Interfaces

Sound creators use knowledge, techniques, and experience to create sound effects for media works, ensuring that these sound effects are suitable for different situations and dramatic presentations. This is a challenging task for inexperienced creators ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

AM '21: Proceedings of the 16th International Audio Mostly Conference
September 2021
283 pages
ISBN:9781450385695
DOI:10.1145/3478384

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mental model
neural networks
query-by-sketching
search
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate177of275submissions,64%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Similarity Analysis of Visual Sketch-based Search for Sounds

AM '21: Proceedings of the 16th International Audio Mostly Conference

ABSTRACT

References

Cited By

Recommendations

Using nonspeech sounds to provide navigation cues

Enhancing scanning input with non-speech sounds

Synthesis of Explosion Sounds from Utterance Voice of Onomatopoeia using Transformer

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Similarity Analysis of Visual Sketch-based Search for Sounds

AM '21: Proceedings of the 16th International Audio Mostly Conference

ABSTRACT

References

Cited By

Recommendations

Using nonspeech sounds to provide navigation cues

Enhancing scanning input with non-speech sounds

Synthesis of Explosion Sounds from Utterance Voice of Onomatopoeia using Transformer

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media