skip to main content
research-article

SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting

Published: 22 May 2020 Publication History

Abstract

Audio fingerprinting techniques were developed to index and retrieve audio samples by comparing a content-based compact signature of the audio instead of the entire audio sample, thereby reducing memory and computational expense. Different techniques have been applied to create audio fingerprints; however, with the introduction of deep learning, new data-driven unsupervised approaches are available. This article presents Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting (SAMAF), which improved hash generation through a novel loss function composed of terms: Mean Square Error, minimizing the reconstruction error; Hash Loss, minimizing the distance between similar hashes and encouraging clustering; and Bitwise Entropy Loss, minimizing the variation inside the clusters. The performance of the model was assessed with a subset of VoxCeleb1 dataset, a“speech in-the-wild” dataset. Furthermore, the model was compared against three baselines: Dejavu, a Shazam-like algorithm; Robust Audio Fingerprinting System (RAFS), a Bit Error Rate (BER) methodology robust to time-frequency distortions and coding/decoding transformations; and Panako, a constellation-based algorithm adding time-frequency distortion resilience. Extensive empirical evidence showed that our approach outperformed all the baselines in the audio identification task and other classification tasks related to the attributes of the audio signal with an economical hash size of either 128 or 256 bits for one second of audio.

Supplementary Material

a43-baez-suarez-suppl.pdf (baez-suarez.zip)
Supplemental movie, appendix, image and software files for, SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Software. Retrieved from https://www.tensorflow.org/. Version 1.13.0.
[2]
Shahin Amiriparian, Michael Freitag, Nicholas Cummins, and Björn Schuller. 2017. Sequence to sequence autoencoders for unsupervised representation learning from audio. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE’17).
[3]
Xavier Anguera, Antonio Garzon, and Tomasz Adamek. 2012. MASK: Robust local features for audio fingerprinting. In Proceedings of the International Conference on Multimedia and Expo (ICME’12). 455--460.
[4]
Andreas Arzt, Sebastian Böck, and Gerhard Widmer. 2012. Fast identification of piece and score position via symbolic fingerprinting. In Proceedings of the 13th International Symposium on Music Information Retrieval (ISMIR’12).
[5]
Chris Bagwell. 2015. SoX—Sound eXchange. Software. Retrieved from http://gts.sourceforge.net/ Version 14.4.2.
[6]
Shumeet Baluja and Michele Covell. 2007. Audio fingerprinting: Combining computer vision and data stream processing. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’07).
[7]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (Mar. 1994), 157--166.
[8]
Judith C. Brown and Miller S. Puckette. 1992. An efficient algorithm for the calculation of a constant Q transform. J. Acoust. Soc. Amer. 92, 5 (June 1992), 2698--2701.
[9]
Christopher J. C. Burges, John C. Platt, and Soumya Jana. 2003. Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Aud. Proc. 11, 3 (May 2003), 165--174.
[10]
Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma. 2005. A review of audio fingerprinting. J. VLSI Sig. Proc. Syst. Sig. Image Vid. Technol. 41, 3 (Nov. 2005), 271--284.
[11]
Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 1445--1454.
[12]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.
[13]
Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee. 2016. Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder. Research Note. College of Electrical Engineering and Computer Science, National Taiwan University, Taipei City, Taiwan.
[14]
George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Aud. Speech Lang. Proc. 20, 1 (Jan. 2012), 30--42.
[15]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.
[16]
Will Drevo. 2013. Audio Fingerprinting with Python and Numpy. Website. Retrieved from http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/.
[17]
Yong Fan and Shuang Feng. 2016. A music identification system based on audio fingerprint. In Proceedings of the International Conference on Applied Computing and Information Technology (ACIT’16). 363--367.
[18]
Jinyang Gaoy, H. V. Jagadish, Wei Lu, and Beng Chin Ooi. 2014. DSH: Data sensitive hashing for high-dimensional k-NN search. In Proceedings of the International Conference on Management of Data (SIGMOD’14).
[19]
Yun Gu, Chao Ma, and Jie Yang. 2016. Supervised recurrent hashing for large scale video retrieval. In Proceedings of the ACM on Multimedia Conference (MM’16). 272--276.
[20]
Vishwa Gupta, Gilles Boulianne, and Patrick Cardinal. 2010. Content-based audio copy detection using nearest-neighbor mapping. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 261--264.
[21]
Jaap Haitsma and Ton Kalker. 2002. A highly robust audio fingerprinting system. In Proceedings of the International Conference on Music Information Retrieval (ISMIR’02).
[22]
Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. 2011. Unsupervised learning of sparse features for scalable audio classification. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR’11). 681--686.
[23]
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel Rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Proc. Mag. 29, 6 (Nov. 2012), 82--97.
[24]
Geoffrey E. Hinton and Ruslan Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 28, 2006), 504--507.
[25]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780.
[26]
Che-Jen Hsieh, Jung-Shian Li, and Cheng-Fu Hung. 2007. A robust audio fingerprinting scheme for MP3 copyright. In Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP’07).
[27]
Corey Kereliuk, Bob L. Sturm, and Jan Larsen. 2015. Deep learning and music adversaries. IEEE Trans. Multimedia 17, 11 (Nov. 2015), 2059--2071.
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), Vol. 1. Curran Associates Inc., Lake Tahoe, NV, 1097--1105. Retrieved from http://dl.acm.org/citation.cfm?id=2999134.2999257.
[29]
Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 3270--3278.
[30]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (May 28, 2015), 436--444.
[31]
Hanchao Li, Xiang Fei, Kuo-Ming Chao, Ming Yang, and Chaobo He. 2016. Towards a hybrid deep-learning method for music classification and similarity measurement. In Proceedings of the IEEE International Conference on e-Business Engineering (ICEBE’16).
[32]
Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep hashing for compact binary codes learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 2475--2483.
[33]
James Lyons. 2017. python_speech_features. Software. Retrieved from https://github.com/jameslyons/python_speech_features. Version 0.6.
[34]
A. Nagrani, J. S. Chung, and A. Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17).
[35]
Viet-Anh Nguyen and Minh N. Do. 2016. Deep learning based supervised hashing for efficient image retrieval. In Proceedings of the International Conference on Multimedia and Expo (ICME’16). 1--6.
[36]
Chahid Ouali, Pierre Dumouchel, and Vishwa Gupta. 2015. Content-based multimedia copy detection. In Proceedings of the IEEE International Symposium on Multimedia (ISM’15).
[37]
Hamza Özer, Bulent Sankur, and Nasir Memon. 2004. Robust audio hashing for audio identification. In Proceedings of the European Signal Processing Conference (EUSIPCO’04).
[38]
Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2015. Neighbor-sensitive hashing. J. Proc. VLDB Endow. 9, 3 (Nov. 2015), 144--155.
[39]
Yohan Petetin, Cyrille Laroche, and Aurélien Mayoue. 2015. Deep neural networks for audio scene recognition. In Proceedings of the European Signal Processing Conference (EUSIPCO’15).
[40]
R. Roopalakshmi and G. Ram Mohana Reddy. 2015. A framework for estimating geometric distortions in video copies based on visual-audio fingerprints. J. VLSI Sig. Proc. Syst. Sig. Image Vid. Technol. 9, 1 (Jan. 2015), 201--210.
[41]
Ruslan Salakhutdinov and Geoffrey E. Hinton. 2009. Semantic hashing. Int. Approx. Reas. 50, 7 (July 2009), 969--978.
[42]
Joren Six, Olmo Cornelis, and Marc Leman. 2014. TarsosDSP, a real-time audio processing framework in Java. In Proceedings of the 53rd Audio Engineering Society Conference (AES’14).
[43]
Joren Six and Marc Leman. 2014. PANAKO-A scalable acoustic fingerprinting system handling time-scale and pitch modificatoin. In Proceedings of the Conference of the International Society for Music Information Retrieval (ISMIR’14).
[44]
Reinhard Sonnleitner and Gerhard Widmer. 2016. Robust quad-based audio fingerprinting. IEEE/ACM Trans. Aud. Speech Lang. Proc. 24, 3 (Mar. 2016), 409--421.
[45]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’14), Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104--3112.
[46]
Christian Szegedy, Alexander Toshev, and Dumitru Erhan. 2013. Deep neural networks for object detection. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates Inc., 553--2561. Retrieved from http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf.
[47]
Avery Li-Chun Wang. 2003. An industrial-strength audio search algorithm. In Proceedings of the International Conference on Music Information Retrieval (ISMIR’03).

Cited By

View all
  • (2024)AMG-Embedding: A Self-Supervised Embedding Approach for Audio IdentificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681647(9544-9553)Online publication date: 28-Oct-2024
  • (2024)FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing FlowIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.348622732(4961-4970)Online publication date: 2024
  • (2024)Metric Learning with Sequence-to-sequence Autoencoder for Content-based Music IdentificationITM Web of Conferences10.1051/itmconf/2024600000760(00007)Online publication date: 9-Jan-2024
  • Show More Cited By

Index Terms

  1. SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2
    May 2020
    390 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3401894
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 May 2020
    Online AM: 07 May 2020
    Accepted: 01 December 2019
    Revised: 01 August 2019
    Received: 01 September 2018
    Published in TOMM Volume 16, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep learning
    2. audio fingerprinting
    3. audio identification
    4. sequence-to-sequence autoencoder

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • North Atlantic Treaty Organization (NATO) Science for Peace and Security Program
    • Department of Homeland Security (DHS)
    • University of Houston I2C Lab (https://i2c.cs.uh.edu/)
    • AWS Cloud Credits for Research (https://aws.amazon.com/research-credits/)
    • Mexican National Council for Science and Technology (CONACYT)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)70
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 30 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)AMG-Embedding: A Self-Supervised Embedding Approach for Audio IdentificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681647(9544-9553)Online publication date: 28-Oct-2024
    • (2024)FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing FlowIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.348622732(4961-4970)Online publication date: 2024
    • (2024)Metric Learning with Sequence-to-sequence Autoencoder for Content-based Music IdentificationITM Web of Conferences10.1051/itmconf/2024600000760(00007)Online publication date: 9-Jan-2024
    • (2023)Pseudo-Broadcast Music-Speech and Cuesheet Dataset for Background Music Identification/Separation/Detection in TV Broadcast AudioJournal of Digital Contents Society10.9728/dcs.2023.24.1.5924:1(59-68)Online publication date: 31-Jan-2023
    • (2023)Effect of Spectrogram Parameters and Noise Types on The Performance of Spectro-temporal Peaks Based Audio Search MethodGazi University Journal of Science10.35378/gujs.100059436:2(624-643)Online publication date: 1-Jun-2023
    • (2023)A Simple and Efficient method for Dubbed Audio Sync Detection using Compressive Sensing2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW58289.2023.00063(565-572)Online publication date: Jan-2023
    • (2023)Accuracy comparisons of fingerprint based song recognition approaches using very high granularityMultimedia Tools and Applications10.1007/s11042-023-14787-282:20(31591-31606)Online publication date: 1-Aug-2023
    • (2023)Pied Piper: Meta Search for MusicInnovations in Computational Intelligence and Computer Vision10.1007/978-981-99-2602-2_54(713-721)Online publication date: 13-Oct-2023
    • (2022)Asymmetric Contrastive Learning for Audio FingerprintingIEEE Signal Processing Letters10.1109/LSP.2022.320143029(1873-1877)Online publication date: 2022

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media