Abstract
Recent advances in digital audio source recognition, particularly within judicial forensics and intellectual property rights domains, have been significantly propelled by deep learning technologies. As these methods evolve, they introduce novel models and enhance processing capabilities crucial for audio source recognition research. Despite these advancements, the limited availability of high-quality labeled samples and the labor-intensive nature of data labeling remain substantial challenges. This paper addresses these challenges by exploring the efficacy of self-attention mechanisms, specifically through a novel neural network that integrates the Squeeze-and-Excitation (SE) self-attention mechanism for identifying recording devices. Our study not only demonstrates a relative improvement of approximately 1.5% in all four evaluation metrics over traditional convolutional neural networks but also compares the performance across two public datasets. Furthermore, we delve into the self-attention mechanism’s adaptability across different network architectures by embedding the Squeeze-and-Excitation mechanism within both residual and conventional convolutional network frameworks. Through ablation studies and comparative analyses, we reveal that the impact of self-attention mechanisms varies significantly with the underlying network architecture. Additionally, employing a transfer learning strategy has allowed us to leverage data from a baseline network with extensive samples, applying it to a smaller dataset to successfully identify 141 devices. This approach resulted in performance enhancements ranging from 4% to 7% across various metrics, highlighting the transfer learning method’s role in advancing digital audio source identification research. These findings not only validate the Squeeze-and-Excitation self-attention mechanism’s effectiveness in audio source recognition but also illustrate the broader applicability and benefits of incorporating advanced learning strategies in overcoming data scarcity and enhancing model adaptability.













Similar content being viewed by others
Data Availability
Data will be made available on reasonable request.
References
N.D. Ahakarchy, Z.N. Abdullah, Z.M. Alameen, Z.A. Harjan, Audio verification in forensic investigation using light deep neural network. Int. J. Inf. Technol. 16(5), 2813–2821 (2024)
B.S. Atal, The history of linear prediction. IEEE Signal Process. Mag. 23(2), 154–161 (2006)
Z. Bai, X. Zhong, Speaker recognition based on deep learning: an overview. Neural Netw. 140, 65–99 (2021)
G. Baldini, I. Amerini, C. Gentile, Microphone identification using convolutional neural networks. IEEE Sens. Lett. 3(7), 1–4 (2019)
R. Buchholz, C. Kraetzer, J. Dittmann, Microphone classification using Fourier coefficients, in Proceedings of Information Hiding, 11th International Workshop, pp. 235–246 (2009)
F. Busquet, F. Efthymiou, C. Hildebrand, Voice analytics in the wild: validity and predictive accuracy of common audio-recording devices. Behav. Res. Methods 56(3), 2114–2134 (2024)
W.M. Campbell, Generalized linear discriminant sequence kernels for speaker recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, pp. 161–164 (2002)
R. Chakroun, M. Frikha, A deep learning approach for text-independent speaker recognition with short utterances. Multimed. Tools Appl. 82, 1–23 (2023)
Z. Chen, M. Lin, Z. Wang, Q. Zheng, C. Liu, Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms. Knowl. Based Syst. 281, 111077 (2023)
N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798 (2011)
M. Geng, X. Xie, Z. Ye, T. Wang, G. Li, S. Hu, X. Liu, H. Meng, Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2597–2611 (2022)
C. Hanilçi, F. Ertas, Optimizing acoustic features for source cell-phone recognition using speech signals, in Proceedings of the First ACM Workshop on Information Hiding and Multimedia Security, pp. 141–148 (2013)
C. Hanilçi, F. Ertas, T. Ertas, Ö. Eskidere, Recognition of brand and models of cell-phones from recorded speech signals. IEEE Trans. Inf. Forensics Secur. 7(2), 625–634 (2012)
M. Hariharan, L.S. Chee, S. Yaacob, Analysis of infant cry through weighted linear prediction cepstral coefficients and probabilistic neural network. J. Med. Syst. 36, 1309–1315 (2012)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
H. Hermansky, Perceptual linear predictive (plp) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018)
Y.A. Ibrahim, J.C. Odiketa, T.S. Ibiyemi, Preprocessing technique in automatic speech recognition for human computer interaction: an overview. Ann. Comput. Sci. Ser. 15(1), 186–191 (2017)
M.M. Kabir, M.F. Mridha, J. Shin, I. Jahan, A.Q. Ohi, A survey of speaker recognition: fundamental theories, recognition methods and opportunities. IEEE Access. 9, 79236–79263 (2021)
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
C. Kotropoulos, Source phone identification using sketches of features. IET Biom. 3(2), 75–83 (2014)
C. Kotropoulos, S. Samaras, Mobile phone identification using recorded speech signals, in Proceedings of 19th International Conference on Digital Signal Processing, pp. 586–591 (2014)
Y. Lei, N. Scheffer, L. Ferrer, M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699 (2014)
B. Logan, Mel frequency cepstral coefficients for music modeling, in Proceedings of Ismir, 1, pp. 11 (2000)
D. Luo, P. Korus, J. Huang, Band energy difference for source attribution in audio forensics. IEEE Trans. Inf. Forensics Secur. 13, 2179–2189 (2018)
A.Q. Ohi, M.F. Mridha, M.A. Hamid, M.M. Monowar, Deep speaker recognition: process, progress, and challenges. IEEE Access. 9, 89619–89643 (2021)
S.J. Pan, Q. Yang, A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Y. Panagakis, C. Kotropoulos, Automatic telephone handset identification by sparse representation of random spectral features, in Proceedings of the on Multimedia and Security, pp. 91–96 (2012)
Y. Panagakis, C. Kotropoulos, Telephone handset identification by feature selection and sparse representations, in Proceedings of IEEE International Workshop on Information Forensics and Security (WIFS), pp. 73–78 (2012)
W. Rao, M.W. Mak, Boosting the performance of i-vector based speaker verification via utterance partitioning. IEEE Trans. Audio Speech Lang. Process. 21(5), 1012–1022 (2013)
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: Robust dnn embeddings for speaker recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
T. Suchitha, A. Bindu, Feature extraction using mfcc and classification using gmm. Int. J. Sci. Res. Dev. 3(5), 1278–1283 (2015)
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056 (2014)
A. Veit, M.J. Wilber, S. Belongie, Residual networks behave like ensembles of relatively shallow networks, in Advances in Neural Information Processing Systems, vol. 29 (2016)
Z. Wang, Y. Yang, C. Zeng, S. Kong, S. Feng, N. Zhao, Shallow and deep feature fusion for digital audio tampering detection. EURASIP J. Adv. Signal Process. 2022(69), 1–20 (2022)
Z. Wang, J. Zhan, G. Zhang, D. Ouyang, H. Guo, An end-to-end transfer learning framework of source recording device identification for audio sustainable security. Sustainability 15(14), 11272 (2023)
C. Zeng, S. Feng, Z. Wang, X. Wan, Y. Chen, N. Zhao, Spatio-temporal representation learning enhanced source cell-phone recognition from speech recordings. J. Inf. Secur. Appl. 80, 103672 (2024)
C. Zeng, S. Feng, Z. Wang, Y. Zhao, K. Li, X. Wan, Audio source recording device recognition based on representation learning of sequential gaussian mean matrix. Forensic Sci. Int. Digit. Investig. 48, 301676 (2024)
C. Zeng, S. Feng, D. Zhu, Z. Wang, Source acquisition device identification from recorded audio based on spatiotemporal representation learning with multi-attention mechanisms. Entropy 25(4), 626 (2023)
C. Zeng, S. Kong, Z. Wang, S. Feng, N. Zhao, J. Wang, Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency. Speech Commun. 158, 103046 (2024)
C. Zeng, S. Kong, Z. Wang, K. Li, Y. Zhao, Digital audio tampering detection based on deep temporal-spatial features of electrical network frequency. Information 14(5), 253 (2023)
C. Zeng, S. Kong, Z. Wang, K. Li, Y. Zhao, X. Wan, Y. Chen, Digital audio tampering detection based on spatio-temporal representation learning of electrical network frequency. Multimed. Tools Appl. 2024, 1–23 (2024)
C. Zeng, K. Li, Z. Wang, Enfformer: long-short term representation of electric network frequency for digital audio tampering detection. Knowl. Based Syst. 297, 111938 (2024)
C. Zeng, Y. Yang, Z. Wang, S. Kong, S. Feng, Audio tampering forensics based on representation learning of enf phase sequence. Int. J. Digit. Crime Forensics 14(1), 1–19 (2022)
C. Zeng, D. Zhu, Z. Wang, M. Wu, W. Xiong, N. Zhao, Spatial and temporal learning representation for end-to-end recording device identification. EURASIP J. Adv. Signal Process. 2021(1), 1–19 (2021)
C. Zeng, D. Zhu, Z. Wang, Z. Wang, N. Zhao, L. He, An end-to-end deep source recording device identification system for web media forensics. Int. J. Web Inf. Syst. 16(4), 413–425 (2020)
Q. Zheng, Z. Chen, Z. Wang, H. Liu, M. Lin, Meconformer: highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder. Expert Syst. Appl. 244, 123004 (2024)
L. Zou, Q. He, X. Feng, Cell phone verification from speech recordings using sparse representation, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1787–1791 (2015)
L. Zou, Q. He, J. Wu, Source cell phone verification from speech recordings using sparse representation. Digit. Signal Process. 62, 125–136 (2017)
L. Zou, Q. He, J. Yang, Y. Li, Source cell phone matching from speech recordings by sparse representation and kiss metric, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2079–2083 (2016)
Acknowledgements
The research work of this paper were supported by the National Natural Science Foundation of China (No. 62177022, 61901165, 61501199), Self-determined Research Funds of CCNU from the Colleges’ Basic Research and Operation of MOE (No. CCNU24JC033), Natural Science Foundation of Hubei Province (No. 2022CFA007), and Wuhan Knowledge Innovation Project (No. 2022020801010258).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zeng, C., Zhao, Y., Wang, Z. et al. Squeeze-and-Excitation Self-Attention Mechanism Enhanced Digital Audio Source Recognition Based on Transfer Learning. Circuits Syst Signal Process 44, 480–512 (2025). https://doi.org/10.1007/s00034-024-02850-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-024-02850-8