Squeeze-and-Excitation Self-Attention Mechanism Enhanced Digital Audio Source Recognition Based on Transfer Learning

Zeng, Chunyan; Zhao, Yuhao; Wang, Zhifeng; Li, Kun; Wan, Xiangkui; Liu, Min

doi:10.1007/s00034-024-02850-8

Squeeze-and-Excitation Self-Attention Mechanism Enhanced Digital Audio Source Recognition Based on Transfer Learning

Published: 13 September 2024

Volume 44, pages 480–512, (2025)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Chunyan Zeng¹,
Yuhao Zhao¹,
Zhifeng Wang ORCID: orcid.org/0000-0001-6960-509X²,
Kun Li¹,
Xiangkui Wan¹ &
…
Min Liu¹

303 Accesses
Explore all metrics

Abstract

Recent advances in digital audio source recognition, particularly within judicial forensics and intellectual property rights domains, have been significantly propelled by deep learning technologies. As these methods evolve, they introduce novel models and enhance processing capabilities crucial for audio source recognition research. Despite these advancements, the limited availability of high-quality labeled samples and the labor-intensive nature of data labeling remain substantial challenges. This paper addresses these challenges by exploring the efficacy of self-attention mechanisms, specifically through a novel neural network that integrates the Squeeze-and-Excitation (SE) self-attention mechanism for identifying recording devices. Our study not only demonstrates a relative improvement of approximately 1.5% in all four evaluation metrics over traditional convolutional neural networks but also compares the performance across two public datasets. Furthermore, we delve into the self-attention mechanism’s adaptability across different network architectures by embedding the Squeeze-and-Excitation mechanism within both residual and conventional convolutional network frameworks. Through ablation studies and comparative analyses, we reveal that the impact of self-attention mechanisms varies significantly with the underlying network architecture. Additionally, employing a transfer learning strategy has allowed us to leverage data from a baseline network with extensive samples, applying it to a smaller dataset to successfully identify 141 devices. This approach resulted in performance enhancements ranging from 4% to 7% across various metrics, highlighting the transfer learning method’s role in advancing digital audio source identification research. These findings not only validate the Squeeze-and-Excitation self-attention mechanism’s effectiveness in audio source recognition but also illustrate the broader applicability and benefits of incorporating advanced learning strategies in overcoming data scarcity and enhancing model adaptability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unmasking Deepfakes Advancements, Challenges, and Ethical Considerations

1D-CNN-based audio tampering detection using ENF signals

Article Open access 16 May 2024

Source Camera Device Identification from Videos

Article Open access 04 June 2022

Data Availability

Data will be made available on reasonable request.

Notes

https://github.com/CCNUZFW/CCNU_Mobile_dataset.

References

N.D. Ahakarchy, Z.N. Abdullah, Z.M. Alameen, Z.A. Harjan, Audio verification in forensic investigation using light deep neural network. Int. J. Inf. Technol. 16(5), 2813–2821 (2024)
Google Scholar
B.S. Atal, The history of linear prediction. IEEE Signal Process. Mag. 23(2), 154–161 (2006)
Article MATH Google Scholar
Z. Bai, X. Zhong, Speaker recognition based on deep learning: an overview. Neural Netw. 140, 65–99 (2021)
Article MATH Google Scholar
G. Baldini, I. Amerini, C. Gentile, Microphone identification using convolutional neural networks. IEEE Sens. Lett. 3(7), 1–4 (2019)
Article MATH Google Scholar
R. Buchholz, C. Kraetzer, J. Dittmann, Microphone classification using Fourier coefficients, in Proceedings of Information Hiding, 11th International Workshop, pp. 235–246 (2009)
F. Busquet, F. Efthymiou, C. Hildebrand, Voice analytics in the wild: validity and predictive accuracy of common audio-recording devices. Behav. Res. Methods 56(3), 2114–2134 (2024)
Article MATH Google Scholar
W.M. Campbell, Generalized linear discriminant sequence kernels for speaker recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, pp. 161–164 (2002)
R. Chakroun, M. Frikha, A deep learning approach for text-independent speaker recognition with short utterances. Multimed. Tools Appl. 82, 1–23 (2023)
Article MATH Google Scholar
Z. Chen, M. Lin, Z. Wang, Q. Zheng, C. Liu, Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms. Knowl. Based Syst. 281, 111077 (2023)
Article MATH Google Scholar
N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798 (2011)
Article MATH Google Scholar
M. Geng, X. Xie, Z. Ye, T. Wang, G. Li, S. Hu, X. Liu, H. Meng, Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2597–2611 (2022)
Article Google Scholar
C. Hanilçi, F. Ertas, Optimizing acoustic features for source cell-phone recognition using speech signals, in Proceedings of the First ACM Workshop on Information Hiding and Multimedia Security, pp. 141–148 (2013)
C. Hanilçi, F. Ertas, T. Ertas, Ö. Eskidere, Recognition of brand and models of cell-phones from recorded speech signals. IEEE Trans. Inf. Forensics Secur. 7(2), 625–634 (2012)
Article MATH Google Scholar
M. Hariharan, L.S. Chee, S. Yaacob, Analysis of infant cry through weighted linear prediction cepstral coefficients and probabilistic neural network. J. Med. Syst. 36, 1309–1315 (2012)
Article MATH Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
H. Hermansky, Perceptual linear predictive (plp) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
Article MATH Google Scholar
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018)
Y.A. Ibrahim, J.C. Odiketa, T.S. Ibiyemi, Preprocessing technique in automatic speech recognition for human computer interaction: an overview. Ann. Comput. Sci. Ser. 15(1), 186–191 (2017)
MATH Google Scholar
M.M. Kabir, M.F. Mridha, J. Shin, I. Jahan, A.Q. Ohi, A survey of speaker recognition: fundamental theories, recognition methods and opportunities. IEEE Access. 9, 79236–79263 (2021)
Article Google Scholar
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
Article MATH Google Scholar
C. Kotropoulos, Source phone identification using sketches of features. IET Biom. 3(2), 75–83 (2014)
Article MATH Google Scholar
C. Kotropoulos, S. Samaras, Mobile phone identification using recorded speech signals, in Proceedings of 19th International Conference on Digital Signal Processing, pp. 586–591 (2014)
Y. Lei, N. Scheffer, L. Ferrer, M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699 (2014)
B. Logan, Mel frequency cepstral coefficients for music modeling, in Proceedings of Ismir, 1, pp. 11 (2000)
D. Luo, P. Korus, J. Huang, Band energy difference for source attribution in audio forensics. IEEE Trans. Inf. Forensics Secur. 13, 2179–2189 (2018)
Article MATH Google Scholar
A.Q. Ohi, M.F. Mridha, M.A. Hamid, M.M. Monowar, Deep speaker recognition: process, progress, and challenges. IEEE Access. 9, 89619–89643 (2021)
Article Google Scholar
S.J. Pan, Q. Yang, A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article MATH Google Scholar
Y. Panagakis, C. Kotropoulos, Automatic telephone handset identification by sparse representation of random spectral features, in Proceedings of the on Multimedia and Security, pp. 91–96 (2012)
Y. Panagakis, C. Kotropoulos, Telephone handset identification by feature selection and sparse representations, in Proceedings of IEEE International Workshop on Information Forensics and Security (WIFS), pp. 73–78 (2012)
W. Rao, M.W. Mak, Boosting the performance of i-vector based speaker verification via utterance partitioning. IEEE Trans. Audio Speech Lang. Process. 21(5), 1012–1022 (2013)
Article MATH Google Scholar
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: Robust dnn embeddings for speaker recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
T. Suchitha, A. Bindu, Feature extraction using mfcc and classification using gmm. Int. J. Sci. Res. Dev. 3(5), 1278–1283 (2015)
MATH Google Scholar
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056 (2014)
A. Veit, M.J. Wilber, S. Belongie, Residual networks behave like ensembles of relatively shallow networks, in Advances in Neural Information Processing Systems, vol. 29 (2016)
Z. Wang, Y. Yang, C. Zeng, S. Kong, S. Feng, N. Zhao, Shallow and deep feature fusion for digital audio tampering detection. EURASIP J. Adv. Signal Process. 2022(69), 1–20 (2022)
MATH Google Scholar
Z. Wang, J. Zhan, G. Zhang, D. Ouyang, H. Guo, An end-to-end transfer learning framework of source recording device identification for audio sustainable security. Sustainability 15(14), 11272 (2023)
Article Google Scholar
C. Zeng, S. Feng, Z. Wang, X. Wan, Y. Chen, N. Zhao, Spatio-temporal representation learning enhanced source cell-phone recognition from speech recordings. J. Inf. Secur. Appl. 80, 103672 (2024)
Google Scholar
C. Zeng, S. Feng, Z. Wang, Y. Zhao, K. Li, X. Wan, Audio source recording device recognition based on representation learning of sequential gaussian mean matrix. Forensic Sci. Int. Digit. Investig. 48, 301676 (2024)
MATH Google Scholar
C. Zeng, S. Feng, D. Zhu, Z. Wang, Source acquisition device identification from recorded audio based on spatiotemporal representation learning with multi-attention mechanisms. Entropy 25(4), 626 (2023)
Article MATH Google Scholar
C. Zeng, S. Kong, Z. Wang, S. Feng, N. Zhao, J. Wang, Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency. Speech Commun. 158, 103046 (2024)
Article MATH Google Scholar
C. Zeng, S. Kong, Z. Wang, K. Li, Y. Zhao, Digital audio tampering detection based on deep temporal-spatial features of electrical network frequency. Information 14(5), 253 (2023)
Article MATH Google Scholar
C. Zeng, S. Kong, Z. Wang, K. Li, Y. Zhao, X. Wan, Y. Chen, Digital audio tampering detection based on spatio-temporal representation learning of electrical network frequency. Multimed. Tools Appl. 2024, 1–23 (2024)
Google Scholar
C. Zeng, K. Li, Z. Wang, Enfformer: long-short term representation of electric network frequency for digital audio tampering detection. Knowl. Based Syst. 297, 111938 (2024)
Article Google Scholar
C. Zeng, Y. Yang, Z. Wang, S. Kong, S. Feng, Audio tampering forensics based on representation learning of enf phase sequence. Int. J. Digit. Crime Forensics 14(1), 1–19 (2022)
Article Google Scholar
C. Zeng, D. Zhu, Z. Wang, M. Wu, W. Xiong, N. Zhao, Spatial and temporal learning representation for end-to-end recording device identification. EURASIP J. Adv. Signal Process. 2021(1), 1–19 (2021)
Article MATH Google Scholar
C. Zeng, D. Zhu, Z. Wang, Z. Wang, N. Zhao, L. He, An end-to-end deep source recording device identification system for web media forensics. Int. J. Web Inf. Syst. 16(4), 413–425 (2020)
Article MATH Google Scholar
Q. Zheng, Z. Chen, Z. Wang, H. Liu, M. Lin, Meconformer: highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder. Expert Syst. Appl. 244, 123004 (2024)
Article Google Scholar
L. Zou, Q. He, X. Feng, Cell phone verification from speech recordings using sparse representation, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1787–1791 (2015)
L. Zou, Q. He, J. Wu, Source cell phone verification from speech recordings using sparse representation. Digit. Signal Process. 62, 125–136 (2017)
Article Google Scholar
L. Zou, Q. He, J. Yang, Y. Li, Source cell phone matching from speech recordings by sparse representation and kiss metric, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2079–2083 (2016)

Download references

Acknowledgements

The research work of this paper were supported by the National Natural Science Foundation of China (No. 62177022, 61901165, 61501199), Self-determined Research Funds of CCNU from the Colleges’ Basic Research and Operation of MOE (No. CCNU24JC033), Natural Science Foundation of Hubei Province (No. 2022CFA007), and Wuhan Knowledge Innovation Project (No. 2022020801010258).

Author information

Authors and Affiliations

Hubei Key Laboratory for High-efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Nanli Road, Wuhan, 430068, China
Chunyan Zeng, Yuhao Zhao, Kun Li, Xiangkui Wan & Min Liu
Department of Digital Media Technology, Central China Normal University, Luoyu Road, Wuhan, 430079, China
Zhifeng Wang

Authors

Chunyan Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Yuhao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhifeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kun Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiangkui Wan
View author publications
You can also search for this author in PubMed Google Scholar
Min Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhifeng Wang.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zeng, C., Zhao, Y., Wang, Z. et al. Squeeze-and-Excitation Self-Attention Mechanism Enhanced Digital Audio Source Recognition Based on Transfer Learning. Circuits Syst Signal Process 44, 480–512 (2025). https://doi.org/10.1007/s00034-024-02850-8

Download citation

Received: 04 July 2023
Revised: 22 August 2024
Accepted: 24 August 2024
Published: 13 September 2024
Issue Date: January 2025
DOI: https://doi.org/10.1007/s00034-024-02850-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Squeeze-and-Excitation Self-Attention Mechanism Enhanced Digital Audio Source Recognition Based on Transfer Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Unmasking Deepfakes Advancements, Challenges, and Ethical Considerations

1D-CNN-based audio tampering detection using ENF signals

Source Camera Device Identification from Videos

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now