skip to main content
research-article

Local Self-attention-based Hybrid Multiple Instance Learning for Partial Spoof Speech Detection

Published: 09 October 2023 Publication History

Abstract

The development of speech synthesis technology has increased the attention toward the threat of spoofed speech. Although various high-performance spoofing countermeasures have been proposed in recent years, a particular scenario is overlooked: partially spoofed audio, where spoofed utterances may contain both spoofed and bona fide segments. Currently, the research on partially spoofed speech detection is lacking. The existing methods either train with partially spoofed speech at utterance level, resulting in gradient conflicting at the segment level, or directly train with segment level data, which requires segment labels that are difficult to obtain in practice. In this study, to better detect partially spoofed speech when only utterance labels are available, we formulate partially spoofed speech detection into a multiple instance learning (MIL) problem. The typical MIL uses a pooling layer to fuse patch scores as a whole, and we propose a hybrid MIL (H-MIL) framework based on max and log-sum-exp pooling methods, which can learn better segment representations to improve partially spoofed speech detection performance. Theoretical and experimental verification shows that H-MIL can effectively relieve the gradient conflicting and gradient vanishing problems. In addition, we analyze the local correlations between segments and introduce a local self-attention mechanism to enhance segment features, which further promotes the detection performance.
In our experiments, we provide not only detection results at the segment and utterance levels but also some detailed visualization analysis, including the effect of spoof ratio and cross-dataset detection. The experimental results demonstrate the effective detection performance of our method at both the utterance and segment levels, especially when dealing with low spoof ratio attacks. The results confirm that our approach can better deal with partially spoofed speech detection than previous methods.

References

[1]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. arXiv:1609.03499. Retrieved from http://arxiv.org/abs/1609.03499
[2]
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: A fully end-to-end text-to-speech synthesis model. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 4006–4010.
[3]
Chris Donahue, Julian J. McAuley, and Miller S. Puckette. 2018. Synthesizing audio with generative adversarial networks. arXiv:1802.04208. Retrieved from http://arxiv.org/abs/1802.04208
[4]
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilçi, Md. Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2037–2041. DOI:
[5]
Tomi Kinnunen, Md. Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2–6. DOI:
[6]
Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future horizons in spoofed and fake audio detection. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). International Speech Communication Association, 1008–1012. DOI:
[7]
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, and Hector Delgado. 2021. ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection. In Proceedings of the Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge (ASVspoof’21). 47–54. DOI:
[8]
Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandr Kozlov. 2019. STC antispoofing systems for the ASVspoof2019 challenge. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). 1033–1037. DOI:
[9]
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-end anti-spoofing with RawNet2. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). 6369–6373. DOI:
[10]
Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. 2021. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. In Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge (ASVspoof’21). 1–8. DOI:
[11]
Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, and Nicholas Evans. 2021. An initial investigation for detecting partially spoofed audio. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). 4264–4268. DOI:
[12]
Lin Zhang, Xin Wang, Erica Cooper, and Junichi Yamagishi. 2021. Multi-task learning in utterance-level and segmental-level spoof detection. In Proceedings of the Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge (ASVspoof’21). 9–15. DOI:
[13]
Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, and Junichi Yamagishi. 2023. The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance. IEEE/ACM Trans. Aud. Speech Lang. Process. 31 (2023), 813–825. DOI:
[14]
Jiangyan Yi, Ye Bai, Jianhua Tao, Zhengkun Tian, Chenglong Wang, Tao Wang, and Ruibo Fu. 2021. Half-truth: A partially fake audio detection dataset. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). 1654–1658. DOI:
[15]
Mirco Ravanelli, Titouan Parcollet, and Yoshua Bengio. 2018. The PyTorch-Kaldi speech recognition toolkit. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 6465–6469. DOI:
[16]
Tomi Kinnunen and Haizhou Li. 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 52, 1 (2010), 12–40. DOI:
[17]
Marvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, Hervé Bredin, and Leibny Paola Garcia-Perera. 2019. End-to-end domain-adversarial voice activity detection. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’20). 3685–3689, DOI:
[18]
Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, and Marie-Philippe Gill. 2020. Pyannote.audio: Neural building blocks for speaker diarization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 7124–7128. DOI:
[19]
Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. 2018. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recogn. 77 (2018), 329–353. DOI:
[20]
Maximilian Ilse, Jakub M. Tomczak, and Max Welling. 2018. Attention-based deep multiple instance learning. In Proceedings of International conference on machine learning (PMLR’18).
[21]
Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. 2018. Revisiting multiple instance neural networks. Pattern Recogn. 74 (2018), 15–24. DOI:
[22]
Yangqing Jia and Changshui Zhang. 2008. Instance-level semisupervised multiple instance learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Dieter Fox and Carla P. Gomes (Eds.). AAAI Press, 640–645.
[23]
Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press. DOI:
[24]
Massimiliano Todisco, Héctor Delgado, and Nicholas Evans. 2016. A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey’16). 283–290. DOI:
[25]
Moustafa Alzantot, Ziqi Wang, and Mani B. Srivastava. 2019. Deep residual neural networks for audio spoofing detection. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). 1078–1082. DOI:
[26]
You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-class learning towards synthetic voice spoofing detection. IEEE Sign. Process. Lett. 28 (2021), 937–941. DOI:
[27]
Xin Wang and Junich Yamagishi. 2021. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). 4259–4263. DOI:
[28]
Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 1 (1997), 31–71. DOI:
[29]
Kamanasish Bhattacharjee, Millie Pant, Yu-Dong Zhang, and Suresh Chandra Satapathy. 2020. Multiple instance learning with genetic pooling for medical data analysis. Pattern Recogn. Lett. 133 (2020), 247–255. DOI:
[30]
Jiawen Yao, Xinliang Zhu, Jitendra Jonnagaddala, Nicholas Hawkins, and Junzhou Huang. 2020. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med. Image Anal. 65 (2020), 101789. DOI:
[31]
Sang Phan, Duy-Dinh Le, and Shin’ichi Satoh. 2015. Multimedia event detection using event-driven multiple instance learning. In Proceedings of the 23rd ACM International Conference on Multimedia (MM ’15). Association for Computing Machinery, New York, NY, 1255–1258. DOI:
[32]
Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. 2017. Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1 (2017), 189–203. DOI:
[33]
Forrest Briggs, Xiaoli Z. Fern, and Raviv Raich. 2012. Rank-loss support instance machines for MIML instance annotation. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). Association for Computing Machinery, New York, NY, 534–542. DOI:
[34]
Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classification and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), 3460–3469.
[35]
Pedro O. Pinheiro and Ronan Collobert. 2015. From image-level to pixel-level labeling with Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1713–1721. DOI:
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.
[37]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’20). 5036–5040. DOI:
[38]
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1412–1421.
[39]
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Ling. 9 (02 2021), 53–68. DOI:
[40]
Jack Rae and Ali Razavi. 2020. Do transformers need deep long-range memory? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 7524–7529. DOI:
[41]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https://arxiv.org/abs/2004.05150
[42]
Koen Oostermeijer, Qing Wang, and Jun Du. 2021. Lightweight causal transformer with local self-attention for real-time speech enhancement. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). 2831–2835. DOI:
[43]
Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, and Zhen-Hua Ling. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 64 (2020), 101114. DOI:
[44]
Zhizheng Wu, Junichi Yamagishi, Tomi Kinnunen, Cemal Hanilçi, Mohammed Sahidullah, Aleksandr Sizov, Nicholas Evans, Massimiliano Todisco, and Héctor Delgado. 2017. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge. IEEE J. Select. Top. Sign. Process. 11, 4 (2017), 588–604. DOI:
[45]
Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, et al. 2021. Asvspoof 2021: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. arXiv:2109.00535. Retrieved from https://arxiv.org/abs/2109.00535
[46]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15), Yoshua Bengio and Yann LeCun (Eds.).
[47]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, (Nov. 2008), 2579–2605.

Cited By

View all
  • (2024)Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680585(7395-7403)Online publication date: 28-Oct-2024
  • (2024)SSLCT: A Convolutional Transformer for Synthetic Speech Localization2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00028(134-140)Online publication date: 7-Aug-2024
  • (2024)Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing DetectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447500(10761-10765)Online publication date: 14-Apr-2024
  • Show More Cited By

Index Terms

  1. Local Self-attention-based Hybrid Multiple Instance Learning for Partial Spoof Speech Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 5
      October 2023
      472 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3615589
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 October 2023
      Online AM: 19 August 2023
      Accepted: 27 July 2023
      Revised: 16 May 2023
      Received: 28 April 2022
      Published in TIST Volume 14, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Partial spoof
      2. multiple instance learning
      3. gradient
      4. hybrid
      5. local self-attention

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)301
      • Downloads (Last 6 weeks)32
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Coarse-to-Fine Proposal Refinement Framework for Audio Temporal Forgery Detection and LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680585(7395-7403)Online publication date: 28-Oct-2024
      • (2024)SSLCT: A Convolutional Transformer for Synthetic Speech Localization2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00028(134-140)Online publication date: 7-Aug-2024
      • (2024)Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing DetectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447500(10761-10765)Online publication date: 14-Apr-2024
      • (2024)Mdrt: Multi-Domain Synthetic Speech LocalizationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446471(11171-11175)Online publication date: 14-Apr-2024
      • (2024)FairSSD: Understanding Bias in Synthetic Speech Detectors2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00445(4418-4428)Online publication date: 17-Jun-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media