skip to main content
research-article

Identification of Reconstructed Speech

Published: 17 January 2017 Publication History

Abstract

Both voice conversion and hidden Markov model-- (HMM) based speech synthesis can be used to produce artificial voices of a target speaker. They have shown great negative impacts on speaker verification (SV) systems. In order to enhance the security of SV systems, the techniques to detect converted/synthesized speech should be taken into consideration. During voice conversion and HMM-based synthesis, speech reconstruction is applied to transform a set of acoustic parameters to reconstructed speech. Hence, the identification of reconstructed speech can be used to distinguish converted/synthesized speech from human speech. Several related works on such identification have been reported. The equal error rates (EERs) lower than 5% of detecting reconstructed speech have been achieved. However, through the cross-database evaluations on different speech databases, we find that the EERs of several testing cases are higher than 10%. The robustness of detection algorithms to different speech databases needs to be improved. In this article, we propose an algorithm to identify the reconstructed speech. Three different speech databases and two different reconstruction methods are considered in our work, which has not been addressed in the reported works. The high-dimensional data visualization approach is used to analyze the effect of speech reconstruction on Mel-frequency cepstral coefficients (MFCC) of speech signals. The Gaussian mixture model supervectors of MFCC are used as acoustic features. Furthermore, a set of commonly used classification algorithms are applied to identify reconstructed speech. According to the comparison among different classification methods, linear discriminant analysis-ensemble classifiers are chosen in our algorithm. Extensive experimental results show that the EERs lower than 1% can be achieved by the proposed algorithm in most cases, outperforming the reported state-of-the-art identification techniques.

References

[1]
Leigh D. Alsteris and Kuldip K. Paliwal. 2007. Short-time phase spectrum in speech processing: A review and some experimental results. Digital Sign. Process. 17, 3 (2007), 578--616.
[2]
Leo Breiman. 1996. Bagging predictors. Mach. Learn. 24, 2 (1996), 123--140.
[3]
William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Douglas A. Jones, and Timothy R. Leek. 2003. Phonetic speaker recognition with support vector machines. In Proceedings of Neural Information Processing Systems (NIPS’03).
[4]
William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Elliot Singer, and Pedro A. Torres-Carrasquillo. 2006a. Support vector machines for speaker and language recognition. Comput. Speech Lang. 20, 2 (2006), 210--229.
[5]
William M. Campbell, Douglas E. Sturim, and Douglas A. Reynolds. 2006b. Support vector machines using GMM supervectors for speaker verification. IEEE Sign. Process. Lett. 13, 5 (2006), 308--311.
[6]
Chih-Chung Chang and Chih-Jen Lin. 2014a. LIBLINEAR -- A Library for Large Linear Classification {Online}. In http://www.csie.ntu.edu.tw/cjlin/liblinear/.
[7]
Chih-Chung Chang and Chih-Jen Lin. 2014b. LIBSVM -- A Library for Support Vector Machines {Online}. In http://www.csie.ntu.edu.tw/ cjlin/libsvm/.
[8]
Speech Database Committee of the Priority Areas Project. 2013. Advanced Utilization of Multimedia to Promote Higher Education Reform Speech Database {Online}. In http://research.nii.ac.jp/src/en/UME-ERJ.html.
[9]
Phillip L. De Leon, Vijendra Raj Apsingekar, Michael Pucher, and Junichi Yamagishi. 2010a. Revisiting the security of speaker verification systems against imposture using synthetic speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 1798--1801.
[10]
Phillip L. De Leon, Michael Pucher, and Junichi Yamagishi. 2010b. Evaluation of the vulnerability of speaker verification to synthetic speech. In Proceedings of Odyssey (The Speaker and Language Recognition Workshop).
[11]
Phillip L. De Leon, Michael Pucher, Junichi Yamagishi, Inma Hernaez, and Ibon Saratxaga. 2012. Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Trans. Aud. Speech Lang. Process. 20, 8 (2012), 2280--2290.
[12]
Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Trans. Aud. Speech Lang. Process. 19, 4 (2011), 788--798.
[13]
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (1977), 1--38.
[14]
Toshiaki Fukada, Keiichi Tokuda, Takao Kobayashi, and Satoshi Imai. 1992. An adaptive algorithm for mel-cepstral analysis of speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’92), Vol. 1. 137--140.
[15]
John Garofolo, David Graff, Doug Paul, and David Pallett. 2013. Wall Street Journal Speech Database {Online}. In https://catalog.ldc.upenn.edu/LDC93S6A.
[16]
John Garofolo, Lori Lamel, William Fisher, Jonathan Fiscus, David Pallett, Nancy Dahlgren, and Victor Zue. 2012. TIMIT Acoustic-Phonetic Continuous Speech Corpus {Online}. In http-s://catalog.ldc.upenn.edu/LDC93S1.
[17]
Ville Hautamaki, Tomi Kinnunen, Ismo Karkkainen, Juhani Saastamoinen, Marko Tuononen, and Pasi Franti. 2008. Maximum a posteriori adaptation of the centroid model for speaker verification. IEEE Sign. Process. Lett. 15 (2008), 162--165.
[18]
Rajesh M. Hegde, Hema A. Murthy, and Venkata Ramana Rao Gadde. 2007. Significance of the modified group delay feature in speech recognition. IEEE Trans. Aud. Speech Lang. Process. 15, 1 (2007), 190--202.
[19]
Qin Jin, Arthur R. Toth, Alan W. Black, and Tanja Schultz. 2008. Is voice transformation a threat to speaker identification? In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’08). 4845--4848.
[20]
Alexander Kain and Michael W. Macon. 2001. Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’01), Vol. 2. 813--816.
[21]
Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain de Cheveigné. 1999. Restructuring speech representations using a pitch-adaptive time--frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun. 27, 3 (1999), 187--207.
[22]
Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel. 2007. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Aud. Speech Lang. Process. 15, 4 (2007), 1435--1447.
[23]
Tomi Kinnunen, Zhi-Zheng Wu, Kong Aik Lee, Filip Sedlak, Eng Siong Chng, and Haizhou Li. 2012. Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 4401--4404.
[24]
Jan Kodovsky, Jessica Fridrich, and Vojtech Holub. 2012. Ensemble classifiers for steganalysis of digital media. IEEE Trans. Inform. Forens. Sec. 7, 2 (2012), 432--444.
[25]
NIST Multimodal Information Group. 2010. 2006 NIST Speaker Recognition Evaluation Database {Online}. In https://catalog.ldc.upenn.edu/LDC2011S09.
[26]
Takashi Masuko, Takafumi Hitotsumatsu, Keiichi Tokuda, and Takao Kobayashi. 1999. On the security of HMM-based speaker verification systems against imposture using synthetic speech. In Proceedings of EUROSPEECH.
[27]
Tomoko Matsui and Sadaoki Furui. 1995. Likelihood normalization for speaker verification using a phoneme-and speaker-independent model. Speech Commun. 17, 1 (1995), 109--116.
[28]
Douglas Reynolds. 2015. Gaussian mixture models. Ency. Biometr. (2015), 827--832.
[29]
Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. 2000. Speaker verification using adapted Gaussian mixture models. Digital Sign. Process. 10, 1 (2000), 19--41.
[30]
Ibon Saratxaga, Inma Hernaez, Daniel Erro, Eva Navas, and Jon Sanchez. 2009. Simple representation of signal phase for harmonic speech models. Electron. Lett. 45, 7 (2009), 381--383.
[31]
Ibon Saratxaga, Inma Hernaez, Igor Odriozola, Eva Navas, Iker Luengo, and Daniel Erro. 2010. Using harmonic phase information to improve ASR rate. In Proceedings of INTERSPEECH. 1185--1188.
[32]
Yannis Stylianou. 2001. Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans. Speech Aud. Process. 9, 1 (2001), 21--29.
[33]
Yannis Stylianou, Olivier Cappé, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Aud. Process. 6, 2 (1998), 131--142.
[34]
Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, and Takao Kobayashi. 1998. Speaker adaptation for HMM-based speech synthesis system using MLLR. In Proceedings of the 3rd ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis.
[35]
Tomoki Toda, Alan W. Black, and Keiichi Tokuda. 2007. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Aud. Speech Lang. Process. 15, 8 (2007), 2222--2235.
[36]
Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. 2001. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’01), Vol. 2. 841--844.
[37]
Tomoki Toda and Keiichi Tokuda. 2007. A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inform. Syst. 90, 5 (2007), 816--824.
[38]
Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’00), Vol. 3. 1315--1318.
[39]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579--2605 (2008), 85.
[40]
Laurens Van der Maaten. 2014. t-Distributed Stochastic Neighbor Embedding (t-SNE) {Online}. In http://lvdmaaten.github.io/tsne/.
[41]
Zhizheng Wu, Tomi Kinnunen, Eng Siong Chng, Haizhou Li, and Eliathamby Ambikairajah. 2012a. A study on spoofing attack in state-of-the-art speaker verification: The telephone speech case. In Proceedings of Signal 8 Information Processing Association Annual Summit and Conference (APSIPA ASC’12). 1--5.
[42]
Zhizheng Wu, Chng Eng Siong, and Haizhou Li. 2012b. Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In Proceedings of INTERSPEECH.
[43]
Zhizheng Wu, Xiong Xiao, Eng Siong Chng, and Haizhou Li. 2013. Synthetic speech detection using temporal modulation feature. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7234--7238.
[44]
Junichi Yamagishi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata, and Juri Isogai. 2009a. Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Aud., Speech Lang. Process. 17, 1 (2009), 66--83.
[45]
Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhen-Hua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, and Steve Renals. 2009b. Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans. Aud. Speech Lang. Process. 17, 6 (2009), 1208--1230.
[46]
Takayoshi Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 1999. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of EUROSPEECH. 2347--2350.
[47]
Heiga Zen, Keiichi Tokuda, and Alan W. Black. 2009. Statistical parametric speech synthesis. Speech Commun. 51, 11 (2009), 1039--1064.

Cited By

View all
  • (2025)Perceptual visual security index: Analyzing image content leakage for vision language modelsJournal of Information Security and Applications10.1016/j.jisa.2025.10398889(103988)Online publication date: Mar-2025
  • (2023)Ensemble deep learning in speech signal tasks: A reviewNeurocomputing10.1016/j.neucom.2023.126436550(126436)Online publication date: Sep-2023
  • (2022)CAQoE: A Novel No-Reference Context-aware Speech Quality Prediction MetricACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352939419:1s(1-23)Online publication date: 13-Apr-2022
  • Show More Cited By

Index Terms

  1. Identification of Reconstructed Speech

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 13, Issue 1
    February 2017
    278 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3012406
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 January 2017
    Accepted: 01 September 2016
    Revised: 01 July 2016
    Received: 01 April 2016
    Published in TOMM Volume 13, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Audio forensics
    2. GMM supervectors
    3. LDA-ensemble classification
    4. MFCC
    5. identification
    6. reconstructed speech
    7. speaker verification

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Characteristic innovation project of Guangdong Province Ordinary University
    • Shenzhen R and D Program
    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Perceptual visual security index: Analyzing image content leakage for vision language modelsJournal of Information Security and Applications10.1016/j.jisa.2025.10398889(103988)Online publication date: Mar-2025
    • (2023)Ensemble deep learning in speech signal tasks: A reviewNeurocomputing10.1016/j.neucom.2023.126436550(126436)Online publication date: Sep-2023
    • (2022)CAQoE: A Novel No-Reference Context-aware Speech Quality Prediction MetricACM Transactions on Multimedia Computing, Communications, and Applications10.1145/352939419:1s(1-23)Online publication date: 13-Apr-2022
    • (2022)End-to-End Spoofing Speech Detection based on CNN-LSTM2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC)10.1109/ICFTIC57696.2022.10075096(755-758)Online publication date: 2-Dec-2022
    • (2022)HTK-based speech recognition and corpus-based English vocabulary online guiding systemInternational Journal of Speech Technology10.1007/s10772-022-09968-725:4(921-931)Online publication date: 1-Dec-2022
    • (2020)Identification of VoIP Speech With Multiple Domain Deep FeaturesIEEE Transactions on Information Forensics and Security10.1109/TIFS.2019.296063515(2253-2267)Online publication date: 2020

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media