Statistical voice activity detection based on sparse representation over learned dictionary

https://doi.org/10.1016/j.dsp.2013.03.005Get rights and content

Abstract

In this paper, we present a novel approach to voice activity detection (VAD) based on the sparse representation of an input noisy speech over a learned dictionary. First, we investigate the relationship between the signal detection and the sparse representation based on the Bayesian framework. Second, we derive the decision rule and an adaptive threshold based on a likelihood ratio test, by modeling the non-zero elements in the sparse representation as a Gaussian distribution. The experimental results show that the proposed approach outperforms the current statistical model-based methods, such as Gaussian, Laplacian, and Gamma, under white, babble, and vehicle noise conditions.

Section snippets

Shi-Wen Deng received the B.E. degree in the Institute of Technology from Jia Mu Si University, JiaMuSi, China, in 1997, the M.E. in the School of Computer Science from Harbin Normal University, Harbin, China, in 2005, and the Ph.D degree in the School of Computer Science from Harbin Institute of Technology in 2012. Currently, he is with the School of Mathematical Sciences, Harbin Normal University, Harbin, China. His research interests are in the area of speech and audio signal processing,

References (26)

  • Y. Ephraim et al.

    Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator

    IEEE Trans. Acoust. Speech Signal Process.

    (December 1984)
  • J.S. Sohn et al.

    A statistical model-based voice activity detection

    IEEE Signal Process. Lett.

    (January 1999)
  • J.-H. Chang et al.

    Voice activity detection based on complex Laplacian model

    Electron. Lett.

    (April 2003)
  • J.-H. Chang et al.

    Voice activity detection based on multiple statistical models

    IEEE Trans. Signal Process.

    (June 2006)
  • J. Ramirez et al.

    Statistical voice activity detection using a multiple observation likelihood ratio test

    IEEE Signal Process. Lett.

    (2005)
  • J.M. Gorriz et al.

    Jointly Gaussian PDF-based likelihood ratio test for voice activity detection

    IEEE Trans. Speech Audio Process.

    (2008)
  • J.M. Gorriz et al.

    Generalized LRT-based voice activity detector

    IEEE Signal Process. Lett.

    (2006)
  • B.A. Olshausen et al.

    Sparse coding with an overcomplete basis set: A strategy employed by V1?

    Vis. Res.

    (1997)
  • K. Huang, S. Aviyente, Sparse representation for signal classification, in: Neural Information Processing Systems...
  • J. Mairal et al.

    Supervised dictionary learning

  • H. Lee et al.

    Efficient sparse coding algorithms

  • M. Elad et al.

    Image denoising via sparse and redundant representations over learned dictionaries

    IEEE Trans. Image Process.

    (2006)
  • W. John et al.

    Robust face recognition via sparse representation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • Cited by (15)

    • Audiovisual speaker indexing for Web-TV automations

      2021, Expert Systems with Applications
      Citation Excerpt :

      Varela, San-Segundo, and Hernández (2011) propose the utilization of pulse-based measurements in a Decision Tree which makes use of the predictions of a baseline VAD, an HMM segmentation module, and a pulse detection module. In the article of (Deng, & Han, 2013), a decision-based statistical approach is evaluated, by creating a sparse representation of audio signals over a learned dictionary, which is proved to outperform Gaussian, Laplacian, and Gamma representations. Mak and Yu, (2014) investigate the effect of pre-processing and speech enhancement in the improvement of the robustness of a statistic-based VAD, directed to NIST Speaker Recognition Evaluation tasks.

    • Optimization of learned dictionary for sparse coding in speech processing

      2016, Neurocomputing
      Citation Excerpt :

      Up to now, there is no report about CD-DNN-HMM speech recognition combined with sparse coding. Moreover, sparse coding has been used in voice activity detection [13,14], which can be treated as outlier signal detection [15,16] because it is a binary classification. Typical work for outlier detection can be found in [16], where an integrated incremental self-organizing map and hierarchical neural network approach is proposed.

    • A speech enhancement method based on sparse reconstruction of power spectral density

      2014, Computers and Electrical Engineering
      Citation Excerpt :

      Other applications have been found in cognitive radios [7], direction-of-arrival estimation [8] and so forth. The applications of sparse representation in audio signal processing mainly focus on voice activity detection (VAD) [9], pitch estimation [10], speaker identification [11] and speech recognition [5]. Currently, speech enhancement methods based on sparse representation have been discussed in some references.

    • A Cross Dataset Approach for Noisy Speech Identification

      2023, Lecture Notes in Electrical Engineering
    • AUC optimization for deep learning-based voice activity detection

      2022, Eurasip Journal on Audio, Speech, and Music Processing
    • Real time implementation of voice activity detection based on false acceptance regulation

      2020, International Journal on Electrical Engineering and Informatics
    View all citing articles on Scopus

    Shi-Wen Deng received the B.E. degree in the Institute of Technology from Jia Mu Si University, JiaMuSi, China, in 1997, the M.E. in the School of Computer Science from Harbin Normal University, Harbin, China, in 2005, and the Ph.D degree in the School of Computer Science from Harbin Institute of Technology in 2012. Currently, he is with the School of Mathematical Sciences, Harbin Normal University, Harbin, China. His research interests are in the area of speech and audio signal processing, including content-based audio analysis, noise suppression, speech/audio classification/detection.

    Ji-Qing Han received the B.S., M.S. in electrical engineering, and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 1987, 1990, and 1998, respectively. Currently, he is the associate dean of the School of Computer Science and Technology, Harbin Institute of Technology. He is a member of IEEE, member of the editorial board of Journal of Chinese Information Processing, and member of the editorial board of the Journal of Data Acquisition & Processing. Prof. Han is undertaking several projects from the National Natural Science Foundation, 863Hi-tech Program, National Basic Research Program. He has won three Second Prize and two Third Prize awards of Science and Technology of Ministry/Province. He has published more than 100 papers and 2 books. His research fields include speech signal processing and audio information processing.

    View full text