Elsevier

Digital Signal Processing

Volume 81, October 2018, Pages 8-15
Digital Signal Processing

An adapted data selection for deep learning-based audio segmentation in multi-genre broadcast channel

https://doi.org/10.1016/j.dsp.2018.03.004Get rights and content

Abstract

Broadcast audio transcription is still a challenging problem because of the complexity of diverse speech and audio signals. Audio segmentation, which is an essential module in a broadcast audio transcription system, has benefited greatly from the development of deep learning theory. However, the need of large amounts of labeled training data becomes a bottleneck of deep learning-based audio segmentation methods. To tackle this problem, an adapted segmentation method is proposed to select speech/non-speech segments with high confidence from unlabeled training data as complements to the labeled training data. The new method relies on GMM-based speech/non-speech models trained on an utterance-by-utterance basis. The long-term information is used to choose reliable training data for speech/non-speech models from the utterances at hand. Experimental results show that this data selection method is a powerful audio segmentation algorithm of its own. We also observed that the deep neural networks trained using data selected by this method are superior to those trained with data chosen by two comparing methods. Moreover, better performance could be obtained by combining the deep learning-based audio segmentation method with the adapted data selection method.

Introduction

Automatic transcription and retrieval for broadcast channel [1], [2] has become one of the most attractive applications in the fields of audio signal processing and recognition. However, processing general broadcast audios is still a challenging task because of the varieties in terms of the data content, channel, and environment. Currently, many evaluations, such as multi-genre broadcast (MGB) challenge [3] and Albayzin evaluation [4], focus on audio data processing or speech recognition under broadcast channels and have attracted wide attentions. The content of broadcast audio is quite rich, including speech, music, and different types of noise or sound effects. Moreover, the speech data are very complex because of various speaking styles, different accents, mixed dialect, or with different types of background music or noise. Hence, automatic audio segmentation is a necessary front-end procedure for broadcast audio processing.

The purpose of audio segmentation is to split an audio record into segments of homogeneous content. Depending on the application, the term ‘homogeneous’ can be defined in terms of speaker, channel, or audio type. Generally, the first stage of audio segmentation is speech/non-speech detection to locate regions containing speech signals, which is also referred to as voice activity detection (VAD). There may be a further step of speaker segmentation/clustering to partition the speech regions into speaker-homogeneous segments. In this paper, we focus on voice activity detection.

Voice activity detection is an indispensable module for most speech signal related applications, and has a great influence on system performances. With the development of the deep learning theory, lots of deep neural network (DNN)-based VAD methods [5], [6], [7], [8] have been proposed. Due to the success of modeling long-term dependences of input signals, recurrent neural networks (RNN) [9] and long short-term memory (LSTM) [10] recurrent neural networks have also been adopted. Convolutional neural network (CNN), known as time-delay neural network (TDNN) [11] in speech research, is also a widely used model for its advantages of learning spatial-temporal connectivity and reducing the number of free parameters.

Comparing with traditional VAD algorithms, deep learning-based VAD obtains much higher classification accuracies which benefits from not only the non-linear discriminative characteristics of algorithm, but also the enormous amounts of precisely-labeled training data (at least hundred hours of audio data). However, it is still a difficult task to collect such a large amount of audio data, not to mention labeling them exactly. And this problem partly restricts the applicability of deep learning-based VAD.

In the 2015 MGB challenge task [12], some data selection methods had been proposed, for example data selection based on light supervised alignments [13] and on phone-level force alignments [14], [15]. These methods need a pre-trained automatic speech recognizer (ASR) which is difficult to be trained in many situations. Also it is difficult to ensure the reliability of selected data since the accuracy of the label largely depends on the accuracy of ASR outputs.

In this paper, we propose an adapted training data selection method for multi-genre broadcast channel. Without requiring any alignments, each audio file in the unlabeled dataset is labeled using an audio segmentation method. The main steps are as follows: firstly, the long-term Mel spectral divergence (LTMD) [16] of each frame is used to classify frames into speech or non-speech class, and frames with highest confidences are selected; then speech/non-speech models are trained using features extracted from the selected frames; finally, all frames in the same audio are classified by the speech/non-speech models, and the speech segments are fine-tuned by a threshold detection based on long-term pitch divergence (LTPD) [17]. After the above processing, the reliable segments are chosen for DNN training according to some selection strategies. Experimental results show that this data selection method itself is a powerful audio segmentation algorithm. And the DNN models trained on the data chosen by it are more discriminative than those trained from two comparing methods using light supervised alignments or phone-level force alignments. Moreover, the performance can be improved further by fusing the outputs of adapted data selection method with those of DNN models.

The outline of this paper is as follows. Section 2 describes the DNN-based VAD procedure. A detailed description of the data selection methods for DNN training is given in Section 3. Section 4 presents the experimental data, setup, and results. And conclusions are drawn in Section 5.

Section snippets

VAD based on deep leaning

With higher accuracies in classification tasks, DNN models have been widely used in pattern recognition fields, besides VAD. The DNN model can be seen as a type of non-linear classifier which is able to learn complex pattern with its deep structure. As shown in Fig. 1, the procedure of DNN-based voice activity detector is as follows. Firstly, a set of feature vectors are extracted from audio data. And then, these feature vectors are fed into a pre-trained DNN models and transformed into speech

Data selection for DNN training

One of essential factors that make DNN models achieve much better performance is the large amount of labeled training data. As known to all, manual labeling is very costly, making labeled training data a bottleneck of DNN-based methods. To tackle this problem, some researches focus on how to enable DNN models to automatically learn from unlabeled data, for example transfer learning [22], [23], or more recently dual-learning mechanism [24] in machine translation task. However, for VAD related

Experimental data

The 2016 MGB challenge is a task for state-of-the-art transcription systems of Arabic TV programs. Three datasets, termed as training, development, and evaluation, are released for model training, parameter setting, and performance evaluation, respectively. Since the labels of evaluation dataset are not available, we only use training and development datasets in our experiments.

The dataset used for DNN training was selected from the training set of the 2016 MGB challenge, which contains audios

Conclusions

In this paper, we proposed an adapted data selection method for deep learning-based audio segmentation system training. As an audio segmentation algorithm, this adapted data selection method could obtain good performance, but it suffers a shortcoming that the audios segmented by this algorithm have to contain both speech and non-speech data, and both types of data should not be less than 10% of the whole audio data. Thus, DNN models were trained with data selected from large amounts of

Xu-Kui Yang was born in Fujian, China, in 1988. He received the B.S. and M.S. degrees in information and communication from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2011 and 2014, respectively. He is currently working towards the Ph.D. degree on speech recognition at the National Digital Switching System Engineering and Technological R&D Center.

His research interests are in speech signal processing, continuous speech recognition, and machine learning.

References (36)

  • X. Yang et al.

    Voice activity detection algorithm based on long-term pitch information

    EURASIP J. Audio Speech Music Process.

    (2016)
  • P. Lopez-Otero et al.

    Ensemble audio segmentation for radio and television programmes

    Multimed. Tools Appl.

    (2017)
  • S. Khurana et al.

    QCRI advanced transcription system (QATS) for the Arabic multi-dialect broadcast media recognition: MGB-2 challenge

  • A. Ali et al.

    The MGB-2 challenge: Arabic multi-dialect broadcast media recognition

  • A. Ortega et al.

    The Albayzin 2016 speaker diarization evaluation

  • Neville Ryant et al.

    Speech activity detection on YouTube using deep neural networks

  • Xiao-Lei Zhang et al.

    Deep belief networks based voice activity detection

    IEEE Trans. Audio Speech Lang. Process.

    (2013)
  • Xiao-Lei Zhang et al.

    Boosting contextual information for deep neural network based voice activity detection

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2016)
  • Z. Koldovský et al.

    CHiME4: multichannel enhancement using beamforming driven by DNN-based voice activity detection

  • Thad Hughes et al.

    Recurrent neural networks for voice activity detection

  • Florian Eyben et al.

    Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies

  • Samuel Thomas et al.

    Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions

  • P. Bell et al.

    The MGB challenge: evaluating multi-genre broadcast media recognition

  • Quoc Do Truong et al.

    The NAIST ASR system for the 2015 multi-genre broadcast challenge: on combination of deep learning systems using a rank-score function

  • O. Saz et al.

    The 2015 Sheffield system for transcription of multi-genre broadcast media

  • P.C. Woodland et al.

    Cambridge university transcription systems for the multi-genre broadcast challenge

  • X. Yang et al.

    The NDSC transcription system for the 2016 multi-genre broadcast challenge

  • X. Glorot et al.

    Understanding the difficulty of training deep feed-forward neural networks

    J. Mach. Learn. Res.

    (2010)
  • Cited by (9)

    View all citing articles on Scopus

    Xu-Kui Yang was born in Fujian, China, in 1988. He received the B.S. and M.S. degrees in information and communication from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2011 and 2014, respectively. He is currently working towards the Ph.D. degree on speech recognition at the National Digital Switching System Engineering and Technological R&D Center.

    His research interests are in speech signal processing, continuous speech recognition, and machine learning.

    Dan Qu was born in Jilin, China, in 1974. She received the B.S., M.S. and Ph.D. degrees in information and communication engineering from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2004, 2007 and 2013, respectively. From 2016 to 2017, she was a visiting scholar in Computer Science Institute of Carnegie Mellon University.

    She is an Associate Professor in the National Digital Switching System Engineering and Technological R&D Center. Her research interests are in speech signal processing and pattern recognition & machine learning, and natural language processing.

    Wen-Lin Zhang was born in Hubei, China, in 1982. He received the B.S., M.S. and Ph.D. degrees in information and communication engineering from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2004, 2007 and 2013, respectively.

    He is an Assistant Professor in the National Digital Switching System Engineering and Technological R&D Center. His research interests are in speech signal processing, speech recognition, and machine learning.

    Wei-Qiang Zhang was born in Hebei, China, in 1979. He received the B.S. degree in applied physics from University of Petroleum, Shandong, in 2002, the M.S. degree in communication and information systems from Beijing Institute of Technology, Beijing, in 2005, and the Ph.D. degree in information and communication engineering from Tsinghua University, Beijing, in 2009. From 2016 to 2017, he was a visiting scholar at the Center for Computer Research in Music and Acoustics (CCRMA), Stanford University.

    He is an Associate Professor at the Department of Electronic Engineering, Tsinghua University, Beijing. His research interests are in the area of radar signal processing, acoustic signal processing, speech signal processing, machine learning and statistical pattern recognition.

    This work was supported in part by the National Natural Science Foundation of China under Grants No. 61673395, No. 61403415, and Henan Province Natural Science Foundation under Grants No. 162300410331. The associate editor coordinating the review of this manuscript and approving it for publication was xxxx.

    View full text