An adapted data selection for deep learning-based audio segmentation in multi-genre broadcast channel☆
Introduction
Automatic transcription and retrieval for broadcast channel [1], [2] has become one of the most attractive applications in the fields of audio signal processing and recognition. However, processing general broadcast audios is still a challenging task because of the varieties in terms of the data content, channel, and environment. Currently, many evaluations, such as multi-genre broadcast (MGB) challenge [3] and Albayzin evaluation [4], focus on audio data processing or speech recognition under broadcast channels and have attracted wide attentions. The content of broadcast audio is quite rich, including speech, music, and different types of noise or sound effects. Moreover, the speech data are very complex because of various speaking styles, different accents, mixed dialect, or with different types of background music or noise. Hence, automatic audio segmentation is a necessary front-end procedure for broadcast audio processing.
The purpose of audio segmentation is to split an audio record into segments of homogeneous content. Depending on the application, the term ‘homogeneous’ can be defined in terms of speaker, channel, or audio type. Generally, the first stage of audio segmentation is speech/non-speech detection to locate regions containing speech signals, which is also referred to as voice activity detection (VAD). There may be a further step of speaker segmentation/clustering to partition the speech regions into speaker-homogeneous segments. In this paper, we focus on voice activity detection.
Voice activity detection is an indispensable module for most speech signal related applications, and has a great influence on system performances. With the development of the deep learning theory, lots of deep neural network (DNN)-based VAD methods [5], [6], [7], [8] have been proposed. Due to the success of modeling long-term dependences of input signals, recurrent neural networks (RNN) [9] and long short-term memory (LSTM) [10] recurrent neural networks have also been adopted. Convolutional neural network (CNN), known as time-delay neural network (TDNN) [11] in speech research, is also a widely used model for its advantages of learning spatial-temporal connectivity and reducing the number of free parameters.
Comparing with traditional VAD algorithms, deep learning-based VAD obtains much higher classification accuracies which benefits from not only the non-linear discriminative characteristics of algorithm, but also the enormous amounts of precisely-labeled training data (at least hundred hours of audio data). However, it is still a difficult task to collect such a large amount of audio data, not to mention labeling them exactly. And this problem partly restricts the applicability of deep learning-based VAD.
In the 2015 MGB challenge task [12], some data selection methods had been proposed, for example data selection based on light supervised alignments [13] and on phone-level force alignments [14], [15]. These methods need a pre-trained automatic speech recognizer (ASR) which is difficult to be trained in many situations. Also it is difficult to ensure the reliability of selected data since the accuracy of the label largely depends on the accuracy of ASR outputs.
In this paper, we propose an adapted training data selection method for multi-genre broadcast channel. Without requiring any alignments, each audio file in the unlabeled dataset is labeled using an audio segmentation method. The main steps are as follows: firstly, the long-term Mel spectral divergence (LTMD) [16] of each frame is used to classify frames into speech or non-speech class, and frames with highest confidences are selected; then speech/non-speech models are trained using features extracted from the selected frames; finally, all frames in the same audio are classified by the speech/non-speech models, and the speech segments are fine-tuned by a threshold detection based on long-term pitch divergence (LTPD) [17]. After the above processing, the reliable segments are chosen for DNN training according to some selection strategies. Experimental results show that this data selection method itself is a powerful audio segmentation algorithm. And the DNN models trained on the data chosen by it are more discriminative than those trained from two comparing methods using light supervised alignments or phone-level force alignments. Moreover, the performance can be improved further by fusing the outputs of adapted data selection method with those of DNN models.
The outline of this paper is as follows. Section 2 describes the DNN-based VAD procedure. A detailed description of the data selection methods for DNN training is given in Section 3. Section 4 presents the experimental data, setup, and results. And conclusions are drawn in Section 5.
Section snippets
VAD based on deep leaning
With higher accuracies in classification tasks, DNN models have been widely used in pattern recognition fields, besides VAD. The DNN model can be seen as a type of non-linear classifier which is able to learn complex pattern with its deep structure. As shown in Fig. 1, the procedure of DNN-based voice activity detector is as follows. Firstly, a set of feature vectors are extracted from audio data. And then, these feature vectors are fed into a pre-trained DNN models and transformed into speech
Data selection for DNN training
One of essential factors that make DNN models achieve much better performance is the large amount of labeled training data. As known to all, manual labeling is very costly, making labeled training data a bottleneck of DNN-based methods. To tackle this problem, some researches focus on how to enable DNN models to automatically learn from unlabeled data, for example transfer learning [22], [23], or more recently dual-learning mechanism [24] in machine translation task. However, for VAD related
Experimental data
The 2016 MGB challenge is a task for state-of-the-art transcription systems of Arabic TV programs. Three datasets, termed as training, development, and evaluation, are released for model training, parameter setting, and performance evaluation, respectively. Since the labels of evaluation dataset are not available, we only use training and development datasets in our experiments.
The dataset used for DNN training was selected from the training set of the 2016 MGB challenge, which contains audios
Conclusions
In this paper, we proposed an adapted data selection method for deep learning-based audio segmentation system training. As an audio segmentation algorithm, this adapted data selection method could obtain good performance, but it suffers a shortcoming that the audios segmented by this algorithm have to contain both speech and non-speech data, and both types of data should not be less than 10% of the whole audio data. Thus, DNN models were trained with data selected from large amounts of
Xu-Kui Yang was born in Fujian, China, in 1988. He received the B.S. and M.S. degrees in information and communication from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2011 and 2014, respectively. He is currently working towards the Ph.D. degree on speech recognition at the National Digital Switching System Engineering and Technological R&D Center.
His research interests are in speech signal processing, continuous speech recognition, and machine learning.
References (36)
- et al.
Voice activity detection algorithm based on long-term pitch information
EURASIP J. Audio Speech Music Process.
(2016) - et al.
Ensemble audio segmentation for radio and television programmes
Multimed. Tools Appl.
(2017) - et al.
QCRI advanced transcription system (QATS) for the Arabic multi-dialect broadcast media recognition: MGB-2 challenge
- et al.
The MGB-2 challenge: Arabic multi-dialect broadcast media recognition
- et al.
The Albayzin 2016 speaker diarization evaluation
- et al.
Speech activity detection on YouTube using deep neural networks
- et al.
Deep belief networks based voice activity detection
IEEE Trans. Audio Speech Lang. Process.
(2013) - et al.
Boosting contextual information for deep neural network based voice activity detection
IEEE/ACM Trans. Audio Speech Lang. Process.
(2016) - et al.
CHiME4: multichannel enhancement using beamforming driven by DNN-based voice activity detection
- et al.
Recurrent neural networks for voice activity detection
Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies
Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions
The MGB challenge: evaluating multi-genre broadcast media recognition
The NAIST ASR system for the 2015 multi-genre broadcast challenge: on combination of deep learning systems using a rank-score function
The 2015 Sheffield system for transcription of multi-genre broadcast media
Cambridge university transcription systems for the multi-genre broadcast challenge
The NDSC transcription system for the 2016 multi-genre broadcast challenge
Understanding the difficulty of training deep feed-forward neural networks
J. Mach. Learn. Res.
Cited by (9)
Co-movement selective detection filter to identify time series co-movement indicator or to filter out symmetric economic shocks
2021, Digital Signal Processing: A Review JournalCitation Excerpt :The work of Iliadis et al. [24] focused on video compress sensing via deep binary masking. Another engineering approach, Yang et al. [57], is focused on adaptive data selection in audio signal for deep learning-based audio segmentation. Moreover, Liu et al. [34] applied convolution neural network and guided filtering to improve the resolution of the images.
Importance of Artificial Intelligence in Neural Network: Speech Signal Segmentation Using K-Means Clustering with Kernelized Deep Belief Networks
2023, International Journal of Intelligent Systems and Applications in EngineeringMachine learning based KNN classifier: towards robust, efficient DTMF tone detection for a Noisy environment
2021, Multimedia Tools and ApplicationsRobust dual-tone multi-frequency tone detection using k-nearest neighbour classifier for a noisy environment
2021, Applied Computing and InformaticsGeneralizing AUC Optimization to Multiclass Classification for Audio Segmentation with Limited Training Data
2021, IEEE Signal Processing LettersMulticlass audio segmentation based on recurrent neural networks for broadcast domain data
2020, Eurasip Journal on Audio, Speech, and Music Processing
Xu-Kui Yang was born in Fujian, China, in 1988. He received the B.S. and M.S. degrees in information and communication from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2011 and 2014, respectively. He is currently working towards the Ph.D. degree on speech recognition at the National Digital Switching System Engineering and Technological R&D Center.
His research interests are in speech signal processing, continuous speech recognition, and machine learning.
Dan Qu was born in Jilin, China, in 1974. She received the B.S., M.S. and Ph.D. degrees in information and communication engineering from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2004, 2007 and 2013, respectively. From 2016 to 2017, she was a visiting scholar in Computer Science Institute of Carnegie Mellon University.
She is an Associate Professor in the National Digital Switching System Engineering and Technological R&D Center. Her research interests are in speech signal processing and pattern recognition & machine learning, and natural language processing.
Wen-Lin Zhang was born in Hubei, China, in 1982. He received the B.S., M.S. and Ph.D. degrees in information and communication engineering from the Zhengzhou Information Science and Technology Institute, Zhengzhou, China, in 2004, 2007 and 2013, respectively.
He is an Assistant Professor in the National Digital Switching System Engineering and Technological R&D Center. His research interests are in speech signal processing, speech recognition, and machine learning.
Wei-Qiang Zhang was born in Hebei, China, in 1979. He received the B.S. degree in applied physics from University of Petroleum, Shandong, in 2002, the M.S. degree in communication and information systems from Beijing Institute of Technology, Beijing, in 2005, and the Ph.D. degree in information and communication engineering from Tsinghua University, Beijing, in 2009. From 2016 to 2017, he was a visiting scholar at the Center for Computer Research in Music and Acoustics (CCRMA), Stanford University.
He is an Associate Professor at the Department of Electronic Engineering, Tsinghua University, Beijing. His research interests are in the area of radar signal processing, acoustic signal processing, speech signal processing, machine learning and statistical pattern recognition.
- ☆
This work was supported in part by the National Natural Science Foundation of China under Grants No. 61673395, No. 61403415, and Henan Province Natural Science Foundation under Grants No. 162300410331. The associate editor coordinating the review of this manuscript and approving it for publication was xxxx.