Cepstral distance based channel selection for distant speech recognition☆
Introduction
Despite the extensive efforts that have been made for reliable automatic speech recognition (ASR), the performance of many voiced based systems is still inadequate under certain conditions. For example, ASR is seriously affected by the presence of reverberation, background noise, and overlapping speakers. In order to overcome these limitations in distant-talking scenarios, some of the most effective strategies adopt the use of multiple microphones (Wölfel, McDonough, 2009, Brandstein, Ward, 2001). There are many applications, e.g., in domestic environments, for which a significant improvement in terms of speech recognition rate can be obtained by deploying a large number of microphones, clustered in arrays with specific geometries, and distributed in such a way to cover the whole environment. A sparse distribution of single microphones in space, combined with an automatic channel selection (CS), represents a simple and effective solution to limit the overall complexity of a distant speech recognition (DSR) system.
CS makes the reasonable assumption that among the acquired microphone signals there is one that can lead to a better recognition performance than the others. In order to identify the related microphone, it is worth addressing the attributes of the signal and the characteristics of the communication channel that shaped the uttered speech from the source to the sensor, and depends on the speaker location, the head orientation, and the room acoustics. The latter variabilities determine the overall reverberation effects that can be observed in the distant microphone signal. Environmental noise, although it is not the main focus of this work, also represents a relevant issue, in particular when it is more concentrated in some areas, i.e., when it introduces more distortion into a subset of the available microphones.
Various CS methods have been presented in the literature, as reported in the following. Some of them rely on measures that quantify the effect of the channel on the speech signals. Examples of these measures are the envelope variance (EV) (Wolf and Nadeu, 2014) and the modulation spectrum ratio (Himawan et al., 2015). Also energy-based techniques can be applied to CS, in particular under controlled conditions as when a calibrated set of microphones is available (Wolf and Nadeu, 2010).
In a previous work, we presented an initial study of how objective signal quality measures, in particular the cepstral distance (CD), can be successfully applied to CS problem (Guerrero et al., 2016). However, we believe that an important requirement, for a more effective application of these quality measures to our problem, is an in-depth understanding of their relationship with DSR performance. In order to address this missing link between CS and DSR, this work aims to provide a novel methodology for assessing the performance and limitations of CS methods, as far as reverberation effects are concerned. To the best of our knowledge, this represents the first empirical study that characterizes, from a quantitative standpoint, the overall system behavior under parameters such as the distance between the speaker and microphones, the speaker orientation, and the microphone network configuration. Additionally, this work constitutes an extensive and deeper investigation of the CD based technique outlined in Guerrero et al. (2016). We discuss the effectiveness of CD to characterize the reverberation in a room e.g., relating it to the direct-to-reverberant ratio (DRR) feature, supporting its application to CS for DSR. Also, we present evidence that shows that CD based CS is strongly related to an oracle selection of the best recognized channels. Then, the investigated approach is analyzed under variations on the setup that regard the speaker position and orientation, and the microphone network configuration. Finally, we extend our findings and confirm the benefits of applying CS to DSR with the use of real data, on which the proposed method achieves a better performance than an EV based state-of-the-art method.
The remaining of this paper is organized as follows. In Section 2 multi-microphone processing for DSR is discussed. Specific parameters of the room acoustics are presented in Section 3. An overview of the most relevant CS methods is given in Section 4. CD based CS is elaborated in Section 5. In Section 6, details about the experimental framework are provided. The activities and analysis performed on the different experimental settings, and their corresponding results, are presented in Sections 7 and 8. Finally, in Section 9 the conclusions of the study and possible directions for future activities are discussed.
Section snippets
Multi-microphone processing for DSR
The problem of DSR in a multi-microphone setting comprises, on one hand, the techniques used for multi-microphone speech processing and, on the other hand, the acoustic properties of the reverberant environments.
Multi-microphone speech processing approaches have proved their potential to significantly improve DSR performance in comparison to single channel solutions. Various architectures can be adopted to process the multiple inputs in order to derive a single recognition output of a spoken
Reverberation time and direct-to-reverberant ratio
When available, IRs can be exploited to estimate parameters that characterize the reverberation in a non-anechoic room. Two important parameters are the reverberation time (T60) and DRR (Kuttruff, 2009, Jo, Koyasu, 1975). The T60 is defined as the time required for a sound to decay 60dB from its initial level, after an abrupt cessation of the source (Kuttruff, 2007). The DRR is defined as the ratio of the sound energy that arrives to the microphone through a direct path, over the sound energy
Channel selection
CS methods share the objective to detect the least distorted channel among the available ones, assuming that a better match will result between the selected channel and the acoustic models of the DSR system. CS can be applied either at front-end or at post-decoding level, commonly referred to as signal based and decoder based approaches, respectively. In both cases, one relies on a specific measure which is optimized for the final selection. According to the type of information exploited for
Cepstral distance based CS
Objective signal quality measures have been exploited for many years in various speech processing applications (Quackenbush, Barnwell, Clements, 1988, Loizou, 2013). Measures such as the CD, the log-likelihood ratio (LLR) (Hansen and Pellom, 1998) and the frequency weighted segmental SNR (fwSSNR) (Tribolet et al., 1978) were initially introduced in the speech coding community (Gray, Markel, 1976, Kitawaki, Nagabuchi, Itoh, 1988, Furui, Sondhi, 1991) as a means of measuring the amount of
Multi-microphone environments
In this study, we use two experimental multi-microphone environments, namely the SQUARE and the DIRHA rooms. These two rooms are schematically presented in Fig. 2 and Fig. 3, respectively. Their detailed characteristics are given in Table 1. In both settings, the average distance between the speaker and the microphones fluctuates in the range of 1–4 meters. In contrast to other studies performed in much reduced spaces, the distance explored in this work implies that reverberation significantly
Experiments in the SQUARE room
In this section we report the experiments performed in the SQUARE room setting, based on the use of IM generated IRs. Concerning speech recognition, all the experiments were conducted using the dnn system detailed in the previous section.
Experiments in the DIRHA room
This section is concerned with recognition experiments in the DIRHA room, which involve the use of two data sets. The first one consists of reverberated speech generated by convolving the IRs measured in the real environment and the clean speech acquired in the FBK recording studio. The second one includes real data recorded in a reverberant room, as reported in Section 6.2. The speech recognition results presented in this section were produced using the dnn system detailed in Section 6.3.
The
Conclusions and future directions
This work has proposed an effective approach to study CS for DSR. The focus was given to CS based on objective quality measures, and particularly on the use of CD in an informed and a blind fashion. With the use of simulated material we studied the relation between the CD and specific characteristics of the acoustic conditions. It was shown that CD is closely related both to T60 and to DRR, a finding that endorses the use of CD measure in the context of CS. Furthermore, CD was found to be
References (59)
- et al.
Speech Dereverberation
(2010) - et al.
Objective measures of speech quality
(1988) - et al.
Distant Speech Recognition
(2009) - et al.
Image method for efficiently simulating small-room acoustics
J. Acoust. Soc. Am.
(1979) - et al.
The third CHiME speech separation and recognition challenge: dataset, task and baselines
Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding
(2015) - et al.
Microphone Arrays: Signal Processing Techniques and Applications
(2001) - et al.
Calculating reverberation time from impulse responses: a comparison of software implementations
Acoust. Aust.
(2016) - et al.
The DIRHA simulated corpus
Proceedings of International Conference on Language Resources and Evaluation
(2014) - et al.
Estimation of room acoustic parameters: the ACE challenge
IEEE/ACM Trans. Audio Speech Lang. Process.
(2016) - et al.
Posterior probability decoding, confidence estimation and system combination
Proceedings of Speech Transcription Workshop
(2000)
Simultaneous measurement of impulse response and distortion with a swept-sine technique
Proceedings of 108-th Audio Engineering Society Convention
A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER)
Proceedings of IEEE Workshop on Automatic Speeck Recognition and Understanding
Computer-steered microphone arrays for sound transduction in large rooms
J. Acoust. Soc. Am.
Advances in Speech Signal Processing
Continuous speech recognition (CSR-I) Wall Street Journal (WSJ0) News Complete
Distance measures for speech processing
IEEE Trans. Acoust. Speech Signal Process.
Channel selection for distant speech recognition - exploiting cepstral distance
Proceedings of Interspeech - Annual Conference of the International Speech Communication Association
An effective quality evaluation protocol for speech enhancement algorithms
Proceedings of International Conference on Spoken Language Processing
Channel selection in the short-time modulation domain for distant speech recognition
Proceedings of Interspeech - Annual Conference of the International Speech Communication Association
Evaluation of objective quality measures for speech enhancement
IEEE Trans. Audio Speech Lang. Process.
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks
Proceedings of Interspeech - Annual Conference of the International Speech Communication Association
Acoustics Measurement of Room Acoustic Parameters- Part 2: reverberation Time in Ordinary Rooms
Measurement of reverberation time based on the direct-reverberant sound energy ratio in steady state
Proceedings of INTER-NOISE and NOISE-CON Congress and Conference Proceedings
The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech
Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
Objective quality evaluation for low-bit-rate speech coding systems
IEEE J. Sel. Areas Commun.
Channel selection based on multichannel cross-correlation coefficients for distant speech recognition
Proceedings of Joint Workshop on Hands-free Speech Communication and Microphone Arrays
Acoustics: An Introduction
Room Acoustics
Cited by (10)
A proposed method to improve the WER of an ASR system in the noisy reverberant room
2024, Journal of the Franklin InstituteChannel and channel subband selection for speaker diarization
2022, Computer Speech and LanguageCitation Excerpt :Cepstral distance is long known for its flexibility and effectiveness in different applications (Guerrero et al., 2016). It was recently used for the selection of the least distorted channel by Flores et al. (2018) for distant speech recognition. As an intrusive measure, the use of the cepstral distance requires a reference channel which is assumed to provide a clean speech signal in some sense.
Machine Learning-Based Modelling in Atomic Layer Deposition Processes
2023, Machine Learning-Based Modelling in Atomic Layer Deposition ProcessesSpeech improvement in noisy reverberant environments using virtual microphones along with proposed array geometry
2022, Eurasip Journal on Advances in Signal ProcessingAutomatic Severity Evaluation of Articulation Disorder in Speech using Dynamic Time Warping
2021, Proceedings of the 4th International Conference on Microelectronics, Signals and Systems, ICMSS 2021
- ☆
This paper has been recommended for acceptance by Roger Moore.