Cepstral distance based channel selection for distant speech recognition

doi:10.1016/j.csl.2017.08.003

Computer Speech & Language

Volume 47, January 2018, Pages 314-332

https://doi.org/10.1016/j.csl.2017.08.003 Get rights and content

Highlights

•
This work concerns distant speech recognition (DSR) with microphones sparse in space.
•
We introduce a methodology to conduct studies on channel selection (CS) for DSR.
•
A CS method is proposed and relies on cepstral distance as measure of signal quality.
•
Experiments are conducted on both simulated and real multi-microphone data sets.
•
Results demonstrate the effectiveness of the proposed methodology and techniques.

Abstract

Shifting from a single to a multi-microphone setting, distant speech recognition can be benefited from the multiple instances of the same utterance in many ways. An effective approach, especially when microphones are not organized in an array fashion, is given by channel selection (CS), which assumes that for each utterance there is at least one channel that can improve the recognition results when compared to the decoding of the remaining channels. In order to identify this most favourable channel, a possible approach is to estimate the degree of distortion that characterizes each microphone signal. In a reverberant environment, this distortion can vary significantly across microphones, for instance due to the orientation of the speaker’s head. In this work, we investigate on the application of cepstral distance as a distortion measure that turns out to be closely related to properties of the room acoustics, such as reverberation time and direct-to-reverberant ratio. From this measure, a blind CS method is derived, which relies on a reference computed by averaging log magnitude spectra of all the microphone signals. Another aim of our study is to propose a novel methodology to analyze CS under a wide set of experimental conditions and setup variations, which depend on the sound source position, its orientation, and the microphone network configuration. Based on the use of prior information, we introduce an informed technique to predict CS performance. Experimental results show both the effectiveness of the proposed blind CS method and the value of the aforementioned analysis methodology. The experiments were conducted using different sets of real and simulated data, the latter ones derived from synthetic and from measured impulse responses. It is demonstrated that the proposed blind CS method is well related to the oracle selection of the best recognized channel. Moreover, our method outperforms a state-of-the-art one, especially on real data.

Introduction

Despite the extensive efforts that have been made for reliable automatic speech recognition (ASR), the performance of many voiced based systems is still inadequate under certain conditions. For example, ASR is seriously affected by the presence of reverberation, background noise, and overlapping speakers. In order to overcome these limitations in distant-talking scenarios, some of the most effective strategies adopt the use of multiple microphones (Wölfel, McDonough, 2009, Brandstein, Ward, 2001). There are many applications, e.g., in domestic environments, for which a significant improvement in terms of speech recognition rate can be obtained by deploying a large number of microphones, clustered in arrays with specific geometries, and distributed in such a way to cover the whole environment. A sparse distribution of single microphones in space, combined with an automatic channel selection (CS), represents a simple and effective solution to limit the overall complexity of a distant speech recognition (DSR) system.

CS makes the reasonable assumption that among the acquired microphone signals there is one that can lead to a better recognition performance than the others. In order to identify the related microphone, it is worth addressing the attributes of the signal and the characteristics of the communication channel that shaped the uttered speech from the source to the sensor, and depends on the speaker location, the head orientation, and the room acoustics. The latter variabilities determine the overall reverberation effects that can be observed in the distant microphone signal. Environmental noise, although it is not the main focus of this work, also represents a relevant issue, in particular when it is more concentrated in some areas, i.e., when it introduces more distortion into a subset of the available microphones.

Various CS methods have been presented in the literature, as reported in the following. Some of them rely on measures that quantify the effect of the channel on the speech signals. Examples of these measures are the envelope variance (EV) (Wolf and Nadeu, 2014) and the modulation spectrum ratio (Himawan et al., 2015). Also energy-based techniques can be applied to CS, in particular under controlled conditions as when a calibrated set of microphones is available (Wolf and Nadeu, 2010).

In a previous work, we presented an initial study of how objective signal quality measures, in particular the cepstral distance (CD), can be successfully applied to CS problem (Guerrero et al., 2016). However, we believe that an important requirement, for a more effective application of these quality measures to our problem, is an in-depth understanding of their relationship with DSR performance. In order to address this missing link between CS and DSR, this work aims to provide a novel methodology for assessing the performance and limitations of CS methods, as far as reverberation effects are concerned. To the best of our knowledge, this represents the first empirical study that characterizes, from a quantitative standpoint, the overall system behavior under parameters such as the distance between the speaker and microphones, the speaker orientation, and the microphone network configuration. Additionally, this work constitutes an extensive and deeper investigation of the CD based technique outlined in Guerrero et al. (2016). We discuss the effectiveness of CD to characterize the reverberation in a room e.g., relating it to the direct-to-reverberant ratio (DRR) feature, supporting its application to CS for DSR. Also, we present evidence that shows that CD based CS is strongly related to an oracle selection of the best recognized channels. Then, the investigated approach is analyzed under variations on the setup that regard the speaker position and orientation, and the microphone network configuration. Finally, we extend our findings and confirm the benefits of applying CS to DSR with the use of real data, on which the proposed method achieves a better performance than an EV based state-of-the-art method.

The remaining of this paper is organized as follows. In Section 2 multi-microphone processing for DSR is discussed. Specific parameters of the room acoustics are presented in Section 3. An overview of the most relevant CS methods is given in Section 4. CD based CS is elaborated in Section 5. In Section 6, details about the experimental framework are provided. The activities and analysis performed on the different experimental settings, and their corresponding results, are presented in Sections 7 and 8. Finally, in Section 9 the conclusions of the study and possible directions for future activities are discussed.

Section snippets

Multi-microphone processing for DSR

The problem of DSR in a multi-microphone setting comprises, on one hand, the techniques used for multi-microphone speech processing and, on the other hand, the acoustic properties of the reverberant environments.

Multi-microphone speech processing approaches have proved their potential to significantly improve DSR performance in comparison to single channel solutions. Various architectures can be adopted to process the multiple inputs in order to derive a single recognition output of a spoken

Reverberation time and direct-to-reverberant ratio

When available, IRs can be exploited to estimate parameters that characterize the reverberation in a non-anechoic room. Two important parameters are the reverberation time (T₆₀) and DRR (Kuttruff, 2009, Jo, Koyasu, 1975). The T₆₀ is defined as the time required for a sound to decay 60dB from its initial level, after an abrupt cessation of the source (Kuttruff, 2007). The DRR is defined as the ratio of the sound energy that arrives to the microphone through a direct path, over the sound energy

Channel selection

CS methods share the objective to detect the least distorted channel among the available ones, assuming that a better match will result between the selected channel and the acoustic models of the DSR system. CS can be applied either at front-end or at post-decoding level, commonly referred to as signal based and decoder based approaches, respectively. In both cases, one relies on a specific measure which is optimized for the final selection. According to the type of information exploited for

Cepstral distance based CS

Objective signal quality measures have been exploited for many years in various speech processing applications (Quackenbush, Barnwell, Clements, 1988, Loizou, 2013). Measures such as the CD, the log-likelihood ratio (LLR) (Hansen and Pellom, 1998) and the frequency weighted segmental SNR (fwSSNR) (Tribolet et al., 1978) were initially introduced in the speech coding community (Gray, Markel, 1976, Kitawaki, Nagabuchi, Itoh, 1988, Furui, Sondhi, 1991) as a means of measuring the amount of

Multi-microphone environments

In this study, we use two experimental multi-microphone environments, namely the SQUARE and the DIRHA rooms. These two rooms are schematically presented in Fig. 2 and Fig. 3, respectively. Their detailed characteristics are given in Table 1. In both settings, the average distance between the speaker and the microphones fluctuates in the range of 1–4 meters. In contrast to other studies performed in much reduced spaces, the distance explored in this work implies that reverberation significantly

Experiments in the SQUARE room

In this section we report the experiments performed in the SQUARE room setting, based on the use of IM generated IRs. Concerning speech recognition, all the experiments were conducted using the dnn system detailed in the previous section.

Experiments in the DIRHA room

This section is concerned with recognition experiments in the DIRHA room, which involve the use of two data sets. The first one consists of reverberated speech generated by convolving the IRs measured in the real environment and the clean speech acquired in the FBK recording studio. The second one includes real data recorded in a reverberant room, as reported in Section 6.2. The speech recognition results presented in this section were produced using the dnn system detailed in Section 6.3.

The

Conclusions and future directions

This work has proposed an effective approach to study CS for DSR. The focus was given to CS based on objective quality measures, and particularly on the use of CD in an informed and a blind fashion. With the use of simulated material we studied the relation between the CD and specific characteristics of the acoustic conditions. It was shown that CD is closely related both to T₆₀ and to DRR, a finding that endorses the use of CD measure in the context of CS. Furthermore, CD was found to be

References (59)

P.A. Naylor et al.
Speech Dereverberation
(2010)
S. Quackenbush et al.
Objective measures of speech quality
(1988)
M. Wölfel et al.
Distant Speech Recognition
(2009)
J.B. Allen et al.
Image method for efficiently simulating small-room acoustics
J. Acoust. Soc. Am.
(1979)
J. Barker et al.
The third CHiME speech separation and recognition challenge: dataset, task and baselines
Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding
(2015)
M. Brandstein et al.
Microphone Arrays: Signal Processing Techniques and Applications
(2001)
D. Cabrera et al.
Calculating reverberation time from impulse responses: a comparison of software implementations
Acoust. Aust.
(2016)
L. Cristoforetti et al.
The DIRHA simulated corpus
Proceedings of International Conference on Language Resources and Evaluation
(2014)
J. Eaton et al.
Estimation of room acoustic parameters: the ACE challenge
IEEE/ACM Trans. Audio Speech Lang. Process.
(2016)
G. Evermann et al.
Posterior probability decoding, confidence estimation and system combination
Proceedings of Speech Transcription Workshop
(2000)

A. Farina

Simultaneous measurement of impulse response and distortion with a swept-sine technique

Proceedings of 108-th Audio Engineering Society Convention

(2000)

J.G. Fiscus

A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER)

Proceedings of IEEE Workshop on Automatic Speeck Recognition and Understanding

(1997)

J. Flanagan et al.

Computer-steered microphone arrays for sound transduction in large rooms

J. Acoust. Soc. Am.

(1985)

S. Furui et al.

Advances in Speech Signal Processing

(1991)

J. Garofalo et al.

Continuous speech recognition (CSR-I) Wall Street Journal (WSJ0) News Complete

(1993)

A. Gray et al.

Distance measures for speech processing

IEEE Trans. Acoust. Speech Signal Process.

(1976)

C. Guerrero et al.

Channel selection for distant speech recognition - exploiting cepstral distance

Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

(2016)

J.H. Hansen et al.

An effective quality evaluation protocol for speech enhancement algorithms

Proceedings of International Conference on Spoken Language Processing

(1998)

I. Himawan et al.

Channel selection in the short-time modulation domain for distant speech recognition

Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

(2015)

HuY. et al.

Evaluation of objective quality measures for speech enhancement

IEEE Trans. Audio Speech Lang. Process.

(2008)

HuangX. et al.

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

(2001)

HuangY. et al.

Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

(2014)

E.N. ISO

Acoustics Measurement of Room Acoustic Parameters- Part 2: reverberation Time in Ordinary Rooms

(2008)

JoT. et al.

Measurement of reverberation time based on the direct-reverberant sound energy ratio in steady state

Proceedings of INTER-NOISE and NOISE-CON Congress and Conference Proceedings

(1975)

K. Kinoshita et al.

The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech

Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

(2013)

N. Kitawaki et al.

Objective quality evaluation for low-bit-rate speech coding systems

IEEE J. Sel. Areas Commun.

(1988)

K. Kumatani et al.

Channel selection based on multichannel cross-correlation coefficients for distant speech recognition

Proceedings of Joint Workshop on Hands-free Speech Communication and Microphone Arrays

(2011)

H. Kuttruff

Acoustics: An Introduction

(2007)

H. Kuttruff

Room Acoustics

(2009)

Cited by (10)

A proposed method to improve the WER of an ASR system in the noisy reverberant room
2024, Journal of the Franklin Institute
This paper proposes a novel approach to reducing the word error rate (WER) of an automatic speech recognition (ASR) system in a noisy reverberant room. This research utilizes the integration of beamforming, dereverberation, and ambisonic. Based on the demonstrated formula, the proposed system synthesizes the signal of desired points on the sphere surface from a combination of 32 signals of a uniform spherical microphone array (USMA). This method uses the non-parametric sound field reproduction technique in the spherical harmonics domain (SHD). Also, the suggested new geometry determines the place of the desired points. In addition to improving the dereverberation performance, the proposed method also improves the performance of the beamformer in terms of directivity factor (DF) and white noise gain (WNG). The results show that objective metrics such as PESQ are significantly improved, and the WER of the Kaldi and the WeNet ASR systems is reduced considerably.
Channel and channel subband selection for speaker diarization
2022, Computer Speech and Language
Citation Excerpt :
Cepstral distance is long known for its flexibility and effectiveness in different applications (Guerrero et al., 2016). It was recently used for the selection of the least distorted channel by Flores et al. (2018) for distant speech recognition. As an intrusive measure, the use of the cepstral distance requires a reference channel which is assumed to provide a clean speech signal in some sense.
Speaker diarization can be considered to be one of the complex problems in speaker recognition. A reliable diarization system should be able to accurately determine the variable length utterances which a speaker contributes to multi-speaker conversations. This is a difficult problem since text-independent speaker identification and verification is yet to be improved for it to be applied reliably. While efficient speaker modelling is important for diarization, the acoustical representation of speech is the basic entity that signifies a speaker. This representation should be outstanding enough to prevent a speaker’s utterances from being lost in the acoustical congestion that is imposed by the rest of the talkers.
For this purpose, it is proposed here, for the case of multiple-microphone diarization, multiple speech signals are used in the acoustic feature extraction instead of combining the signals beforehand. The reason is to make an optimal use of those signals in order to enrich the quality of the acoustical representation of the speaker. To this end, and since not all microphone signals (channels) may be desirable, two selection approaches are proposed in this work. These are, a best quality channel selection method and a novel approach for diverse channel selection. Furthermore, a novel method is proposed which retains the speech spectrum from selected least reverberated subbands of the available channels’ spectrums. A new model, referred to here as Averaged Joint Gradient (AJG), is introduced for this purpose. The proposed approach reduces the Diarization Error Rate (DER) in both of the diarization systems used in the evaluations. The first system is based on binary keys and achieves a maximum relative reduction in DER of 14%. The second one is a Gaussian Mixture Model-Bayesian Information Criterion (GMM-BIC) based system which achieves a maximum relative reduction in DER of 20%.
Machine Learning-Based Modelling in Atomic Layer Deposition Processes
2023, Machine Learning-Based Modelling in Atomic Layer Deposition Processes
Speech improvement in noisy reverberant environments using virtual microphones along with proposed array geometry
2022, Eurasip Journal on Advances in Signal Processing
Learning to rank microphones for distant speech recognition
2021, arXiv
Automatic Severity Evaluation of Articulation Disorder in Speech using Dynamic Time Warping
2021, Proceedings of the 4th International Conference on Microelectronics, Signals and Systems, ICMSS 2021

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Roger Moore.

View full text

Cepstral distance based channel selection for distant speech recognition☆

Highlights

Abstract

Introduction

Section snippets

Multi-microphone processing for DSR

Reverberation time and direct-to-reverberant ratio

Channel selection

Cepstral distance based CS

Multi-microphone environments

Experiments in the SQUARE room

Experiments in the DIRHA room

Conclusions and future directions

Image method for efficiently simulating small-room acoustics

J. Acoust. Soc. Am.

The third CHiME speech separation and recognition challenge: dataset, task and baselines

Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding

Microphone Arrays: Signal Processing Techniques and Applications

Calculating reverberation time from impulse responses: a comparison of software implementations

Acoust. Aust.

The DIRHA simulated corpus

Proceedings of International Conference on Language Resources and Evaluation

Estimation of room acoustic parameters: the ACE challenge

IEEE/ACM Trans. Audio Speech Lang. Process.

Posterior probability decoding, confidence estimation and system combination

Proceedings of Speech Transcription Workshop

Simultaneous measurement of impulse response and distortion with a swept-sine technique

Proceedings of 108-th Audio Engineering Society Convention

A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER)

Proceedings of IEEE Workshop on Automatic Speeck Recognition and Understanding

Computer-steered microphone arrays for sound transduction in large rooms

J. Acoust. Soc. Am.

Advances in Speech Signal Processing

Continuous speech recognition (CSR-I) Wall Street Journal (WSJ0) News Complete

Distance measures for speech processing

IEEE Trans. Acoust. Speech Signal Process.

Channel selection for distant speech recognition - exploiting cepstral distance

Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

An effective quality evaluation protocol for speech enhancement algorithms

Proceedings of International Conference on Spoken Language Processing

Channel selection in the short-time modulation domain for distant speech recognition

Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

Evaluation of objective quality measures for speech enhancement

IEEE Trans. Audio Speech Lang. Process.

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

Acoustics Measurement of Room Acoustic Parameters- Part 2: reverberation Time in Ordinary Rooms

Measurement of reverberation time based on the direct-reverberant sound energy ratio in steady state

Proceedings of INTER-NOISE and NOISE-CON Congress and Conference Proceedings

The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech

Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

Objective quality evaluation for low-bit-rate speech coding systems

IEEE J. Sel. Areas Commun.

Channel selection based on multichannel cross-correlation coefficients for distant speech recognition

Proceedings of Joint Workshop on Hands-free Speech Communication and Microphone Arrays

Acoustics: An Introduction

Room Acoustics