Elsevier

Computer Speech & Language

Volume 47, January 2018, Pages 314-332
Computer Speech & Language

Cepstral distance based channel selection for distant speech recognition

https://doi.org/10.1016/j.csl.2017.08.003Get rights and content

Highlights

  • This work concerns distant speech recognition (DSR) with microphones sparse in space.

  • We introduce a methodology to conduct studies on channel selection (CS) for DSR.

  • A CS method is proposed and relies on cepstral distance as measure of signal quality.

  • Experiments are conducted on both simulated and real multi-microphone data sets.

  • Results demonstrate the effectiveness of the proposed methodology and techniques.

Abstract

Shifting from a single to a multi-microphone setting, distant speech recognition can be benefited from the multiple instances of the same utterance in many ways. An effective approach, especially when microphones are not organized in an array fashion, is given by channel selection (CS), which assumes that for each utterance there is at least one channel that can improve the recognition results when compared to the decoding of the remaining channels. In order to identify this most favourable channel, a possible approach is to estimate the degree of distortion that characterizes each microphone signal. In a reverberant environment, this distortion can vary significantly across microphones, for instance due to the orientation of the speaker’s head. In this work, we investigate on the application of cepstral distance as a distortion measure that turns out to be closely related to properties of the room acoustics, such as reverberation time and direct-to-reverberant ratio. From this measure, a blind CS method is derived, which relies on a reference computed by averaging log magnitude spectra of all the microphone signals. Another aim of our study is to propose a novel methodology to analyze CS under a wide set of experimental conditions and setup variations, which depend on the sound source position, its orientation, and the microphone network configuration. Based on the use of prior information, we introduce an informed technique to predict CS performance. Experimental results show both the effectiveness of the proposed blind CS method and the value of the aforementioned analysis methodology. The experiments were conducted using different sets of real and simulated data, the latter ones derived from synthetic and from measured impulse responses. It is demonstrated that the proposed blind CS method is well related to the oracle selection of the best recognized channel. Moreover, our method outperforms a state-of-the-art one, especially on real data.

Introduction

Despite the extensive efforts that have been made for reliable automatic speech recognition (ASR), the performance of many voiced based systems is still inadequate under certain conditions. For example, ASR is seriously affected by the presence of reverberation, background noise, and overlapping speakers. In order to overcome these limitations in distant-talking scenarios, some of the most effective strategies adopt the use of multiple microphones (Wölfel, McDonough, 2009, Brandstein, Ward, 2001). There are many applications, e.g., in domestic environments, for which a significant improvement in terms of speech recognition rate can be obtained by deploying a large number of microphones, clustered in arrays with specific geometries, and distributed in such a way to cover the whole environment. A sparse distribution of single microphones in space, combined with an automatic channel selection (CS), represents a simple and effective solution to limit the overall complexity of a distant speech recognition (DSR) system.

CS makes the reasonable assumption that among the acquired microphone signals there is one that can lead to a better recognition performance than the others. In order to identify the related microphone, it is worth addressing the attributes of the signal and the characteristics of the communication channel that shaped the uttered speech from the source to the sensor, and depends on the speaker location, the head orientation, and the room acoustics. The latter variabilities determine the overall reverberation effects that can be observed in the distant microphone signal. Environmental noise, although it is not the main focus of this work, also represents a relevant issue, in particular when it is more concentrated in some areas, i.e., when it introduces more distortion into a subset of the available microphones.

Various CS methods have been presented in the literature, as reported in the following. Some of them rely on measures that quantify the effect of the channel on the speech signals. Examples of these measures are the envelope variance (EV) (Wolf and Nadeu, 2014) and the modulation spectrum ratio (Himawan et al., 2015). Also energy-based techniques can be applied to CS, in particular under controlled conditions as when a calibrated set of microphones is available (Wolf and Nadeu, 2010).

In a previous work, we presented an initial study of how objective signal quality measures, in particular the cepstral distance (CD), can be successfully applied to CS problem (Guerrero et al., 2016). However, we believe that an important requirement, for a more effective application of these quality measures to our problem, is an in-depth understanding of their relationship with DSR performance. In order to address this missing link between CS and DSR, this work aims to provide a novel methodology for assessing the performance and limitations of CS methods, as far as reverberation effects are concerned. To the best of our knowledge, this represents the first empirical study that characterizes, from a quantitative standpoint, the overall system behavior under parameters such as the distance between the speaker and microphones, the speaker orientation, and the microphone network configuration. Additionally, this work constitutes an extensive and deeper investigation of the CD based technique outlined in Guerrero et al. (2016). We discuss the effectiveness of CD to characterize the reverberation in a room e.g., relating it to the direct-to-reverberant ratio (DRR) feature, supporting its application to CS for DSR. Also, we present evidence that shows that CD based CS is strongly related to an oracle selection of the best recognized channels. Then, the investigated approach is analyzed under variations on the setup that regard the speaker position and orientation, and the microphone network configuration. Finally, we extend our findings and confirm the benefits of applying CS to DSR with the use of real data, on which the proposed method achieves a better performance than an EV based state-of-the-art method.

The remaining of this paper is organized as follows. In Section 2 multi-microphone processing for DSR is discussed. Specific parameters of the room acoustics are presented in Section 3. An overview of the most relevant CS methods is given in Section 4. CD based CS is elaborated in Section 5. In Section 6, details about the experimental framework are provided. The activities and analysis performed on the different experimental settings, and their corresponding results, are presented in Sections 7 and 8. Finally, in Section 9 the conclusions of the study and possible directions for future activities are discussed.

Section snippets

Multi-microphone processing for DSR

The problem of DSR in a multi-microphone setting comprises, on one hand, the techniques used for multi-microphone speech processing and, on the other hand, the acoustic properties of the reverberant environments.

Multi-microphone speech processing approaches have proved their potential to significantly improve DSR performance in comparison to single channel solutions. Various architectures can be adopted to process the multiple inputs in order to derive a single recognition output of a spoken

Reverberation time and direct-to-reverberant ratio

When available, IRs can be exploited to estimate parameters that characterize the reverberation in a non-anechoic room. Two important parameters are the reverberation time (T60) and DRR (Kuttruff, 2009, Jo, Koyasu, 1975). The T60 is defined as the time required for a sound to decay 60dB from its initial level, after an abrupt cessation of the source (Kuttruff, 2007). The DRR is defined as the ratio of the sound energy that arrives to the microphone through a direct path, over the sound energy

Channel selection

CS methods share the objective to detect the least distorted channel among the available ones, assuming that a better match will result between the selected channel and the acoustic models of the DSR system. CS can be applied either at front-end or at post-decoding level, commonly referred to as signal based and decoder based approaches, respectively. In both cases, one relies on a specific measure which is optimized for the final selection. According to the type of information exploited for

Cepstral distance based CS

Objective signal quality measures have been exploited for many years in various speech processing applications (Quackenbush, Barnwell, Clements, 1988, Loizou, 2013). Measures such as the CD, the log-likelihood ratio (LLR) (Hansen and Pellom, 1998) and the frequency weighted segmental SNR (fwSSNR) (Tribolet et al., 1978) were initially introduced in the speech coding community (Gray, Markel, 1976, Kitawaki, Nagabuchi, Itoh, 1988, Furui, Sondhi, 1991) as a means of measuring the amount of

Multi-microphone environments

In this study, we use two experimental multi-microphone environments, namely the SQUARE and the DIRHA rooms. These two rooms are schematically presented in Fig. 2 and Fig. 3, respectively. Their detailed characteristics are given in Table 1. In both settings, the average distance between the speaker and the microphones fluctuates in the range of 1–4 meters. In contrast to other studies performed in much reduced spaces, the distance explored in this work implies that reverberation significantly

Experiments in the SQUARE room

In this section we report the experiments performed in the SQUARE room setting, based on the use of IM generated IRs. Concerning speech recognition, all the experiments were conducted using the dnn system detailed in the previous section.

Experiments in the DIRHA room

This section is concerned with recognition experiments in the DIRHA room, which involve the use of two data sets. The first one consists of reverberated speech generated by convolving the IRs measured in the real environment and the clean speech acquired in the FBK recording studio. The second one includes real data recorded in a reverberant room, as reported in Section 6.2. The speech recognition results presented in this section were produced using the dnn system detailed in Section 6.3.

The

Conclusions and future directions

This work has proposed an effective approach to study CS for DSR. The focus was given to CS based on objective quality measures, and particularly on the use of CD in an informed and a blind fashion. With the use of simulated material we studied the relation between the CD and specific characteristics of the acoustic conditions. It was shown that CD is closely related both to T60 and to DRR, a finding that endorses the use of CD measure in the context of CS. Furthermore, CD was found to be

References (59)

  • P.A. Naylor et al.

    Speech Dereverberation

    (2010)
  • S. Quackenbush et al.

    Objective measures of speech quality

    (1988)
  • M. Wölfel et al.

    Distant Speech Recognition

    (2009)
  • J.B. Allen et al.

    Image method for efficiently simulating small-room acoustics

    J. Acoust. Soc. Am.

    (1979)
  • J. Barker et al.

    The third CHiME speech separation and recognition challenge: dataset, task and baselines

    Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding

    (2015)
  • M. Brandstein et al.

    Microphone Arrays: Signal Processing Techniques and Applications

    (2001)
  • D. Cabrera et al.

    Calculating reverberation time from impulse responses: a comparison of software implementations

    Acoust. Aust.

    (2016)
  • L. Cristoforetti et al.

    The DIRHA simulated corpus

    Proceedings of International Conference on Language Resources and Evaluation

    (2014)
  • J. Eaton et al.

    Estimation of room acoustic parameters: the ACE challenge

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2016)
  • G. Evermann et al.

    Posterior probability decoding, confidence estimation and system combination

    Proceedings of Speech Transcription Workshop

    (2000)
  • A. Farina

    Simultaneous measurement of impulse response and distortion with a swept-sine technique

    Proceedings of 108-th Audio Engineering Society Convention

    (2000)
  • J.G. Fiscus

    A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER)

    Proceedings of IEEE Workshop on Automatic Speeck Recognition and Understanding

    (1997)
  • J. Flanagan et al.

    Computer-steered microphone arrays for sound transduction in large rooms

    J. Acoust. Soc. Am.

    (1985)
  • S. Furui et al.

    Advances in Speech Signal Processing

    (1991)
  • J. Garofalo et al.

    Continuous speech recognition (CSR-I) Wall Street Journal (WSJ0) News Complete

    (1993)
  • A. Gray et al.

    Distance measures for speech processing

    IEEE Trans. Acoust. Speech Signal Process.

    (1976)
  • C. Guerrero et al.

    Channel selection for distant speech recognition - exploiting cepstral distance

    Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

    (2016)
  • J.H. Hansen et al.

    An effective quality evaluation protocol for speech enhancement algorithms

    Proceedings of International Conference on Spoken Language Processing

    (1998)
  • I. Himawan et al.

    Channel selection in the short-time modulation domain for distant speech recognition

    Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

    (2015)
  • HuY. et al.

    Evaluation of objective quality measures for speech enhancement

    IEEE Trans. Audio Speech Lang. Process.

    (2008)
  • HuangX. et al.

    Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

    (2001)
  • HuangY. et al.

    Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

    Proceedings of Interspeech - Annual Conference of the International Speech Communication Association

    (2014)
  • E.N. ISO

    Acoustics Measurement of Room Acoustic Parameters- Part 2: reverberation Time in Ordinary Rooms

    (2008)
  • JoT. et al.

    Measurement of reverberation time based on the direct-reverberant sound energy ratio in steady state

    Proceedings of INTER-NOISE and NOISE-CON Congress and Conference Proceedings

    (1975)
  • K. Kinoshita et al.

    The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech

    Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

    (2013)
  • N. Kitawaki et al.

    Objective quality evaluation for low-bit-rate speech coding systems

    IEEE J. Sel. Areas Commun.

    (1988)
  • K. Kumatani et al.

    Channel selection based on multichannel cross-correlation coefficients for distant speech recognition

    Proceedings of Joint Workshop on Hands-free Speech Communication and Microphone Arrays

    (2011)
  • H. Kuttruff

    Acoustics: An Introduction

    (2007)
  • H. Kuttruff

    Room Acoustics

    (2009)
  • Cited by (10)

    • Channel and channel subband selection for speaker diarization

      2022, Computer Speech and Language
      Citation Excerpt :

      Cepstral distance is long known for its flexibility and effectiveness in different applications (Guerrero et al., 2016). It was recently used for the selection of the least distorted channel by Flores et al. (2018) for distant speech recognition. As an intrusive measure, the use of the cepstral distance requires a reference channel which is assumed to provide a clean speech signal in some sense.

    • Machine Learning-Based Modelling in Atomic Layer Deposition Processes

      2023, Machine Learning-Based Modelling in Atomic Layer Deposition Processes
    • Automatic Severity Evaluation of Articulation Disorder in Speech using Dynamic Time Warping

      2021, Proceedings of the 4th International Conference on Microelectronics, Signals and Systems, ICMSS 2021
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Roger Moore.

    View full text