Elsevier

Speech Communication

Volume 76, February 2016, Pages 170-185
Speech Communication

On the relationship between Early-to-Late Ratio of Room Impulse Responses and ASR performance in reverberant environments

https://doi.org/10.1016/j.specom.2015.09.004Get rights and content

Highlights

  • Study on the correlation between ASR accuracy and reverberant acoustic conditions.

  • Experiments involve a large variety of simulated and measured room impulse responses.

  • The best duration of early arrivals is determined experimentally.

  • The approach is applied to data contamination for acoustic model training and model selection.

  • A large vocabulary recognition task (WSJ) is considered using both GMM and DNN.

Abstract

This work presents an experimental analysis of distant-talking speech recognition in a variety of reverberant conditions, correlating ASR performance to a compact representation of the propagation channel (i.e., the room impulse response).

It is well known that reverberation and background noise degrade speech recognition performance, but few studies have investigated the relation between room impulse responses and recognition rates in a comprehensive manner. In particular, we show how the ASR accuracy is related to features derived from the structure of the early arrivals and the reverberation tail. A representation based on the combination of few parameters is hence proposed, analysing the impact of reverberation on different speech recognition tasks. Possible applications of the derived measure are in data contamination for acoustic modeling where this feature can be employed either to select the most suitable model for a given acoustic condition or to define the subset of room impulse responses to be used for the creation of partially matched reverberant models. Recognition results using different back-end solutions (GMM, DNN) on data generated with the image method and with real impulse responses validate the effectiveness of the approach.

Introduction

Automatic speech recognition (ASR) can be considered a reliable technology only in controlled environments where the quality of the acquired voice signal is not critically degraded by environmental factors as background noise or reverberation. Close-talking microphones are often required to guarantee such quality but in specific scenarios these microphones are inconvenient or intrusive. As a result, distant speech recognition is gaining major attention and a number of studies have been addressing the need of effective speech acquisition and processing (Wölfel and McDonough, 2009). As primary way, microphone arrays allow the implementation of selective spatial acquisition or other speech enhancement techniques (Brandstein and Ward, 2001), like beamforming (Benesty et al., 2008), dereverberation and denoising (Naylor and Gaubitch, 2010). However, compact microphone arrays have a restricted spatial view and are therefore not suitable for wide enclosures where the source may be in an unfavorable position (e.g., not frontal to the microphones, without line of sight). As an alternative, networks of distributed microphones could be employed, guaranteeing a uniform acoustic coverage of the monitored area, independently of the source position and orientation. The latter scenario and the related challenges have been investigated under the EU project DIRHA, where a vocal system for the control of home devices is targeted. In such configurations the adoption of array-processing techniques is often neither possible nor effective (Kumatani et al., 2011) and alternative approaches can be successfully applied, as for example channel selection (Wolf and Nadeu, 2014) or source separation (Vincent et al., 2013, Delcroix et al., 2013).

Although several studies investigated the decreasing trend of ASR performance as a function of the actual noise (i.e., the SNR), few studies have addressed the relation between the acoustic propagation and recognition rates (Nishiura et al., 2007, Fukumori et al., 2013). Indeed, in a distributed microphone setup, various spatial factors have an impact on the signal quality: the distance and the orientation of the pairs source-microphone, the consequent different SNR, the acoustic propagation. Hence, it is of interest to correlate ASR performance with some purely acoustic measurements. In this direction an estimation of the Signal-to-Noise Ratio (SNR) in multiple sub-bands has been introduced in (Takeda et al., 2000). In (Hermansky et al., 2013), the authors proposed a temporal-domain method for predicting recognition performance in unseen noisy environments. This estimate can be usefully exploited during setup to increase the robustness of the resulting system, for example selecting or training a suitable acoustic model. Authors in (Petrick et al., 2007) studied the harming parts of room impulse responses, discussing the contribution of early and late reflections to ASR performance, while in (Sehr and Kellermann, 2010) the inter-frame correlation of reverberant feature vectors is analyzed. In a related work, the adjustments of dereverberation algorithms to ASR systems are evaluated (Sehr et al., 2010). More recently the reverberation problem has been addressed from different perspectives (Krueger and Haeb-Umbach, 2010, Sadjadi and et al., 2012, Gomez et al., 2013, Wang et al., 2012), showing the need for effective solutions to cope with the related masking effects.

This work presents an experimental analysis of distant-talking speech recognition in reverberant conditions, correlating the recognition performance to the acoustic characteristics of the propagation channel, namely the structure of the Room Impulse Response (RIR). It extends our previous work (Brutti and Matassoni, 2014), providing a more exhaustive experimental analysis and considering a large number of acoustic conditions (including two databases of real measured RIRs). We show that, given a certain recognition task (e.g., WSJ), the recognition accuracy varies according to the actual reverberant condition and is directly related to the Early-to-Late Reverberation Ratio (ELR) of the corresponding RIR, similarly to the SNR in case of background noise. In particular, we compare three different back-end solutions, based on Gaussian Mixture Model (GMM), Feature space Maximum Likelihood Linear Regression (fMLLR) and Deep Neural Networks (DNN), to further validate our hypothesis, independently of the recognition approach.

Basically, given a variety of acoustic conditions, typical of a domestic scenario (different room layouts, source and microphone positions, orientations and directivity patterns), ASR performance in reverberant conditions can be predicted using few parameters able to characterize the acoustic environment. As a result, a suitable data contamination strategy (Matassoni et al., 2002) for acoustic modeling can be derived, associating reverberant conditions with similar ELR to the same acoustic model, independently of the actual room characteristics. Indeed, given a set of pre-trained acoustic models associated to a variety of reverberant conditions, it is possible to select the best model as the model whose ELR closely matches the ELR of the test channel. A further application is in multi-condition training: the addressed ELR metric can be used to efficiently define a limited set of acoustic conditions that are sufficient to derive a robust multi-condition model.

Note that the estimation of the ELR, or similar metrics, directly from the recorded audio signals is beyond the scope of this experimental analysis. Recently, solutions employing GMM (Matassoni and Brutti, 2014) or DNN (Parada et al., 2014) classifiers have been presented. Several methods addressing the estimation of similar metrics are available in the literature (Naylor et al., 2010, Jeub et al., 2011, Falk and Chan, 2010, Georganti et al., 2014). Finally, Blind System Identification (BSI) algorithms (Kowalczyk et al., 2013, Huang and Benesty, 2003) could be employed to obtain an estimation of the propagation channel from which the ELR metric can be obtained.

The paper is organized as follows. Section 2 introduces the problem of ASR in reverberant conditions while Section 3 presents the proposed characterization of the impulse responses based on a low-rank representation. The experimental framework is introduced in Section 5, describing the data and the recognition tasks. Results on simulated data are discussed in Section 6 while results obtained on real measured RIRs are reported in Sections 7 Results on AIR RIRs, 8 Results on DIRHA RIRs. Finally, Section 9 draws some conclusions and introduces possible directions for future investigations.

Metrics similar to the ELR, like the definition and clarity, have been already introduced in (Kuttruff, 1991) without any reference to ASR performance or acoustic modeling in general. In (Nishiura et al., 2007, Fukumori et al., 2013), the correlation between these two metrics and the ASR performance is investigated with interesting results. Recently, the definition was adopted in (Sehr et al., 2010) in relation with the performance of de-reverberation algorithms. Similarly, in (Couvreur and Couvreur, 2000, Wolf and Nadeu, 2014), equivalent metrics are adopted to select the most promising microphones for ASR. However, their correlation with the recognition performance is not investigated. Recently, some works have targeted the model selection based on the similarity between the definition of the test and training channel (Parada et al., 2014, Xiong et al., 2014). In this paper, we propose a novel representation that is more effective than other descriptors commonly used in the literature.

Section snippets

Room acoustic and ASR performance

In enclosures, acoustic waves propagate from a source to the acquisition device through multiple paths due to the presence of reflecting surfaces (e.g., walls, furniture). As a consequence, multiple distorted replicas of the emitted signal reach the microphone, resulting in the so-called reverberation and deteriorating the quality of the received signal.

The effects of the enclosure acoustics are usually described through the convolution between the RIR h and the clean speech signal s(t)

Characterization of room impulse responses

In the past, ASR performance in reverberant environments has been mainly associated with the reverberation time T60 or to the distance between the source and the microphones (Kingsbury and Morgan, 1997, Seltzer, 2003). However, these metrics make sense only if all the other factors contributing to the RIR definition are kept fixed: source directivity and orientation, room dimensions and wall absorption coefficients, to mention a few. Recently, the Direct-to-Reverberant Ratio (DRR) has become a

Classification of RIRs

Our working hypothesis is that λT, or one of the two modified versions introduced here, characterizes the impact of the RIR on ASR: we can then use this parameter to cluster different propagation channels (i.e., diverse RIRs) which are similar in terms of ASR performance. Let us assume that C channels are available. Denoting with λT(c) the ELR of a specific channel c (c=1,,C), we indicate with W(c) the related ASR Word Accuracy (WA). Basically, our claim is that W(c) can be expressed as a

Experimental setup

Starting from the clean signals, a large variety of reverberant speech was created by means of convolution with both synthetic and real RIRs.

Results on synthetic RIRs

The first analysis aims at verifying if the proposed parameter λT(c) is truly correlated with W(c), estimating the best value of T. Along this direction, Fig. 3 plots W(c), on the WSJ recognition task, for the 1000 randomly generated RIRs as a function of λT(c). Three values of T are considered: 50, 100, 250 ms. Acoustic models are trained on clean material. The reverberation time T60 is also considered for comparison. Given that in some experimental conditions the line-of-sight may be missing

Results on AIR RIRs

To complete our analysis, we report some further experimental evidence on the real RIRs of the AIR database which offers a wide range of reverberant conditions (168 RIRs with T60 ranging from 0.1 s to about 4 s) (Jeub et al., 2009). From now on, we report results on GMM-based acoustic models since we have already observed that similar trends are observable when using alternative state-of-the-art solutions. Interestingly, for a given recognition task, λ110 can be successfully used also for real

Results on DIRHA RIRs

Further evidence is offered by results on the real DIRHA RIRs (Cristoforetti et al., 2014). Fig. 17 reports W(λ110) considering 10 clusters, when employing acoustic models trained on the simulated data for few values of the λ110 (panel a) and acoustic models trained on 4 measured RIRs, 2 in the kitchen and 2 in the livingroom as summarized in Table 5 (panel b). Fig. 17a shows that, although completely different, the simulated RIRs enhance the recognition performance over the clean models if

Discussion and conclusions

In this work a study of distant-talking speech recognition in reverberant conditions is presented. Exploiting a large variety of both simulated and real measured impulse responses, we showed that ASR performance highly correlates with features derived from the Room Impulse Responses. Although the ASR errors distribution is influenced also by dictionary and language model, the analysis confirms that it is possible to approximately predict recognition accuracy by using a compact representation of

Acknowledgments

The research leading to these results has partially received funding from the European Union’s 7th Framework Programme (FP7/2007–2013) under Grant agreement no. 288121 – DIRHA.

References (59)

  • M. Delcroix et al.

    Speech recognition in living rooms: integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

    Comput. Speech Lang.

    (2013)
  • M. Wolf et al.

    Channel selection measures for multi-microphone speech recognition

    Speech Commun.

    (2014)
  • Abel, J.S., Huang, P., 2006. A simple, robust measure of reverberation echo density. In: Audio Engineering Society...
  • J. Allen et al.

    Image method for efficiently simulating small-room acoustics

    J. Acoust. Soc. Am.

    (1979)
  • I. Arweiler et al.

    The influence of spectral characteristics of early reflections on speech intelligibility

    J. Acoust. Soc. Am.

    (2011)
  • Au Yeung, S.-K., Siu. M.-H., 2004. Improved performance of Aurora 4 using HTK and unsupervised MLLR adaptation. In:...
  • Bahl, L., Balakrishnan-Aiyer, S., Bellgarda, J., Franz, M., Gopalakrishnan, P., Nahamoo, D., Novak, M., Padmanabhan,...
  • Barker, J., Marxer, R., Vincent, E., Watanabe, S., 2015. The third ‘CHiME’ speech separation and recognition challenge:...
  • J. Benesty et al.

    Microphone Array Signal Processing

    (2008)
  • M. Brandstein et al.

    Microphone Arrays: Signal Processing Techniques and Applications

    (2001)
  • Brutti, A., Matassoni, M., 2014. On the use of early-to-late reverberation ratio for ASR in reverberant environments....
  • A. Brutti et al.

    An environment aware ML estimation of acoustic radiation pattern with distributed microphone pairs

    Signal Process

    (2013)
  • Couvreur, L., Couvreur, C., 2000. On the use of artificial reverberation for ASR in highly reverberant environments....
  • Cristoforetti, L., Ravanelli, M., Omologo, M., Sosi, A., Abad, A., Hagmueller, M., Maragos, P., 2014. The DIRHA...
  • T. Falk et al.

    Temporal dynamics for blind measurement of room acoustical parameters

    IEEE Trans. Instrum. Meas.

    (2010)
  • Fukumori, T., Nakayama, M., Nishiura, T., Yamashita, Y., October 2013. Estimation of speech recognition performance in...
  • Georganti, E., Mourjopoulos, J., van de Par, S., 2014. Room statistics and direct-to-reverberant ratio estimation from...
  • Gomez, R., Nakamura, K., Nakadai, K., 2013. Robustness to speaker position in distant-talking automatic speech...
  • Halmrast, T., 2001. Sound coloration from (very) early reflections. In” Proc. Meeting Acoust. Soc....
  • Hermansky, H., Variani, E., Peddinti, V., 2013. Mean temporal distance: predicting ASR error from temporal properties...
  • T. Houtgast et al.

    A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria

    J. Acoust. Soc. Am.

    (1985)
  • Y. Huang et al.

    A class of frequency-domain adaptive approaches to blind multichannel identification

    IEEE Trans. Signal Process.

    (2003)
  • Y. Huang et al.

    Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

  • Jeub, M., Nelke, C., Beaugeant, C., Vary, P., 2011. Blind estimation of the coherent-to-diffuse energy ratio from noisy...
  • Jeub, M., Schafer, M., Vary, P., 2009. A binaural room impulse response database for the evaluation of dereverberation...
  • Kingsbury, B., Morgan, N., 1997. Recognizing reverberant speech with RASTA-PLP. In:...
  • Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R., October 2013. The reverb...
  • K. Kowalczyk et al.

    Blind system identification using sparse learning for tdoa estimation of room reflections

    IEEE Signal Process. Lett.

    (2013)
  • A. Krueger et al.

    Model-based feature enhancement for reverberant speech recognition

    IEEE Trans. Audio Speech Lang. Process.

    (2010)
  • Cited by (7)

    • Hybrid flexible (HyFlex) seminar delivery – A technical overview of the implementation

      2022, Building and Environment
      Citation Excerpt :

      The MOS scale was then extended by ITU and used by Ref. [17]in the webMUSHRA as an on-line implementation. The recent work by Ref. [18] provided a study of the distant-talking speech recognition in reverberant conditions for simulated and real conditions. By comparing the room impulse response to the Automatic Speech Recognition (ASR) accuracy and early to late reverberation [19], compared multiple microphones set-ups and acoustical conditions; however their approach was intended to improve the performance of ASR systems and not as a method to identify quality.

    • A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions

      2017, Computer Speech and Language
      Citation Excerpt :

      Another parameter which is commonly associated with reverberant signals is the reverberation time (RT) which is defined as the time taken for the audio signal to decay by 60 dB. Early reflections can be considered as a convolution of the speech signal with a stationary channel response, which can be handled by CMVN, and they have been reported to improve ASR performance under certain conditions (Petrick et al., 2014; Brutti and Matassoni, 2016). Late reverberation, by contrast, is considered to be uncorrelated to the speech signal and it contributes most to the degradation of the ASR performance.

    • An analysis of environment, microphone and data simulation mismatches in robust speech recognition

      2017, Computer Speech and Language
      Citation Excerpt :

      Suspicion about simulated data is common in the speech processing community, due for instance to the misleadingly high performance of direction-of-arrival based adaptive beamformers on simulated data compared to real data (Kumatani et al., 2012). Fortunately, this case against simulation does not arise for all techniques: most modern enhancement and ASR techniques can benefit from data augmentation and simulation (Kanda et al., 2013; Brutti and Matassoni, 2016). Few existing datasets involve both real and simulated data.

    • On Practical Aspects of Multi-condition Training Based on Augmentation for Reverberation-/Noise-Robust Speech Recognition

      2019, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text