On the relationship between Early-to-Late Ratio of Room Impulse Responses and ASR performance in reverberant environments

doi:10.1016/j.specom.2015.09.004

Speech Communication

Volume 76, February 2016, Pages 170-185

https://doi.org/10.1016/j.specom.2015.09.004 Get rights and content

Highlights

•
Study on the correlation between ASR accuracy and reverberant acoustic conditions.
•
Experiments involve a large variety of simulated and measured room impulse responses.
•
The best duration of early arrivals is determined experimentally.
•
The approach is applied to data contamination for acoustic model training and model selection.
•
A large vocabulary recognition task (WSJ) is considered using both GMM and DNN.

Abstract

This work presents an experimental analysis of distant-talking speech recognition in a variety of reverberant conditions, correlating ASR performance to a compact representation of the propagation channel (i.e., the room impulse response).

It is well known that reverberation and background noise degrade speech recognition performance, but few studies have investigated the relation between room impulse responses and recognition rates in a comprehensive manner. In particular, we show how the ASR accuracy is related to features derived from the structure of the early arrivals and the reverberation tail. A representation based on the combination of few parameters is hence proposed, analysing the impact of reverberation on different speech recognition tasks. Possible applications of the derived measure are in data contamination for acoustic modeling where this feature can be employed either to select the most suitable model for a given acoustic condition or to define the subset of room impulse responses to be used for the creation of partially matched reverberant models. Recognition results using different back-end solutions (GMM, DNN) on data generated with the image method and with real impulse responses validate the effectiveness of the approach.

Introduction

Automatic speech recognition (ASR) can be considered a reliable technology only in controlled environments where the quality of the acquired voice signal is not critically degraded by environmental factors as background noise or reverberation. Close-talking microphones are often required to guarantee such quality but in specific scenarios these microphones are inconvenient or intrusive. As a result, distant speech recognition is gaining major attention and a number of studies have been addressing the need of effective speech acquisition and processing (Wölfel and McDonough, 2009). As primary way, microphone arrays allow the implementation of selective spatial acquisition or other speech enhancement techniques (Brandstein and Ward, 2001), like beamforming (Benesty et al., 2008), dereverberation and denoising (Naylor and Gaubitch, 2010). However, compact microphone arrays have a restricted spatial view and are therefore not suitable for wide enclosures where the source may be in an unfavorable position (e.g., not frontal to the microphones, without line of sight). As an alternative, networks of distributed microphones could be employed, guaranteeing a uniform acoustic coverage of the monitored area, independently of the source position and orientation. The latter scenario and the related challenges have been investigated under the EU project DIRHA, where a vocal system for the control of home devices is targeted. In such configurations the adoption of array-processing techniques is often neither possible nor effective (Kumatani et al., 2011) and alternative approaches can be successfully applied, as for example channel selection (Wolf and Nadeu, 2014) or source separation (Vincent et al., 2013, Delcroix et al., 2013).

Although several studies investigated the decreasing trend of ASR performance as a function of the actual noise (i.e., the SNR), few studies have addressed the relation between the acoustic propagation and recognition rates (Nishiura et al., 2007, Fukumori et al., 2013). Indeed, in a distributed microphone setup, various spatial factors have an impact on the signal quality: the distance and the orientation of the pairs source-microphone, the consequent different SNR, the acoustic propagation. Hence, it is of interest to correlate ASR performance with some purely acoustic measurements. In this direction an estimation of the Signal-to-Noise Ratio (SNR) in multiple sub-bands has been introduced in (Takeda et al., 2000). In (Hermansky et al., 2013), the authors proposed a temporal-domain method for predicting recognition performance in unseen noisy environments. This estimate can be usefully exploited during setup to increase the robustness of the resulting system, for example selecting or training a suitable acoustic model. Authors in (Petrick et al., 2007) studied the harming parts of room impulse responses, discussing the contribution of early and late reflections to ASR performance, while in (Sehr and Kellermann, 2010) the inter-frame correlation of reverberant feature vectors is analyzed. In a related work, the adjustments of dereverberation algorithms to ASR systems are evaluated (Sehr et al., 2010). More recently the reverberation problem has been addressed from different perspectives (Krueger and Haeb-Umbach, 2010, Sadjadi and et al., 2012, Gomez et al., 2013, Wang et al., 2012), showing the need for effective solutions to cope with the related masking effects.

This work presents an experimental analysis of distant-talking speech recognition in reverberant conditions, correlating the recognition performance to the acoustic characteristics of the propagation channel, namely the structure of the Room Impulse Response (RIR). It extends our previous work (Brutti and Matassoni, 2014), providing a more exhaustive experimental analysis and considering a large number of acoustic conditions (including two databases of real measured RIRs). We show that, given a certain recognition task (e.g., WSJ), the recognition accuracy varies according to the actual reverberant condition and is directly related to the Early-to-Late Reverberation Ratio (ELR) of the corresponding RIR, similarly to the SNR in case of background noise. In particular, we compare three different back-end solutions, based on Gaussian Mixture Model (GMM), Feature space Maximum Likelihood Linear Regression (fMLLR) and Deep Neural Networks (DNN), to further validate our hypothesis, independently of the recognition approach.

Basically, given a variety of acoustic conditions, typical of a domestic scenario (different room layouts, source and microphone positions, orientations and directivity patterns), ASR performance in reverberant conditions can be predicted using few parameters able to characterize the acoustic environment. As a result, a suitable data contamination strategy (Matassoni et al., 2002) for acoustic modeling can be derived, associating reverberant conditions with similar ELR to the same acoustic model, independently of the actual room characteristics. Indeed, given a set of pre-trained acoustic models associated to a variety of reverberant conditions, it is possible to select the best model as the model whose ELR closely matches the ELR of the test channel. A further application is in multi-condition training: the addressed ELR metric can be used to efficiently define a limited set of acoustic conditions that are sufficient to derive a robust multi-condition model.

Note that the estimation of the ELR, or similar metrics, directly from the recorded audio signals is beyond the scope of this experimental analysis. Recently, solutions employing GMM (Matassoni and Brutti, 2014) or DNN (Parada et al., 2014) classifiers have been presented. Several methods addressing the estimation of similar metrics are available in the literature (Naylor et al., 2010, Jeub et al., 2011, Falk and Chan, 2010, Georganti et al., 2014). Finally, Blind System Identification (BSI) algorithms (Kowalczyk et al., 2013, Huang and Benesty, 2003) could be employed to obtain an estimation of the propagation channel from which the ELR metric can be obtained.

The paper is organized as follows. Section 2 introduces the problem of ASR in reverberant conditions while Section 3 presents the proposed characterization of the impulse responses based on a low-rank representation. The experimental framework is introduced in Section 5, describing the data and the recognition tasks. Results on simulated data are discussed in Section 6 while results obtained on real measured RIRs are reported in Sections 7 Results on AIR RIRs, 8 Results on DIRHA RIRs. Finally, Section 9 draws some conclusions and introduces possible directions for future investigations.

Metrics similar to the ELR, like the definition and clarity, have been already introduced in (Kuttruff, 1991) without any reference to ASR performance or acoustic modeling in general. In (Nishiura et al., 2007, Fukumori et al., 2013), the correlation between these two metrics and the ASR performance is investigated with interesting results. Recently, the definition was adopted in (Sehr et al., 2010) in relation with the performance of de-reverberation algorithms. Similarly, in (Couvreur and Couvreur, 2000, Wolf and Nadeu, 2014), equivalent metrics are adopted to select the most promising microphones for ASR. However, their correlation with the recognition performance is not investigated. Recently, some works have targeted the model selection based on the similarity between the definition of the test and training channel (Parada et al., 2014, Xiong et al., 2014). In this paper, we propose a novel representation that is more effective than other descriptors commonly used in the literature.

Section snippets

Room acoustic and ASR performance

In enclosures, acoustic waves propagate from a source to the acquisition device through multiple paths due to the presence of reflecting surfaces (e.g., walls, furniture). As a consequence, multiple distorted replicas of the emitted signal reach the microphone, resulting in the so-called reverberation and deteriorating the quality of the received signal.

The effects of the enclosure acoustics are usually described through the convolution $*$ between the RIR h and the clean speech signal $s (t)$

Characterization of room impulse responses

In the past, ASR performance in reverberant environments has been mainly associated with the reverberation time $T_{60}$ or to the distance between the source and the microphones (Kingsbury and Morgan, 1997, Seltzer, 2003). However, these metrics make sense only if all the other factors contributing to the RIR definition are kept fixed: source directivity and orientation, room dimensions and wall absorption coefficients, to mention a few. Recently, the Direct-to-Reverberant Ratio (DRR) has become a

Classification of RIRs

Our working hypothesis is that $λ_{T}$ , or one of the two modified versions introduced here, characterizes the impact of the RIR on ASR: we can then use this parameter to cluster different propagation channels (i.e., diverse RIRs) which are similar in terms of ASR performance. Let us assume that C channels are available. Denoting with $λ_{T} (c)$ the ELR of a specific channel c $(c = 1, \dots, C)$ , we indicate with $W (c)$ the related ASR Word Accuracy (WA). Basically, our claim is that $W (c)$ can be expressed as a

Experimental setup

Starting from the clean signals, a large variety of reverberant speech was created by means of convolution with both synthetic and real RIRs.

Results on synthetic RIRs

The first analysis aims at verifying if the proposed parameter $λ_{T} (c)$ is truly correlated with $W (c)$ , estimating the best value of T. Along this direction, Fig. 3 plots $W (c)$ , on the WSJ recognition task, for the 1000 randomly generated RIRs as a function of $λ_{T} (c)$ . Three values of T are considered: 50, 100, 250 ms. Acoustic models are trained on clean material. The reverberation time $T_{60}$ is also considered for comparison. Given that in some experimental conditions the line-of-sight may be missing

Results on AIR RIRs

To complete our analysis, we report some further experimental evidence on the real RIRs of the AIR database which offers a wide range of reverberant conditions (168 RIRs with $T_{60}$ ranging from 0.1 s to about 4 s) (Jeub et al., 2009). From now on, we report results on GMM-based acoustic models since we have already observed that similar trends are observable when using alternative state-of-the-art solutions. Interestingly, for a given recognition task, $λ_{110}$ can be successfully used also for real

Results on DIRHA RIRs

Further evidence is offered by results on the real DIRHA RIRs (Cristoforetti et al., 2014). Fig. 17 reports $W (\overline{λ_{110}})$ considering 10 clusters, when employing acoustic models trained on the simulated data for few values of the $λ_{110}$ (panel a) and acoustic models trained on 4 measured RIRs, 2 in the kitchen and 2 in the livingroom as summarized in Table 5 (panel b). Fig. 17a shows that, although completely different, the simulated RIRs enhance the recognition performance over the clean models if

Discussion and conclusions

In this work a study of distant-talking speech recognition in reverberant conditions is presented. Exploiting a large variety of both simulated and real measured impulse responses, we showed that ASR performance highly correlates with features derived from the Room Impulse Responses. Although the ASR errors distribution is influenced also by dictionary and language model, the analysis confirms that it is possible to approximately predict recognition accuracy by using a compact representation of

Acknowledgments

The research leading to these results has partially received funding from the European Union’s 7th Framework Programme (FP7/2007–2013) under Grant agreement no. 288121 – DIRHA.

References (59)

M. Delcroix et al.
Speech recognition in living rooms: integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds
Comput. Speech Lang.
(2013)
M. Wolf et al.
Channel selection measures for multi-microphone speech recognition
Speech Commun.
(2014)
Abel, J.S., Huang, P., 2006. A simple, robust measure of reverberation echo density. In: Audio Engineering Society...
J. Allen et al.
Image method for efficiently simulating small-room acoustics
J. Acoust. Soc. Am.
(1979)
I. Arweiler et al.
The influence of spectral characteristics of early reflections on speech intelligibility
J. Acoust. Soc. Am.
(2011)
Au Yeung, S.-K., Siu. M.-H., 2004. Improved performance of Aurora 4 using HTK and unsupervised MLLR adaptation. In:...
Bahl, L., Balakrishnan-Aiyer, S., Bellgarda, J., Franz, M., Gopalakrishnan, P., Nahamoo, D., Novak, M., Padmanabhan,...
Barker, J., Marxer, R., Vincent, E., Watanabe, S., 2015. The third ‘CHiME’ speech separation and recognition challenge:...
J. Benesty et al.
Microphone Array Signal Processing
(2008)
M. Brandstein et al.
Microphone Arrays: Signal Processing Techniques and Applications
(2001)

Brutti, A., Matassoni, M., 2014. On the use of early-to-late reverberation ratio for ASR in reverberant environments....

A. Brutti et al.

An environment aware ML estimation of acoustic radiation pattern with distributed microphone pairs

Signal Process

(2013)

Couvreur, L., Couvreur, C., 2000. On the use of artificial reverberation for ASR in highly reverberant environments....

Cristoforetti, L., Ravanelli, M., Omologo, M., Sosi, A., Abad, A., Hagmueller, M., Maragos, P., 2014. The DIRHA...

T. Falk et al.

Temporal dynamics for blind measurement of room acoustical parameters

IEEE Trans. Instrum. Meas.

(2010)

Fukumori, T., Nakayama, M., Nishiura, T., Yamashita, Y., October 2013. Estimation of speech recognition performance in...

Georganti, E., Mourjopoulos, J., van de Par, S., 2014. Room statistics and direct-to-reverberant ratio estimation from...

Gomez, R., Nakamura, K., Nakadai, K., 2013. Robustness to speaker position in distant-talking automatic speech...

Halmrast, T., 2001. Sound coloration from (very) early reflections. In” Proc. Meeting Acoust. Soc....

Hermansky, H., Variani, E., Peddinti, V., 2013. Mean temporal distance: predicting ASR error from temporal properties...

T. Houtgast et al.

A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria

J. Acoust. Soc. Am.

(1985)

Y. Huang et al.

A class of frequency-domain adaptive approaches to blind multichannel identification

IEEE Trans. Signal Process.

(2003)

Y. Huang et al.

Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

Jeub, M., Nelke, C., Beaugeant, C., Vary, P., 2011. Blind estimation of the coherent-to-diffuse energy ratio from noisy...

Jeub, M., Schafer, M., Vary, P., 2009. A binaural room impulse response database for the evaluation of dereverberation...

Kingsbury, B., Morgan, N., 1997. Recognizing reverberant speech with RASTA-PLP. In:...

Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R., October 2013. The reverb...

K. Kowalczyk et al.

Blind system identification using sparse learning for tdoa estimation of room reflections

IEEE Signal Process. Lett.

(2013)

A. Krueger et al.

Model-based feature enhancement for reverberant speech recognition

IEEE Trans. Audio Speech Lang. Process.

(2010)

Cited by (7)

Hybrid flexible (HyFlex) seminar delivery – A technical overview of the implementation
2022, Building and Environment
Citation Excerpt :
The MOS scale was then extended by ITU and used by Ref. [17]in the webMUSHRA as an on-line implementation. The recent work by Ref. [18] provided a study of the distant-talking speech recognition in reverberant conditions for simulated and real conditions. By comparing the room impulse response to the Automatic Speech Recognition (ASR) accuracy and early to late reverberation [19], compared multiple microphones set-ups and acoustical conditions; however their approach was intended to improve the performance of ASR systems and not as a method to identify quality.
This paper investigates a new technology for Hybrid flexible delivery (known as HyFlex), as implemented at King's College London. The relatively novel character of HyFlex, of mixing synchronously on-line and in-room teaching, and the recent changes due to the COVID-19 pandemic mean this use of the technology and teaching model is largely new to the UK. This research evaluated audio quality in the context of a HyFlex technical environment. The paper provides a high-level overview of the process of designing a HyFlex solution and presents a detailed evaluation of the impact of reverberation in relation to the accuracy of automatically generated subtitles and the influence of microphone selection. The paper shows that there was a significant relationship between the reverberation, the audio quality, and the subtitling system, which is important as past studies highlighted audio quality is key for the students' experience. It presents a viable and simple methodology to estimate the audio quality on installed HyFlex systems to improve the students experience in a hybrid teaching environment.
A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions
2017, Computer Speech and Language
Citation Excerpt :
Another parameter which is commonly associated with reverberant signals is the reverberation time (RT) which is defined as the time taken for the audio signal to decay by 60 dB. Early reflections can be considered as a convolution of the speech signal with a stationary channel response, which can be handled by CMVN, and they have been reported to improve ASR performance under certain conditions (Petrick et al., 2014; Brutti and Matassoni, 2016). Late reverberation, by contrast, is considered to be uncorrelated to the speech signal and it contributes most to the degradation of the ASR performance.
Robustness to reverberation is a key concern for distant-microphone ASR. Various approaches have been proposed, including single-channel or multichannel dereverberation, robust feature extraction, alternative acoustic models, and acoustic model adaptation. However, to the best of our knowledge, a detailed study of these techniques in varied reverberation conditions is still missing in the literature. In this paper, we conduct a series of experiments to assess the impact of various dereverberation and acoustic model adaptation approaches on the ASR performance in the range of reverberation conditions found in real domestic environments. We consider both established approaches such as WPE and newer approaches such as learning hidden unit contribution (LHUC) adaptations, whose performance has not been reported before in this context, and we employ them in combination. Our results indicate that performing weighted prediction error (WPE) dereverberation on a reverberated test speech utterance and decoding using a deep neural network (DNN) acoustic model trained with multi-condition reverberated speech with feature-space maximum likelihood linear regression (fMLLR) transformed features, outperforms more recent approaches and helps significantly reduce the word error rate (WER).
An analysis of environment, microphone and data simulation mismatches in robust speech recognition
2017, Computer Speech and Language
Citation Excerpt :
Suspicion about simulated data is common in the speech processing community, due for instance to the misleadingly high performance of direction-of-arrival based adaptive beamformers on simulated data compared to real data (Kumatani et al., 2012). Fortunately, this case against simulation does not arise for all techniques: most modern enhancement and ASR techniques can benefit from data augmentation and simulation (Kanda et al., 2013; Brutti and Matassoni, 2016). Few existing datasets involve both real and simulated data.
Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.
Effect of ideal ratio mask using different early and late reverberation partition methods on speech recognition performance
2019, Shengxue Xuebao/Acta Acustica
Joint estimation of reverberation time and early-to-late reverberation ratio from single-channel speech signals
2019, IEEE/ACM Transactions on Audio Speech and Language Processing
On Practical Aspects of Multi-condition Training Based on Augmentation for Reverberation-/Noise-Robust Speech Recognition
2019, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text

On the relationship between Early-to-Late Ratio of Room Impulse Responses and ASR performance in reverberant environments

Highlights

Abstract

Introduction

Section snippets

Room acoustic and ASR performance

Characterization of room impulse responses

Classification of RIRs

Experimental setup

Results on synthetic RIRs

Results on AIR RIRs

Results on DIRHA RIRs

Discussion and conclusions

Acknowledgments

Comput. Speech Lang.

Speech Commun.

Image method for efficiently simulating small-room acoustics

J. Acoust. Soc. Am.

The influence of spectral characteristics of early reflections on speech intelligibility

J. Acoust. Soc. Am.

Microphone Array Signal Processing

Microphone Arrays: Signal Processing Techniques and Applications

An environment aware ML estimation of acoustic radiation pattern with distributed microphone pairs

Signal Process

Temporal dynamics for blind measurement of room acoustical parameters

IEEE Trans. Instrum. Meas.

A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria

J. Acoust. Soc. Am.

A class of frequency-domain adaptive approaches to blind multichannel identification

IEEE Trans. Signal Process.

Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

Blind system identification using sparse learning for tdoa estimation of room reflections

IEEE Signal Process. Lett.

Model-based feature enhancement for reverberant speech recognition

IEEE Trans. Audio Speech Lang. Process.