Room-localized spoken command recognition in multi-room, multi-microphone environments☆,☆☆
Introduction
Significant research effort has been devoted over the past decades to the design of Voice-enabled User Interfaces (VUIs) for natural, hands-free human-computer interaction. Such interfaces have typically been employed in interactive voice response systems at call centers and, more recently, in personal assistant applications on personal computers or smartphones (Schalkwyk et al., 2010). State-of-the-art developments in acoustic modeling for speech recognition (Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, Kingsbury, 2012, Yu, Deng, 2015) have certainly contributed a lot to making VUIs practically usable in a variety of everyday environments; however, untethered, far-field, and always-listening operation, robust to noise, still constitutes a challenge that limits their universal applicability.
This challenge remains prominent in the very active research area of ambient assisted living inside smart homes, where, among others, VUIs are seen as crucial to the occupants’ safety and well-being (Edwards, Grinter, 2001, Chan, Estve, Escriba, Campo, 2008, Vacher, Caffiau, Portet, Meillon, Roux, Elias, Lecouteux, Chahuara, 2015). Indeed, domestic environments typically exhibit inter-room interference, frequent overlaps of various acoustic events and background noise with target speech, and moderate-to-high reverberation, when the acoustic scene is captured by far-field microphones, as is desired in an always-listening, untethered operation scenario. Similar conditions are present in additional everyday indoors environments, for example multi-room offices. Not surprisingly, Distant Speech Recognition (DSR) performance under such conditions lags dramatically compared to close-talking, noise-free scenarios (Kumatani et al., 2012).
A promising course for improving DSR in indoors environments is to exploit information from multiple audio channels, if such is available by distributed microphone arrays (Brandstein and Ward, 2001), located inside the smart space and providing sufficient spatio-temporal sampling of the acoustic scene. Such a solution has been investigated, for example, in the recent EU-funded project DIRHA.1 The project focused on the design of a VUI for home automation, supporting distant speech interaction in different languages, targeting, in particular, people with kinetic disabilities. The basic use-case involved command-like voice-control of automated home equipment, for example of the room lights, temperature settings, door, window and shutter operation, etc. To enable hands-free operation, the VUI was designed to be always-listening, employing key-phrase based activation. Further, to achieve appropriate disambiguation of uttered commands, allow possible interaction with multiple users in different rooms, and provide localized feedback (VUI confirmation using room loudspeakers), room-level localization of the recognized commands was also performed An example of the DIRHA challenging acoustic scene is depicted in Fig. 1.
In this paper, we describe in detail the design of a robust multichannel distant speech processing pipeline, developed for the purposes of the aforementioned DIRHA domestic interaction scenario in the Greek language. The adopted methodology is rather general, being readily applicable to support VUIs in other everyday indoors multi-room environments equipped with multiple microphone sensors, such as smart offices, for example. The work deals with a wide range of challenging topics in the area of distant speech processing, where its contributions lie, namely addressing the following topics:
- •
Always-listening operation, achieved by employing Speech Activity Detection (SAD), key-phrase detection, and DSR.
- •
Room-localized operation, based on a multi-room SAD component used to drive separate, parallel cascades of key-phrase detection and DSR for each room of the smart space.
- •
Multichannel speech processing beyond beamforming, such as channel selection and decision fusion of single-channel results, considered in all pipeline components.
- •
Robust acoustic modeling, based on far-field data simulation and per-channel adaptation with little training data available in the target environment.
- •
Pipeline component optimization, studying component inter-dependencies and optimizing their tunable parameters separately, sequentially, or jointly.
- •
System and pipeline component evaluation on both simulated and real corpora in two multi-room, multichannel smart environments.
In more detail, to support always-listening operation, we build on the widely used cascade of three speech processing stages, as overviewed in Fig. 2, namely: (a) SAD, to separate speech from non-speech events; (b) key-phrase detection, to identify a predefined system activation phrase; and (c) DSR, to recognize the issued command. Combinations of some of the above components can be found in a variety of VUIs, providing partial robustness against non-speech events and increased efficiency, by processing only the speech segments of the incoming signals.
Further, to allow room-localized operation, we modify the aforementioned traditional cascade by designing a multi-room SAD component, instead of employing a generic, room-independent SAD. Such is able to identify speech segments in conjunction with their room of origin, robustly addressing the problem of inter-room interference. The component is used to drive separate cascades of key-phrase detection and DSR for each room of the smart space, operating in parallel. The process yields room-localized speech command recognition, as required by the VUI scenario considered in this paper.
To fulfill the needs of the detection and recognition tasks involved in the system, we elaborate and combine multichannel speech processing methods that have been explored in our previous preliminary studies (Giannoulis, Brutti, Matassoni, Abad, Katsamanis, Matos, Potamianos, Maragos, 2015, Katsamanis, Rodomagoulakis, Potamianos, Maragos, Tsiami, 2014, Tsiami, Katsamanis, Maragos, Potamianos, 2014), achieving promising results and robustness in the challenging conditions considered. The implemented components make extensive use of channel selection and combination strategies to benefit from the available network of microphones inside the rooms. The advantage of these approaches is that they require no prior information regarding microphone network topology, other than mere room-microphone association. The proposed channel combination methods are based on decision fusion schemes, and they appear to outperform beamforming in most cases.
We gain additional benefits by employing robust modeling, in order to reduce mismatch between training and test conditions. In particular, we generate artificial training data simulating the test conditions, and, furthermore, we employ statistical model adaptation for each microphone channel, using few data from the target environment, if available.
Further, we consider optimization of a number of tunable system component parameters, while taking into consideration their inter-dependencies. Specifically, we observe that their joint optimization, rather than separate or sequential optimization, leads to improved command recognition accuracy.
Finally, we conduct extensive experimentation on both simulated and real datasets, where the adopted system architecture is evaluated systematically. For this purpose, we employ three separate databases: (a) DIRHA-sim, a corpus of simulated long audio recordings inside a real multi-room apartment (Cristoforetti et al., 2014); (b) ATHENA-real, a set of real recordings in a two-room office environment (Tsiami et al., 2014b); and (c) DIRHA-real, a corpus of real recordings captured inside the multi-room apartment also used for the first set. The first two consist of both development and test subsets, allowing for model adaptation and system optimization, while the third one is employed for testing the proposed pipeline on real data, unseen during its training. Reported results vary due to different characteristics and challenges of each dataset, reaching 76.6% in command recognition accuracy on the ATHENA-real corpus.
The rest of the paper is organized as follows: Section 2 overviews related work in the literature. Section 3 presents the proposed system, describes its components in detail, and reviews the adopted robust modeling and multichannel processing methods. Section 4 describes the databases used for the development and evaluation of the system pipeline. Section 5 introduces the adopted experimental framework and presents results of both the isolated components and the integrated system. Details on system optimization, final pipeline evaluation, and an error analysis are also included. Finally, Section 6 concludes the paper with a brief discussion.
Section snippets
Related work
Several projects and challenges have been launched over the last decade targeting intelligent interfaces for indoors smart environments and addressing DSR via multiple distributed microphones. Initially, the community focused on single-room setups for the analysis of lectures and meetings. Research projects like CHIL (Chu et al., 2006) and AMI (Hain et al., 2008) produced a wide range of results under the framework of the NIST Rich Transcription evaluation campaigns (Fiscus et al., 2008).
Proposed multichannel, always-listening, distant speech recognition pipeline
As already outlined, the proposed speech processing pipeline aims at recognizing spoken commands for home and office automation. The user is potentially able to address the system from any position in the multi-room space. This is achieved by designing it to operate in parallelized room-dependent speech processing cascades, consisting of (a) microphone selection, (b) command detection, and (c) command recognition, all driven by multi-room SAD that provides candidate speech segments for each
Simulated and real corpora for indoor automation
Three challenging multichannel datasets are employed for the development and evaluation of the proposed system: (a) The DIRHA simulated corpus (DIRHA-sim), (b) the DIRHA real corpus (DIRHA-real), and (c) the ATHENA real database (ATHENA-real). All sets have been acquired in multi-room smart environments and include one-minute long recordings of a variety of commands and activation phrases in Greek, as well as non-speech events and background noises, deeming the recordings very realistic for
Experimental framework and system evaluation
The design of the experimental framework for the development and evaluation of the presented system pipeline is complex due to the inter-dependency of the connected modules. To account for the behavior of each component individually and relatively to the others, we group experimental tasks into three categories, discussing details in the following subsections, mainly:
- 1.
Individual: every module of the pipeline is tested separately in terms of standard evaluation metrics such as precision, recall,
Conclusions, discussion and future work
In this work, we detail the design, optimization and systematic evaluation of a speech processing and recognition pipeline for an always-listening voice enabled user interface in Greek. The pipeline aims at robust far-field spoken command recognition in challenging multi-room smart environments as homes and offices equipped with sparsely distributed microphone arrays. The proposed system architecture is based on the synergy between multichannel speech activity detection, key-phrase detection,
References (57)
- et al.
A review of smart homes - present state and future challenges
Comput. Methods Progr. Biomed.
(2008) - et al.
A generalized estimation approach for linear and nonlinear microphone array post-filters
Speech Commun.
(2007) - et al.
Hidden Markov model training with contaminated speech material for distant-talking speech recognition
Comput. Speech Lang.
(2002) - et al.
An integrated system for voice command recognition and emergency detection based on audio signals
J. Expert Syst. Appl.
(2015) - et al.
Multi-source far-distance microphone selection and combination for automatic transcription of lectures
Proceedings of the International Conference on Speech Communication and Technology (Interspeech)
(2006) - et al.
A French corpus for distant-microphone speech processing in real homes
Proceedings of the International Conference on Speech Communication and Technology (Interspeech)
(2016) - et al.
Microphone Arrays: Signal Processing Techniques and Applications
(2001) - et al.
The AMI meeting corpus: a pre-announcement
Proceedings of the International Workshop on Machine Learning for Multimodal Interaction
(2006) - et al.
The N-best algorithm: an efficient procedure for finding top N sentence hypotheses
Proceedings of the ACM Workshop on Speech and Natural Language
(1989) - et al.
Automatic speech recognition and speech activity detection in the CHIL smart room
Proceedings of the International Workshop on Machine Learning for Multimodal Interaction
(2006)
The DIRHA simulated corpus
Proceedings of the International Conference on Language Resources and Evaluation (LREC)
Strategies for distant speech recognition in reverberant environments
EURASIP J. Adv. Signal Process.
Large vocabulary continuous speech recognition in Greek: corpus and an automatic dictation system
Proceedings of the International Conference on Speech Communication and Technology (Interspeech)
GridNews: a distributed automatic Greek broadcast transcription system
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
At home with ubiquitous computing: seven challenges
Ubicomp 2001: Ubiquitous Computing
Simultaneous measurement of impulse response and distortion with a swept-sine technique
Proceedings of the 108 Audio Engineering Society Convention
The rich transcription 2007 meeting recognition evaluation
Multimodal Technologies for Perception of Humans
A French corpus of audio and multimodal interactions in a health smart home
J. Multimodal User Interfaces
The Greek language in the digital age
A practical, real-time speech-driven home automation front-end
IEEE Trans. Consum. Electron.
Multi-room speech activity detection using a distributed microphone network in domestic environments
Proceedings of the European Signal Processing Conference (EUSIPCO)
The 2007 AMI(DA) system for meeting transcription
Multimodal Technologies for Perception of Humans
Transcribing meetings with the AMIDA systems
IEEE Trans. Audio, Speech, Lang. Process.
The automatic speech recognition in reverberant environments (ASpIRE) challenge
Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups
IEEE Signal Process. Mag.
Using KL-divergence and multilingual information to improve ASR for under-resourced languages
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
The ICSI meeting corpus
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Robust far-field spoken command recognition for home automation combining adaptation and multichannel processing
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Cited by (0)
- ☆
This research was partially supported by EU project DIRHA, grant no. FP7-ICT-2011-7-288121.
- ☆☆
This paper has been recommended for acceptance by Prof. R. K. Moore.