Elsevier

Computer Speech & Language

Volume 46, November 2017, Pages 419-443
Computer Speech & Language

Room-localized spoken command recognition in multi-room, multi-microphone environments,☆☆

https://doi.org/10.1016/j.csl.2017.02.004Get rights and content

Highlights

  • Always-listening recognition pipeline for multi-room smart spaces.

  • Room-localized operation based on multi-room speech activity detection.

  • Channel selection and decision fusion approaches in all pipeline components.

  • Robust acoustic modeling based on far-field data simulation and per-channel adaptation.

  • Systematic pipeline evaluation and optimization on both simulated and real corpora.

Abstract

The paper focuses on the design of a practical system pipeline for always-listening, far-field spoken command recognition in everyday smart indoor environments that consist of multiple rooms equipped with sparsely distributed microphone arrays. Such environments, for example domestic and multi-room offices, present challenging acoustic scenes to state-of-the-art speech recognizers, especially under always-listening operation, due to low signal-to-noise ratios, frequent overlaps of target speech, acoustic events, and background noise, as well as inter-room interference and reverberation. In addition, recognition of target commands often needs to be accompanied by their spatial localization, at least at the room level, to account for users in different rooms, providing command disambiguation and room-localized feedback. To address the above requirements, the use of parallel recognition pipelines is proposed, one per room of interest. The approach is enabled by a room-dependent speech activity detection module that employs appropriate multichannel features to determine speech segments and their room of origin, feeding them to the corresponding room-dependent pipelines for further processing. These consist of the traditional cascade of far-field spoken command detection and recognition, the former based on the detection of “activating” key-phrases. Robustness to the challenging environments is pursued by a number of multichannel combination and acoustic modeling techniques, thoroughly investigated in the paper. In particular, channel selection, beamforming, and decision fusion of single-channel results are considered, with the latter performing best. Additional gains are observed, when the employed acoustic models are trained on appropriately simulated reverberant and noisy speech data, and are channel-adapted to the target environments. Further issues investigated concern the inter-dependencies of the various system components, demonstrating the superiority of joint optimization of the component tunable parameters over their separate or sequential optimization. The proposed approach is developed for the Greek language, exhibiting promising performance in real recordings in a four-room apartment, as well as a two-room office. For example, in the latter, a 76.6% command recognition accuracy is achieved on a speaker-independent test, employing a 180-sentence decoding grammar. This result represents a 46% relative improvement over conventional beamforming.

Introduction

Significant research effort has been devoted over the past decades to the design of Voice-enabled User Interfaces (VUIs) for natural, hands-free human-computer interaction. Such interfaces have typically been employed in interactive voice response systems at call centers and, more recently, in personal assistant applications on personal computers or smartphones (Schalkwyk et al., 2010). State-of-the-art developments in acoustic modeling for speech recognition (Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, Kingsbury, 2012, Yu, Deng, 2015) have certainly contributed a lot to making VUIs practically usable in a variety of everyday environments; however, untethered, far-field, and always-listening operation, robust to noise, still constitutes a challenge that limits their universal applicability.

This challenge remains prominent in the very active research area of ambient assisted living inside smart homes, where, among others, VUIs are seen as crucial to the occupants’ safety and well-being (Edwards, Grinter, 2001, Chan, Estve, Escriba, Campo, 2008, Vacher, Caffiau, Portet, Meillon, Roux, Elias, Lecouteux, Chahuara, 2015). Indeed, domestic environments typically exhibit inter-room interference, frequent overlaps of various acoustic events and background noise with target speech, and moderate-to-high reverberation, when the acoustic scene is captured by far-field microphones, as is desired in an always-listening, untethered operation scenario. Similar conditions are present in additional everyday indoors environments, for example multi-room offices. Not surprisingly, Distant Speech Recognition (DSR) performance under such conditions lags dramatically compared to close-talking, noise-free scenarios (Kumatani et al., 2012).

A promising course for improving DSR in indoors environments is to exploit information from multiple audio channels, if such is available by distributed microphone arrays (Brandstein and Ward, 2001), located inside the smart space and providing sufficient spatio-temporal sampling of the acoustic scene. Such a solution has been investigated, for example, in the recent EU-funded project DIRHA.1 The project focused on the design of a VUI for home automation, supporting distant speech interaction in different languages, targeting, in particular, people with kinetic disabilities. The basic use-case involved command-like voice-control of automated home equipment, for example of the room lights, temperature settings, door, window and shutter operation, etc. To enable hands-free operation, the VUI was designed to be always-listening, employing key-phrase based activation. Further, to achieve appropriate disambiguation of uttered commands, allow possible interaction with multiple users in different rooms, and provide localized feedback (VUI confirmation using room loudspeakers), room-level localization of the recognized commands was also performed An example of the DIRHA challenging acoustic scene is depicted in Fig. 1.

In this paper, we describe in detail the design of a robust multichannel distant speech processing pipeline, developed for the purposes of the aforementioned DIRHA domestic interaction scenario in the Greek language. The adopted methodology is rather general, being readily applicable to support VUIs in other everyday indoors multi-room environments equipped with multiple microphone sensors, such as smart offices, for example. The work deals with a wide range of challenging topics in the area of distant speech processing, where its contributions lie, namely addressing the following topics:

  • Always-listening operation, achieved by employing Speech Activity Detection (SAD), key-phrase detection, and DSR.

  • Room-localized operation, based on a multi-room SAD component used to drive separate, parallel cascades of key-phrase detection and DSR for each room of the smart space.

  • Multichannel speech processing beyond beamforming, such as channel selection and decision fusion of single-channel results, considered in all pipeline components.

  • Robust acoustic modeling, based on far-field data simulation and per-channel adaptation with little training data available in the target environment.

  • Pipeline component optimization, studying component inter-dependencies and optimizing their tunable parameters separately, sequentially, or jointly.

  • System and pipeline component evaluation on both simulated and real corpora in two multi-room, multichannel smart environments.

In more detail, to support always-listening operation, we build on the widely used cascade of three speech processing stages, as overviewed in Fig. 2, namely: (a) SAD, to separate speech from non-speech events; (b) key-phrase detection, to identify a predefined system activation phrase; and (c) DSR, to recognize the issued command. Combinations of some of the above components can be found in a variety of VUIs, providing partial robustness against non-speech events and increased efficiency, by processing only the speech segments of the incoming signals.

Further, to allow room-localized operation, we modify the aforementioned traditional cascade by designing a multi-room SAD component, instead of employing a generic, room-independent SAD. Such is able to identify speech segments in conjunction with their room of origin, robustly addressing the problem of inter-room interference. The component is used to drive separate cascades of key-phrase detection and DSR for each room of the smart space, operating in parallel. The process yields room-localized speech command recognition, as required by the VUI scenario considered in this paper.

To fulfill the needs of the detection and recognition tasks involved in the system, we elaborate and combine multichannel speech processing methods that have been explored in our previous preliminary studies (Giannoulis, Brutti, Matassoni, Abad, Katsamanis, Matos, Potamianos, Maragos, 2015, Katsamanis, Rodomagoulakis, Potamianos, Maragos, Tsiami, 2014, Tsiami, Katsamanis, Maragos, Potamianos, 2014), achieving promising results and robustness in the challenging conditions considered. The implemented components make extensive use of channel selection and combination strategies to benefit from the available network of microphones inside the rooms. The advantage of these approaches is that they require no prior information regarding microphone network topology, other than mere room-microphone association. The proposed channel combination methods are based on decision fusion schemes, and they appear to outperform beamforming in most cases.

We gain additional benefits by employing robust modeling, in order to reduce mismatch between training and test conditions. In particular, we generate artificial training data simulating the test conditions, and, furthermore, we employ statistical model adaptation for each microphone channel, using few data from the target environment, if available.

Further, we consider optimization of a number of tunable system component parameters, while taking into consideration their inter-dependencies. Specifically, we observe that their joint optimization, rather than separate or sequential optimization, leads to improved command recognition accuracy.

Finally, we conduct extensive experimentation on both simulated and real datasets, where the adopted system architecture is evaluated systematically. For this purpose, we employ three separate databases: (a) DIRHA-sim, a corpus of simulated long audio recordings inside a real multi-room apartment (Cristoforetti et al., 2014); (b) ATHENA-real, a set of real recordings in a two-room office environment (Tsiami et al., 2014b); and (c) DIRHA-real, a corpus of real recordings captured inside the multi-room apartment also used for the first set. The first two consist of both development and test subsets, allowing for model adaptation and system optimization, while the third one is employed for testing the proposed pipeline on real data, unseen during its training. Reported results vary due to different characteristics and challenges of each dataset, reaching 76.6% in command recognition accuracy on the ATHENA-real corpus.

The rest of the paper is organized as follows: Section 2 overviews related work in the literature. Section 3 presents the proposed system, describes its components in detail, and reviews the adopted robust modeling and multichannel processing methods. Section 4 describes the databases used for the development and evaluation of the system pipeline. Section 5 introduces the adopted experimental framework and presents results of both the isolated components and the integrated system. Details on system optimization, final pipeline evaluation, and an error analysis are also included. Finally, Section 6 concludes the paper with a brief discussion.

Section snippets

Related work

Several projects and challenges have been launched over the last decade targeting intelligent interfaces for indoors smart environments and addressing DSR via multiple distributed microphones. Initially, the community focused on single-room setups for the analysis of lectures and meetings. Research projects like CHIL (Chu et al., 2006) and AMI (Hain et al., 2008) produced a wide range of results under the framework of the NIST Rich Transcription evaluation campaigns (Fiscus et al., 2008).

Proposed multichannel, always-listening, distant speech recognition pipeline

As already outlined, the proposed speech processing pipeline aims at recognizing spoken commands for home and office automation. The user is potentially able to address the system from any position in the multi-room space. This is achieved by designing it to operate in parallelized room-dependent speech processing cascades, consisting of (a) microphone selection, (b) command detection, and (c) command recognition, all driven by multi-room SAD that provides candidate speech segments for each

Simulated and real corpora for indoor automation

Three challenging multichannel datasets are employed for the development and evaluation of the proposed system: (a) The DIRHA simulated corpus (DIRHA-sim), (b) the DIRHA real corpus (DIRHA-real), and (c) the ATHENA real database (ATHENA-real). All sets have been acquired in multi-room smart environments and include one-minute long recordings of a variety of commands and activation phrases in Greek, as well as non-speech events and background noises, deeming the recordings very realistic for

Experimental framework and system evaluation

The design of the experimental framework for the development and evaluation of the presented system pipeline is complex due to the inter-dependency of the connected modules. To account for the behavior of each component individually and relatively to the others, we group experimental tasks into three categories, discussing details in the following subsections, mainly:

  • 1.

    Individual: every module of the pipeline is tested separately in terms of standard evaluation metrics such as precision, recall,

Conclusions, discussion and future work

In this work, we detail the design, optimization and systematic evaluation of a speech processing and recognition pipeline for an always-listening voice enabled user interface in Greek. The pipeline aims at robust far-field spoken command recognition in challenging multi-room smart environments as homes and offices equipped with sparsely distributed microphone arrays. The proposed system architecture is based on the synergy between multichannel speech activity detection, key-phrase detection,

References (57)

  • L. Cristoforetti et al.

    The DIRHA simulated corpus

    Proceedings of the International Conference on Language Resources and Evaluation (LREC)

    (2014)
  • M. Delcroix et al.

    Strategies for distant speech recognition in reverberant environments

    EURASIP J. Adv. Signal Process.

    (2015)
  • V. Digalakis et al.

    Large vocabulary continuous speech recognition in Greek: corpus and an automatic dictation system

    Proceedings of the International Conference on Speech Communication and Technology (Interspeech)

    (2003)
  • D. Dimitriadis et al.

    GridNews: a distributed automatic Greek broadcast transcription system

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

    (2009)
  • W.K. Edwards et al.

    At home with ubiquitous computing: seven challenges

    Ubicomp 2001: Ubiquitous Computing

    (2001)
  • A. Farina

    Simultaneous measurement of impulse response and distortion with a swept-sine technique

    Proceedings of the 108 Audio Engineering Society Convention

    (2000)
  • J.G. Fiscus et al.

    The rich transcription 2007 meeting recognition evaluation

    Multimodal Technologies for Perception of Humans

    (2008)
  • A. Fleury et al.

    A French corpus of audio and multimodal interactions in a health smart home

    J. Multimodal User Interfaces

    (2013)
  • M. Gavrilidou et al.

    The Greek language in the digital age

  • T. Giannakopoulos et al.

    A practical, real-time speech-driven home automation front-end

    IEEE Trans. Consum. Electron.

    (2005)
  • P. Giannoulis et al.

    Multi-room speech activity detection using a distributed microphone network in domestic environments

    Proceedings of the European Signal Processing Conference (EUSIPCO)

    (2015)
  • T. Hain et al.

    The 2007 AMI(DA) system for meeting transcription

    Multimodal Technologies for Perception of Humans

    (2008)
  • T. Hain et al.

    Transcribing meetings with the AMIDA systems

    IEEE Trans. Audio, Speech, Lang. Process.

    (2012)
  • M. Harper

    The automatic speech recognition in reverberant environments (ASpIRE) challenge

    Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)

    (2015)
  • G. Hinton et al.

    Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups

    IEEE Signal Process. Mag.

    (2012)
  • D. Imseng et al.

    Using KL-divergence and multilingual information to improve ASR for under-resourced languages

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

    (2012)
  • A. Janin et al.

    The ICSI meeting corpus

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

    (2003)
  • A. Katsamanis et al.

    Robust far-field spoken command recognition for home automation combining adaptation and multichannel processing

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

    (2014)
  • Cited by (0)

    This research was partially supported by EU project DIRHA, grant no. FP7-ICT-2011-7-288121.

    ☆☆

    This paper has been recommended for acceptance by Prof. R. K. Moore.

    View full text