Elsevier

Computer Speech & Language

Volume 46, November 2017, Pages 284-310
Computer Speech & Language

Improving the understanding of spoken referring expressions through syntactic-semantic and contextual-phonetic error-correction

https://doi.org/10.1016/j.csl.2017.05.005Get rights and content

Highlights

  • We present a classifier for detecting Automatic Speech Recognition (ASR) errors.

  • We offer a mechanism that uses shallow semantic parsing to break up the referring expressions heard by the ASR into labelled semantic segments, which are then used to set up syntactic expectations.

  • We describe a syntactic-semantic error-correction model that decides how to modify the output of the ASR on the basis of the syntactic expectations of its semantic segments.

  • We propose a contextual-phonetic model that re-ranks the output of a Spoken Language Understanding (SLU) system on the basis of the phonetic similarity between words mis-heard by the ASR and the contextually-valid candidate interpretations returned by the SLU system.

Abstract

Despite recent advances in automatic speech recognition, one of the main stumbling blocks to the widespread adoption of Spoken Dialogue Systems is the lack of reliability of automatic speech recognizers. In this paper, we offer a two-tier error-correction process that harnesses syntactic, semantic and pragmatic information to improve the understanding of spoken referring expressions, specifically descriptions of objects in physical spaces. A syntactic-semantic tier offers generic corrections to perceived ASR errors on the basis of syntactic expectations of a semantic model, and passes the corrected texts to a language understanding system. The output of this system, which consists of pragmatic interpretations, is then refined by a contextual-phonetic tier, which prefers interpretations that are phonetically similar to the mis-heard words. Our results, obtained on a corpus of 341 referring expressions, show that syntactic-semantic error correction significantly improves interpretation performance, and contextual-phonetic refinements yield further improvements.

Introduction

In recent times, there have been significant improvements in Automatic Speech Recognition (ASR) (Pellegrini, Trancoso, 2010, Chorowski, Bahdanau, Serdyuk, Cho, Bengio, 2015). Nonetheless, ASR errors and inconsistent performance across domains still impede the widespread adoption of Spoken Dialogue Systems (SDSs). For example, a research prototype of a spoken slot-filling dialogue system reported a Word Error Rate (WER) of 13.8% when using “a generic dictation ASR system” (Mesnil et al., 2015), and Google reported an 8% WER for its ASR API,1 but this API had a WER of 54.6% when applied to the Let’s Go corpus (Lange and Suendermann-Oeft, 2014). The accuracy of the commercial ASR employed in this research (Microsoft Speech SDK 6.1) falls between these numbers, with a WER of 30% for descriptions of household objects. On one hand, these descriptions are more open-ended than the utterances employed in slot-filling applications, but on the other hand, they do not contain proper nouns, such as names of cities and foreign entities, which are rather error prone (Bulyko et al., 2005).

ASR errors not only produce mis-heard (wrongly recognized) entities or actions, but may also yield ungrammatical utterances that cannot be processed by subsequent interpretation modules of a Spoken Language Understanding (SLU) system (e.g., “the plate inside the microwave” being mis-heard as “of plating sight the microwave”), or yield incorrect results when processed by these modules (e.g., hesitations, often accompanied by fillers such as “hmm” or “ah”, being mis-heard as “and” or “on”) – all of which happened in our trials. Thus, further improvements are required in order to enable the widespread adoption of SDSs.

Two approaches for achieving such improvements are (1) enhancing ASR performance, e.g., through the use of deep neural networks (Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, Kingsbury, 2012, Bahdanau, Chorowski, Serdyuk, Bengio, 2016) or by reducing the noise in the input signal (Maas et al., 2012); and (2) performing post-ASR error correction, e.g.,  (Alumäe, Kurimo, 2010, Béchet, Favre, Nasr, Morey, 2014, Fusayasu, Tanaka, Takiguchi, Ariki, 2015). Clearly, perfect ASR would obviate the need for the latter approach, but we are not there yet, even with recent advances in ASRs based on deep neural networks (Tam et al., 2014). Further, Ringger and Allen (1996) argue that “even if the SR engine’s language model can be updated with new domain-specific data, the post-processor trained on the same new data can provide additional improvements in accuracy”. Hence, at present, post-ASR error correction is an active area of research (Section 9), which offers a practical way of de-coupling innovation cycles within target applications from ASR innovation cycles (Feld et al., 2012), and increases system portability (Ringger and Allen, 1996).

In this paper, we offer a mechanism for improving the understanding of spoken referring expressions, specifically descriptions of objects in physical spaces, by harnessing syntactic, semantic and pragmatic information after obtaining output from the ASR. Our mechanism consists of two main stages: syntactic-semantic error correction (Kim et al., 2013) and contextual-phonetic error-correction (Zukerman et al., 2015b).

The syntactic-semantic tier receives as input textual alternatives returned by the ASR. It first invokes a classifier that postulates wrong words in the ASR output (Zavareh et al., 2013) (Section 4) and a shallow semantic parser that breaks up these texts into semantically labelled segments (Kim et al., 2013) (Section 5). It then activates a syntactic-semantic error-correction model that proposes modifications for selected words in each text on the basis of the information provided by the word-error classifier and syntactic expectations of the semantic segments. The following modifications are considered during this stage: removal of noise (which is often due to filled pauses), insertion of missing prepositions, and replacement of mis-heard words. We consider two main types of replacements: (1) words or phrases expected to be closed class are replaced with phonetically similar closed-class words or phrases, e.g., “for there a wave” in prepositional phrase position is replaced with “further away”; and (2) words or phrases expected to be open class incur generic replacements, e.g., given a text such as “the played inside the microwave” (mis-heard by our ASR from the description “the plate inside the microwave” in the context of Fig. 1b2), “played” is replaced with the generic noun “thing”, yielding “the thing inside the microwave”. The rationale for this replacement is that “played” would lead a parser astray, while the proposed replacement allows an SLU system to proceed.

The resultant, possibly modified, texts are then given as input to the Scusi? SLU system (Zukerman et al., 2015a), which is a component of an SDS for human-robot interactions. Scusi? generates a ranked list of pragmatic interpretations for these texts, where an interpretation comprises candidate things in a physical space and the spatial relations between them, e.g., plateD-location_in-microwave1, and the ranking of an interpretation corresponds to the extent to which it matches the description (Section 2). In our example, the interpretation comprising Plate D is ranked first, as it is the only object inside the microwave. When the ASR made the same error for the description “the plate on the table” (which ambiguously refers to Plate E in Fig. 1b), yielding “the played on the table”, the syntactic-semantic tier generated “the thing on the table”. At this stage, this change offers little benefit in the context of Fig. 1b, as most items are on the table. However, in the context of Fig. 1a, the three objects on the table are ranked equal first, ahead of the other objects in the scene.

The contextual-phonetic tier receives as input the top-N pragmatic interpretations produced by Scusi?, and re-ranks them according to the phonetic similarity between the mis-heard head nouns in the texts that led to each interpretation and the terms used to designate the referents of these head nouns (Section 7). To illustrate, let us reconsider “the played on the table” in the context of Fig. 1a, and the three candidates: the pot plant, the plate and the laptop. Some of the words that designate a plate are “plate”, “dish” and “saucer”; “plate” is more similar to “played” than the other words, and is therefore selected to replace “played” as a reference to the red plate. The same procedure is followed for the laptop and the pot plant: “plant” is also pretty similar to “played”, but there are no words that designate the laptop and sound like “played”. As a result, the red plate is ranked first, followed by the pot plant.

Note that this procedure does not simply match potentially mis-heard words with phonetically similar words (Mangu and Padmanabhan, 2001), as this tends to introduce noise. Rather, our approach is inspired by what people would do if they heard most of a description, but mis-heard a few words, which is what happens for ASR output. Most of the text produced by an ASR is correct, which enables us to first propose pragmatically plausible candidate interpretations based on the words presumed correct (e.g., “in the microwave”, “on the table”). Only then we compare the mis-heard words with the terms that designate their referents within these candidate interpretations. This approach is currently applied only to the head noun of a description, because the head noun phrase incurs a high WER of 53%, compared to around 20% for its complements. The extension of this approach to complements is left for future work.

Our mechanism was evaluated on the corpus of 341 spoken referring expressions used to evaluate the original Scusi? system (Zukerman et al., 2015a). Our ASR, which has a WER of 30%, did not return a completely correct textual output for 219 of these descriptions (15% of these incorrect texts have only minor errors, e.g., “a” instead of “the” or missing “the”). The modifications made by the syntactic-semantic error-correction model statistically significantly improve the interpretation performance of the original Scusi? system, and the modifications made by the contextual-phonetic model yield further improvements (Section 8).

To summarize, the main contribution of this paper is a two-tier mechanism for improving the performance of SLU systems by identifying and correcting ASR errors: a syntactic-semantic tier performs generic corrections of the ASR output prior to passing it to the SLU system, and a contextual-pragmatic tier harnesses phonetic information to refine the output of the SLU system. Specific components that support the syntactic-semantic tier are: (1) a classifier for detecting ASR errors, and (2) a shallow semantic parser that breaks up the referring expressions heard by the ASR into labelled semantic segments, which are then used to set up syntactic expectations.

The rest of this paper is organized as follows. In the next section, we briefly describe the workings of our SLU system, and outline our two-tier process, followed by a description of our dataset in Section 3. Sections 4 and 5 describe the word-error classifier and the shallow semantic parser respectively. The syntactic-semantic error-correction mechanism is described in Section 6, followed by the contextual-phonetic error-correction model in Section 7. In Section 8, we discuss our evaluation. We round off the paper with a discussion of related work, and concluding remarks.

Section snippets

System Design

In this section, we briefly describe our SLU system Scusi? (Zukerman et al., 2015a), and outline the application of the error-recovery mechanism as a pre-processing and post-processing step of Scusi?.

The dataset

All the components of our system were evaluated using the dataset employed to evaluate the Scusi? system (Zukerman et al., 2015a). This dataset comprises 400 spoken descriptions generated by 26 speakers – 13 native English speakers and 13 non-native. The descriptions were obtained by asking trial subjects to describe 12 designated objects (labelled A–L) in the scenarios depicted in Fig. 1 (three objects per scenario, where a scenario contains between 8 and 16 objects). The subjects spoke into a

Word error classifier

Our word-error detector classifies the words in the textual outputs produced by an ASR into two classes: Correct and Wrong.

Shallow Semantic Parser (SSP)

An SSP groups syntactic units into chunks, and assigns to each chunk a label corresponding to its semantic role. SSPs have been used for SLU by Coppola et al. (2009) and Geertzen (2009). Coppola et al. used FrameNet (Baker et al., 1998) to detect and filter the frames of target words, and employed SVMs to perform semantic labelling. Geertzen used a shallow parser to detect semantic units only when a dependency parser failed to produce a parse tree. Our SSP is part of an error-correction model

Syntactic-semantic error-correction model

Given a textual output produced by the ASR, our syntactic-semantic error-correction mechanism removes Noise, inserts missing prepositions and replaces erroneous words (Fig. 3). The decision to perform removals and replacements depends on the probability assigned by the word-error classifier to the words in question being wrong, and the impact of the action on the probability of the resultant word sequence according to the syntactic-semantic model. Specifically, if a word is deemed wrong by the

Contextual-phonetic error-correction model for objects and landmarks

As mentioned in Section 2, the score of an ICG incorporates the score of the lexical match between instantiated concepts and the corresponding uninstantiated concepts in its parent UCGs. At this stage in the interpretation process, an uninstantiated concept may be the noun “thing” (which replaces non-nouns) or a noun heard by the ASR. The contextual-phonetic error-correction model receives as input a ranked list of top-N ICGs, and re-ranks them based on the phonetic similarity between mis-heard

Evaluation

In this section, we describe our evaluation metrics, and compare the results obtained by the original Scusi? system13 with those obtained by complementing Scusi? with the syntactic-semantic error-correction model (Section 6) and the contextual-phonetic error-correction model (Section 7). Performance was evaluated using 13-fold cross validation on 341 descriptions obtained from the dataset described in Section 3.

Related research

In this section we provide a brief overview of evaluation metrics for SLU systems, and describe related work. This work mainly focuses on improving ASR performance, except for the research described in (Deoras, Tur, Sarikaya, Hakkani-Tür, 2013, Tur, Deoras, Hakkani-Tür, 2013), which jointly performs ASR error correction and semantic labelling in a slot-filling application. In contrast, our work combines ASR error correction with SLU to obtain pragmatic interpretations in the context of a

Conclusion and future work

In this paper, we have offered a two-tier error-correction process that harnesses syntactic, semantic and pragmatic information to improve the understanding of spoken referring expressions, specifically descriptions of objects in physical spaces. The syntactic-semantic tier offers generic corrections to perceived ASR errors on the basis of syntactic expectations of a semantic model, and passes the corrected texts to our SLU system Scusi?. The output of this system, which consists of pragmatic

Acknowledgments

This research was supported in part by grant DP120100103 from the Australian Research Council. The authors thank Farshid Zavareh and Hung Quan Tran for their work on the ASR error classifier.

References (68)

  • E. Brill et al.

    An improved error model for noisy channel spelling correction

    Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong

    (2000)
  • A. Brooks et al.

    Working with robots and objects: Revisiting deictic reference for achieving spatial common ground

    Proceedings of the 1st ACM SIGCHI/SIGART Conference on Human-robot Interaction, Salt Lake City, Utah

    (2006)
  • P. Brown et al.

    A statistical approach to machine translation

    Comput. Ling.

    (1990)
  • J.K. Chorowski et al.

    Attention-based models for speech recognition

  • B. Coppola et al.

    Shallow semantic parsing for spoken language understanding

    Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado

    (2009)
  • A. Deoras et al.

    Joint discriminative decoding of words and semantic tags for spoken language understanding

    IEEE Trans. Audio, Speech Lang. Proces.

    (2013)
  • D. DeVault et al.

    Can I finish? Learning when to respond to incremental interpretation results in interactive dialogue

    Proceedings of the 10th SIGdial Meeting on Discourse and Dialogue, London, United Kingdom

    (2009)
  • P. Domingos et al.

    On the optimality of the simple Bayesian classifier under zero-one loss

    Mach. Learn.

    (1997)
  • M. Feld et al.

    Mobile texting: Can post-ASR correction solve the issues? An experimental study on gain vs. costs

    Proceedings of the ACM International Conference on Intelligent User Interfaces, Lisbon, Portugal

    (2012)
  • C. Fellbaum

    WordNet: An Electronic Lexical Database (Language, Speech, and Communication)

    (1998)
  • Y. Fusayasu et al.

    Word-error correction of continuous speech recognition based on normalized relevance distance

    Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina

    (2015)
  • S. Gandrabur et al.

    Confidence estimation for NLP applications

    ACM Trans. Speech Lang. Process.

    (2006)
  • J. Geertzen

    Semantic interpretation of Dutch spoken dialogue

    Proceedings of the 8th International Conference on Computational Semantics,Tilburg, The Netherlands

    (2009)
  • S. Ghannay et al.

    Word embeddings combination and neural networks for robustness in ASR error detection

    Proceedings of the 23rd European Signal Processing Conference, Nice, France

    (2015)
  • P. Gorniak et al.

    Probabilistic grounding of situated speech using plan recognition and reference resolution

    Proceedings of the 7th International Conference on Multimodal Interfaces, Trento, Italy

    (2005)
  • M. Hacker et al.

    A phonetic similarity based noisy channel approach to ASR hypothesis re-ranking and error detection

    Proceedings of the 2014 IEEE International Conference on Acoustic, Speech and Signal Processing, Florence, Italy

    (2014)
  • G. Hinton et al.

    Deep neural networks for acoustic modeling in speech recognition

    IEEE Signal Process. Mag.

    (2012)
  • L. Hirschman

    The evolution of evaluation: Lessons from the Message Understanding Conferences

    Comput. Speech Lang.

    (1998)
  • K. Järvelin et al.

    Cumulated gain-based evaluation of IR techniques

    ACM Trans. Inform. Syst.

    (2002)
  • JeongM. et al.

    Speech recognition error correction using maximum entropy language model

    Proceedings of Interspeech 2004, Jeju Island, Korea

    (2004)
  • JeongM. et al.

    Using higher-level linguistic knowledge for speech recognition error correction in a spoken Q/A dialog

    Proceedings of the HLT-NAACL Workshop on Higher-Level Linguistic Information for Speech Processing, Boston, Massachusetts

    (2004)
  • M. Johnson et al.

    A TAG-based noisy channel model of speech repairs

    Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain

    (2004)
  • K. Jokinen et al.

    Spoken Dialogue Systems

    (2010)
  • S. Kaki et al.

    A method for correcting errors in speech recognition using the statistical features of character co-occurrence

    Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Canada

    (1998)
  • This paper has been recommended for acceptance by Prof. R. K. Moore.

    View full text