Keywords

1 Introduction

In recent years, intelligent speech interaction systems (such as Apple Siri, Microsoft Cortana, Google Now, Amazon Alexa, and Samsung or Sougou Voice Assistant, etc.) have gained popularity in all segments of people’s life and work [1]. In 2018, according to Forrest Research, 25% businesses have used the conversational user interface in which speech is the most direct and important way for communication to assist the mouse-click analytic tools. Currently, these systems normally transcribe speeches into their corresponding texts with automatic speech recognition (ASR) techniques at first, and then use natural language processing (NLP) techniques to extract the semantic information of utterances. However, the existing ASR techniques only model from the aspects of pronunciation and grammar, ignoring the guidance of relevant domain knowledge. Besides, due to the diversity and complexity of natural language, and the differences between human dialects and habits, the accuracy of ASR is still not high enough to meet the requirements of specific application scenarios.

To address these problems, many error correction approaches have been proposed in these years. For example, Zhou et al. [2] proposed a speech recognition error detection and correction algorithm which generates 20 candidates for each individual word, and then uses a linear scoring system to score each sentence and selects the sentence with the highest score as the final result. Mangu et al. [3] proposed a transformation-based learning algorithm which uses confusion network model to detect and correct the errors. Che et al. [4] proposed a post-editing Chinese text correction and intention recognition method for the Chinese speech interaction context. However, these methods still cannot obtain good results in specific areas.

In this paper, we propose a domain knowledge enhanced error correction approach which uses words with the same or similar pronunciations from different domains as the candidates for each individual word. We then use the combination method of Chinese phonetic editing distance and language model to select the word with the highest score as the final result. Besides, to improve the concurrency performance of multiple requests, we use the Flask + Gunicorn + Nginx [5] framework to encapsulate the algorithm as an application programming interface (API), making it take full advantage of the capability of multiple cores of the CPU. Experimental results demonstrate the effectiveness and efficiency of our proposed method.

The rest of the paper is structured as follows. Section 2 makes a brief overview of the error correction methods and points our their disadvantages. Section 3 presents the proposed framework for error correction as a service in our work. Section 4 carries out an extensive of experiments to evaluate the performance of the proposed method. Section 5 summarizes the paper and gives a brief introduction to our future work.

2 Related Work

Error correction of speech recognition has become a hot topic in NLP field. Recently, researchers from both academia and industry have proposed a variety of error correction methods for speech interaction scenarios. For example, Wang et al. [6] proposed a method which combines statistics and rules to realize the translation from Chinese phonetics to actual texts. Zhou et al. [2] proposed an error detection and correction algorithm that first generates 20 candidates for each word, and then uses a linear scoring system to evaluate each sentence and select the sentence with the highest score as the actual content. However, since this method only aims at specific areas, the words that can be retrieved are very limited. To solve this problem, Mangu et al. [3] proposed a transformation-based learning algorithm. In this method, a confusion network model [7] was used to detect and correct the potential errors.

Another kind of error correction algorithm is based on post-processing method [8, 9]. This method adds an additional layer behind the speech recognition system to post-process the results of speech recognition. For instance, Ringger et al. [10] used a noise channel to detect and correct speech recognition results. In 2012, Bassil et al. [11] proposed a post-processing speech recognition text error correction algorithm based on Bing’s online spelling recommendation. A large number of experiments in different languages verified the effectiveness of this method and improved the accuracy of text error correction. In 2016, Fujiwara et al. [12] designed a self-defined speech alphabet method to improve the speech recognition algorithm and the accuracy of word input in noisy environment. In 2018, Che et al. [4] proposed an improved phonetic editing distance [13] method to correct possible errors for the Chinese speech interaction context. However, when two candidates have the same editing distance, it will be difficult to find a suitable one.

Fig. 1.
figure 1

Overall architecture of our proposed method.

3 Methodology

3.1 Overall Architecture

As can be seen from Fig. 1, the overall architecture of our proposed method can be divided into corpus processing phase and error correction phase. In the corpus processing phase, the utterances are first tokenized by word segmentation [14] module. For each word, on the one hand, each character of the word is replaced by a character which has the same phonetics from the character-level confusion set to form a new word. On the other hand, the phonetics of the word is also generated to form a candidate-phonetics dictionary. In the error correction phase, the recognized text is firstly separated into several word segments. For each segment, the corresponding phonetics is generated. Then the similarity scores are calculated with the segment and each candidate from the dictionary. Finally, the candidate which has the best score is selected as the final result. In this paper, we use the weighted score of phonetic editing distance and language model [15, 16] as the similarity score.

3.2 Corpus Processing

Corpus Construction. In the NLP field, corpus plays a very important role for training a model or constructing a candidate set. Although there is a huge number of universal corpora, the scales of the corpora for specific application scenarios are still very limited. To improve the accuracy of error correction, a set of text prompts that can be used in enterprise scenarios are carefully designed. This corpus contains 700 correct utterances and 196 error utterances. For the correct utterances, we further divide this corpus into four micro-scenarios, including travel application, operational data query, reimbursement and enterprise news broadcasting.

Domain Knowledge Construction. Each industry or enterprise scenario has its own unique domain knowledge [17]. This domain knowledge plays an important role in decoding the conceptual representation of user utterance for speech interaction scenarios. For each micro-scenario, to construct the relations between text prompts and core semantics, we use the dependency syntax analysis [18] to extract the core components of the utterance. We then extend the corpus with Word2Vec [4] to generate synonym words which have the same or similar semantics as the core components. Finally, the combination of the core components and their synonym words is integrated into the domain knowledge to form a specific candidate set for error correction. To improve the performance of error correction, we obtain the Chinese phonetics of each word in the domain knowledge set to form a candidate-phonetics pair dictionary at first.

Word-Level Confusion Set Generation. Since the error correction method proposed in this paper is based on word-level, the effectiveness of word segmentation will affect the accuracy of error correction. Besides, the scale of word set (called user dictionary) also plays a significant role for the main stream word segmentation algorithms such as Jieba [19] or HanLP [20] used in industrial fields. To generate a massive number of word set, we first conduct word segmentation for each utterance. Then for each segment, we replace each character with the characters which have the same or similar phonetics in the character-level confusion set, respectively. Therefore, all combinations of characters which have the same or similar phonetics with the segment will form a new word that will be put into the word-level confusion set.

Language Model Generation. When there are multiple candidates in the domain knowledge set, it will be difficult to find a suitable one. To address this problem, statistic models such as n-gram language models are usually used to evaluate the fluency of the sentence if the word is used in this case. In this paper, we use all the correct utterances to train a trigram language model with the KenLM toolkit [21].

3.3 Error Correction

For the text information output by the ASR system, we first conduct word segmentation with the same algorithm as in Sect. 3.2. To ensure the performance of word segmentation, the word-level confusion set is loaded into the word segmentation model in advance. Then we convert the text into Chinese phonetics. To prevent the impact of environmental noise and dialects, the fuzzy words are also unified. For each segment, we calculate the phonetic editing distance between the segment and each word in the domain knowledge set successively. For traditional editing distance, if the length of the two phonetics differs greatly, the distance between them cannot well represented. To solve this problem, we use the improved editing distance.

Suppose \(t_0\) and \(t_i\) are the text segment that needs to be corrected and a word from the domain knowledge set, the distance function between \(t_0\) and \(t_i\) can be seen in the following:

$$\begin{aligned} distance(t_0,t_i) = \left| len(t_0) - len(t_i) \right| * \left( \sum _{w in t_0} len_p(w) + \sum _{w in t_i} len_p(w)\right) / (len(t_0) + len(t_i)) \end{aligned}$$
(1)

where len(x) is the number of words in x and \(len_p(x)\) refers to the number of characters in phonetics of x.

Although using phonetic editing distance can usually obtain good results, when there are multiple candidates that are within the distance requirement, it will be difficult to find a suitable one. To solve this problem, we use the language model to evaluate the fluency of the sentences for each candidate further more. The word which maximize the score of the sentence will be selected as the final result.

Fig. 2.
figure 2

The Flask + Gunicorn + Nginx error correction service framework.

3.4 Error Correction as a Service

To improve the concurrency of the error correction task, we use the Flask + Gunicorn + Nginx framework which has been used widely and can provide high performance in these years. As can be seen in Fig. 2, the error correction service which has used this framework is divided into 3 layers: reverse proxy layer, WSGI HTTP server layer and worker layer. When a new HTTP request comes from the clients, it first accesses the reverse proxy layer and forwards this request to the WSGI HTTP server layer according to the routes in the Nginx server. Then the WSGI server will parse the request based on the WSGI protocol and call the Flask framework in the worker layer to handle this request. The Gunicorn server in the WSGI layer is in fact a Python WSGI HTTP server for UNIX. It is broadly compatible with various web frameworks and has high scalability and performance so that it can handle multiple requests with high concurrency. Finally, the Flask framework in the work layer will instantiate multiple Flask instances to handle the request for error correction and return the final results to the clients.

4 Experiments

4.1 Experimental Setup

Data Set. We established a benchmark data set from a real-world intelligent speech interaction system for enterprise application scenarios, XiaoK Digital Speech Assistant, for error correction. The raw data set includes 960 Mandarin utterances recorded in 2018. To evaluate the accuracy of our method, we invite three well-trained human labelers to mark the correct sentence of each utterance. To address the inconsistency issue, if the labelers had different opinions, they would start a discussion about the inconsistent parts until reach an agreement.

Comparison Methods. We designed two machine learning methods to compare the performance with our proposed method: (1) character-level language model (language model); (2) our proposed method which is implemented only with the Flask framework (flask); (3) our proposed method which is implemented with the Flask + Gunicorn + Nginx framework (flask + gunicorn + nginx).

Evaluation Metrics. Like other NLP tasks, we evaluate the correction performance in terms of the accuracy with sentence level. For the whole data set, 700 utterances are used for constructing the model while 196 utterances for testing.

4.2 Experimental Results

Error Correction Performance with Different Methods. Table 1 shows the accuracy of error correction using different comparison methods. From the results, we can see that the performance of using our method is better than that of using other machine learning model. The reason might be that the method can better leverage the domain knowledge for error correction. By comparison the similarity between the segment and the candidates, the method can select the most suitable one effectively.

Table 1. Error correction performance with different methods.

Concurrency Performance with Different Methods. To demonstrate the usability of our method in real-world enterprise scenarios, we also compare the concurrency performance with different methods. For each method, 10 threads are used to start the error correction service and 100 simulated concurrent clients are started to send requests to the error correction service simultaneously. Table 2 lists the experimental results. As can be seen from this table, by using the Flask + Gunicorn + Nginx framework, our method can have much higher concurrency performance with other methods (almost 40 and 20 times faster than the language model and the method only with the flask framework, respectively).

Table 2. Concurrency performance with different methods.

5 Conclusion and Future Work

This paper investigates implementing error correction as a service in intelligent speech interaction systems for enterprise scenarios. In this paper, we propose a domain knowledge enhanced error correction approach which first adopts improved phonetic editing distance to find the candidates which have the same or similar phonetics with the segment from the error text prompt, and then uses language model to further select the most suitable one as the final result. Besides, we also encapsulate the error correction task as a service with the Flask + Gunicorn + Nginx framework. Experimental results indicate that compared with other methods, our method can not only have much higher accuracy, the concurrency performance is also much higher.

Future work will be dedicated to reducing the ratio of the correct segments that are wrongly corrected.