Abstract
Intelligent speech interaction systems have gained great popularity in recent years. For these systems, the accuracy of automatic speech recognition (ASR) has become a key factor of determining user experience. Due to the influence of environmental noise and the diversity and complexity of natural language, the performance of ASR still cannot meet the requirements of real-world application scenarios. To improve the accuracy of ASR, in this paper, we propose a domain knowledge enhanced error correction method which first the improved phonetic editing distance to select the candidates which have the same or similar phonetics with the error segment, and then adopts language model the find the most appropriate one from the domain knowledge set as the final result. We also encapsulate the method as a service with the Flask + Gunicorn + Nginx framework to improve the high concurrency performance. Experimental results demonstrate that our proposed method outperforms the comparison methods over 48.4% in terms of accuracy and almost 20–40 times concurrency performance.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, intelligent speech interaction systems (such as Apple Siri, Microsoft Cortana, Google Now, Amazon Alexa, and Samsung or Sougou Voice Assistant, etc.) have gained popularity in all segments of people’s life and work [1]. In 2018, according to Forrest Research, 25% businesses have used the conversational user interface in which speech is the most direct and important way for communication to assist the mouse-click analytic tools. Currently, these systems normally transcribe speeches into their corresponding texts with automatic speech recognition (ASR) techniques at first, and then use natural language processing (NLP) techniques to extract the semantic information of utterances. However, the existing ASR techniques only model from the aspects of pronunciation and grammar, ignoring the guidance of relevant domain knowledge. Besides, due to the diversity and complexity of natural language, and the differences between human dialects and habits, the accuracy of ASR is still not high enough to meet the requirements of specific application scenarios.
To address these problems, many error correction approaches have been proposed in these years. For example, Zhou et al. [2] proposed a speech recognition error detection and correction algorithm which generates 20 candidates for each individual word, and then uses a linear scoring system to score each sentence and selects the sentence with the highest score as the final result. Mangu et al. [3] proposed a transformation-based learning algorithm which uses confusion network model to detect and correct the errors. Che et al. [4] proposed a post-editing Chinese text correction and intention recognition method for the Chinese speech interaction context. However, these methods still cannot obtain good results in specific areas.
In this paper, we propose a domain knowledge enhanced error correction approach which uses words with the same or similar pronunciations from different domains as the candidates for each individual word. We then use the combination method of Chinese phonetic editing distance and language model to select the word with the highest score as the final result. Besides, to improve the concurrency performance of multiple requests, we use the Flask + Gunicorn + Nginx [5] framework to encapsulate the algorithm as an application programming interface (API), making it take full advantage of the capability of multiple cores of the CPU. Experimental results demonstrate the effectiveness and efficiency of our proposed method.
The rest of the paper is structured as follows. Section 2 makes a brief overview of the error correction methods and points our their disadvantages. Section 3 presents the proposed framework for error correction as a service in our work. Section 4 carries out an extensive of experiments to evaluate the performance of the proposed method. Section 5 summarizes the paper and gives a brief introduction to our future work.
2 Related Work
Error correction of speech recognition has become a hot topic in NLP field. Recently, researchers from both academia and industry have proposed a variety of error correction methods for speech interaction scenarios. For example, Wang et al. [6] proposed a method which combines statistics and rules to realize the translation from Chinese phonetics to actual texts. Zhou et al. [2] proposed an error detection and correction algorithm that first generates 20 candidates for each word, and then uses a linear scoring system to evaluate each sentence and select the sentence with the highest score as the actual content. However, since this method only aims at specific areas, the words that can be retrieved are very limited. To solve this problem, Mangu et al. [3] proposed a transformation-based learning algorithm. In this method, a confusion network model [7] was used to detect and correct the potential errors.
Another kind of error correction algorithm is based on post-processing method [8, 9]. This method adds an additional layer behind the speech recognition system to post-process the results of speech recognition. For instance, Ringger et al. [10] used a noise channel to detect and correct speech recognition results. In 2012, Bassil et al. [11] proposed a post-processing speech recognition text error correction algorithm based on Bing’s online spelling recommendation. A large number of experiments in different languages verified the effectiveness of this method and improved the accuracy of text error correction. In 2016, Fujiwara et al. [12] designed a self-defined speech alphabet method to improve the speech recognition algorithm and the accuracy of word input in noisy environment. In 2018, Che et al. [4] proposed an improved phonetic editing distance [13] method to correct possible errors for the Chinese speech interaction context. However, when two candidates have the same editing distance, it will be difficult to find a suitable one.
3 Methodology
3.1 Overall Architecture
As can be seen from Fig. 1, the overall architecture of our proposed method can be divided into corpus processing phase and error correction phase. In the corpus processing phase, the utterances are first tokenized by word segmentation [14] module. For each word, on the one hand, each character of the word is replaced by a character which has the same phonetics from the character-level confusion set to form a new word. On the other hand, the phonetics of the word is also generated to form a candidate-phonetics dictionary. In the error correction phase, the recognized text is firstly separated into several word segments. For each segment, the corresponding phonetics is generated. Then the similarity scores are calculated with the segment and each candidate from the dictionary. Finally, the candidate which has the best score is selected as the final result. In this paper, we use the weighted score of phonetic editing distance and language model [15, 16] as the similarity score.
3.2 Corpus Processing
Corpus Construction. In the NLP field, corpus plays a very important role for training a model or constructing a candidate set. Although there is a huge number of universal corpora, the scales of the corpora for specific application scenarios are still very limited. To improve the accuracy of error correction, a set of text prompts that can be used in enterprise scenarios are carefully designed. This corpus contains 700 correct utterances and 196 error utterances. For the correct utterances, we further divide this corpus into four micro-scenarios, including travel application, operational data query, reimbursement and enterprise news broadcasting.
Domain Knowledge Construction. Each industry or enterprise scenario has its own unique domain knowledge [17]. This domain knowledge plays an important role in decoding the conceptual representation of user utterance for speech interaction scenarios. For each micro-scenario, to construct the relations between text prompts and core semantics, we use the dependency syntax analysis [18] to extract the core components of the utterance. We then extend the corpus with Word2Vec [4] to generate synonym words which have the same or similar semantics as the core components. Finally, the combination of the core components and their synonym words is integrated into the domain knowledge to form a specific candidate set for error correction. To improve the performance of error correction, we obtain the Chinese phonetics of each word in the domain knowledge set to form a candidate-phonetics pair dictionary at first.
Word-Level Confusion Set Generation. Since the error correction method proposed in this paper is based on word-level, the effectiveness of word segmentation will affect the accuracy of error correction. Besides, the scale of word set (called user dictionary) also plays a significant role for the main stream word segmentation algorithms such as Jieba [19] or HanLP [20] used in industrial fields. To generate a massive number of word set, we first conduct word segmentation for each utterance. Then for each segment, we replace each character with the characters which have the same or similar phonetics in the character-level confusion set, respectively. Therefore, all combinations of characters which have the same or similar phonetics with the segment will form a new word that will be put into the word-level confusion set.
Language Model Generation. When there are multiple candidates in the domain knowledge set, it will be difficult to find a suitable one. To address this problem, statistic models such as n-gram language models are usually used to evaluate the fluency of the sentence if the word is used in this case. In this paper, we use all the correct utterances to train a trigram language model with the KenLM toolkit [21].
3.3 Error Correction
For the text information output by the ASR system, we first conduct word segmentation with the same algorithm as in Sect. 3.2. To ensure the performance of word segmentation, the word-level confusion set is loaded into the word segmentation model in advance. Then we convert the text into Chinese phonetics. To prevent the impact of environmental noise and dialects, the fuzzy words are also unified. For each segment, we calculate the phonetic editing distance between the segment and each word in the domain knowledge set successively. For traditional editing distance, if the length of the two phonetics differs greatly, the distance between them cannot well represented. To solve this problem, we use the improved editing distance.
Suppose \(t_0\) and \(t_i\) are the text segment that needs to be corrected and a word from the domain knowledge set, the distance function between \(t_0\) and \(t_i\) can be seen in the following:
where len(x) is the number of words in x and \(len_p(x)\) refers to the number of characters in phonetics of x.
Although using phonetic editing distance can usually obtain good results, when there are multiple candidates that are within the distance requirement, it will be difficult to find a suitable one. To solve this problem, we use the language model to evaluate the fluency of the sentences for each candidate further more. The word which maximize the score of the sentence will be selected as the final result.
3.4 Error Correction as a Service
To improve the concurrency of the error correction task, we use the Flask + Gunicorn + Nginx framework which has been used widely and can provide high performance in these years. As can be seen in Fig. 2, the error correction service which has used this framework is divided into 3 layers: reverse proxy layer, WSGI HTTP server layer and worker layer. When a new HTTP request comes from the clients, it first accesses the reverse proxy layer and forwards this request to the WSGI HTTP server layer according to the routes in the Nginx server. Then the WSGI server will parse the request based on the WSGI protocol and call the Flask framework in the worker layer to handle this request. The Gunicorn server in the WSGI layer is in fact a Python WSGI HTTP server for UNIX. It is broadly compatible with various web frameworks and has high scalability and performance so that it can handle multiple requests with high concurrency. Finally, the Flask framework in the work layer will instantiate multiple Flask instances to handle the request for error correction and return the final results to the clients.
4 Experiments
4.1 Experimental Setup
Data Set. We established a benchmark data set from a real-world intelligent speech interaction system for enterprise application scenarios, XiaoK Digital Speech Assistant, for error correction. The raw data set includes 960 Mandarin utterances recorded in 2018. To evaluate the accuracy of our method, we invite three well-trained human labelers to mark the correct sentence of each utterance. To address the inconsistency issue, if the labelers had different opinions, they would start a discussion about the inconsistent parts until reach an agreement.
Comparison Methods. We designed two machine learning methods to compare the performance with our proposed method: (1) character-level language model (language model); (2) our proposed method which is implemented only with the Flask framework (flask); (3) our proposed method which is implemented with the Flask + Gunicorn + Nginx framework (flask + gunicorn + nginx).
Evaluation Metrics. Like other NLP tasks, we evaluate the correction performance in terms of the accuracy with sentence level. For the whole data set, 700 utterances are used for constructing the model while 196 utterances for testing.
4.2 Experimental Results
Error Correction Performance with Different Methods. Table 1 shows the accuracy of error correction using different comparison methods. From the results, we can see that the performance of using our method is better than that of using other machine learning model. The reason might be that the method can better leverage the domain knowledge for error correction. By comparison the similarity between the segment and the candidates, the method can select the most suitable one effectively.
Concurrency Performance with Different Methods. To demonstrate the usability of our method in real-world enterprise scenarios, we also compare the concurrency performance with different methods. For each method, 10 threads are used to start the error correction service and 100 simulated concurrent clients are started to send requests to the error correction service simultaneously. Table 2 lists the experimental results. As can be seen from this table, by using the Flask + Gunicorn + Nginx framework, our method can have much higher concurrency performance with other methods (almost 40 and 20 times faster than the language model and the method only with the flask framework, respectively).
5 Conclusion and Future Work
This paper investigates implementing error correction as a service in intelligent speech interaction systems for enterprise scenarios. In this paper, we propose a domain knowledge enhanced error correction approach which first adopts improved phonetic editing distance to find the candidates which have the same or similar phonetics with the segment from the error text prompt, and then uses language model to further select the most suitable one as the final result. Besides, we also encapsulate the error correction task as a service with the Flask + Gunicorn + Nginx framework. Experimental results indicate that compared with other methods, our method can not only have much higher accuracy, the concurrency performance is also much higher.
Future work will be dedicated to reducing the ratio of the correct segments that are wrongly corrected.
References
Ning, Y.S., et al.: Multi-task deep learning for user intention understanding in speech interaction systems. In: Proceedings of AAAI Conference on Artificial Intelligence, San Francisco (2017)
Zhou, Z.Y., Meng, H., Lo, W.K.: A multi-pass error detection and correction framework for Mandarin LVCSR. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 1646–1649 (2006)
Mangu, L., Padmanabhan, M.: Error corrective mechanisms for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 29–32 (2001)
Che, J., Chen, H., Zeng, J., Zhang, L.J.: A Chinese text correction and intention identification method for speech interactive context. In: Proceedings of the 2018 International Conference on AI & Mobile Services (AIMS) (2018)
Flask Application Example 3 - Construct Web Services through Nginx+Gunicorn+Flask, 12 April 2018. https://www.jianshu.com/p/d71d6d793aaa
Zhang, R., Wang, Z.: Chinese pinyin to text translation technique with error correction used for continuous speech recognition. Tsinghua University (1997)
Bertoldi, N., Zens, R., Federico, M.: Speech translation by confusion network decoding. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2007)
Frankel, A., Santisteban, A.: System and method for post processing speech recognition output. U.S. Patent 7,996,223 (2011)
Xu, Y., Du, J., Huang, Z., Dai, L.R., Lee, C.H.: Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. arXiv preprint arXiv:1703.07172 (2017)
Ringger, E.K., Allen, J.F.: Error correction via a post-processor for continuous speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 427–430 (1996)
Bassil, Y., Alwani, M.: OCR post-processing error correction algorithm using Google online spelling suggestion. arXiv preprint arXiv:1204.0191 (2012)
Fujiwara, K.: Error correction of speech recognition by custom phonetic alphabet input for ultra-small devices. In: Proceedings of CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 104–109 (2016)
Pucher, M., Türk, A., Ajmera, J., Fecher, N.: Phonetic distance measures for speech recognition vocabulary and grammar optimization. In: Proceedings of the 3rd Congress of the Alps Adria Acoustics Association, September 2007
Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206 (2015)
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI), March 2016
Ballinger, B.M., Schalkwyk, J., Cohen, M.H., Allauzen, C.G.L.: Language model selection for speech-to-text conversion. U.S. Patent 9,495,127 (2016)
Stewart, R., Ermon, S.: Label-free supervision of neural networks with physics and domain knowledge. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2017)
Ye, Z.L., Zhao, H.X.: Syntactic word embedding based on dependency syntax and polysemous analysis. Front. Inf. Technol. Electr. Eng. 19(4), 524–535 (2018)
Jieba Chinese Word Segmentation, 05 March 2018. https://github.com/fxsjy/jieba
Python Interface of Natural Language Processing Toolkit-HanLP, 05 March 2018. https://github.com/hankcs/pyhanlp
KenLM: Faster and Smaller Language Model Queries, 20 March 2018. https://github.com/kpu/kenlm
Acknowledgements
This work is partially supported by the technical projects No. c1533411500138 and No. 2017YFB0802700. This work is also supported by NSFC (91646202). This work is also supported by NSFC (91646202), the 1000-Talent program and the China Postdoctoral Science Foundation (2019M652949).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ning, Y., Xing, C., Zhang, LJ. (2019). Domain Knowledge Enhanced Error Correction Service for Intelligent Speech Interaction. In: Wang, D., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2019. AIMS 2019. Lecture Notes in Computer Science(), vol 11516. Springer, Cham. https://doi.org/10.1007/978-3-030-23367-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-23367-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23366-2
Online ISBN: 978-3-030-23367-9
eBook Packages: Computer ScienceComputer Science (R0)