1 Introduction

According to Statista’s estimate, the global smart home market will reach 79.3 billion U.S. dollars in 2021. ABI Research’s predictions are more optimistic. It is predicted that the global smart home market will reach 70 billion U.S. dollars in 2018 and 100 billion U.S. dollars in 2021. In addition, the major Internet giants have entered the market for market layout; Apple’s Siri, Amazon’s Alexa, Google’s GoogleNow and Microsoft’s Cortana have seized the smart home market. Due to the complexity of natural language and the large differences between different human speech, and the speech signal is easily disturbed by the environment, the accuracy of current Automatic Speech Recognition (ASR) is still not high enough.

In recent years, many researchers have done a lot of research on text correction for speech interaction scenarios. Back in 1997, Zhang and Wang [1] makes use of a hybrid statistical and rule approach to realize Chinese pinyin to text translation and puts forward an approach to correct some pinyin errors in the case of pinyin errors. In 2012, Bassil and Alwani [2] proposed a post-editing ASR error correction method and algorithm based on Bing’s online spelling suggestion. Experiments carried out on various speeches in different languages indicated a successful decrease in the number of ASR errors and an improvement in the overall error correction rate. In 2016, Fujiwara [3] designed a custom phonetic alphabet optimal for ASR. It enables the user to input words more accurately than spelling them out directly or using the NATO phonetic alphabet, which is known as the standardized phonetic alphabet used for human-human speech interaction under noise. Wang et al. [4] divided speech error correction process into four steps: (1) initial recognition; (2) detecting repeated words by computing the phonetic similarity between collected words and the CCN; (3) correcting recognition errors of repeated words automatically; (4) extracting new words from the recognition result of the current utterance.

In this paper, For the Chinese speech interaction context, we propose a POST-Editing Chinese text correction and intention recognition method for speech interaction. First of all, syntactic analysis of all the text material, extract the core components and use word2vec to extend the corpus, and then extract Pinyin to build inverted index. For the results after the ASR speech recognition, the Pinyin representation is extracted first, then the unvoiced sounds are unified, the retrieval is performed in the inverted index, and finally the distance is calculated using the improved edit distance.

2 Related Research

Pronunciation primitive:

Chinese is a language composed of syllables. Syllables can be used as the basis for Chinese speech recognition. The internal structure of Chinese syllables is structured. Generally, each syllable can be divided into two parts: initials and finals [5]. It is a very good choice to use initials and finals as Chinese-specific pronunciation recognition primitives. Table 1 lists all Chinese initials and finals.

Table 1. Chinese initials and finals

Fuzzy tone:

“Fuzzy tone” [6] is a pair of syllables that are easily confused and indistinguishable. Fuzzy tone is used most often in Pinyin input method. In the field of speech recognition, fuzzy tone is also useful, due to some Chinese pronunciation is very similar, brought a lot of trouble to speech recognition, using of fuzzy tones to unify all of these sounds is beneficial for the purpose of identification. Most of the fuzzy sounds are listed in Table 2.

Table 2. Fuzzy sound classification and statistics

3 The Framework of Speech Correction Overall Structure

This paper divides text correction and intention recognition of speech interaction context into two phases: corpus processing and text correction (Fig. 1).

Fig. 1.
figure 1

Speech correction and intention recognition method for speech interaction context

At the corpus processing stage, firstly obtain the question and answer corpus of the required context, and then remove the colloquial stop words and then perform a sequential dependency analysis on these corpus. The main purpose is to extract the core components. These core component words can roughly express the overall meaning of the sentence. But because of the diversity of language, many words can express the same or similar, we use the Word2Vec model to replace the core components to generate more similar question and answer corpus and extend the usability of the model. Next, we will convert the generated linguistic data into Chinese Pinyin form, and establish an inverted index to help us to conduct fast retrieval and timely response.

In the text correcting stage, we first need to obtain text data from ASR, and then convert the text data into Chinese Pinyin, and then replace all the Pinyin with a unified fuzzy word, and query the n words whose edit distance is less than k in the inverted index. We use the improved edit distance method designed in this paper to perform distance calculations on these n words, and output the most likely text error correction and intent recognition results.

3.1 Speech Recognition Error Analysis

In ASR systems, textual errors can generally be divided into three types:

  1. (1)

    The pronunciation is the same (similar) but the characters are different

    Since the ASR system is essentially a model for sounds and characters (words), the model can often accurately recognize word sounds (or similar word sounds) but output wrong words. For Examples ‘‘’’ (means login to Kingdee cloud) was identified as ‘‘’’ (means login classic cloud’’) or ‘‘’’ (means login classic language). At this time, text correction is needed to replace the text with the correct representation.

  2. (2)

    The meaning is the same (similar) but the characters are different

    Due to the diversity of natural language, we usually have many ways to express the same thing, and the speech interaction is usually a language that is biased toward colloquialism, and contains many meaningless stop words (such as “” (means let me) and “” (means a bit). The presence or absence of a word does not affect the core meaning of the whole sentence. For example, we say that ‘‘’’ (means Open this application “everyone’s performance” let me take a look) to express the “” (means open “everyone’s performance”) order. At this time, intention recognition is needed to identify the true intention.

  3. (3)

    Mixed

    This type is a mixture of the above two types. That is to say, the recognized text itself is wrong, and the semantics is also just an approximate representation of our linguistic problems. At this time, it is necessary to organically combine text correction and intention recognition.

3.2 Corpus Processing Stage

For a speech interaction application, corpus is needed as a support, the first is to train the speech recognition model, and the second is to define the ability range. We need to do some semantic extensions to support more comprehensive intent recognition, and we need to establish a suitable model to support sorting and distance calculations.

First, we analyze the corpus by dependency syntax to find the core components of the sentence. Dependency Parsing (DP) [7] reveals its syntactic structure by analyzing the dependencies between components within a linguistic unit. Intuitively speaking, grammatical components such as “subject-verb-object” and “attributive-adverbial- complement” are identified in a sentence-by-syntax analysis, and the relationships among the components are analyzed. For a sentence, the core component can roughly express the approximate meaning of the sentence, so the core component plays an extremely important role for a sentence. For the voice interaction scenario, the core component of the user’s question is often a representative verb, such as “” (means propose), “” (means open), “” (means play) etc. These words are often highly substitutable, such as “” You can use “” instead.

Then, we use the trained Word2Vec [8, 9] to generate synonym words for the core words and replace the core words in the original sentences to generate new corpus. Word2Vec is a three-layer neural network model. It trains a large number of texts, and it can be used to vectorize the words very well. With additional data structures, synonyms can be calculated. The new corpus generated using the synonym to replace the core words is semantically the same as the previous sentences, which can expand the usability of intention recognition.

Then, all the corpus is converted into Pinyin representations and the fuzzy sounds are replaced. All the phonetic alphabets which have similar pronunciation are replaced by a single representation. After a unified replacement, the accuracy of speech recognition can be greatly improved.

Finally, all the Pinyin is built inverted index for storage. The inverted index can improve the efficiency of the fuzzy query, and the common inverted index engine can quickly perform the search of the specified edit distance range. This operation helps us to filter candidate results.

3.3 Error Correction Stage

For the text information t0 output by the ASR system, since the speech input of the ASR is often spoken text, it is necessary that we need to preprocess it. The purpose of preprocessing is to remove all redundant characters and words on the premise of retaining as many core sentences as possible in order to keep the sentences streamlined so that we can perform algorithm analysis. The commonly used pre-processing method is to use the stop word list after the segmentation to filter out the stop words. In this paper we uses the stop words data set provided by the Hanlp project [10] to perform stop word culling operations. The dataset contains 1,208 Chinese and English stop words.

Similarly, after the text preprocessing, we also need to convert the text into a Chinese Pinyin form, and then unify the fuzzy word.

Next, we need to search in the reverse index database generated by the Corpus processing stage to get n pieces of corpus with the closest edit distance to t0 for further analysis. This paper uses Solr [11] to construct the inverted index. Solr is a high-performance, full-text search server based on Lucene [12] developed by Java. It extends on the basis of Lucene, provides a richer query language than Lucene, and is configurable, extensible, and optimized for query performance. It is a very good full-text search engine. Solr helps us quickly build inverted indexes and can quickly perform distance-based searches. By querying Solr, we obtained n pieces of corpus which is nearest of t0 in editing distance. In this way, we do not need to calculate the distance between t0 and each corpus, which greatly reduces the system response time and load.

Next, we need to use t0 and n pieces of acquired corpus t1 – tn to calculate distances one by one to further narrow the candidate set. This paper uses a modified edit distance to perform this distance calculation.

The traditional editing distance has a problem with the speech interaction context. That is, if the lengths of the two strings differ greatly, the distance between the strings cannot be well represented.

In order to solve this problem, this paper improves the traditional editing distance algorithm. Improvements include the following:

  1. (1)

    Pinyin texts are separated using the separator “-”, which can avoid the appearance of Pinyin ambiguity and increase the editing distance between words.

  2. (2)

    Introducing Pinyin Word Length Regular Terms:

$$ {\text{lr}} = {\text{abs}}\left( {{\text{len}}\left( {{\text{t}}_{0} } \right) - {\text{len}}\left( {{\text{t}}_{{\text{i}}} } \right)} \right) \times (\sum\nolimits_{{{\text{w }}{\kern 1pt} {\text{in }}{\kern 1pt} {\text{t}}_{0} }}^{{\text{w}}} {{\text{len}}_{{\text{p}}} } \left( {\text{w}} \right) + \sum\nolimits_{{{\text{w }}{\kern 1pt} {\text{in}}{\kern 1pt} {\text{t}}_{{\text{i}}} }}^{{\text{w}}} {{\text{len}}_{{\text{p}}} } \left( {\text{w}} \right)){\text{/}}\left( {{\text{len}}\left( {{\text{t}}_{0} } \right) + {\text{len}}\left( {{\text{t}}_{{\text{i}}} } \right)} \right) $$

The abs(x) method refers to the calculation of the absolute value of x, the len(x) method refers to the number of words of x, and the lenp(x) refers to the number of letters in Pinyin of x. Adding Pinyin word length regular terms to the Pinyin editing distance can better calculate the distance for texts with a large difference in length, which is more suitable for speech interaction contexts.

4 Experiments

In order to verify the effectiveness of our method in the Chinese speech interaction context, the method designed in this paper was connected to a smart ASR system, the validity test was conducted by the multi-person round-call question test method. Finally, the results were manually labeled. The test example is shown in Table 3.

Table 3. Test example

We can see that some of these ASR misidentification use cases are similar in pronunciation and some are semantically similar. We test on such a data set. The test results are shown in Table 4.

Table 4. Test results

We can see that our method can greatly improve the accuracy of text recognition in ASR system. By analyzing the erroneous use cases, we found that in addition to invalid (i.e. meaningless) exceptions, errors can be categorized into the following categories:

  1. (1)

    Intentional identification failure. For example, “” should be identified as “” or “” (means playing channel Kingdee cloud), but since corpus processing stage does not use the “play channel” as a synonym for the core word “open”. Caused intention identification failure

  2. (2)

    The input text is too long. Such as “ ” (means hello little K, I’m very glad to meet you. I want you to help me run the “performance of everyone”. Let me see if it’s OK) Because the text is too long, it leads to a large deviation from the calculation. As a result, error correction fails.

  3. (3)

    ASR is incomplete. Due to the ASR’s own reasons or environmental reasons, the identified text is different from the real text, or it fails to recognize the integrity. For example, “” (meaningless) we guessed that the speaker’s intention was to “” (means login to the Kingdee cloud) but due to missing some information, the error correction failed.

5 Conclusions

This paper first briefly introduces the industry development and research trends in the field of voice interaction, and points out the problems and deficiencies in text correcting and intention recognition. On this basis, the method of text error correction and intention recognition of the face-to-face speech interaction as described in this paper is proposed. The method uses semantics and speech to perform error correction. It is possible to deal with mixed complex contexts. Finally, a large number of tests are carried out using test cases. The experiment shows that the accuracy of ASR system can be greatly improved by the text error correction and intention recognition method.

But at the same time, we discovered some deficiencies in the system:

  1. (1)

    Since the traditional edit distance algorithm is used in the search, the improved edit distance algorithm is used when the final result set is calculated. Therefore, the result set obtained in some cases is not the result we hoped for.

  2. (2)

    Since all texts are unreliable and error-prone, the improved optimization method designed in this paper will fail at some point and does not have the desired effect. When the improvement method fails, it will degenerate into a Ordinary editing distance algorithm.

  3. (3)

    Analyze the error use case to find that there is still room for further optimization in the intent recognition. The method in this paper is not effective in dealing with the long text and incomplete recognition.