Improvement of the LR parsing table and its application to grammatical error correction

https://doi.org/10.1016/S0020-0255(02)00272-4Get rights and content

Abstract

A LR parsing table is generally made use of the parsing process based on the context free grammar for natural languages. Besides the parsing process, it can be used as the index of approximate pattern matching and error correction, because it has the characteristic to be able to predict the next character in the sentence. As for the issue of the traditional LR parsing table, however we can mention if the number of sequences to be processed becomes large, many reduce actions will be created in the parsing table, as a result, it takes a great deal of time to parse the sentence. In this paper, we propose the method to construct a new LR parsing table without reduce actions from the generalized context free grammar. This new parsing table denotes the states to be transited after accepting each symbol. Moreover, we applied this new parsing table to detect and correct erroneous sentences which include the syntax errors, unknown words and misspelling. By using this table, the symbol which is allocated just after the error position can be utilized for selecting correction symbols, as a result, the number of candidates produced on the correction process is reduced, and fast system can be realized. The experiment results, using 1050 sentences including error characters, show that this method can correct error points 69 times faster than the traditional method, also keep the almost same correction accuracy as the traditional method.

Introduction

A LR parsing table is generally made use of the parsing process based on the context free grammar for natural languages. Besides the parsing process, it can be used as the index of approximate pattern matching and error correction, because it has the characteristic to be able to predict the next character in the sentence. As for the context-free grammar, it is impossible to make up the context-free grammar which accepts all natural languages. Generally speaking, the context-free grammar can accept only 70% of the natural languages. In the case when the target domain is limited, however, it is possible to prepare the task-specific grammar [5], and the grammar-based method is able to obtain a high correction accuracy. As for the issue of the traditional LR parsing table, however we can mention if the number of sequences to be processed becomes large, many reduce actions will be created in the parsing table, as a result, it takes a great deal of time to parse the sentence.

In this paper, we propose the method to construct a new LR parsing table without reduce actions from the generalized context free grammar. This new parsing table denotes the states to be transited after accepting each symbol. This new table contains only the states reached by the shift moves and does not include any reduce moves which were calculated beforehand. This new table is called a Descendant Set table (DS table). By using the DS table, symbols which admit a transition can be decided upon at the same time without building a new stack. As a result, the number of candidates produced on the correction process is reduced, and fast system can be realized.

Moreover, we applied this new parsing table to detect and correct erroneous sentences which include the syntax errors, unknown words and misspelling. In order to correct the erroneous sentences automatically, the method to narrow correct symbols using character and word n-gram models has been already proposed [4], [6], [13]. And, the method to select candidate words by calculation of the hash and Levenshtein distances between the input word and each word in the dictionary has been also proposed [3], [12]. Both of them, however, make use of the only local statistics information and not the grammatical information, so they cannot correct using the global information in the erroneous sentence. On the other hand, Saito et al. [5], [9], [10] noted that the context free grammar [1] can accept the wider language classes than the regular grammar and n-gram models, and they applied the LR parsing method [1], [14] to the error correction method. The LR parsing algorithm is a table driven shift-reduce parsing algorithm that can handle arbitrary context-free grammars, and parses the sentence looking ahead the suitable symbols for the context-free grammar. By using the property that the suitable symbols for the grammar are looked ahead, their method can find that the symbol of the place where the parsing fails may be wrong, and the method can correct an erroneous sentence according to the context-free grammar. By using the DS table, the symbol which is allocated just after the error position can be utilized for selecting correction symbols, as a result, the number of candidates produced on the correction process is reduced, and fast system can be realized.

In order to observe the effect of this method, we made an experiment using 1050 sentences (28.5 Kbyte) including 418 error characters. The experimental results show that this method can correct error points 69 times faster than the traditional method, also keep the almost same correction accuracy (90%) as the traditional method.

In the following sections, the fast error correction method presented is described in detail. Section 2 describes the outline of the LR parsing algorithm. In Section 3, first of all the traditional error correction method using the LR parsing is explained according to Saito et al. [5], [9], [10], and then the construction algorithm of the DS table used with this method is shown. In Section 4, the replacement process for altered errors, the deletion process for extra errors, and the insertion process for missing errors are introduced as the error correction method using the DS table, respectively. In Section 5, experimental results with various key sets are given. Finally, our results are summarized and the future research is discussed in Section 6.

Section snippets

LR parsing algorithm

The LR parsing algorithm is a table driven shift-reduce parsing algorithm that can handle arbitrary context-free grammars in polynomial time. This LR parser analyzes the input sequence by using the LR parsing table and the stack. The LR parsing table consists of the action table, which indicates the next action, and the goto table, which decides to which state the parser should go. This table is made beforehand according to the grammar by the table generator. The LR parser scans the input

Descendant set table

As for the issue of the traditional LR parsing table, we can mention if the number of sequences to be processed becomes large, many reduce actions will be created in the parsing table, as a result, it takes a great deal of time to parse the sentence. In this paper, we propose the method to construct a new LR parsing table without reduce actions from the generalized context free grammar. This new parsing table denotes the next state to be transited by accepting each symbol and state. This table

Three correction processes

This method handles three kinds of errors: altered, extra and missing errors.

  • 1.

    Altered errors: Right symbols are replaced with other wrong symbols.

  • 2.

    Extra errors: Extra symbols are inserted into the right input sequence.

  • 3.

    Missing errors: Arbitrary symbols in the right input sequence are missing.

In order to correct the above errors, this method prepares three correction processes, namely the replacement process for altered errors, the deletion process for extra errors, and the insertion process for

Evaluation

This system has been implemented in C language on a SUPER WORKSTATION AS 5000, and the used grammar is the task-specific grammar [5] for queries concerning the registration of the conference. The details of this grammar are as the following: the number of rules is 2048, the number of grammar symbols is 797, the number of vocabularies is 770, the number of LR states is 3465 and the target language is Japanese. The parsing table is compiled by the canonical LR(1), and the number of retreats [7],

Conclusions

This paper proposed the method for improving the time efficiency of the traditional method by using the DS table instead of the parsing table on the occasion of automatic error correction, and the validity of this method has been supported by empirical observations. This method can correct error points of input sequences 68 times faster than the traditional method, also keep the same correction accuracy as the traditional method.

As future improvements, an efficient data structure for DS table

References (14)

  • A.V. Aho et al.

    Principles of Compiler Design

    (1977)
  • J. Aoe et al.

    The construction of weak precedence parsers by parsing tables

    Transactions of Information Processing Society of Japan

    (1977)
  • M. Hatada et al.

    Spelling correction method for English and Katakana in Japanese OCR text

    Transactions of Information Processing Society of Japan

    (1997)
  • N. Itoh et al.

    A method of detecting and correcting errors in the results of Japanese OCR

    Transactions of Information Processing Society of Japan

    (1992)
  • K. Kita et al.

    HMM continuous speech recognition using generalized LR parsing

    Transactions of Information Processing Society of Japan

    (1990)
  • T. Kurita et al.

    A method for correcting errors on Japanese words input and its application to spoken word recognition with large vocabulary

    Transactions of Information Processing Society of Japan

    (1984)
  • M.D. Mickunas et al.

    Automatic error recovery for LR parsers

    Communications of the ACM

    (1978)
There are more references available in the full text version of this article.

Cited by (3)

  • A method of extracting and evaluating good and bad reputations for natural language expressions

    2005, International Journal of Information Technology and Decision Making
  • A method of extracting and evaluating popularity and unpopularity for natural language expressions

    2004, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
View full text