Abstract
This paper describes a novel approach for automatic creation of Bangla error corpus for training and evaluation of grammar checker systems. The procedure begins with automatic creation of large number of erroneous sentences from a set of grammatically correct sentences. A statistical Confidence Score Filter has been implemented to select proper samples from the generated erroneous sentences such that sentences with less probable word sequences get lower confidence score and vice versa. Rule based Mal-rule filter with HMM based semi-supervised POS tagger has been used to collect the sentences having improper tag sequences. Combination of these two filters ensures the robustness of the proposed approach such that no valid construction is getting selected within the synthetically generated error corpus. Though the present work focuses on the most frequent grammatical errors in Bangla written text, detail taxonomy of grammatical errors in Bangla is also presented here, with an aim to increase the coverage of the error corpus in future. The proposed approach is language independent and could be easily applied for creating similar corpora in other languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kamp, H., Reyle, U.: From Discourse to Logic:Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representatio. Studies in Linguistics and Philosophy. Kluwer Academic Publishers (1993)
Wagner, J., Foster, J., van Genabith, J.: A Comparative Evaluation of Deep and Shallow Approach to the Automatic Detection of Common Grammatical Error. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Processing, pp. 112–121 (2007)
Foster, J.: Good Reasons for Noting Bad Grammar: Empirical Investigations into the Parsing of Ungrammatical Written English, Phd. Thesis, University of Dublin, Trinity College, Dublin, Ireland (2005)
Stemberger: Syntactic errors in speech. Journal of Psycholinguistic Research, 313–345 (1982)
Thurmair, G.: Parsing for Grammar and Style Checking. In: Proceedings of the 13th International Conference on Computational Linguistics, pp. 365–370 (1990)
Bustamante, F.R., Leon, F.S.: GramCheck: A grammar and style checker. In: Proceedings of COLING, pp. 175–181 (1996)
Stanley, Goodman: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (1996)
Dagan, I., Karov, Y., Roth, D.: Mistake-Driven Learning in Text Categorization. In: The Second Conference on Empirical Methods in Natural Language Processing, pp. 55–63 (1997)
Powers, D.M.W.: Learning and Application of Differential Grammars. In: Proceedings Meeting of the ACL Special Interest Group in Natural Language Learning, pp. 88–96 (1996)
Liu, C., Wu, C., Harris, M.: Word Order Correction for Language Transfer Using Relative Position Language Modeling. In: Proceedings of 6th ISCSLP, pp. 1–4 (2008)
Michaud, L.N., Mccoy, K.F.: An intelligent tutoring system for deaf learners of written English. In: Proceedings of the Fourth International ACM SIGCAPH Conference on Assistive Technologies, pp. 13–15
Leacock, Chodorow, Gamon, Tetreault: Automated Grammatical Error Detection for Language Learners. Morgan & Claypool Publishers (2010)
Sjobergh, Knutsson: Faking errors to avoid making errors: Very weakly supervised learning for error detection in writing. In: Proceeding of the International Conference on Recent Advances in Natural Language Processing, pp. 506–512 (2005)
Lee, Seneff: Correcting misuse of verb forms. In: Proceeding of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technology, pp. 174–182 (2008)
Brockett, Dolan, Gamon: Correcting ESL errors using phrasal SMT techniques. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 249–256 (2006)
Foster, Andersen: GenERRate: Generating errors for use in grammatical error detection. In: Proceedings of the Fourth Workshop on Building Educational Applications Using NLP, pp. 82–90 (2009)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)
Raybaud, S., Langlois, D., Smaïli, K.: Efficient combination of confidence measures for machine translation. In: Proc. INTERSPEECH, pp. 424–427 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kundu, B., Chakraborti, S., Choudhury, S.K. (2012). Combining Confidence Score and Mal-rule Filters for Automatic Creation of Bangla Error Corpus: Grammar Checker Perspective. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28601-8_39
Download citation
DOI: https://doi.org/10.1007/978-3-642-28601-8_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28600-1
Online ISBN: 978-3-642-28601-8
eBook Packages: Computer ScienceComputer Science (R0)