Unsupervised Grammar Inference Using the Minimum Description Length Principle

Sapkota, Upendra; Bryant, Barrett R.; Sprague, Alan

doi:10.1007/978-3-642-31537-4_12

Upendra Sapkota²⁰,
Barrett R. Bryant²¹ &
Alan Sprague²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7376))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

5903 Accesses

Abstract

Context Free Grammars (CFGs) are widely used in programming language descriptions, natural language processing, compilers, and other areas of software engineering where there is a need for describing the syntactic structures of programs. Grammar inference (GI) is the induction of CFGs from sample programs and is a challenging problem. We describe an unsupervised GI approach which uses simplicity as the criterion for directing the inference process and beam search for moving from a complex to a simpler grammar. We use several operators to modify a grammar and use the Minimum Description Length (MDL) Principle to favor simple and compact grammars. The effectiveness of this approach is shown by a case study of a domain specific language. The experimental results show that an accurate grammar can be inferred in a reasonable amount of time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dupont, P.: Regular Grammatical Inference from Positive and Negative Samples by Genetic Search: The GIG Method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 236–245. Springer, Heidelberg (1994), http://dl.acm.org/citation.cfm?id=645515.658234
Chapter Google Scholar
Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)
Article MATH Google Scholar
de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, New York (2010)
MATH Google Scholar
Javed, F., Mernik, M., Bryant, B.R., Sprague, A.: An unsupervised incremental learning algorithm for domain-specific language development. Applied Artificial Intelligence 22(7), 707–729 (2008)
Article Google Scholar
Lammel, R., Verhoef, C.: Semi-automatic grammar recovery. Software — Practice & Experience 31(15), 1395–1438 (2001)
Article Google Scholar
Langley, P., Stromsten, S.: Learning Context-Free Grammars with a Simplicity Bias. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 220–228. Springer, Heidelberg (2000)
Chapter Google Scholar
Li, M., Vitanyi, P.M.: An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn. Springer Publishing Company, Incorporated (2008)
Google Scholar
Mernik, M., Hrncic, D., Bryant, B., Sprague, A., Gray, J., Liu, Q., Javed, F.: Grammar inference algorithms and applications in software engineering. In: Proceedings of ICAT 2009, the XXII International Symposium on Information, Communication and Automation Technologies, pp. 1–7 (October 2009)
Google Scholar
Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specific languages. ACM Comput. Surv. 37(4), 316–344 (2005), http://doi.acm.org/10.1145/1118890.1118892
Article Google Scholar
Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7, 67–82 (1997)
MATH Google Scholar
Oncina, J., Garcia, P.: Inferring regular languages in polynomial update time. In: Pattern Recognition and Image Analysis, pp. 49–61 (1992)
Google Scholar
Paakki, J.: Attribute grammar paradigms a high-level methodology in language implementation. ACM Comput. Surv. 27, 196–255 (1995), http://doi.acm.org/10.1145/210376.197409
Article Google Scholar
Petasis, G., Paliouras, G., Karkaletsis, V., Halatsis, C., Spyropoulos, C.D.: E-grids: Computationally efficient grammatical inference from positive examples. Grammars 7 (2004)
Google Scholar
Rissanen, J.: Stochastic Complexity in Statistical Inquiry Theory. World Scientific Publishing Co., Inc., River Edge (1989)
Google Scholar
Tu, K., Honavar, V.: Unsupervised Learning of Probabilistic Context-Free Grammar using Iterative Biclustering. In: Clark, A., Coste, F., Miclet, L. (eds.) ICGI 2008. LNCS (LNAI), vol. 5278, pp. 224–237. Springer, Heidelberg (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, AL, 35294-1170, USA
Upendra Sapkota & Alan Sprague
Department of Computer Science and Engineering, University of North Texas, Denton, TX, 76203-5017, USA
Barrett R. Bryant

Authors

Upendra Sapkota
View author publications
You can also search for this author in PubMed Google Scholar
Barrett R. Bryant
View author publications
You can also search for this author in PubMed Google Scholar
Alan Sprague
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI, Kohlenstraße 2, 04107, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sapkota, U., Bryant, B.R., Sprague, A. (2012). Unsupervised Grammar Inference Using the Minimum Description Length Principle. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2012. Lecture Notes in Computer Science(), vol 7376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31537-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-31537-4_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31536-7
Online ISBN: 978-3-642-31537-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics