Using MDL for Grammar Induction

Adriaans, Pieter; Jacobs, Ceriel

doi:10.1007/11872436_24

Pieter Adriaans²³ &
Ceriel Jacobs²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4201))

Included in the following conference series:

International Colloquium on Grammatical Inference

641 Accesses

Abstract

In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the randomness deficiency. This is a measure of how typical the data set is for the theory. It can not be computed, but it can in many relevant cases be approximated. An optimal theory has minimal randomness deficiency. Using results from [4] and [2] we show that:

– Shorter code not necessarily leads to better theories. We prove that, in DFA induction, already as a result of a single deterministic merge of two nodes, divergence of randomness deficiency and MDL code can occur.

– Contrary to what is suggested by the results of [6] there is no fundamental difference between positive and negative data from an MDL perspective.

– MDL is extremely sensitive to the correct calculation of code length: model code and data-to-model code.

These results show why the applications of MDL to grammar induction so far have been disappointing. We show how the theoretical results can be deployed to create an effective algorithm for DFA induction. However, we believe that, since MDL is a global optimization criterion, MDL based solutions will in many cases be less effective in problem domains where local optimization criteria can be easily calculated. The algorithms were tested on the Abbadingo problems ([10]). The code was in Java, using the Satin ([17]) divide-and-conquer system that runs on top of the Ibis ([18]) Grid programming environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

IDL-PMCFG, a Grammar Formalism for Describing Free Word Order Languages

Article Open access 29 April 2022

Computing $$\textit{pre}^{*}$$ for General Context Free Grammars

Improved Descriptional Complexity Results on Generalized Forbidding Grammars

References

Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Adriaans, P., Vitányi, P.M.B.: The power and perils of MDL, Human Computer Studies Lab, Universiteit van Amsterdam (2005)
Google Scholar
Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)
MATH Google Scholar
Vereshchagin, N.K., Vitányi, P.M.B.: Kolmogorov’s structure functions and model selection. IEEE Trans. Information Theory 50(12), 3265–3290 (2004)
Article Google Scholar
Grünwald, P.D., Langford, J.: Suboptimal behaviour of Bayes and MDL in classification under misspecification. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 331–347. Springer, Heidelberg (2004)
Chapter Google Scholar
Gold, E.M.: Mark, Language Identification in the Limit. Information and Control 10(5), 447–474 (1967)
Article MATH Google Scholar
Pitt, L., Warmuth, M.K.: The Minimum Consistent DFA Problem Cannot be Approximated within any Polynomial. Journal of the ACM 40(1), 95–142 (1993)
Article MATH MathSciNet Google Scholar
Adriaans, P., Vervoort, M.: The EMILE 4.1 grammar induction toolbox, in Grammatical Inference: Algorithms and Applications. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 293–295. Springer, Heidelberg (2002)
Chapter Google Scholar
Vervoort, M.: Games, walks and Grammars, Thesis University of Amsterdam (2000)
Google Scholar
Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo One DFA learning competition and a new evidence-driven state merging algorithm. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 1–12. Springer, Heidelberg (2002)
Chapter Google Scholar
van Zaanen, M., Adriaans, P.: Alignment-Based Learning versus EMILE: A Comparison. In: Proceedings of the Belgian-Dutch Conference on Artificial Intelligence (BNAIC), Amsterdam, The Netherlands, pp. 315–322 (2001)
Google Scholar
Solan, Z., Horn, D., Ruppin, E., Edelman, S.: Unsupervised learning of natural languages. PNAS 102(33), 11629–11634 (2005)
Article Google Scholar
Curnéjols, A., Miclet, L.: Apprentissage artificiel, concepts et algorithmes, Eyrolles (2003)
Google Scholar
Wolff, J.G.: Computing As Compression: An Overview of the SP Theory and System. New Generation Comput. 13(2), 187–214 (1995)
Article Google Scholar
Wolff, J.G.: Information Compression by Multiple Alignment, Unification and Search as a Unifying Principle in Computing and Cognition. Journal of Artificial Intelligence Research 19(3), 193–230 (2003)
Article Google Scholar
de la Higuera, C., Adriaans, P. W., van Zaanen, M., Oncina, J.(eds.): Proceedings of the Workshop and Tutorial on Learning Context-Free Grammars held at the 14th European Conference on Machine Learning (ECML) and the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Dubrovnik, Croatia (2003)
Google Scholar
van Nieuwpoort, R.V., Maassen, J., Kielmann, T., Bal, H.E.: Simple and Efficient Java-based Grid Programming. Scalable Computing: Practice and Experience 6(3), 19–32 (2005)
Google Scholar
van Nieuwpoort, R.V., Maassen, J., Wrzesinska, G., Hofman, R., Jacobs, C., Kielmann, T., Bal, H.E.: Ibis: a Flexible and Efficient Java based Grid Programming Environment. Concurrency and Computation: Practice and Experience 17(7-8), 1079–1107 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Amsterdam, Kruislaan 419, 1098VA, Amsterdam, The Netherlands
Pieter Adriaans
Department of Computer Science, Vrije Universiteit Amsterdam, De Boelelaan 1081a, 1081HV, Amsterdam, The Netherlands
Ceriel Jacobs

Authors

Pieter Adriaans
View author publications
You can also search for this author in PubMed Google Scholar
Ceriel Jacobs
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, 223-8522, Yokohama, Japan
Yasubumi Sakakibara
Dept. of Computer Science, Kyoto Sangyo University, Kamigamo Motoyama, Kita-ku, Kyoto, Japan
Satoshi Kobayashi
Japan Biological Informatics Consortium, 10F TIME24 Building, 2-45 Aomi, Koto-ku, 135-8073, Tokyo, Japan
Kengo Sato
Department of Information and Communication Engineering, Graduate School of Electro-Communications, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, 182-8585, Tokyo, Japan
Tetsuro Nishino
Department of Information and Communication Engineering, Faculty of Electro-Communications, The University of Electro-Communications, Chofugaoka 1–5–1, Chofu, 182-8585, Tokyo, Japan
Etsuji Tomita

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adriaans, P., Jacobs, C. (2006). Using MDL for Grammar Induction. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds) Grammatical Inference: Algorithms and Applications. ICGI 2006. Lecture Notes in Computer Science(), vol 4201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11872436_24

Download citation

DOI: https://doi.org/10.1007/11872436_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45264-5
Online ISBN: 978-3-540-45265-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics