Skip to main content

Using MDL for Grammar Induction

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4201))

Abstract

In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the randomness deficiency. This is a measure of how typical the data set is for the theory. It can not be computed, but it can in many relevant cases be approximated. An optimal theory has minimal randomness deficiency. Using results from [4] and [2] we show that:

– Shorter code not necessarily leads to better theories. We prove that, in DFA induction, already as a result of a single deterministic merge of two nodes, divergence of randomness deficiency and MDL code can occur.

– Contrary to what is suggested by the results of [6] there is no fundamental difference between positive and negative data from an MDL perspective.

– MDL is extremely sensitive to the correct calculation of code length: model code and data-to-model code.

These results show why the applications of MDL to grammar induction so far have been disappointing. We show how the theoretical results can be deployed to create an effective algorithm for DFA induction. However, we believe that, since MDL is a global optimization criterion, MDL based solutions will in many cases be less effective in problem domains where local optimization criteria can be easily calculated. The algorithms were tested on the Abbadingo problems ([10]). The code was in Java, using the Satin ([17]) divide-and-conquer system that runs on top of the Ibis ([18]) Grid programming environment.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  2. Adriaans, P., Vitányi, P.M.B.: The power and perils of MDL, Human Computer Studies Lab, Universiteit van Amsterdam (2005)

    Google Scholar 

  3. Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)

    MATH  Google Scholar 

  4. Vereshchagin, N.K., Vitányi, P.M.B.: Kolmogorov’s structure functions and model selection. IEEE Trans. Information Theory 50(12), 3265–3290 (2004)

    Article  Google Scholar 

  5. Grünwald, P.D., Langford, J.: Suboptimal behaviour of Bayes and MDL in classification under misspecification. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 331–347. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Gold, E.M.: Mark, Language Identification in the Limit. Information and Control 10(5), 447–474 (1967)

    Article  MATH  Google Scholar 

  7. Pitt, L., Warmuth, M.K.: The Minimum Consistent DFA Problem Cannot be Approximated within any Polynomial. Journal of the ACM 40(1), 95–142 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  8. Adriaans, P., Vervoort, M.: The EMILE 4.1 grammar induction toolbox, in Grammatical Inference: Algorithms and Applications. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 293–295. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  9. Vervoort, M.: Games, walks and Grammars, Thesis University of Amsterdam (2000)

    Google Scholar 

  10. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo One DFA learning competition and a new evidence-driven state merging algorithm. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 1–12. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  11. van Zaanen, M., Adriaans, P.: Alignment-Based Learning versus EMILE: A Comparison. In: Proceedings of the Belgian-Dutch Conference on Artificial Intelligence (BNAIC), Amsterdam, The Netherlands, pp. 315–322 (2001)

    Google Scholar 

  12. Solan, Z., Horn, D., Ruppin, E., Edelman, S.: Unsupervised learning of natural languages. PNAS 102(33), 11629–11634 (2005)

    Article  Google Scholar 

  13. Curnéjols, A., Miclet, L.: Apprentissage artificiel, concepts et algorithmes, Eyrolles (2003)

    Google Scholar 

  14. Wolff, J.G.: Computing As Compression: An Overview of the SP Theory and System. New Generation Comput. 13(2), 187–214 (1995)

    Article  Google Scholar 

  15. Wolff, J.G.: Information Compression by Multiple Alignment, Unification and Search as a Unifying Principle in Computing and Cognition. Journal of Artificial Intelligence Research 19(3), 193–230 (2003)

    Article  Google Scholar 

  16. de la Higuera, C., Adriaans, P. W., van Zaanen, M., Oncina, J.(eds.): Proceedings of the Workshop and Tutorial on Learning Context-Free Grammars held at the 14th European Conference on Machine Learning (ECML) and the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Dubrovnik, Croatia (2003)

    Google Scholar 

  17. van Nieuwpoort, R.V., Maassen, J., Kielmann, T., Bal, H.E.: Simple and Efficient Java-based Grid Programming. Scalable Computing: Practice and Experience 6(3), 19–32 (2005)

    Google Scholar 

  18. van Nieuwpoort, R.V., Maassen, J., Wrzesinska, G., Hofman, R., Jacobs, C., Kielmann, T., Bal, H.E.: Ibis: a Flexible and Efficient Java based Grid Programming Environment. Concurrency and Computation: Practice and Experience 17(7-8), 1079–1107 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Adriaans, P., Jacobs, C. (2006). Using MDL for Grammar Induction. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds) Grammatical Inference: Algorithms and Applications. ICGI 2006. Lecture Notes in Computer Science(), vol 4201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11872436_24

Download citation

  • DOI: https://doi.org/10.1007/11872436_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45264-5

  • Online ISBN: 978-3-540-45265-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics