Skip to main content
Log in

A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to construct theories by sampling from this large pool of data. The first, “subsampling”, is a single-sample design in which the utility of a potential rule is evaluated on a randomly selected sub-sample of the data. The second, “logical windowing”, is multiple-sample design that tests and sequentially includes errors made by a partially correct theory. Both schemes are derived from techniques developed to enable propositional learning methods (like decision trees) to cope with large datasets. The ILP system CProgol, equipped with each of these methods, is used to construct theories for two datasets—one artificial (a chess endgame) and the other naturally occurring (a language tagging problem). In each case, we ask the following questions of CProgol equipped with sampling: (1) Is its theory comparable in predictive accuracy to that obtained if all the data were used (that is, no sampling was employed)?; and (2) Is its theory constructed in less time than the one obtained with all the data? For the problems considered, the answers to these questions is “yes”. This suggests that an ILP program equipped with an appropriate sampling method could begin to address problems satisfactorily that have hitherto been inaccessible simply due to data extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bain, M. 1991. Experiments in non-monotonic learning. In Proceedings of the Eighth International Workshop on Machine Learning, pages 380-384, San Mateo, CA. Morgan Kaufmann.

    Google Scholar 

  • Barnett, V. 1992. Sample Survey Principles and Methods. Edward Arnold, London.

    Google Scholar 

  • Bratko, I. and Grobelnik, M. 1993. Inductive learning applied to program construction and verification. In Third International Workshop on Inductive Logic Programming, pages 279-292, 1993. Available as Technical Report IJS-DP-6707, J. Stefan Inst., Ljubljana, Slovenia.

  • Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. 1984. Classification and Regression Trees. Wadsworth, Belmont.

    Google Scholar 

  • Cussens, J. 1997. Part-of-speech tagging using Progol. In Proceedings of the Seventh Inductive Logic Programming Workshop. Springer-Verlag, Berlin. To appear.

    Google Scholar 

  • L. De Raedt and M. Bruynooghe. Interactive concept-learning and constructive induction by analogy. Machine Learning, 8(2):107-150.

  • Dolsak, B. and Muggleton, S. 1992. The application of Inductive Logic Programming to finite element mesh design. In S. Muggleton, editor, Inductive Logic Programming, pages 453-472. Academic Press, London.

    Google Scholar 

  • Džeroski, S., Dehaspe, L., Ruck, B. and Walley, W. 1994. Classification of river water quality data using machine learning. In P. Zannetti, editor Computer Techniques in Environmental Studies V (Proceedings of the Fifth International Conference Development and Application of Computer Techniques to Environmental Studies), Vol I, pages 129-137, Computational Mechanics Publications, Southampton.

    Google Scholar 

  • Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L. 1988. A general lower bound on the number of examples needed for learning. In COLT 88: Proceedings of the Conference on Learning, pages 110-120, Los Altos, CA. Morgan-Kaufmann.

    Google Scholar 

  • Feng, C. 1992. Inducing temporal fault dignostic rules from a qualitative model. In S. Muggleton, editor, Inductive Logic Programming, pages 473-486. Academic Press, London.

    Google Scholar 

  • Furnkranz, J. 1997. More Efficient Windowing. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pages 509-514, Menlo Park, CA. AAAI Press/MIT Press.

    Google Scholar 

  • Furnkranz, J. 1997. Noise-tolerant windowing. Technical Report OEFAI-TR-97-07, Austrian Research Institute for Artificial Intelligence.

  • Gries, D. 1983. The science of programming. Springer-Verlag, New York.

    Google Scholar 

  • Harel, D. 1989. Algorithmics. Addison-Wesley, Reading, Mass.

    Google Scholar 

  • King, R.D., Muggleton, S.H., Srinivasan, A. and Sternberg, M.J.E. 1996. Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. of the National Academy of Sciences, 93:438-442.

    Google Scholar 

  • King, R.D., Muggleton, S.H. and Sternberg, M.J.E. 1992. Drug design by machine learning: The use of inductive logic programming to model the structure-activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. of the National Academy of Sciences, 89(23):11322-11326.

    Google Scholar 

  • Lewis, D.D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of The Eleventh International Machine Learning Conference, Los Altos, CA. Morgan Kaufmann.

    Google Scholar 

  • Lloyd, J.W. 1984. Foundations of Logic Programming. Springer-Verlag, Berlin.

    Google Scholar 

  • Michalski, R.S. 1983. A theory and methodology of inductive learning. In R. Michalski, J. Carbonnel, and T. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, pages 83-134. Tioga, Palo Alto, CA.

  • Moroney, M.J. 1953. Facts from Figures. Penguin Books (2nd Pelican Edition), Harmondsworth.

    Google Scholar 

  • Muggleton, S. 1991. Inductive logic programming. New Generation Computing, 8(4):295-318.

    Google Scholar 

  • Muggleton, S. 1993. Optimal layered learning: A PAC approach to incremental sampling. In K. Jantke, S. Kobayashi, E. Tomita, and T. Yokomori, editors, Proceedings of the 4th Conference on Algorithmic Learning Theory, LNAI, pages 37-44. Springer-Verlag.

  • Muggleton, S. 1994. Inductive Logic Programming: derivations, successes and shortcomings. SIGART Bulletin, 5(1):5-11.

    Google Scholar 

  • Muggleton, S. 1995. Inverse Entailment and Progol. New Gen. Comput., 13:245-286.

    Google Scholar 

  • Muggleton, S. 1996. Learning from positive data. In Proceedings of the Sixth Inductive Logic Programming Workshop, Stockholm University.

  • Muggleton, S., King, R. and Sternberg, M. 1992. Predicting protein secondary structure using inductive logic programming. Protein Engineering, 5:647-657.

    Google Scholar 

  • Muggleton, S., Page, C.D. and Srinivasan, A. 1996. An initial experiment into stereochemistry-based drug design using ILP. In Proceedings of the Sixth Inductive Logic Programming Workshop, LNAI, Springer-Verlag, Berlin.

    Google Scholar 

  • Muggleton, S.H., Bain, M.E., Hayes-Michie, J. and Michie, D. 1989. An experimental comparison of human and machine learning formalisms. In Proceedings of the Sixth International Workshop on Machine Learning. Kaufmann.

  • Muggleton, S.H. and Feng, C. 1990. Efficient induction of logic programs. In Proceedings of the First Conference on Algorithmic Learning Theory, Tokyo. Ohmsha.

  • Nilsson, N.J. 1980. Principles of Artificial Intelligence. Tioga, Palo Alto, CA.

  • Plotkin, G.D. 1971. Automatic Methods of Inductive Inference. PhD thesis, Edinburgh University.

  • Quinlan, J.R. 1979 Discovering rules from large collections of examples: a case study. In D. Michie, editor, Expert Systems in the Micro-electronic Age, pages 168-201. Edinburgh University Press, Edinburgh.

    Google Scholar 

  • Quinlan, J.R. 1990. Learning logical definitions from relations. Machine Learning, 5:239-266.

    Google Scholar 

  • Sebag, M. and Rouveirol, C. 1997. Tractable Induction and Classification in First-Order Logic via Stochastic Matching. In Proceedings of the Fifteenth International Conference on Artificial Intelligence (IJCAI-97). Morgan Kaufmann, Los Angeles, CA.

    Google Scholar 

  • Srinivasan, A., Muggleton, S.H., King, R.D. and Sternberg, M.J.E. 1996. Theories for mutagenicity: a study of first-order and feature based induction. Artificial Intelligence, 85:277-299.

    Google Scholar 

  • Wald, A. 1947. Sequential Analysis. Wiley, New York.

    Google Scholar 

  • Walpole, R.E. and Myers, R.H. 1978. Probability and Statistics for Engineers and Scientists. Collier Macmillan, New York. 2nd Edition.

    Google Scholar 

  • J. Wirth and J. Catlett. Experiments on the costs and benefits of windowing in ID3. In Proceedings of the Fifth International Conference on Machine Learning, pages 87-99, San Mateo, CA, 1988. Morgan Kaufmann.

    Google Scholar 

  • Wrobel, S. 1996. First-order Theory Refinement. In L. De Raedt, editor, Advances in Inductive Logic Programming. IOS Press, Tokyo.

    Google Scholar 

  • Zelle, J. and Mooney, R. 1993. Learning semantic grammars with constructive inductive logic programming. In Proceedings of the Eleventh National Conference on Artificial Intelligence, pages 817-822. Morgan Kaufmann.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srinivasan, A. A Study of Two Sampling Methods for Analyzing Large Datasets with ILP. Data Mining and Knowledge Discovery 3, 95–123 (1999). https://doi.org/10.1023/A:1009824123462

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009824123462

Navigation