A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

Srinivasan, Ashwin

doi:10.1023/A:1009824123462

A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

Published: March 1999

Volume 3, pages 95–123, (1999)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Ashwin Srinivasan¹

223 Accesses
26 Citations
3 Altmetric
Explore all metrics

Abstract

This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to construct theories by sampling from this large pool of data. The first, “subsampling”, is a single-sample design in which the utility of a potential rule is evaluated on a randomly selected sub-sample of the data. The second, “logical windowing”, is multiple-sample design that tests and sequentially includes errors made by a partially correct theory. Both schemes are derived from techniques developed to enable propositional learning methods (like decision trees) to cope with large datasets. The ILP system CProgol, equipped with each of these methods, is used to construct theories for two datasets—one artificial (a chess endgame) and the other naturally occurring (a language tagging problem). In each case, we ask the following questions of CProgol equipped with sampling: (1) Is its theory comparable in predictive accuracy to that obtained if all the data were used (that is, no sampling was employed)?; and (2) Is its theory constructed in less time than the one obtained with all the data? For the problems considered, the answers to these questions is “yes”. This suggests that an ILP program equipped with an appropriate sampling method could begin to address problems satisfactorily that have hitherto been inaccessible simply due to data extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bain, M. 1991. Experiments in non-monotonic learning. In Proceedings of the Eighth International Workshop on Machine Learning, pages 380-384, San Mateo, CA. Morgan Kaufmann.
Google Scholar
Barnett, V. 1992. Sample Survey Principles and Methods. Edward Arnold, London.
Google Scholar
Bratko, I. and Grobelnik, M. 1993. Inductive learning applied to program construction and verification. In Third International Workshop on Inductive Logic Programming, pages 279-292, 1993. Available as Technical Report IJS-DP-6707, J. Stefan Inst., Ljubljana, Slovenia.
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. 1984. Classification and Regression Trees. Wadsworth, Belmont.
Google Scholar
Cussens, J. 1997. Part-of-speech tagging using Progol. In Proceedings of the Seventh Inductive Logic Programming Workshop. Springer-Verlag, Berlin. To appear.
Google Scholar
L. De Raedt and M. Bruynooghe. Interactive concept-learning and constructive induction by analogy. Machine Learning, 8(2):107-150.
Dolsak, B. and Muggleton, S. 1992. The application of Inductive Logic Programming to finite element mesh design. In S. Muggleton, editor, Inductive Logic Programming, pages 453-472. Academic Press, London.
Google Scholar
Džeroski, S., Dehaspe, L., Ruck, B. and Walley, W. 1994. Classification of river water quality data using machine learning. In P. Zannetti, editor Computer Techniques in Environmental Studies V (Proceedings of the Fifth International Conference Development and Application of Computer Techniques to Environmental Studies), Vol I, pages 129-137, Computational Mechanics Publications, Southampton.
Google Scholar
Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L. 1988. A general lower bound on the number of examples needed for learning. In COLT 88: Proceedings of the Conference on Learning, pages 110-120, Los Altos, CA. Morgan-Kaufmann.
Google Scholar
Feng, C. 1992. Inducing temporal fault dignostic rules from a qualitative model. In S. Muggleton, editor, Inductive Logic Programming, pages 473-486. Academic Press, London.
Google Scholar
Furnkranz, J. 1997. More Efficient Windowing. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pages 509-514, Menlo Park, CA. AAAI Press/MIT Press.
Google Scholar
Furnkranz, J. 1997. Noise-tolerant windowing. Technical Report OEFAI-TR-97-07, Austrian Research Institute for Artificial Intelligence.
Gries, D. 1983. The science of programming. Springer-Verlag, New York.
Google Scholar
Harel, D. 1989. Algorithmics. Addison-Wesley, Reading, Mass.
Google Scholar
King, R.D., Muggleton, S.H., Srinivasan, A. and Sternberg, M.J.E. 1996. Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. of the National Academy of Sciences, 93:438-442.
Google Scholar
King, R.D., Muggleton, S.H. and Sternberg, M.J.E. 1992. Drug design by machine learning: The use of inductive logic programming to model the structure-activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. of the National Academy of Sciences, 89(23):11322-11326.
Google Scholar
Lewis, D.D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of The Eleventh International Machine Learning Conference, Los Altos, CA. Morgan Kaufmann.
Google Scholar
Lloyd, J.W. 1984. Foundations of Logic Programming. Springer-Verlag, Berlin.
Google Scholar
Michalski, R.S. 1983. A theory and methodology of inductive learning. In R. Michalski, J. Carbonnel, and T. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, pages 83-134. Tioga, Palo Alto, CA.
Moroney, M.J. 1953. Facts from Figures. Penguin Books (2nd Pelican Edition), Harmondsworth.
Google Scholar
Muggleton, S. 1991. Inductive logic programming. New Generation Computing, 8(4):295-318.
Google Scholar
Muggleton, S. 1993. Optimal layered learning: A PAC approach to incremental sampling. In K. Jantke, S. Kobayashi, E. Tomita, and T. Yokomori, editors, Proceedings of the 4th Conference on Algorithmic Learning Theory, LNAI, pages 37-44. Springer-Verlag.
Muggleton, S. 1994. Inductive Logic Programming: derivations, successes and shortcomings. SIGART Bulletin, 5(1):5-11.
Google Scholar
Muggleton, S. 1995. Inverse Entailment and Progol. New Gen. Comput., 13:245-286.
Google Scholar
Muggleton, S. 1996. Learning from positive data. In Proceedings of the Sixth Inductive Logic Programming Workshop, Stockholm University.
Muggleton, S., King, R. and Sternberg, M. 1992. Predicting protein secondary structure using inductive logic programming. Protein Engineering, 5:647-657.
Google Scholar
Muggleton, S., Page, C.D. and Srinivasan, A. 1996. An initial experiment into stereochemistry-based drug design using ILP. In Proceedings of the Sixth Inductive Logic Programming Workshop, LNAI, Springer-Verlag, Berlin.
Google Scholar
Muggleton, S.H., Bain, M.E., Hayes-Michie, J. and Michie, D. 1989. An experimental comparison of human and machine learning formalisms. In Proceedings of the Sixth International Workshop on Machine Learning. Kaufmann.
Muggleton, S.H. and Feng, C. 1990. Efficient induction of logic programs. In Proceedings of the First Conference on Algorithmic Learning Theory, Tokyo. Ohmsha.
Nilsson, N.J. 1980. Principles of Artificial Intelligence. Tioga, Palo Alto, CA.
Plotkin, G.D. 1971. Automatic Methods of Inductive Inference. PhD thesis, Edinburgh University.
Quinlan, J.R. 1979 Discovering rules from large collections of examples: a case study. In D. Michie, editor, Expert Systems in the Micro-electronic Age, pages 168-201. Edinburgh University Press, Edinburgh.
Google Scholar
Quinlan, J.R. 1990. Learning logical definitions from relations. Machine Learning, 5:239-266.
Google Scholar
Sebag, M. and Rouveirol, C. 1997. Tractable Induction and Classification in First-Order Logic via Stochastic Matching. In Proceedings of the Fifteenth International Conference on Artificial Intelligence (IJCAI-97). Morgan Kaufmann, Los Angeles, CA.
Google Scholar
Srinivasan, A., Muggleton, S.H., King, R.D. and Sternberg, M.J.E. 1996. Theories for mutagenicity: a study of first-order and feature based induction. Artificial Intelligence, 85:277-299.
Google Scholar
Wald, A. 1947. Sequential Analysis. Wiley, New York.
Google Scholar
Walpole, R.E. and Myers, R.H. 1978. Probability and Statistics for Engineers and Scientists. Collier Macmillan, New York. 2nd Edition.
Google Scholar
J. Wirth and J. Catlett. Experiments on the costs and benefits of windowing in ID3. In Proceedings of the Fifth International Conference on Machine Learning, pages 87-99, San Mateo, CA, 1988. Morgan Kaufmann.
Google Scholar
Wrobel, S. 1996. First-order Theory Refinement. In L. De Raedt, editor, Advances in Inductive Logic Programming. IOS Press, Tokyo.
Google Scholar
Zelle, J. and Mooney, R. 1993. Learning semantic grammars with constructive inductive logic programming. In Proceedings of the Eleventh National Conference on Artificial Intelligence, pages 817-822. Morgan Kaufmann.

Download references

Author information

Authors and Affiliations

Computing Laboratory, Oxford University, Oxford, UK
Ashwin Srinivasan

Authors

Ashwin Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srinivasan, A. A Study of Two Sampling Methods for Analyzing Large Datasets with ILP. Data Mining and Knowledge Discovery 3, 95–123 (1999). https://doi.org/10.1023/A:1009824123462

Download citation

Issue Date: March 1999
DOI: https://doi.org/10.1023/A:1009824123462

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

Abstract

Access this article

Similar content being viewed by others

Generation of Near-Optimal Solutions Using ILP-Guided Sampling

Evaluation of Leave-One Out Method Based on Incremental Sampling Scheme

Rule Induction Partitioning Estimator

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

Abstract

Access this article

Similar content being viewed by others

Generation of Near-Optimal Solutions Using ILP-Guided Sampling

Evaluation of Leave-One Out Method Based on Incremental Sampling Scheme

Rule Induction Partitioning Estimator

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation